2023-12-08 06:14:27

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Unmapped folios accessed through file descriptors can be
underprotected. Those folios are added to the oldest generation based
on:
1. The fact that they are less costly to reclaim (no need to walk the
rmap and flush the TLB) and have less impact on performance (don't
cause major PFs and can be non-blocking if needed again).
2. The observation that they are likely to be single-use. E.g., for
client use cases like Android, its apps parse configuration files
and store the data in heap (anon); for server use cases like MySQL,
it reads from InnoDB files and holds the cached data for tables in
buffer pools (anon).

However, the oldest generation can be very short lived, and if so, it
doesn't provide the PID controller with enough time to respond to a
surge of refaults. (Note that the PID controller uses weighted
refaults and those from evicted generations only take a half of the
whole weight.) In other words, for a short lived generation, the
moving average smooths out the spike quickly.

To fix the problem:
1. For folios that are already on LRU, if they can be beyond the
tracking range of tiers, i.e., five accesses through file
descriptors, move them to the second oldest generation to give them
more time to age. (Note that tiers are used by the PID controller
to statistically determine whether folios accessed multiple times
through file descriptors are worth protecting.)
2. When adding unmapped folios to LRU, adjust the placement of them so
that they are not too close to the tail. The effect of this is
similar to the above.

On Android, launching 55 apps sequentially:
Before After Change
workingset_refault_anon 25641024 25598972 0%
workingset_refault_file 115016834 106178438 -8%

Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <[email protected]>
Reported-by: Charan Teja Kalla <[email protected]>
Tested-by: Kalesh Singh <[email protected]>
Cc: [email protected]
---
include/linux/mm_inline.h | 23 ++++++++++++++---------
mm/vmscan.c | 2 +-
mm/workingset.c | 6 +++---
3 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9ae7def16cb2..f4fe593c1400 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -232,22 +232,27 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
if (folio_test_unevictable(folio) || !lrugen->enabled)
return false;
/*
- * There are three common cases for this page:
- * 1. If it's hot, e.g., freshly faulted in or previously hot and
- * migrated, add it to the youngest generation.
- * 2. If it's cold but can't be evicted immediately, i.e., an anon page
- * not in swapcache or a dirty page pending writeback, add it to the
- * second oldest generation.
- * 3. Everything else (clean, cold) is added to the oldest generation.
+ * There are four common cases for this page:
+ * 1. If it's hot, i.e., freshly faulted in, add it to the youngest
+ * generation, and it's protected over the rest below.
+ * 2. If it can't be evicted immediately, i.e., a dirty page pending
+ * writeback, add it to the second youngest generation.
+ * 3. If it should be evicted first, e.g., cold and clean from
+ * folio_rotate_reclaimable(), add it to the oldest generation.
+ * 4. Everything else falls between 2 & 3 above and is added to the
+ * second oldest generation if it's considered inactive, or the
+ * oldest generation otherwise. See lru_gen_is_active().
*/
if (folio_test_active(folio))
seq = lrugen->max_seq;
else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
(folio_test_reclaim(folio) &&
(folio_test_dirty(folio) || folio_test_writeback(folio))))
- seq = lrugen->min_seq[type] + 1;
- else
+ seq = lrugen->max_seq - 1;
+ else if (reclaiming || lrugen->min_seq[type] + MIN_NR_GENS >= lrugen->max_seq)
seq = lrugen->min_seq[type];
+ else
+ seq = lrugen->min_seq[type] + 1;

gen = lru_gen_from_seq(seq);
flags = (gen + 1UL) << LRU_GEN_PGOFF;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4e3b835c6b4a..e67631c60ac0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4260,7 +4260,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
}

/* protected */
- if (tier > tier_idx) {
+ if (tier > tier_idx || refs == BIT(LRU_REFS_WIDTH)) {
int hist = lru_hist_from_seq(lrugen->min_seq[type]);

gen = folio_inc_gen(lruvec, folio, false);
diff --git a/mm/workingset.c b/mm/workingset.c
index 7d3dacab8451..2a2a34234df9 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -313,10 +313,10 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
* 1. For pages accessed through page tables, hotter pages pushed out
* hot pages which refaulted immediately.
* 2. For pages accessed multiple times through file descriptors,
- * numbers of accesses might have been out of the range.
+ * they would have been protected by sort_folio().
*/
- if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
- folio_set_workingset(folio);
+ if (lru_gen_in_fault() || refs >= BIT(LRU_REFS_WIDTH) - 1) {
+ set_mask_bits(&folio->flags, 0, LRU_REFS_MASK | BIT(PG_workingset));
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
}
unlock:
--
2.43.0.472.g3155946c3a-goog


2023-12-08 06:14:36

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v1 2/4] mm/mglru: try to stop at high watermarks

The initial MGLRU patchset didn't include the memcg LRU support, and
it relied on should_abort_scan(), added by commit f76c83378851 ("mm:
multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid
overshooting their aggregate reclaim target by too much".

Later on when the memcg LRU was added, should_abort_scan() was deemed
unnecessary, and the test results [1] showed no side effects after it
was removed by commit a579086c99ed ("mm: multi-gen LRU: remove
eviction fairness safeguard").

However, that test used memory.reclaim, which sets nr_to_reclaim to
SWAP_CLUSTER_MAX. So it can overshoot only by SWAP_CLUSTER_MAX-1
pages, i.e., from nr_reclaimed=nr_to_reclaim-1 to
nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1. Compared with the batch
size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny. Therefore
that test isn't able to reproduce the worst case scenario, i.e.,
kswapd overshooting GBs on large systems and "consuming 100% CPU" (see
the Closes tag).

Bring back a simplified version of should_abort_scan() on top of the
memcg LRU, so that kswapd stops when all eligible zones are above
their respective high watermarks plus a small delta to lower the
chance of KSWAPD_HIGH_WMARK_HIT_QUICKLY. Note that this only applies
to order-0 reclaim, meaning compaction-induced reclaim can still run
wild (which is a different problem).

On Android, launching 55 apps sequentially:
Before After Change
pgpgin 838377172 802955040 -4%
pgpgout 38037080 34336300 -10%

[1] https://lore.kernel.org/[email protected]/

Fixes: a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard")
Signed-off-by: Yu Zhao <[email protected]>
Reported-by: Charan Teja Kalla <[email protected]>
Reported-by: Jaroslav Pulchart <[email protected]>
Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/
Tested-by: Jaroslav Pulchart <[email protected]>
Tested-by: Kalesh Singh <[email protected]>
Cc: [email protected]
---
mm/vmscan.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e67631c60ac0..10e964cd0efe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4676,20 +4676,41 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0;
}

-static unsigned long get_nr_to_reclaim(struct scan_control *sc)
+static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
{
+ int i;
+ enum zone_watermarks mark;
+
/* don't abort memcg reclaim to ensure fairness */
if (!root_reclaim(sc))
- return -1;
+ return false;

- return max(sc->nr_to_reclaim, compact_gap(sc->order));
+ if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order)))
+ return true;
+
+ /* check the order to exclude compaction-induced reclaim */
+ if (!current_is_kswapd() || sc->order)
+ return false;
+
+ mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
+ WMARK_PROMO : WMARK_HIGH;
+
+ for (i = 0; i <= sc->reclaim_idx; i++) {
+ struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
+ unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH;
+
+ if (managed_zone(zone) && !zone_watermark_ok(zone, 0, size, sc->reclaim_idx, 0))
+ return false;
+ }
+
+ /* kswapd should abort if all eligible zones are safe */
+ return true;
}

static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
long nr_to_scan;
unsigned long scanned = 0;
- unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
int swappiness = get_swappiness(lruvec, sc);

/* clean file folios are more likely to exist */
@@ -4711,7 +4732,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
if (scanned >= nr_to_scan)
break;

- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (should_abort_scan(lruvec, sc))
break;

cond_resched();
@@ -4772,7 +4793,6 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
struct lru_gen_folio *lrugen;
struct mem_cgroup *memcg;
const struct hlist_nulls_node *pos;
- unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);

bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
restart:
@@ -4805,7 +4825,7 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)

rcu_read_lock();

- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (should_abort_scan(lruvec, sc))
break;
}

@@ -4816,7 +4836,7 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)

mem_cgroup_put(memcg);

- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (!is_a_nulls(pos))
return;

/* restart if raced with lru_gen_rotate_memcg() */
--
2.43.0.472.g3155946c3a-goog

2023-12-08 06:14:41

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v1 3/4] mm/mglru: respect min_ttl_ms with memcgs

While investigating kswapd "consuming 100% CPU" [1] (also see
"mm/mglru: try to stop at high watermarks"), it was discovered that
the memcg LRU can breach the thrashing protection imposed by
min_ttl_ms.

Before the memcg LRU:
kswapd()
shrink_node_memcgs()
mem_cgroup_iter()
inc_max_seq() // always hit a different memcg
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation

After the memcg LRU:
kswapd()
shrink_many()
restart:
iterate the memcg LRU:
inc_max_seq() // occasionally hit the same memcg
if raced with lru_gen_rotate_memcg():
goto restart
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation

Specifically, when the restart happens in shrink_many(), it needs to
stick with the (memcg LRU) generation it began with. In other words,
it should neither re-read memcg_lru->seq nor age an lruvec of a
different generation. Otherwise it can hit the same memcg multiple
times without giving lru_gen_age_node() a chance to check the
timestamp of that memcg's oldest generation (against min_ttl_ms).

[1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/

Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <[email protected]>
Tested-by: T.J. Mercier <[email protected]>
Cc: [email protected]
---
include/linux/mmzone.h | 30 +++++++++++++++++-------------
mm/vmscan.c | 30 ++++++++++++++++--------------
2 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b23bc5390240..e3093ef9530f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -510,33 +510,37 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
* the old generation, is incremented when all its bins become empty.
*
* There are four operations:
- * 1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in its
+ * 1. MEMCG_LRU_HEAD, which moves a memcg to the head of a random bin in its
* current generation (old or young) and updates its "seg" to "head";
- * 2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in its
+ * 2. MEMCG_LRU_TAIL, which moves a memcg to the tail of a random bin in its
* current generation (old or young) and updates its "seg" to "tail";
- * 3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in the old
+ * 3. MEMCG_LRU_OLD, which moves a memcg to the head of a random bin in the old
* generation, updates its "gen" to "old" and resets its "seg" to "default";
- * 4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin in the
+ * 4. MEMCG_LRU_YOUNG, which moves a memcg to the tail of a random bin in the
* young generation, updates its "gen" to "young" and resets its "seg" to
* "default".
*
* The events that trigger the above operations are:
* 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
- * 2. The first attempt to reclaim an memcg below low, which triggers
+ * 2. The first attempt to reclaim a memcg below low, which triggers
* MEMCG_LRU_TAIL;
- * 3. The first attempt to reclaim an memcg below reclaimable size threshold,
+ * 3. The first attempt to reclaim a memcg below reclaimable size threshold,
* which triggers MEMCG_LRU_TAIL;
- * 4. The second attempt to reclaim an memcg below reclaimable size threshold,
+ * 4. The second attempt to reclaim a memcg below reclaimable size threshold,
* which triggers MEMCG_LRU_YOUNG;
- * 5. Attempting to reclaim an memcg below min, which triggers MEMCG_LRU_YOUNG;
+ * 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
* 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
- * 7. Offlining an memcg, which triggers MEMCG_LRU_OLD.
+ * 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
*
- * Note that memcg LRU only applies to global reclaim, and the round-robin
- * incrementing of their max_seq counters ensures the eventual fairness to all
- * eligible memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
+ * Notes:
+ * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
+ * of their max_seq counters ensures the eventual fairness to all eligible
+ * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
+ * 2. There are only two valid generations: old (seq) and young (seq+1).
+ * MEMCG_NR_GENS is set to three so that when reading the generation counter
+ * locklessly, a stale value (seq-1) does not wraparound to young.
*/
-#define MEMCG_NR_GENS 2
+#define MEMCG_NR_GENS 3
#define MEMCG_NR_BINS 8

struct lru_gen_memcg {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10e964cd0efe..cac38e9cac86 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4117,6 +4117,9 @@ static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
else
VM_WARN_ON_ONCE(true);

+ WRITE_ONCE(lruvec->lrugen.seg, seg);
+ WRITE_ONCE(lruvec->lrugen.gen, new);
+
hlist_nulls_del_rcu(&lruvec->lrugen.list);

if (op == MEMCG_LRU_HEAD || op == MEMCG_LRU_OLD)
@@ -4127,9 +4130,6 @@ static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
pgdat->memcg_lru.nr_memcgs[old]--;
pgdat->memcg_lru.nr_memcgs[new]++;

- lruvec->lrugen.gen = new;
- WRITE_ONCE(lruvec->lrugen.seg, seg);
-
if (!pgdat->memcg_lru.nr_memcgs[old] && old == get_memcg_gen(pgdat->memcg_lru.seq))
WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);

@@ -4152,11 +4152,11 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg)

gen = get_memcg_gen(pgdat->memcg_lru.seq);

+ lruvec->lrugen.gen = gen;
+
hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[gen][bin]);
pgdat->memcg_lru.nr_memcgs[gen]++;

- lruvec->lrugen.gen = gen;
-
spin_unlock_irq(&pgdat->memcg_lru.lock);
}
}
@@ -4663,7 +4663,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
DEFINE_MAX_SEQ(lruvec);

if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
- return 0;
+ return -1;

if (!should_run_aging(lruvec, max_seq, sc, can_swap, &nr_to_scan))
return nr_to_scan;
@@ -4738,7 +4738,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
cond_resched();
}

- /* whether try_to_inc_max_seq() was successful */
+ /* whether this lruvec should be rotated */
return nr_to_scan < 0;
}

@@ -4792,13 +4792,13 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
struct lruvec *lruvec;
struct lru_gen_folio *lrugen;
struct mem_cgroup *memcg;
- const struct hlist_nulls_node *pos;
+ struct hlist_nulls_node *pos;

+ gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
restart:
op = 0;
memcg = NULL;
- gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));

rcu_read_lock();

@@ -4809,6 +4809,10 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
}

mem_cgroup_put(memcg);
+ memcg = NULL;
+
+ if (gen != READ_ONCE(lrugen->gen))
+ continue;

lruvec = container_of(lrugen, struct lruvec, lrugen);
memcg = lruvec_memcg(lruvec);
@@ -4893,16 +4897,14 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH)
return;
/*
- * Determine the initial priority based on ((total / MEMCG_NR_GENS) >>
- * priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, where the
- * estimated reclaimed_to_scanned_ratio = inactive / total.
+ * Determine the initial priority based on
+ * (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim,
+ * where reclaimed_to_scanned_ratio = inactive / total.
*/
reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE);
if (get_swappiness(lruvec, sc))
reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON);

- reclaimable /= MEMCG_NR_GENS;
-
/* round down reclaimable and round up sc->nr_to_reclaim */
priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1);

--
2.43.0.472.g3155946c3a-goog

2023-12-08 06:14:44

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v1 4/4] mm/mglru: reclaim offlined memcgs harder

In the effort to reduce zombie memcgs [1], it was discovered that the
memcg LRU doesn't apply enough pressure on offlined memcgs.
Specifically, instead of rotating them to the tail of the current
generation (MEMCG_LRU_TAIL) for a second attempt, it moves them to the
next generation (MEMCG_LRU_YOUNG) after the first attempt.

Not applying enough pressure on offlined memcgs can cause them to
build up, and this can be particularly harmful to memory-constrained
systems.

On Pixel 8 Pro, launching apps for 50 cycles:
Before After Change
Zombie memcgs 45 35 -22%

[1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/

Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <[email protected]>
Reported-by: T.J. Mercier <[email protected]>
Tested-by: T.J. Mercier <[email protected]>
Cc: [email protected]
---
include/linux/mmzone.h | 8 ++++----
mm/vmscan.c | 24 ++++++++++++++++--------
2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e3093ef9530f..2efd3be484fd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -524,10 +524,10 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
* 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
* 2. The first attempt to reclaim a memcg below low, which triggers
* MEMCG_LRU_TAIL;
- * 3. The first attempt to reclaim a memcg below reclaimable size threshold,
- * which triggers MEMCG_LRU_TAIL;
- * 4. The second attempt to reclaim a memcg below reclaimable size threshold,
- * which triggers MEMCG_LRU_YOUNG;
+ * 3. The first attempt to reclaim a memcg offlined or below reclaimable size
+ * threshold, which triggers MEMCG_LRU_TAIL;
+ * 4. The second attempt to reclaim a memcg offlined or below reclaimable size
+ * threshold, which triggers MEMCG_LRU_YOUNG;
* 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
* 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
* 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cac38e9cac86..dad4b80b04cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4626,7 +4626,12 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
}

/* try to scrape all its memory if this memcg was deleted */
- *nr_to_scan = mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
+ if (!mem_cgroup_online(memcg)) {
+ *nr_to_scan = total;
+ return false;
+ }
+
+ *nr_to_scan = total >> sc->priority;

/*
* The aging tries to be lazy to reduce the overhead, while the eviction
@@ -4747,14 +4752,9 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
bool success;
unsigned long scanned = sc->nr_scanned;
unsigned long reclaimed = sc->nr_reclaimed;
- int seg = lru_gen_memcg_seg(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);

- /* see the comment on MEMCG_NR_GENS */
- if (!lruvec_is_sizable(lruvec, sc))
- return seg != MEMCG_LRU_TAIL ? MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
-
mem_cgroup_calculate_protection(NULL, memcg);

if (mem_cgroup_below_min(NULL, memcg))
@@ -4762,7 +4762,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)

if (mem_cgroup_below_low(NULL, memcg)) {
/* see the comment on MEMCG_NR_GENS */
- if (seg != MEMCG_LRU_TAIL)
+ if (lru_gen_memcg_seg(lruvec) != MEMCG_LRU_TAIL)
return MEMCG_LRU_TAIL;

memcg_memory_event(memcg, MEMCG_LOW);
@@ -4778,7 +4778,15 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)

flush_reclaim_state(sc);

- return success ? MEMCG_LRU_YOUNG : 0;
+ if (success && mem_cgroup_online(memcg))
+ return MEMCG_LRU_YOUNG;
+
+ if (!success && lruvec_is_sizable(lruvec, sc))
+ return 0;
+
+ /* one retry if offlined or too small */
+ return lru_gen_memcg_seg(lruvec) != MEMCG_LRU_TAIL ?
+ MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
}

#ifdef CONFIG_MEMCG
--
2.43.0.472.g3155946c3a-goog

2023-12-08 08:24:50

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
>
> Unmapped folios accessed through file descriptors can be
> underprotected. Those folios are added to the oldest generation based
> on:
> 1. The fact that they are less costly to reclaim (no need to walk the
> rmap and flush the TLB) and have less impact on performance (don't
> cause major PFs and can be non-blocking if needed again).
> 2. The observation that they are likely to be single-use. E.g., for
> client use cases like Android, its apps parse configuration files
> and store the data in heap (anon); for server use cases like MySQL,
> it reads from InnoDB files and holds the cached data for tables in
> buffer pools (anon).
>
> However, the oldest generation can be very short lived, and if so, it
> doesn't provide the PID controller with enough time to respond to a
> surge of refaults. (Note that the PID controller uses weighted
> refaults and those from evicted generations only take a half of the
> whole weight.) In other words, for a short lived generation, the
> moving average smooths out the spike quickly.
>
> To fix the problem:
> 1. For folios that are already on LRU, if they can be beyond the
> tracking range of tiers, i.e., five accesses through file
> descriptors, move them to the second oldest generation to give them
> more time to age. (Note that tiers are used by the PID controller
> to statistically determine whether folios accessed multiple times
> through file descriptors are worth protecting.)
> 2. When adding unmapped folios to LRU, adjust the placement of them so
> that they are not too close to the tail. The effect of this is
> similar to the above.
>
> On Android, launching 55 apps sequentially:
> Before After Change
> workingset_refault_anon 25641024 25598972 0%
> workingset_refault_file 115016834 106178438 -8%

Hi Yu,

Thanks you for your amazing works on MGLRU.

I believe this is the similar issue I was trying to resolve previously:
https://lwn.net/Articles/945266/
The idea is to use refault distance to decide if the page should be
place in oldest generation or some other gen, which per my test,
worked very well, and we have been using refault distance for MGLRU in
multiple workloads.

There are a few issues left in my previous RFC series, like anon pages
in MGLRU shouldn't be considered, I wanted to collect feedback or test
cases, but unfortunately it seems didn't get too much attention
upstream.

I think both this patch and my previous series are for solving the
file pages underpertected issue, and I did a quick test using this
series, for mongodb test, refault distance seems still a better
solution (I'm not saying these two optimization are mutually exclusive
though, just they do have some conflicts in implementation and solving
similar problem):

Previous result:
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s

This patch:
==================================================================
Execution Results after 900 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
------------------------------------------------------------------
TOTAL 1594 27061522574.4 0.06 txn/s

Unpatched version is always around ~500.

I think there are a few points here:
- Refault distance make use of page shadow so it can better
distinguish evicted pages of different access pattern (re-access
distance).
- Throttled refault distance can help hold part of workingset when
memory is too small to hold the whole workingset.

So maybe part of this patch and the bits of previous series can be
combined to work better on this issue, how do you think?

>
> Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
> Signed-off-by: Yu Zhao <[email protected]>
> Reported-by: Charan Teja Kalla <[email protected]>
> Tested-by: Kalesh Singh <[email protected]>
> Cc: [email protected]
> ---
> include/linux/mm_inline.h | 23 ++++++++++++++---------
> mm/vmscan.c | 2 +-
> mm/workingset.c | 6 +++---
> 3 files changed, 18 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 9ae7def16cb2..f4fe593c1400 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -232,22 +232,27 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
> if (folio_test_unevictable(folio) || !lrugen->enabled)
> return false;
> /*
> - * There are three common cases for this page:
> - * 1. If it's hot, e.g., freshly faulted in or previously hot and
> - * migrated, add it to the youngest generation.
> - * 2. If it's cold but can't be evicted immediately, i.e., an anon page
> - * not in swapcache or a dirty page pending writeback, add it to the
> - * second oldest generation.
> - * 3. Everything else (clean, cold) is added to the oldest generation.
> + * There are four common cases for this page:
> + * 1. If it's hot, i.e., freshly faulted in, add it to the youngest
> + * generation, and it's protected over the rest below.
> + * 2. If it can't be evicted immediately, i.e., a dirty page pending
> + * writeback, add it to the second youngest generation.
> + * 3. If it should be evicted first, e.g., cold and clean from
> + * folio_rotate_reclaimable(), add it to the oldest generation.
> + * 4. Everything else falls between 2 & 3 above and is added to the
> + * second oldest generation if it's considered inactive, or the
> + * oldest generation otherwise. See lru_gen_is_active().
> */
> if (folio_test_active(folio))
> seq = lrugen->max_seq;
> else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
> (folio_test_reclaim(folio) &&
> (folio_test_dirty(folio) || folio_test_writeback(folio))))
> - seq = lrugen->min_seq[type] + 1;
> - else
> + seq = lrugen->max_seq - 1;
> + else if (reclaiming || lrugen->min_seq[type] + MIN_NR_GENS >= lrugen->max_seq)
> seq = lrugen->min_seq[type];
> + else
> + seq = lrugen->min_seq[type] + 1;

For example. maybe still keep the pages on oldest gen by default, but
if the page have a eligible shadow, then put it on min_seq + 1?

>
> gen = lru_gen_from_seq(seq);
> flags = (gen + 1UL) << LRU_GEN_PGOFF;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4e3b835c6b4a..e67631c60ac0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4260,7 +4260,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> }
>
> /* protected */
> - if (tier > tier_idx) {
> + if (tier > tier_idx || refs == BIT(LRU_REFS_WIDTH)) {
> int hist = lru_hist_from_seq(lrugen->min_seq[type]);
>
> gen = folio_inc_gen(lruvec, folio, false);
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 7d3dacab8451..2a2a34234df9 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -313,10 +313,10 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
> * 1. For pages accessed through page tables, hotter pages pushed out
> * hot pages which refaulted immediately.
> * 2. For pages accessed multiple times through file descriptors,
> - * numbers of accesses might have been out of the range.
> + * they would have been protected by sort_folio().
> */
> - if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
> - folio_set_workingset(folio);
> + if (lru_gen_in_fault() || refs >= BIT(LRU_REFS_WIDTH) - 1) {
> + set_mask_bits(&folio->flags, 0, LRU_REFS_MASK | BIT(PG_workingset));
> mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
> }

Also this can combine with refault distance check for setting the
reference flag.

2023-12-11 22:02:22

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/4] mm/mglru: try to stop at high watermarks

On Fri, Dec 8, 2023 at 4:00 AM Hillf Danton <[email protected]> wrote:
>
> On Thu, 7 Dec 2023 23:14:05 -0700 Yu Zhao <[email protected]>
> > -static unsigned long get_nr_to_reclaim(struct scan_control *sc)
> > +static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
> > {
> > + int i;
> > + enum zone_watermarks mark;
> > +
> > /* don't abort memcg reclaim to ensure fairness */
> > if (!root_reclaim(sc))
> > - return -1;
> > + return false;
> >
> > - return max(sc->nr_to_reclaim, compact_gap(sc->order));
> > + if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order)))
> > + return true;
> > +
> > + /* check the order to exclude compaction-induced reclaim */
> > + if (!current_is_kswapd() || sc->order)
> > + return false;
> > +
> > + mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > + WMARK_PROMO : WMARK_HIGH;
> > +
> > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > + unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH;
> > +
> > + if (managed_zone(zone) && !zone_watermark_ok(zone, 0, size, sc->reclaim_idx, 0))
> > + return false;
> > + }
> > +
> > + /* kswapd should abort if all eligible zones are safe */
>
> This comment does not align with 86c79f6b5426
> ("mm: vmscan: do not reclaim from kswapd if there is any eligible zone").
> Any thing special here?

I don't see how they are not: they essentially say the same thing ("no
more than needed") but with different units: zones or pages. IOW,
don't reclaim from more zones or pages than needed.

2023-12-11 22:07:39

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
>
> Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> >
> > Unmapped folios accessed through file descriptors can be
> > underprotected. Those folios are added to the oldest generation based
> > on:
> > 1. The fact that they are less costly to reclaim (no need to walk the
> > rmap and flush the TLB) and have less impact on performance (don't
> > cause major PFs and can be non-blocking if needed again).
> > 2. The observation that they are likely to be single-use. E.g., for
> > client use cases like Android, its apps parse configuration files
> > and store the data in heap (anon); for server use cases like MySQL,
> > it reads from InnoDB files and holds the cached data for tables in
> > buffer pools (anon).
> >
> > However, the oldest generation can be very short lived, and if so, it
> > doesn't provide the PID controller with enough time to respond to a
> > surge of refaults. (Note that the PID controller uses weighted
> > refaults and those from evicted generations only take a half of the
> > whole weight.) In other words, for a short lived generation, the
> > moving average smooths out the spike quickly.
> >
> > To fix the problem:
> > 1. For folios that are already on LRU, if they can be beyond the
> > tracking range of tiers, i.e., five accesses through file
> > descriptors, move them to the second oldest generation to give them
> > more time to age. (Note that tiers are used by the PID controller
> > to statistically determine whether folios accessed multiple times
> > through file descriptors are worth protecting.)
> > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > that they are not too close to the tail. The effect of this is
> > similar to the above.
> >
> > On Android, launching 55 apps sequentially:
> > Before After Change
> > workingset_refault_anon 25641024 25598972 0%
> > workingset_refault_file 115016834 106178438 -8%
>
> Hi Yu,
>
> Thanks you for your amazing works on MGLRU.
>
> I believe this is the similar issue I was trying to resolve previously:
> https://lwn.net/Articles/945266/
> The idea is to use refault distance to decide if the page should be
> place in oldest generation or some other gen, which per my test,
> worked very well, and we have been using refault distance for MGLRU in
> multiple workloads.
>
> There are a few issues left in my previous RFC series, like anon pages
> in MGLRU shouldn't be considered, I wanted to collect feedback or test
> cases, but unfortunately it seems didn't get too much attention
> upstream.
>
> I think both this patch and my previous series are for solving the
> file pages underpertected issue, and I did a quick test using this
> series, for mongodb test, refault distance seems still a better
> solution (I'm not saying these two optimization are mutually exclusive
> though, just they do have some conflicts in implementation and solving
> similar problem):
>
> Previous result:
> ==================================================================
> Execution Results after 905 seconds
> ------------------------------------------------------------------
> Executed Time (µs) Rate
> STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> ------------------------------------------------------------------
> TOTAL 2542 27121571486.2 0.09 txn/s
>
> This patch:
> ==================================================================
> Execution Results after 900 seconds
> ------------------------------------------------------------------
> Executed Time (µs) Rate
> STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> ------------------------------------------------------------------
> TOTAL 1594 27061522574.4 0.06 txn/s
>
> Unpatched version is always around ~500.

Thanks for the test results!

> I think there are a few points here:
> - Refault distance make use of page shadow so it can better
> distinguish evicted pages of different access pattern (re-access
> distance).
> - Throttled refault distance can help hold part of workingset when
> memory is too small to hold the whole workingset.
>
> So maybe part of this patch and the bits of previous series can be
> combined to work better on this issue, how do you think?

I'll try to find some time this week to look at your RFC. It'd be a
lot easier for me if you could share
1. your latest tree, preferably based on the mainline, and
2. your VM image containing the above test.

2023-12-12 06:52:52

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
>
> On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> >
> > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > >
> > > Unmapped folios accessed through file descriptors can be
> > > underprotected. Those folios are added to the oldest generation based
> > > on:
> > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > rmap and flush the TLB) and have less impact on performance (don't
> > > cause major PFs and can be non-blocking if needed again).
> > > 2. The observation that they are likely to be single-use. E.g., for
> > > client use cases like Android, its apps parse configuration files
> > > and store the data in heap (anon); for server use cases like MySQL,
> > > it reads from InnoDB files and holds the cached data for tables in
> > > buffer pools (anon).
> > >
> > > However, the oldest generation can be very short lived, and if so, it
> > > doesn't provide the PID controller with enough time to respond to a
> > > surge of refaults. (Note that the PID controller uses weighted
> > > refaults and those from evicted generations only take a half of the
> > > whole weight.) In other words, for a short lived generation, the
> > > moving average smooths out the spike quickly.
> > >
> > > To fix the problem:
> > > 1. For folios that are already on LRU, if they can be beyond the
> > > tracking range of tiers, i.e., five accesses through file
> > > descriptors, move them to the second oldest generation to give them
> > > more time to age. (Note that tiers are used by the PID controller
> > > to statistically determine whether folios accessed multiple times
> > > through file descriptors are worth protecting.)
> > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > that they are not too close to the tail. The effect of this is
> > > similar to the above.
> > >
> > > On Android, launching 55 apps sequentially:
> > > Before After Change
> > > workingset_refault_anon 25641024 25598972 0%
> > > workingset_refault_file 115016834 106178438 -8%
> >
> > Hi Yu,
> >
> > Thanks you for your amazing works on MGLRU.
> >
> > I believe this is the similar issue I was trying to resolve previously:
> > https://lwn.net/Articles/945266/
> > The idea is to use refault distance to decide if the page should be
> > place in oldest generation or some other gen, which per my test,
> > worked very well, and we have been using refault distance for MGLRU in
> > multiple workloads.
> >
> > There are a few issues left in my previous RFC series, like anon pages
> > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > cases, but unfortunately it seems didn't get too much attention
> > upstream.
> >
> > I think both this patch and my previous series are for solving the
> > file pages underpertected issue, and I did a quick test using this
> > series, for mongodb test, refault distance seems still a better
> > solution (I'm not saying these two optimization are mutually exclusive
> > though, just they do have some conflicts in implementation and solving
> > similar problem):
> >
> > Previous result:
> > ==================================================================
> > Execution Results after 905 seconds
> > ------------------------------------------------------------------
> > Executed Time (µs) Rate
> > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > ------------------------------------------------------------------
> > TOTAL 2542 27121571486.2 0.09 txn/s
> >
> > This patch:
> > ==================================================================
> > Execution Results after 900 seconds
> > ------------------------------------------------------------------
> > Executed Time (µs) Rate
> > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > ------------------------------------------------------------------
> > TOTAL 1594 27061522574.4 0.06 txn/s
> >
> > Unpatched version is always around ~500.
>
> Thanks for the test results!
>
> > I think there are a few points here:
> > - Refault distance make use of page shadow so it can better
> > distinguish evicted pages of different access pattern (re-access
> > distance).
> > - Throttled refault distance can help hold part of workingset when
> > memory is too small to hold the whole workingset.
> >
> > So maybe part of this patch and the bits of previous series can be
> > combined to work better on this issue, how do you think?
>
> I'll try to find some time this week to look at your RFC. It'd be a

Thanks!

> lot easier for me if you could share
> 1. your latest tree, preferably based on the mainline, and
> 2. your VM image containing the above test.

Sure, I'll update the RFC and try to provide an easier test reproducer.

2023-12-13 03:03:35

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
>
> Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> >
> > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > >
> > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > >
> > > > Unmapped folios accessed through file descriptors can be
> > > > underprotected. Those folios are added to the oldest generation based
> > > > on:
> > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > cause major PFs and can be non-blocking if needed again).
> > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > client use cases like Android, its apps parse configuration files
> > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > it reads from InnoDB files and holds the cached data for tables in
> > > > buffer pools (anon).
> > > >
> > > > However, the oldest generation can be very short lived, and if so, it
> > > > doesn't provide the PID controller with enough time to respond to a
> > > > surge of refaults. (Note that the PID controller uses weighted
> > > > refaults and those from evicted generations only take a half of the
> > > > whole weight.) In other words, for a short lived generation, the
> > > > moving average smooths out the spike quickly.
> > > >
> > > > To fix the problem:
> > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > tracking range of tiers, i.e., five accesses through file
> > > > descriptors, move them to the second oldest generation to give them
> > > > more time to age. (Note that tiers are used by the PID controller
> > > > to statistically determine whether folios accessed multiple times
> > > > through file descriptors are worth protecting.)
> > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > that they are not too close to the tail. The effect of this is
> > > > similar to the above.
> > > >
> > > > On Android, launching 55 apps sequentially:
> > > > Before After Change
> > > > workingset_refault_anon 25641024 25598972 0%
> > > > workingset_refault_file 115016834 106178438 -8%
> > >
> > > Hi Yu,
> > >
> > > Thanks you for your amazing works on MGLRU.
> > >
> > > I believe this is the similar issue I was trying to resolve previously:
> > > https://lwn.net/Articles/945266/
> > > The idea is to use refault distance to decide if the page should be
> > > place in oldest generation or some other gen, which per my test,
> > > worked very well, and we have been using refault distance for MGLRU in
> > > multiple workloads.
> > >
> > > There are a few issues left in my previous RFC series, like anon pages
> > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > cases, but unfortunately it seems didn't get too much attention
> > > upstream.
> > >
> > > I think both this patch and my previous series are for solving the
> > > file pages underpertected issue, and I did a quick test using this
> > > series, for mongodb test, refault distance seems still a better
> > > solution (I'm not saying these two optimization are mutually exclusive
> > > though, just they do have some conflicts in implementation and solving
> > > similar problem):
> > >
> > > Previous result:
> > > ==================================================================
> > > Execution Results after 905 seconds
> > > ------------------------------------------------------------------
> > > Executed Time (µs) Rate
> > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > ------------------------------------------------------------------
> > > TOTAL 2542 27121571486.2 0.09 txn/s
> > >
> > > This patch:
> > > ==================================================================
> > > Execution Results after 900 seconds
> > > ------------------------------------------------------------------
> > > Executed Time (µs) Rate
> > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > ------------------------------------------------------------------
> > > TOTAL 1594 27061522574.4 0.06 txn/s
> > >
> > > Unpatched version is always around ~500.
> >
> > Thanks for the test results!
> >
> > > I think there are a few points here:
> > > - Refault distance make use of page shadow so it can better
> > > distinguish evicted pages of different access pattern (re-access
> > > distance).
> > > - Throttled refault distance can help hold part of workingset when
> > > memory is too small to hold the whole workingset.
> > >
> > > So maybe part of this patch and the bits of previous series can be
> > > combined to work better on this issue, how do you think?
> >
> > I'll try to find some time this week to look at your RFC. It'd be a

Hi Yu,

I'm working on V4 of the RFC now, which just update some comments, and
skip anon page re-activation in refault path for mglru which was not
very helpful, only some tiny adjustment.
And I found it easier to test with fio, using following test script:

#!/bin/bash
swapoff -a

modprobe brd rd_nr=1 rd_size=16777216
mkfs.ext4 /dev/ram0
mount /dev/ram0 /mnt

mkdir -p /sys/fs/cgroup/benchmark
cd /sys/fs/cgroup/benchmark

echo 4G > memory.max
echo $$ > cgroup.procs
echo 3 > /proc/sys/vm/drop_caches

fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:0.5 --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting

zipf:0.5 is used here to simulate a cached read with slight bias
towards certain pages.
Unpatched 6.7-rc4:
Run status group 0 (all jobs):
READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
(6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec

Patched with RFC v4:
Run status group 0 (all jobs):
READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
(7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec

Patched with this series:
Run status group 0 (all jobs):
READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
(7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec

MGLRU off:
Run status group 0 (all jobs):
READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
(6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec

- If I change zipf:0.5 to random:
Unpatched 6.7-rc4:
Patched with this series:
Run status group 0 (all jobs):
READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
(6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec

Patched with RFC v4:
Run status group 0 (all jobs):
READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
(6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec

Patched with this series:
Run status group 0 (all jobs):
READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
(6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec

MGLRU off:
Run status group 0 (all jobs):
READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
(5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec

fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
test I provided before uses a SATA SSD so it will have a much higher
impact. I'll provides a script to setup the test case and run it, it's
more complex to setup than fio since involving setting up multiple
replicas and auth and hundreds of GB of test fixtures, I'm currently
occupied by some other tasks but will try best to send them out as
soon as possible.

2023-12-13 08:00:42

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
>
> Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> >
> > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > >
> > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > >
> > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > >
> > > > > Unmapped folios accessed through file descriptors can be
> > > > > underprotected. Those folios are added to the oldest generation based
> > > > > on:
> > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > cause major PFs and can be non-blocking if needed again).
> > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > client use cases like Android, its apps parse configuration files
> > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > buffer pools (anon).
> > > > >
> > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > refaults and those from evicted generations only take a half of the
> > > > > whole weight.) In other words, for a short lived generation, the
> > > > > moving average smooths out the spike quickly.
> > > > >
> > > > > To fix the problem:
> > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > tracking range of tiers, i.e., five accesses through file
> > > > > descriptors, move them to the second oldest generation to give them
> > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > to statistically determine whether folios accessed multiple times
> > > > > through file descriptors are worth protecting.)
> > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > that they are not too close to the tail. The effect of this is
> > > > > similar to the above.
> > > > >
> > > > > On Android, launching 55 apps sequentially:
> > > > > Before After Change
> > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > workingset_refault_file 115016834 106178438 -8%
> > > >
> > > > Hi Yu,
> > > >
> > > > Thanks you for your amazing works on MGLRU.
> > > >
> > > > I believe this is the similar issue I was trying to resolve previously:
> > > > https://lwn.net/Articles/945266/
> > > > The idea is to use refault distance to decide if the page should be
> > > > place in oldest generation or some other gen, which per my test,
> > > > worked very well, and we have been using refault distance for MGLRU in
> > > > multiple workloads.
> > > >
> > > > There are a few issues left in my previous RFC series, like anon pages
> > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > cases, but unfortunately it seems didn't get too much attention
> > > > upstream.
> > > >
> > > > I think both this patch and my previous series are for solving the
> > > > file pages underpertected issue, and I did a quick test using this
> > > > series, for mongodb test, refault distance seems still a better
> > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > though, just they do have some conflicts in implementation and solving
> > > > similar problem):
> > > >
> > > > Previous result:
> > > > ==================================================================
> > > > Execution Results after 905 seconds
> > > > ------------------------------------------------------------------
> > > > Executed Time (µs) Rate
> > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > ------------------------------------------------------------------
> > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > >
> > > > This patch:
> > > > ==================================================================
> > > > Execution Results after 900 seconds
> > > > ------------------------------------------------------------------
> > > > Executed Time (µs) Rate
> > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > ------------------------------------------------------------------
> > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > >
> > > > Unpatched version is always around ~500.
> > >
> > > Thanks for the test results!
> > >
> > > > I think there are a few points here:
> > > > - Refault distance make use of page shadow so it can better
> > > > distinguish evicted pages of different access pattern (re-access
> > > > distance).
> > > > - Throttled refault distance can help hold part of workingset when
> > > > memory is too small to hold the whole workingset.
> > > >
> > > > So maybe part of this patch and the bits of previous series can be
> > > > combined to work better on this issue, how do you think?
> > >
> > > I'll try to find some time this week to look at your RFC. It'd be a
>
> Hi Yu,
>
> I'm working on V4 of the RFC now, which just update some comments, and
> skip anon page re-activation in refault path for mglru which was not
> very helpful, only some tiny adjustment.
> And I found it easier to test with fio, using following test script:
>
> #!/bin/bash
> swapoff -a
>
> modprobe brd rd_nr=1 rd_size=16777216
> mkfs.ext4 /dev/ram0
> mount /dev/ram0 /mnt
>
> mkdir -p /sys/fs/cgroup/benchmark
> cd /sys/fs/cgroup/benchmark
>
> echo 4G > memory.max
> echo $$ > cgroup.procs
> echo 3 > /proc/sys/vm/drop_caches
>
> fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> --buffered=1 --ioengine=io_uring --iodepth=128 \
> --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> --rw=randread --random_distribution=zipf:0.5 --norandommap \
> --time_based --ramp_time=5m --runtime=5m --group_reporting
>
> zipf:0.5 is used here to simulate a cached read with slight bias
> towards certain pages.
> Unpatched 6.7-rc4:
> Run status group 0 (all jobs):
> READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
>
> Patched with RFC v4:
> Run status group 0 (all jobs):
> READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
>
> Patched with this series:
> Run status group 0 (all jobs):
> READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
>
> MGLRU off:
> Run status group 0 (all jobs):
> READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
>
> - If I change zipf:0.5 to random:
> Unpatched 6.7-rc4:
> Patched with this series:
> Run status group 0 (all jobs):
> READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
>
> Patched with RFC v4:
> Run status group 0 (all jobs):
> READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
>
> Patched with this series:
> Run status group 0 (all jobs):
> READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
>
> MGLRU off:
> Run status group 0 (all jobs):
> READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
>
> fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> test I provided before uses a SATA SSD so it will have a much higher
> impact. I'll provides a script to setup the test case and run it, it's
> more complex to setup than fio since involving setting up multiple
> replicas and auth and hundreds of GB of test fixtures, I'm currently
> occupied by some other tasks but will try best to send them out as
> soon as possible.

Thanks! Apparently your RFC did show better IOPS with both access
patterns, which was a surprise to me because it had higher refaults
and usually higher refautls result in worse performance.

So I'm still trying to figure out why it turned out the opposite. My
current guess is that:
1. It had a very small but stable inactive LRU list, which was able to
fit into the L3 cache entirely.
2. It counted few folios as workingset and therefore incurred less
overhead from CONFIG_PSI and/or CONFIG_TASK_DELAY_ACCT.

Did you save workingset_refault_file when you ran the test? If so, can
you check the difference between this series and your RFC?

2023-12-14 03:09:47

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> >
> > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > >
> > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > >
> > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > >
> > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > >
> > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > on:
> > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > client use cases like Android, its apps parse configuration files
> > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > buffer pools (anon).
> > > > > >
> > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > refaults and those from evicted generations only take a half of the
> > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > moving average smooths out the spike quickly.
> > > > > >
> > > > > > To fix the problem:
> > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > to statistically determine whether folios accessed multiple times
> > > > > > through file descriptors are worth protecting.)
> > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > that they are not too close to the tail. The effect of this is
> > > > > > similar to the above.
> > > > > >
> > > > > > On Android, launching 55 apps sequentially:
> > > > > > Before After Change
> > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > >
> > > > > Hi Yu,
> > > > >
> > > > > Thanks you for your amazing works on MGLRU.
> > > > >
> > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > https://lwn.net/Articles/945266/
> > > > > The idea is to use refault distance to decide if the page should be
> > > > > place in oldest generation or some other gen, which per my test,
> > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > multiple workloads.
> > > > >
> > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > upstream.
> > > > >
> > > > > I think both this patch and my previous series are for solving the
> > > > > file pages underpertected issue, and I did a quick test using this
> > > > > series, for mongodb test, refault distance seems still a better
> > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > though, just they do have some conflicts in implementation and solving
> > > > > similar problem):
> > > > >
> > > > > Previous result:
> > > > > ==================================================================
> > > > > Execution Results after 905 seconds
> > > > > ------------------------------------------------------------------
> > > > > Executed Time (µs) Rate
> > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > ------------------------------------------------------------------
> > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > >
> > > > > This patch:
> > > > > ==================================================================
> > > > > Execution Results after 900 seconds
> > > > > ------------------------------------------------------------------
> > > > > Executed Time (µs) Rate
> > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > ------------------------------------------------------------------
> > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > >
> > > > > Unpatched version is always around ~500.
> > > >
> > > > Thanks for the test results!
> > > >
> > > > > I think there are a few points here:
> > > > > - Refault distance make use of page shadow so it can better
> > > > > distinguish evicted pages of different access pattern (re-access
> > > > > distance).
> > > > > - Throttled refault distance can help hold part of workingset when
> > > > > memory is too small to hold the whole workingset.
> > > > >
> > > > > So maybe part of this patch and the bits of previous series can be
> > > > > combined to work better on this issue, how do you think?
> > > >
> > > > I'll try to find some time this week to look at your RFC. It'd be a
> >
> > Hi Yu,
> >
> > I'm working on V4 of the RFC now, which just update some comments, and
> > skip anon page re-activation in refault path for mglru which was not
> > very helpful, only some tiny adjustment.
> > And I found it easier to test with fio, using following test script:
> >
> > #!/bin/bash
> > swapoff -a
> >
> > modprobe brd rd_nr=1 rd_size=16777216
> > mkfs.ext4 /dev/ram0
> > mount /dev/ram0 /mnt
> >
> > mkdir -p /sys/fs/cgroup/benchmark
> > cd /sys/fs/cgroup/benchmark
> >
> > echo 4G > memory.max
> > echo $$ > cgroup.procs
> > echo 3 > /proc/sys/vm/drop_caches
> >
> > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > --time_based --ramp_time=5m --runtime=5m --group_reporting
> >
> > zipf:0.5 is used here to simulate a cached read with slight bias
> > towards certain pages.
> > Unpatched 6.7-rc4:
> > Run status group 0 (all jobs):
> > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> >
> > Patched with RFC v4:
> > Run status group 0 (all jobs):
> > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> >
> > Patched with this series:
> > Run status group 0 (all jobs):
> > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> >
> > MGLRU off:
> > Run status group 0 (all jobs):
> > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> >
> > - If I change zipf:0.5 to random:
> > Unpatched 6.7-rc4:
> > Patched with this series:
> > Run status group 0 (all jobs):
> > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> >
> > Patched with RFC v4:
> > Run status group 0 (all jobs):
> > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> >
> > Patched with this series:
> > Run status group 0 (all jobs):
> > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> >
> > MGLRU off:
> > Run status group 0 (all jobs):
> > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> >
> > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > test I provided before uses a SATA SSD so it will have a much higher
> > impact. I'll provides a script to setup the test case and run it, it's
> > more complex to setup than fio since involving setting up multiple
> > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > occupied by some other tasks but will try best to send them out as
> > soon as possible.
>
> Thanks! Apparently your RFC did show better IOPS with both access
> patterns, which was a surprise to me because it had higher refaults
> and usually higher refautls result in worse performance.
>
> So I'm still trying to figure out why it turned out the opposite. My
> current guess is that:
> 1. It had a very small but stable inactive LRU list, which was able to
> fit into the L3 cache entirely.
> 2. It counted few folios as workingset and therefore incurred less
> overhead from CONFIG_PSI and/or CONFIG_TASK_DELAY_ACCT.
>
> Did you save workingset_refault_file when you ran the test? If so, can
> you check the difference between this series and your RFC?


It seems I was right about #1 above. After I scaled your test up by 20x,
I saw my series performed ~5% faster with zipf and ~9% faster with random
accesses.

IOW, I made rd_size from 16GB to 320GB, memory.max from 4GB to 80GB,
--numjobs from 12 to 60 and --size from 1GB to 4GB.

v6.7-c5 + this series
=====================

zipf
----

mglru: (groupid=0, jobs=60): err= 0: pid=12155: Wed Dec 13 17:50:36 2023
read: IOPS=5074k, BW=19.4GiB/s (20.8GB/s)(5807GiB/300007msec)
slat (usec): min=36, max=109326, avg=363.67, stdev=1829.97
clat (nsec): min=783, max=113292k, avg=1136755.10, stdev=3162056.05
lat (usec): min=37, max=149232, avg=1500.43, stdev=3644.21
clat percentiles (usec):
| 1.00th=[ 490], 5.00th=[ 519], 10.00th=[ 537], 20.00th=[ 553],
| 30.00th=[ 570], 40.00th=[ 586], 50.00th=[ 627], 60.00th=[ 840],
| 70.00th=[ 988], 80.00th=[ 1074], 90.00th=[ 1188], 95.00th=[ 1336],
| 99.00th=[ 7308], 99.50th=[31327], 99.90th=[36963], 99.95th=[45351],
| 99.99th=[53216]
bw ( MiB/s): min= 8332, max=27116, per=100.00%, avg=19846.67, stdev=58.20, samples=35903
iops : min=2133165, max=6941826, avg=5080741.79, stdev=14899.13, samples=35903
lat (nsec) : 1000=0.01%
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=1.76%, 750=52.94%, 1000=16.65%
lat (msec) : 2=26.22%, 4=0.15%, 10=1.36%, 20=0.01%, 50=0.90%
lat (msec) : 100=0.02%, 250=0.01%
cpu : usr=5.42%, sys=87.59%, ctx=470315, majf=0, minf=2184
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=100.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
issued rwts: total=1522384845,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
READ: bw=19.4GiB/s (20.8GB/s), 19.4GiB/s-19.4GiB/s (20.8GB/s-20.8GB/s), io=5807GiB (6236GB), run=300007-300007msec

Disk stats (read/write):
ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
mglru: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=128

random
------

mglru: (groupid=0, jobs=60): err= 0: pid=12576: Wed Dec 13 18:00:50 2023
read: IOPS=3853k, BW=14.7GiB/s (15.8GB/s)(4410GiB/300014msec)
slat (usec): min=58, max=118605, avg=486.45, stdev=2311.45
clat (usec): min=3, max=169810, avg=1496.60, stdev=3982.89
lat (usec): min=73, max=170019, avg=1983.06, stdev=4585.87
clat percentiles (usec):
| 1.00th=[ 586], 5.00th=[ 627], 10.00th=[ 644], 20.00th=[ 668],
| 30.00th=[ 693], 40.00th=[ 725], 50.00th=[ 816], 60.00th=[ 1123],
| 70.00th=[ 1221], 80.00th=[ 1352], 90.00th=[ 1516], 95.00th=[ 1713],
| 99.00th=[31851], 99.50th=[34866], 99.90th=[41681], 99.95th=[54264],
| 99.99th=[61080]
bw ( MiB/s): min= 6049, max=21328, per=100.00%, avg=15070.00, stdev=45.96, samples=35940
iops : min=1548543, max=5459997, avg=3857912.87, stdev=11765.30, samples=35940
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=0.01%
lat (usec) : 500=0.01%, 750=44.64%, 1000=8.20%
lat (msec) : 2=43.84%, 4=0.27%, 10=1.79%, 20=0.01%, 50=1.20%
lat (msec) : 100=0.07%, 250=0.01%
cpu : usr=3.19%, sys=89.87%, ctx=463840, majf=0, minf=2248
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
issued rwts: total=1155923744,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
READ: bw=14.7GiB/s (15.8GB/s), 14.7GiB/s-14.7GiB/s (15.8GB/s-15.8GB/s), io=4410GiB (4735GB), run=300014-300014msec

Disk stats (read/write):
ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

memcg 3 /zipf
node 0
0 1521654 0 0x
0 0r 0e 0p 0 0 0
1 0r 0e 0p 0 0 0
2 0r 0e 0p 0 0 0
3 0r 0e 0p 0 0 0
0 0 0 0 0 0
1 1521654 0 21
0 0 0 0 1077016797r 1111542014e 0p
1 0 0 0 317997853r 324814007e 0p
2 0 0 0 68064253r 68866308e 124302p
3 0 0 0 0r 0e 12282816p
0 0 0 0 0 0
2 1521654 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
3 1521654 0 0
0 0R 0T 0 0R 0T 0
1 0R 0T 0 0R 0T 0
2 0R 0T 0 0R 0T 0
3 0R 0T 0 0R 0T 0
0L 0O 0Y 0N 0F 0A
node 1
0 1521654 0 0
0 0r 0e 0p 0r 0e 0p
1 0r 0e 0p 0r 0e 0p
2 0r 0e 0p 0r 0e 0p
3 0r 0e 0p 0r 0e 0p
0 0 0 0 0 0
1 1521654 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
2 1521654 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
3 1521654 0 0
0 0R 0T 0 0R 0T 0
1 0R 0T 0 0R 0T 0
2 0R 0T 0 0R 0T 0
3 0R 0T 0 0R 0T 0
0L 0O 0Y 0N 0F 0A
memcg 4 /random
node 0
0 600431 0 0x
0 0r 0e 0p 0 0 0
1 0r 0e 0p 0 0 0
2 0r 0e 0p 0 0 0
3 0r 0e 0p 0 0 0
0 0 0 0 0 0
1 600431 0 11169201
0 0 0 0 1071724785r 1103937007e 0p
1 0 0 0 376193810r 384852629e 0p
2 0 0 0 77315518r 78596395e 0p
3 0 0 0 0r 0e 9593442p
0 0 0 0 0 0
2 600431 1 9593442
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
3 600431 36 754
0 0R 0T 0 0R 0T 0
1 0R 0T 0 0R 0T 0
2 0R 0T 0 0R 0T 0
3 0R 0T 0 0R 0T 0
0L 0O 0Y 0N 0F 0A
node 1
0 600431 0 0
0 0r 0e 0p 0r 0e 0p
1 0r 0e 0p 0r 0e 0p
2 0r 0e 0p 0r 0e 0p
3 0r 0e 0p 0r 0e 0p
0 0 0 0 0 0
1 600431 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
2 600431 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
3 600431 0 0
0 0R 0T 0 0R 0T 0
1 0R 0T 0 0R 0T 0
2 0R 0T 0 0R 0T 0
3 0R 0T 0 0R 0T 0
0L 0O 0Y 0N 0F 0A

v6.7-c5 + RFC v3
================

zipf
----

mglru: (groupid=0, jobs=60): err= 0: pid=11600: Wed Dec 13 18:34:31 2023
read: IOPS=4816k, BW=18.4GiB/s (19.7GB/s)(5512GiB/300014msec)
slat (usec): min=3, max=121722, avg=384.46, stdev=2066.10
clat (nsec): min=356, max=174717k, avg=1197513.60, stdev=3568734.58
lat (usec): min=3, max=174919, avg=1581.97, stdev=4112.49
clat percentiles (usec):
| 1.00th=[ 486], 5.00th=[ 515], 10.00th=[ 529], 20.00th=[ 553],
| 30.00th=[ 570], 40.00th=[ 594], 50.00th=[ 652], 60.00th=[ 898],
| 70.00th=[ 988], 80.00th=[ 1139], 90.00th=[ 1254], 95.00th=[ 1369],
| 99.00th=[ 6915], 99.50th=[35914], 99.90th=[42206], 99.95th=[52167],
| 99.99th=[61604]
bw ( MiB/s): min= 7716, max=26325, per=100.00%, avg=18836.65, stdev=57.20, samples=35880
iops : min=1975306, max=6739280, avg=4822176.85, stdev=14642.35, samples=35880
lat (nsec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=0.01%
lat (usec) : 500=2.57%, 750=50.99%, 1000=17.56%
lat (msec) : 2=26.41%, 4=0.16%, 10=1.41%, 20=0.01%, 50=0.84%
lat (msec) : 100=0.05%, 250=0.01%
cpu : usr=4.95%, sys=88.09%, ctx=457609, majf=0, minf=2184
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
issued rwts: total=1445015808,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
READ: bw=18.4GiB/s (19.7GB/s), 18.4GiB/s-18.4GiB/s (19.7GB/s-19.7GB/s), io=5512GiB (5919GB), run=300014-300014msec

Disk stats (read/write):
ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
mglru: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=128

random
------

mglru: (groupid=0, jobs=60): err= 0: pid=12024: Wed Dec 13 18:44:45 2023
read: IOPS=3519k, BW=13.4GiB/s (14.4GB/s)(4027GiB/300011msec)
slat (usec): min=54, max=136278, avg=534.57, stdev=2738.72
clat (usec): min=3, max=176186, avg=1638.66, stdev=4714.55
lat (usec): min=78, max=176426, avg=2173.23, stdev=5426.40
clat percentiles (usec):
| 1.00th=[ 627], 5.00th=[ 676], 10.00th=[ 693], 20.00th=[ 725],
| 30.00th=[ 766], 40.00th=[ 816], 50.00th=[ 1090], 60.00th=[ 1205],
| 70.00th=[ 1270], 80.00th=[ 1369], 90.00th=[ 1500], 95.00th=[ 1614],
| 99.00th=[38536], 99.50th=[41681], 99.90th=[47973], 99.95th=[65799],
| 99.99th=[72877]
bw ( MiB/s): min= 5586, max=20476, per=100.00%, avg=13760.26, stdev=45.33, samples=35904
iops : min=1430070, max=5242110, avg=3522621.15, stdev=11604.46, samples=35904
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=0.01%
lat (usec) : 500=0.01%, 750=26.33%, 1000=21.81%
lat (msec) : 2=48.54%, 4=0.16%, 10=1.91%, 20=0.01%, 50=1.17%
lat (msec) : 100=0.09%, 250=0.01%
cpu : usr=2.74%, sys=90.35%, ctx=481356, majf=0, minf=2244
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
issued rwts: total=1055590880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
READ: bw=13.4GiB/s (14.4GB/s), 13.4GiB/s-13.4GiB/s (14.4GB/s-14.4GB/s), io=4027GiB (4324GB), run=300011-300011msec

Disk stats (read/write):
ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

memcg 3 /zipf
node 0
0 1522519 0 22
0 0r 0e 0p 996363383r 1092111170e 0p
1 0r 0e 0p 274581982r 235766575e 0p
2 0r 0e 0p 85176438r 71356676e 96114p
3 0r 0e 0p 12470364r 11510461e 221796p
0 0 0 0 0 0
1 1522519 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
2 1522519 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
3 1522519 0 0
0 0R 0T 0 0R 0T 0
1 0R 0T 0 0R 0T 0
2 0R 0T 0 0R 0T 0
3 0R 0T 0 0R 0T 0
0L 0O 0Y 0N 0F 0A
node 1
0 1522519 0 0
0 0r 0e 0p 0r 0e 0p
1 0r 0e 0p 0r 0e 0p
2 0r 0e 0p 0r 0e 0p
3 0r 0e 0p 0r 0e 0p
0 0 0 0 0 0
1 1522519 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
2 1522519 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
3 1522519 0 0
0 0R 0T 0 0R 0T 0
1 0R 0T 0 0R 0T 0
2 0R 0T 0 0R 0T 0
3 0R 0T 0 0R 0T 0
0L 0O 0Y 0N 0F 0A
memcg 4 /random
node 0
0 600413 0 2289676
0 0r 0e 0p 875605725r 960492874e 0p
1 0r 0e 0p 411230731r 383704269e 0p
2 0r 0e 0p 112639317r 97774351e 0p
3 0r 0e 0p 2103334r 1766407e 0p
0 0 0 0 0 0
1 600413 1 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
2 600413 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
3 600413 35 18466878
0 0R 0T 0 0R 0T 0
1 0R 0T 0 0R 0T 0
2 0R 0T 0 0R 0T 0
3 0R 0T 0 0R 0T 0
0L 0O 0Y 0N 0F 0A
node 1
0 600413 0 0
0 0r 0e 0p 0r 0e 0p
1 0r 0e 0p 0r 0e 0p
2 0r 0e 0p 0r 0e 0p
3 0r 0e 0p 0r 0e 0p
0 0 0 0 0 0
1 600413 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
2 600413 0 0
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 0 0 0 0 0
3 600413 0 0
0 0R 0T 0 0R 0T 0
1 0R 0T 0 0R 0T 0
2 0R 0T 0 0R 0T 0
3 0R 0T 0 0R 0T 0
0L 0O 0Y 0N 0F 0A

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
node 0 size: 385748 MB
node 0 free: 383735 MB
node 1 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 1 size: 387047 MB
node 1 free: 137896 MB
node distances:
node 0 1
0: 10 21
1: 21 10

# git diff
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 970bd6ff38c4..ca51cfdf34af 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -75,7 +75,7 @@ static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
if (page)
return 0;

- page = alloc_page(gfp | __GFP_ZERO | __GFP_HIGHMEM);
+ page = alloc_pages_node(1, gfp | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE, 0);
if (!page)
return -ENOMEM;


# cat /proc/swaps
Filename Type Size Used Priority

# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

# fio --version
fio-3.36

# cat fio.sh
mkfs.ext4 /dev/ram0
mount /dev/ram0 /mnt
cd /sys/fs/cgroup/

mkdir zipf
cd zipf
echo 80G >memory.max
echo $$ >cgroup.procs
echo 3 >/proc/sys/vm/drop_caches
fio -name=mglru --numjobs=60 --directory=/mnt --size=4096m --buffered=1 --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 --iodepth_batch_complete=32 --rw=randread --random_distribution=zipf:0.5 --norandommap --time_based --ramp_time=5m --runtime=5m --group_reporting

umount /mnt
mount /dev/ram0 /mnt
cd ..

mkdir random
cd random/
echo $$ >cgroup.procs
echo 80G >memory.max
echo 3 >/proc/sys/vm/drop_caches
fio -name=mglru --numjobs=60 --directory=/mnt --size=4096m --buffered=1 --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 --iodepth_batch_complete=32 --rw=randread --random_distribution=random --norandommap --time_based --ramp_time=5m --runtime=5m --group_reporting

cat /sys/kernel/debug/lru_gen_full

# numactl -N 0 -m 0 bash fio.sh

# zcat /proc/config.gz
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 6.7.0 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="clang version google3-trunk (a91cb9ce39dc42e6a7a2c4fe97580e51eb1c2961)"
CONFIG_GCC_VERSION=0
CONFIG_CC_IS_CLANG=y
CONFIG_CLANG_VERSION=99990000
CONFIG_AS_IS_LLVM=y
CONFIG_AS_VERSION=99990000
CONFIG_LD_VERSION=0
CONFIG_LD_IS_LLD=y
CONFIG_LLD_VERSION=160666
CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y
CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT=y
CONFIG_TOOLS_SUPPORT_RELR=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
CONFIG_PAHOLE_VERSION=124
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
# CONFIG_WERROR is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
CONFIG_KERNEL_LZMA=y
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_IRQ_MSI_IOMMU=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_IDLE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_CONTEXT_TRACKING_USER=y
# CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US=125
# end of Timers subsystem

CONFIG_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y

#
# BPF subsystem
#
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
# CONFIG_BPF_JIT_ALWAYS_ON is not set
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_BPF_UNPRIV_DEFAULT_OFF=y
# CONFIG_BPF_PRELOAD is not set
CONFIG_BPF_LSM=y
# end of BPF subsystem

CONFIG_PREEMPT_NONE_BUILD=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_DYNAMIC is not set
# CONFIG_SCHED_CORE is not set

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_SCHED_AVG_IRQ=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
# CONFIG_TASK_DELAY_ACCT is not set
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
# CONFIG_RCU_LAZY is not set
# end of RCU Subsystem

CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=20
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
# CONFIG_PRINTK_INDEX is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_CC_IMPLICIT_FALLTHROUGH="-Wimplicit-fallthrough"
CONFIG_GCC11_NO_ARRAY_BOUNDS=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
# CONFIG_CGROUP_FAVOR_DYNMODS is not set
CONFIG_MEMCG=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_SCHED_MM_CID=y
CONFIG_CGROUP_PIDS=y
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
CONFIG_CGROUP_MISC=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_TIME_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_CHECKPOINT_RESTORE=y
# CONFIG_SCHED_AUTOGROUP is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE="gbuild-obj/initramfs-symlink.cpio.xz"
CONFIG_INITRAMFS_ROOT_UID=0
CONFIG_INITRAMFS_ROOT_GID=0
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
CONFIG_RD_ZSTD=y
# CONFIG_INITRAMFS_COMPRESSION_GZIP is not set
# CONFIG_INITRAMFS_COMPRESSION_BZIP2 is not set
# CONFIG_INITRAMFS_COMPRESSION_LZMA is not set
CONFIG_INITRAMFS_COMPRESSION_XZ=y
# CONFIG_INITRAMFS_COMPRESSION_LZO is not set
# CONFIG_INITRAMFS_COMPRESSION_LZ4 is not set
# CONFIG_INITRAMFS_COMPRESSION_ZSTD is not set
# CONFIG_INITRAMFS_COMPRESSION_NONE is not set
# CONFIG_BOOT_CONFIG is not set
CONFIG_INITRAMFS_PRESERVE_MTIME=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_LD_ORPHAN_WARN=y
CONFIG_LD_ORPHAN_WARN_LEVEL="warn"
CONFIG_SYSCTL=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_EXPERT=y
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_MEMBARRIER=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_SELFTEST is not set
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE=y
CONFIG_KCMP=y
CONFIG_RSEQ=y
CONFIG_CACHESTAT_SYSCALL=y
# CONFIG_DEBUG_RSEQ is not set
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_GUEST_PERF_EVENTS=y
# CONFIG_PC104 is not set

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
# end of Kernel Performance Events And Counters

CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y

#
# Kexec and crash features
#
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_KEXEC=y
# CONFIG_KEXEC_FILE is not set
CONFIG_CRASH_DUMP=y
CONFIG_CRASH_HOTPLUG=y
CONFIG_CRASH_MAX_MEMORY_RANGES=8192
# end of Kexec and crash features
# end of General setup

CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_AUDIT_ARCH=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DYNAMIC_PHYSICAL_MASK=y
CONFIG_PGTABLE_LEVELS=5
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

#
# Processor type and features
#
CONFIG_SMP=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
# CONFIG_GOLDFISH is not set
CONFIG_X86_CPU_RESCTRL=y
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_NUMACHIP is not set
# CONFIG_X86_VSMP is not set
# CONFIG_X86_UV is not set
# CONFIG_X86_GOLDFISH is not set
# CONFIG_X86_INTEL_MID is not set
# CONFIG_X86_INTEL_LPSS is not set
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
# CONFIG_IOSF_MBI is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_HYPERVISOR_GUEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_IA32_FEAT_CTL=y
CONFIG_X86_VMX_FEATURE_NAMES=y
# CONFIG_PROCESSOR_SELECT is not set
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_HYGON=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_ZHAOXIN=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_GART_IOMMU is not set
CONFIG_BOOT_VESA_SUPPORT=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS_RANGE_BEGIN=2
CONFIG_NR_CPUS_RANGE_END=512
CONFIG_NR_CPUS_DEFAULT=64
CONFIG_NR_CPUS=512
CONFIG_SCHED_CLUSTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_SCHED_MC_PRIO is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCELOG_LEGACY=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
CONFIG_X86_MCE_INJECT=m

#
# Performance monitoring
#
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_PERF_EVENTS_INTEL_RAPL=y
CONFIG_PERF_EVENTS_INTEL_CSTATE=y
# CONFIG_PERF_EVENTS_AMD_POWER is not set
CONFIG_PERF_EVENTS_AMD_UNCORE=y
# CONFIG_PERF_EVENTS_AMD_BRS is not set
# end of Performance monitoring

CONFIG_X86_16BIT=y
CONFIG_X86_ESPFIX64=y
CONFIG_X86_VSYSCALL_EMULATION=y
CONFIG_X86_IOPL_IOPERM=y
CONFIG_MICROCODE=y
# CONFIG_MICROCODE_LATE_LOADING is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_5LEVEL=y
CONFIG_X86_DIRECT_GBPAGES=y
# CONFIG_X86_CPA_STATISTICS is not set
CONFIG_X86_MEM_ENCRYPT=y
CONFIG_AMD_MEM_ENCRYPT=y
# CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=4
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
# CONFIG_X86_PMEM_LEGACY is not set
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_X86_UMIP=y
CONFIG_CC_HAS_IBT=y
CONFIG_X86_CET=y
CONFIG_X86_KERNEL_IBT=y
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=y
# CONFIG_X86_INTEL_TSX_MODE_OFF is not set
CONFIG_X86_INTEL_TSX_MODE_ON=y
# CONFIG_X86_INTEL_TSX_MODE_AUTO is not set
# CONFIG_X86_SGX is not set
# CONFIG_X86_USER_SHADOW_STACK is not set
# CONFIG_INTEL_TDX_HOST is not set
CONFIG_EFI=y
CONFIG_EFI_STUB=y
CONFIG_EFI_HANDOVER_PROTOCOL=y
# CONFIG_EFI_MIXED is not set
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_ARCH_SUPPORTS_KEXEC=y
CONFIG_ARCH_SUPPORTS_KEXEC_FILE=y
CONFIG_ARCH_SUPPORTS_KEXEC_SIG=y
CONFIG_ARCH_SUPPORTS_KEXEC_SIG_FORCE=y
CONFIG_ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG=y
CONFIG_ARCH_SUPPORTS_KEXEC_JUMP=y
CONFIG_ARCH_SUPPORTS_CRASH_DUMP=y
CONFIG_ARCH_SUPPORTS_CRASH_HOTPLUG=y
CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_RANDOMIZE_BASE=y
CONFIG_X86_NEED_RELOCS=y
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_DYNAMIC_MEMORY_LAYOUT=y
CONFIG_RANDOMIZE_MEMORY=y
CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING=0x0
# CONFIG_ADDRESS_MASKING is not set
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
CONFIG_LEGACY_VSYSCALL_XONLY=y
# CONFIG_LEGACY_VSYSCALL_NONE is not set
CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE="oops=panic panic=10 io_delay=0xed libata.allow_tpm=1 nmi_watchdog=panic tco_start=1 quiet slab_nomerge fb_tunnels=none mce=print_all msr.allow_writes=on acpi_enforce_resources=lax video=efifb:off hest_disable=1 erst_disable=1 bert_disable=1 retbleed=off eagerfpu=on kvm_amd.nested=0"
# CONFIG_CMDLINE_OVERRIDE is not set
CONFIG_MODIFY_LDT_SYSCALL=y
# CONFIG_STRICT_SIGALTSTACK_SIZE is not set
CONFIG_HAVE_LIVEPATCH=y
# CONFIG_LIVEPATCH is not set
# end of Processor type and features

CONFIG_CC_HAS_SLS=y
CONFIG_CC_HAS_RETURN_THUNK=y
CONFIG_CC_HAS_ENTRY_PADDING=y
CONFIG_FUNCTION_PADDING_CFI=11
CONFIG_FUNCTION_PADDING_BYTES=16
CONFIG_SPECULATION_MITIGATIONS=y
CONFIG_PAGE_TABLE_ISOLATION=y
# CONFIG_RETPOLINE is not set
CONFIG_CPU_IBPB_ENTRY=y
CONFIG_CPU_IBRS_ENTRY=y
CONFIG_SLS=y
# CONFIG_GDS_FORCE_MITIGATION is not set
CONFIG_ARCH_HAS_ADD_PAGES=y

#
# Power management and ACPI options
#
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_SUSPEND_SKIP_SYNC is not set
# CONFIG_HIBERNATION is not set
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_USERSPACE_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_ADVANCED_DEBUG is not set
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
# CONFIG_DPM_WATCHDOG is not set
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
# CONFIG_ENERGY_MODEL is not set
CONFIG_ARCH_SUPPORTS_ACPI=y
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
CONFIG_ACPI_SPCR_TABLE=y
# CONFIG_ACPI_FPDT is not set
CONFIG_ACPI_LPIT=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_REV_OVERRIDE_POSSIBLE=y
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_TAD is not set
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_PROCESSOR=y
# CONFIG_ACPI_IPMI is not set
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=m
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
# CONFIG_ACPI_HED is not set
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set
CONFIG_ACPI_NFIT=y
# CONFIG_NFIT_SECURITY_DEBUG is not set
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_HMAT is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
CONFIG_ACPI_APEI=y
# CONFIG_ACPI_APEI_GHES is not set
# CONFIG_ACPI_APEI_MEMORY_FAILURE is not set
CONFIG_ACPI_APEI_EINJ=m
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
# CONFIG_ACPI_DPTF is not set
# CONFIG_ACPI_CONFIGFS is not set
# CONFIG_ACPI_PFRUT is not set
# CONFIG_ACPI_FFH is not set
# CONFIG_PMIC_OPREGION is not set
CONFIG_ACPI_PRMT=y
CONFIG_X86_PM_TIMER=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
# CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set

#
# CPU frequency scaling drivers
#
# CONFIG_X86_INTEL_PSTATE is not set
# CONFIG_X86_PCC_CPUFREQ is not set
# CONFIG_X86_AMD_PSTATE is not set
# CONFIG_X86_AMD_PSTATE_UT is not set
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ_CPB=y
CONFIG_X86_POWERNOW_K8=m
# CONFIG_X86_AMD_FREQ_SENSITIVITY is not set
CONFIG_X86_SPEEDSTEP_CENTRINO=m
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# end of CPU Frequency scaling

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_CPU_IDLE_GOV_TEO is not set
# end of CPU Idle

CONFIG_INTEL_IDLE=y
# end of Power management and ACPI options

#
# Bus options (PCI etc.)
#
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_MMCONF_FAM10H=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
# CONFIG_ISA_BUS is not set
# CONFIG_ISA_DMA_API is not set
CONFIG_AMD_NB=y
# end of Bus options (PCI etc.)

#
# Binary Emulations
#
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_EMULATION_DEFAULT_DISABLED is not set
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
# end of Binary Emulations

CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_PFNCACHE=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQFD=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_DIRTY_RING=y
CONFIG_HAVE_KVM_DIRTY_RING_TSO=y
CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y
CONFIG_KVM_COMPAT=y
CONFIG_HAVE_KVM_IRQ_BYPASS=y
CONFIG_HAVE_KVM_NO_POLL=y
CONFIG_KVM_XFER_TO_GUEST_WORK=y
CONFIG_HAVE_KVM_PM_NOTIFIER=y
CONFIG_KVM_GENERIC_HARDWARE_ENABLING=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_WERROR=y
CONFIG_KVM_INTEL=y
CONFIG_KVM_AMD=y
CONFIG_KVM_AMD_SEV=y
CONFIG_KVM_SMM=y
# CONFIG_KVM_XEN is not set
# CONFIG_KVM_PROVE_MMU is not set
CONFIG_KVM_MAX_NR_VCPUS=1024
CONFIG_AS_AVX512=y
CONFIG_AS_SHA1_NI=y
CONFIG_AS_SHA256_NI=y
CONFIG_AS_TPAUSE=y
CONFIG_AS_GFNI=y
CONFIG_AS_WRUSS=y

#
# General architecture-dependent options
#
CONFIG_HOTPLUG_SMT=y
CONFIG_HOTPLUG_CORE_SYNC=y
CONFIG_HOTPLUG_CORE_SYNC_DEAD=y
CONFIG_HOTPLUG_CORE_SYNC_FULL=y
CONFIG_HOTPLUG_SPLIT_STARTUP=y
CONFIG_HOTPLUG_PARALLEL=y
CONFIG_GENERIC_ENTRY=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
# CONFIG_STATIC_KEYS_SELFTEST is not set
# CONFIG_STATIC_CALL_SELFTEST is not set
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_KRETPROBE_ON_RETHOOK=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_ARCH_CORRECT_STACKTRACE_ON_KRETPROBE=y
CONFIG_HAVE_FUNCTION_ERROR_INJECTION=y
CONFIG_HAVE_NMI=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_NMI_SUPPORT=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_HAS_SET_DIRECT_MAP=y
CONFIG_ARCH_HAS_CPU_FINALIZE_INIT=y
CONFIG_HAVE_ARCH_THREAD_STRUCT_WHITELIST=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_ARCH_WANTS_NO_INSTR=y
CONFIG_HAVE_ASM_MODVERSIONS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_RSEQ=y
CONFIG_HAVE_RUST=y
CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_MMU_GATHER_MERGE_VMAS=y
CONFIG_MMU_LAZY_TLB_REFCOUNT=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y
# CONFIG_SECCOMP_CACHE_DEBUG is not set
CONFIG_HAVE_ARCH_STACKLEAK=y
CONFIG_HAVE_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG_THIN=y
CONFIG_HAS_LTO_CLANG=y
CONFIG_LTO_NONE=y
# CONFIG_LTO_CLANG_FULL is not set
# CONFIG_LTO_CLANG_THIN is not set
CONFIG_ARCH_SUPPORTS_CFI_CLANG=y
# CONFIG_CFI_CLANG is not set
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING_USER=y
CONFIG_HAVE_CONTEXT_TRACKING_USER_OFFSTACK=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_MOVE_PUD=y
CONFIG_HAVE_MOVE_PMD=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_HUGE_VMALLOC=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_PMD_MKWRITE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK=y
CONFIG_SOFTIRQ_ON_OWN_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_PAGE_SIZE_LESS_THAN_64KB=y
CONFIG_PAGE_SIZE_LESS_THAN_256KB=y
CONFIG_HAVE_OBJTOOL=y
CONFIG_HAVE_JUMP_LABEL_HACK=y
CONFIG_HAVE_NOINSTR_HACK=y
CONFIG_HAVE_NOINSTR_VALIDATION=y
CONFIG_HAVE_UACCESS_VALIDATION=y
CONFIG_HAVE_STACK_VALIDATION=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
CONFIG_COMPAT_32BIT_TIME=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
# CONFIG_VMAP_STACK is not set
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
CONFIG_RANDOMIZE_KSTACK_OFFSET=y
# CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT is not set
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y
CONFIG_ARCH_USE_MEMREMAP_PROT=y
# CONFIG_LOCK_EVENT_COUNTS is not set
CONFIG_ARCH_HAS_MEM_ENCRYPT=y
CONFIG_ARCH_HAS_CC_PLATFORM=y
CONFIG_HAVE_STATIC_CALL=y
CONFIG_HAVE_STATIC_CALL_INLINE=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_ARCH_WANT_LD_ORPHAN_WARN=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_SUPPORTS_PAGE_TABLE_CHECK=y
CONFIG_ARCH_HAS_ELFCORE_COMPAT=y
CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH=y
CONFIG_DYNAMIC_SIGFRAME=y
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# end of GCOV-based kernel profiling

CONFIG_HAVE_GCC_PLUGINS=y
CONFIG_FUNCTION_ALIGNMENT_4B=y
CONFIG_FUNCTION_ALIGNMENT_16B=y
CONFIG_FUNCTION_ALIGNMENT=16
# end of General architecture-dependent options

CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_DEBUG is not set
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODULE_UNLOAD_TAINT_TRACKING is not set
CONFIG_MODVERSIONS=y
CONFIG_ASM_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_MODULE_SIG is not set
CONFIG_MODULE_COMPRESS_NONE=y
# CONFIG_MODULE_COMPRESS_GZIP is not set
# CONFIG_MODULE_COMPRESS_XZ is not set
# CONFIG_MODULE_COMPRESS_ZSTD is not set
# CONFIG_MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS is not set
CONFIG_MODPROBE_PATH="/sbin/modprobe"
# CONFIG_TRIM_UNUSED_KSYMS is not set
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLOCK_LEGACY_AUTOLOAD=y
CONFIG_BLK_CGROUP_RWSTAT=y
CONFIG_BLK_DEV_BSG_COMMON=y
CONFIG_BLK_ICQ=y
CONFIG_BLK_DEV_BSGLIB=y
# CONFIG_BLK_DEV_INTEGRITY is not set
# CONFIG_BLK_DEV_ZONED is not set
CONFIG_BLK_DEV_THROTTLING=y
# CONFIG_BLK_DEV_THROTTLING_LOW is not set
# CONFIG_BLK_WBT is not set
# CONFIG_BLK_CGROUP_IOLATENCY is not set
# CONFIG_BLK_CGROUP_IOCOST is not set
# CONFIG_BLK_CGROUP_IOPRIO is not set
CONFIG_BLK_DEBUG_FS=y
CONFIG_BLK_SED_OPAL=y
# CONFIG_BLK_INLINE_ENCRYPTION is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_AIX_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
# CONFIG_BSD_DISKLABEL is not set
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
# CONFIG_CMDLINE_PARTITION is not set
# end of Partition Types

CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y
CONFIG_BLK_PM=y
CONFIG_BLOCK_HOLDER_DEPRECATED=y
CONFIG_BLK_MQ_STACKING=y

#
# IO Schedulers
#
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
CONFIG_IOSCHED_BFQ=y
CONFIG_BFQ_GROUP_IOSCHED=y
CONFIG_BFQ_CGROUP_DEBUG=y
# end of IO Schedulers

CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_ASN1=m
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
CONFIG_FREEZER=y

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_BINFMT_SCRIPT=y
CONFIG_BINFMT_MISC=y
CONFIG_COREDUMP=y
# end of Executable file formats

#
# Memory Management options
#
CONFIG_ZPOOL=y
CONFIG_SWAP=y
CONFIG_ZSWAP=y
# CONFIG_ZSWAP_DEFAULT_ON is not set
# CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_DEFLATE is not set
CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZO=y
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_842 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4HC is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_ZSTD is not set
CONFIG_ZSWAP_COMPRESSOR_DEFAULT="lzo"
# CONFIG_ZSWAP_ZPOOL_DEFAULT_ZBUD is not set
# CONFIG_ZSWAP_ZPOOL_DEFAULT_Z3FOLD is not set
CONFIG_ZSWAP_ZPOOL_DEFAULT_ZSMALLOC=y
CONFIG_ZSWAP_ZPOOL_DEFAULT="zsmalloc"
# CONFIG_ZBUD is not set
# CONFIG_Z3FOLD is not set
CONFIG_ZSMALLOC=y
# CONFIG_ZSMALLOC_STAT is not set
CONFIG_ZSMALLOC_CHAIN_SIZE=8

#
# SLAB allocator options
#
CONFIG_SLAB_DEPRECATED=y
# CONFIG_SLUB is not set
CONFIG_SLAB=y
CONFIG_SLAB_MERGE_DEFAULT=y
CONFIG_SLAB_FREELIST_RANDOM=y
# CONFIG_SLAB_FREELIST_HARDENED is not set
# end of SLAB allocator options

CONFIG_SHUFFLE_PAGE_ALLOCATOR=y
# CONFIG_COMPAT_BRK is not set
CONFIG_SPARSEMEM=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP=y
CONFIG_ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP=y
CONFIG_HAVE_FAST_GUP=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_EXCLUSIVE_SYSTEM_RAM=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_MEMORY_BALLOON=y
CONFIG_BALLOON_COMPACTION=y
CONFIG_COMPACTION=y
CONFIG_COMPACT_UNEVICTABLE_DEFAULT=1
CONFIG_PAGE_REPORTING=y
CONFIG_MIGRATION=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_ARCH_ENABLE_THP_MIGRATION=y
CONFIG_CONTIG_ALLOC=y
CONFIG_PCP_BATCH_SCALE_MAX=5
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_MMU_NOTIFIER=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=65536
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
# CONFIG_HWPOISON_INJECT is not set
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ARCH_WANTS_THP_SWAP=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_THP_SWAP=y
CONFIG_READ_ONLY_THP_FOR_FS=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
# CONFIG_CMA is not set
CONFIG_MEM_SOFT_DIRTY=y
CONFIG_GENERIC_EARLY_IOREMAP=y
# CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set
CONFIG_PAGE_IDLE_FLAG=y
CONFIG_IDLE_PAGE_TRACKING=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CURRENT_STACK_POINTER=y
CONFIG_ARCH_HAS_PTE_DEVMAP=y
CONFIG_ARCH_HAS_ZONE_DMA_SET=y
# CONFIG_ZONE_DMA is not set
CONFIG_ZONE_DMA32=y
CONFIG_HMM_MIRROR=y
CONFIG_ARCH_USES_HIGH_VMA_FLAGS=y
CONFIG_ARCH_HAS_PKEYS=y
CONFIG_VM_EVENT_COUNTERS=y
# CONFIG_PERCPU_STATS is not set
# CONFIG_GUP_TEST is not set
# CONFIG_DMAPOOL_TEST is not set
CONFIG_ARCH_HAS_PTE_SPECIAL=y
CONFIG_MEMFD_CREATE=y
CONFIG_SECRETMEM=y
CONFIG_ANON_VMA_NAME=y
CONFIG_HAVE_ARCH_USERFAULTFD_WP=y
CONFIG_HAVE_ARCH_USERFAULTFD_MINOR=y
CONFIG_USERFAULTFD=y
CONFIG_PTE_MARKER_UFFD_WP=y
CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y
# CONFIG_LRU_GEN_STATS is not set
CONFIG_ARCH_SUPPORTS_PER_VMA_LOCK=y
CONFIG_PER_VMA_LOCK=y
CONFIG_LOCK_MM_AND_FIND_VMA=y

#
# Data Access Monitoring
#
# CONFIG_DAMON is not set
# end of Data Access Monitoring
# end of Memory Management options

CONFIG_NET=y
CONFIG_NET_INGRESS=y
CONFIG_NET_EGRESS=y
CONFIG_NET_XGRESS=y
CONFIG_NET_REDIRECT=y
CONFIG_SKB_EXTENSIONS=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_DIAG=y
CONFIG_UNIX=y
CONFIG_UNIX_SCM=y
CONFIG_AF_UNIX_OOB=y
CONFIG_UNIX_DIAG=y
# CONFIG_TLS is not set
# CONFIG_XFRM_USER is not set
# CONFIG_NET_KEY is not set
# CONFIG_SMC is not set
CONFIG_XDP_SOCKETS=y
# CONFIG_XDP_SOCKETS_DIAG is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
# CONFIG_IP_FIB_TRIE_STATS is not set
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_ROUTE_CLASSID=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=y
CONFIG_NET_IPGRE_DEMUX=y
CONFIG_NET_IP_TUNNEL=y
CONFIG_NET_IPGRE=y
# CONFIG_NET_IPGRE_BROADCAST is not set
# CONFIG_IP_MROUTE is not set
CONFIG_SYN_COOKIES=y
# CONFIG_NET_IPVTI is not set
CONFIG_NET_UDP_TUNNEL=y
CONFIG_NET_FOU=y
CONFIG_NET_FOU_IP_TUNNELS=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
CONFIG_INET_TABLE_PERTURB_ORDER=16
CONFIG_INET_TUNNEL=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
CONFIG_INET_UDP_DIAG=y
# CONFIG_INET_RAW_DIAG is not set
CONFIG_INET_DIAG_DESTROY=y
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_NV is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
CONFIG_TCP_CONG_DCTCP=m
# CONFIG_TCP_CONG_CDG is not set
CONFIG_TCP_CONG_BBR=y
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_BBR is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_SIGPOOL=y
# CONFIG_TCP_AO is not set
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_IPV6_ILA is not set
CONFIG_INET6_TUNNEL=y
# CONFIG_IPV6_VTI is not set
CONFIG_IPV6_SIT=y
CONFIG_IPV6_SIT_6RD=y
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=y
CONFIG_IPV6_GRE=y
CONFIG_IPV6_FOU=y
CONFIG_IPV6_FOU_TUNNEL=y
CONFIG_IPV6_MULTIPLE_TABLES=y
# CONFIG_IPV6_SUBTREES is not set
# CONFIG_IPV6_MROUTE is not set
# CONFIG_IPV6_SEG6_LWTUNNEL is not set
# CONFIG_IPV6_SEG6_HMAC is not set
# CONFIG_IPV6_RPL_LWTUNNEL is not set
# CONFIG_IPV6_IOAM6_LWTUNNEL is not set
# CONFIG_NETLABEL is not set
# CONFIG_MPTCP is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NET_PTP_CLASSIFY=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=m

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_INGRESS=y
CONFIG_NETFILTER_EGRESS=y
CONFIG_NETFILTER_SKIP_EGRESS=y
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_FAMILY_BRIDGE=y
CONFIG_NETFILTER_BPF_LINK=y
# CONFIG_NETFILTER_NETLINK_ACCT is not set
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
# CONFIG_NETFILTER_NETLINK_OSF is not set
CONFIG_NF_CONNTRACK=m
CONFIG_NF_LOG_SYSLOG=m
CONFIG_NETFILTER_CONNCOUNT=m
CONFIG_NF_CONNTRACK_MARK=y
# CONFIG_NF_CONNTRACK_ZONES is not set
# CONFIG_NF_CONNTRACK_PROCFS is not set
CONFIG_NF_CONNTRACK_EVENTS=y
# CONFIG_NF_CONNTRACK_TIMEOUT is not set
# CONFIG_NF_CONNTRACK_TIMESTAMP is not set
# CONFIG_NF_CONNTRACK_LABELS is not set
CONFIG_NF_CT_PROTO_DCCP=y
CONFIG_NF_CT_PROTO_SCTP=y
CONFIG_NF_CT_PROTO_UDPLITE=y
# CONFIG_NF_CONNTRACK_AMANDA is not set
CONFIG_NF_CONNTRACK_FTP=m
# CONFIG_NF_CONNTRACK_H323 is not set
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_SNMP is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CONNTRACK_TFTP is not set
CONFIG_NF_CT_NETLINK=m
# CONFIG_NETFILTER_NETLINK_GLUE_CT is not set
CONFIG_NF_NAT=m
CONFIG_NF_NAT_FTP=m
CONFIG_NF_NAT_MASQUERADE=y
# CONFIG_NF_TABLES is not set
CONFIG_NETFILTER_XTABLES=y
# CONFIG_NETFILTER_XTABLES_COMPAT is not set

#
# Xtables combined modules
#
CONFIG_NETFILTER_XT_MARK=m
CONFIG_NETFILTER_XT_CONNMARK=m
CONFIG_NETFILTER_XT_SET=m

#
# Xtables targets
#
# CONFIG_NETFILTER_XT_TARGET_AUDIT is not set
# CONFIG_NETFILTER_XT_TARGET_CHECKSUM is not set
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
CONFIG_NETFILTER_XT_TARGET_CT=m
CONFIG_NETFILTER_XT_TARGET_DSCP=m
CONFIG_NETFILTER_XT_TARGET_HL=m
# CONFIG_NETFILTER_XT_TARGET_HMARK is not set
# CONFIG_NETFILTER_XT_TARGET_IDLETIMER is not set
CONFIG_NETFILTER_XT_TARGET_LOG=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_NAT=m
# CONFIG_NETFILTER_XT_TARGET_NETMAP is not set
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_NOTRACK=m
CONFIG_NETFILTER_XT_TARGET_RATEEST=m
# CONFIG_NETFILTER_XT_TARGET_REDIRECT is not set
CONFIG_NETFILTER_XT_TARGET_MASQUERADE=m
# CONFIG_NETFILTER_XT_TARGET_TEE is not set
CONFIG_NETFILTER_XT_TARGET_TPROXY=y
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=y
CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=m

#
# Xtables matches
#
CONFIG_NETFILTER_XT_MATCH_ADDRTYPE=m
# CONFIG_NETFILTER_XT_MATCH_BPF is not set
# CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_CLUSTER is not set
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m
# CONFIG_NETFILTER_XT_MATCH_CONNLABEL is not set
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=m
CONFIG_NETFILTER_XT_MATCH_CONNMARK=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
# CONFIG_NETFILTER_XT_MATCH_CPU is not set
CONFIG_NETFILTER_XT_MATCH_DCCP=m
# CONFIG_NETFILTER_XT_MATCH_DEVGROUP is not set
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ECN=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
CONFIG_NETFILTER_XT_MATCH_HELPER=m
CONFIG_NETFILTER_XT_MATCH_HL=m
# CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set
CONFIG_NETFILTER_XT_MATCH_IPRANGE=m
# CONFIG_NETFILTER_XT_MATCH_L2TP is not set
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
# CONFIG_NETFILTER_XT_MATCH_LIMIT is not set
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=y
# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set
# CONFIG_NETFILTER_XT_MATCH_OSF is not set
CONFIG_NETFILTER_XT_MATCH_OWNER=m
# CONFIG_NETFILTER_XT_MATCH_PHYSDEV is not set
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
CONFIG_NETFILTER_XT_MATCH_RATEEST=m
CONFIG_NETFILTER_XT_MATCH_REALM=m
CONFIG_NETFILTER_XT_MATCH_RECENT=m
CONFIG_NETFILTER_XT_MATCH_SCTP=m
CONFIG_NETFILTER_XT_MATCH_SOCKET=m
CONFIG_NETFILTER_XT_MATCH_STATE=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=y
CONFIG_NETFILTER_XT_MATCH_TIME=m
CONFIG_NETFILTER_XT_MATCH_U32=m
# end of Core Netfilter Configuration

CONFIG_IP_SET=m
CONFIG_IP_SET_MAX=1024
# CONFIG_IP_SET_BITMAP_IP is not set
# CONFIG_IP_SET_BITMAP_IPMAC is not set
# CONFIG_IP_SET_BITMAP_PORT is not set
CONFIG_IP_SET_HASH_IP=m
# CONFIG_IP_SET_HASH_IPMARK is not set
# CONFIG_IP_SET_HASH_IPPORT is not set
# CONFIG_IP_SET_HASH_IPPORTIP is not set
# CONFIG_IP_SET_HASH_IPPORTNET is not set
# CONFIG_IP_SET_HASH_IPMAC is not set
# CONFIG_IP_SET_HASH_MAC is not set
# CONFIG_IP_SET_HASH_NETPORTNET is not set
CONFIG_IP_SET_HASH_NET=m
# CONFIG_IP_SET_HASH_NETNET is not set
# CONFIG_IP_SET_HASH_NETPORT is not set
# CONFIG_IP_SET_HASH_NETIFACE is not set
# CONFIG_IP_SET_LIST_SET is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=y
CONFIG_NF_SOCKET_IPV4=m
CONFIG_NF_TPROXY_IPV4=y
# CONFIG_NF_DUP_IPV4 is not set
# CONFIG_NF_LOG_ARP is not set
# CONFIG_NF_LOG_IPV4 is not set
CONFIG_NF_REJECT_IPV4=m
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_AH is not set
CONFIG_IP_NF_MATCH_ECN=m
# CONFIG_IP_NF_MATCH_RPFILTER is not set
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
# CONFIG_IP_NF_TARGET_SYNPROXY is not set
CONFIG_IP_NF_NAT=m
CONFIG_IP_NF_TARGET_MASQUERADE=m
# CONFIG_IP_NF_TARGET_NETMAP is not set
# CONFIG_IP_NF_TARGET_REDIRECT is not set
CONFIG_IP_NF_MANGLE=y
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_RAW=m
# CONFIG_IP_NF_SECURITY is not set
# CONFIG_IP_NF_ARPTABLES is not set
# end of IP: Netfilter Configuration

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_SOCKET_IPV6=m
CONFIG_NF_TPROXY_IPV6=y
# CONFIG_NF_DUP_IPV6 is not set
CONFIG_NF_REJECT_IPV6=m
CONFIG_NF_LOG_IPV6=m
CONFIG_IP6_NF_IPTABLES=y
# CONFIG_IP6_NF_MATCH_AH is not set
# CONFIG_IP6_NF_MATCH_EUI64 is not set
CONFIG_IP6_NF_MATCH_FRAG=m
# CONFIG_IP6_NF_MATCH_OPTS is not set
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
# CONFIG_IP6_NF_MATCH_MH is not set
# CONFIG_IP6_NF_MATCH_RPFILTER is not set
# CONFIG_IP6_NF_MATCH_RT is not set
# CONFIG_IP6_NF_MATCH_SRH is not set
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_REJECT=m
# CONFIG_IP6_NF_TARGET_SYNPROXY is not set
CONFIG_IP6_NF_MANGLE=y
CONFIG_IP6_NF_RAW=m
# CONFIG_IP6_NF_SECURITY is not set
# CONFIG_IP6_NF_NAT is not set
# end of IPv6: Netfilter Configuration

CONFIG_NF_DEFRAG_IPV6=y
# CONFIG_NF_CONNTRACK_BRIDGE is not set
# CONFIG_BRIDGE_NF_EBTABLES is not set
# CONFIG_BPFILTER is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=m
# CONFIG_SCTP_DBG_OBJCNT is not set
CONFIG_SCTP_DEFAULT_COOKIE_HMAC_MD5=y
# CONFIG_SCTP_DEFAULT_COOKIE_HMAC_SHA1 is not set
# CONFIG_SCTP_DEFAULT_COOKIE_HMAC_NONE is not set
CONFIG_SCTP_COOKIE_HMAC_MD5=y
# CONFIG_SCTP_COOKIE_HMAC_SHA1 is not set
CONFIG_INET_SCTP_DIAG=m
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
CONFIG_STP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
# CONFIG_BRIDGE_VLAN_FILTERING is not set
# CONFIG_BRIDGE_MRP is not set
# CONFIG_BRIDGE_CFM is not set
# CONFIG_NET_DSA is not set
CONFIG_VLAN_8021Q=m
# CONFIG_VLAN_8021Q_GVRP is not set
# CONFIG_VLAN_8021Q_MVRP is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_6LOWPAN is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_HTB=y
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_PRIO=y
CONFIG_NET_SCH_MULTIQ=m
CONFIG_NET_SCH_RED=m
# CONFIG_NET_SCH_SFB is not set
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
# CONFIG_NET_SCH_CBS is not set
# CONFIG_NET_SCH_ETF is not set
CONFIG_NET_SCH_MQPRIO_LIB=m
# CONFIG_NET_SCH_TAPRIO is not set
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_NETEM=m
CONFIG_NET_SCH_DRR=m
CONFIG_NET_SCH_MQPRIO=m
# CONFIG_NET_SCH_SKBPRIO is not set
# CONFIG_NET_SCH_CHOKE is not set
# CONFIG_NET_SCH_QFQ is not set
CONFIG_NET_SCH_CODEL=y
CONFIG_NET_SCH_FQ_CODEL=m
# CONFIG_NET_SCH_CAKE is not set
CONFIG_NET_SCH_FQ=y
# CONFIG_NET_SCH_HHF is not set
CONFIG_NET_SCH_PIE=m
# CONFIG_NET_SCH_FQ_PIE is not set
CONFIG_NET_SCH_INGRESS=y
# CONFIG_NET_SCH_PLUG is not set
# CONFIG_NET_SCH_ETS is not set
# CONFIG_NET_SCH_DEFAULT is not set

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=y
# CONFIG_CLS_U32_PERF is not set
# CONFIG_CLS_U32_MARK is not set
CONFIG_NET_CLS_FLOW=m
# CONFIG_NET_CLS_CGROUP is not set
CONFIG_NET_CLS_BPF=y
# CONFIG_NET_CLS_FLOWER is not set
# CONFIG_NET_CLS_MATCHALL is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
# CONFIG_NET_EMATCH_IPSET is not set
# CONFIG_NET_EMATCH_IPT is not set
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
# CONFIG_GACT_PROB is not set
CONFIG_NET_ACT_MIRRED=m
# CONFIG_NET_ACT_SAMPLE is not set
CONFIG_NET_ACT_IPT=m
CONFIG_NET_ACT_NAT=m
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
CONFIG_NET_ACT_SKBEDIT=m
# CONFIG_NET_ACT_CSUM is not set
# CONFIG_NET_ACT_MPLS is not set
# CONFIG_NET_ACT_VLAN is not set
CONFIG_NET_ACT_BPF=y
# CONFIG_NET_ACT_CONNMARK is not set
# CONFIG_NET_ACT_CTINFO is not set
# CONFIG_NET_ACT_SKBMOD is not set
# CONFIG_NET_ACT_IFE is not set
# CONFIG_NET_ACT_TUNNEL_KEY is not set
# CONFIG_NET_ACT_GATE is not set
# CONFIG_NET_TC_SKB_EXT is not set
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
# CONFIG_DNS_RESOLVER is not set
CONFIG_BATMAN_ADV=m
# CONFIG_BATMAN_ADV_BATMAN_V is not set
CONFIG_BATMAN_ADV_BLA=y
CONFIG_BATMAN_ADV_DAT=y
CONFIG_BATMAN_ADV_NC=y
CONFIG_BATMAN_ADV_MCAST=y
# CONFIG_BATMAN_ADV_DEBUG is not set
# CONFIG_BATMAN_ADV_TRACING is not set
# CONFIG_OPENVSWITCH is not set
CONFIG_VSOCKETS=y
CONFIG_VSOCKETS_DIAG=y
CONFIG_VSOCKETS_LOOPBACK=y
CONFIG_VIRTIO_VSOCKETS=y
CONFIG_VIRTIO_VSOCKETS_COMMON=y
CONFIG_NETLINK_DIAG=y
# CONFIG_MPLS is not set
# CONFIG_NET_NSH is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
CONFIG_NET_L3_MASTER_DEV=y
# CONFIG_QRTR is not set
# CONFIG_NET_NCSI is not set
CONFIG_PCPU_DEV_REFCNT=y
CONFIG_MAX_SKB_FRAGS=17
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_SOCK_RX_QUEUE_MAPPING=y
CONFIG_XPS=y
# CONFIG_CGROUP_NET_PRIO is not set
# CONFIG_CGROUP_NET_CLASSID is not set
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_BPF_STREAM_PARSER=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
CONFIG_NET_PKTGEN=m
# CONFIG_NET_DROP_MONITOR is not set
# end of Network testing
# end of Networking options

# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
CONFIG_STREAM_PARSER=y
# CONFIG_MCTP is not set
CONFIG_FIB_RULES=y
# CONFIG_WIRELESS is not set
# CONFIG_RFKILL is not set
CONFIG_NET_9P=m
CONFIG_NET_9P_FD=m
CONFIG_NET_9P_VIRTIO=m
# CONFIG_NET_9P_RDMA is not set
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
# CONFIG_LWTUNNEL is not set
CONFIG_DST_CACHE=y
CONFIG_GRO_CELLS=y
CONFIG_NET_SELFTESTS=y
CONFIG_NET_SOCK_MSG=y
CONFIG_NET_DEVLINK=y
CONFIG_PAGE_POOL=y
# CONFIG_PAGE_POOL_STATS is not set
CONFIG_FAILOVER=m
CONFIG_ETHTOOL_NETLINK=y

#
# Device Drivers
#
CONFIG_HAVE_EISA=y
# CONFIG_EISA is not set
CONFIG_HAVE_PCI=y
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
# CONFIG_PCIEPORTBUS is not set
CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
# CONFIG_PCIE_PTM is not set
CONFIG_PCI_MSI=y
CONFIG_PCI_QUIRKS=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
# CONFIG_PCI_STUB is not set
CONFIG_PCI_PF_STUB=m
CONFIG_PCI_ATS=y
CONFIG_PCI_LOCKLESS_CONFIG=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_PCI_LABEL=y
# CONFIG_PCIE_BUS_TUNE_OFF is not set
CONFIG_PCIE_BUS_DEFAULT=y
# CONFIG_PCIE_BUS_SAFE is not set
# CONFIG_PCIE_BUS_PERFORMANCE is not set
# CONFIG_PCIE_BUS_PEER2PEER is not set
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_ACPI is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# PCI controller drivers
#
# CONFIG_VMD is not set

#
# Cadence-based PCIe controllers
#
# end of Cadence-based PCIe controllers

#
# DesignWare-based PCIe controllers
#
# CONFIG_PCI_MESON is not set
# CONFIG_PCIE_DW_PLAT_HOST is not set
# end of DesignWare-based PCIe controllers

#
# Mobiveil-based PCIe controllers
#
# end of Mobiveil-based PCIe controllers
# end of PCI controller drivers

#
# PCI Endpoint
#
# CONFIG_PCI_ENDPOINT is not set
# end of PCI Endpoint

#
# PCI switch controller drivers
#
# CONFIG_PCI_SW_SWITCHTEC is not set
# end of PCI switch controller drivers

# CONFIG_CXL_BUS is not set
# CONFIG_PCCARD is not set
# CONFIG_RAPIDIO is not set

#
# Generic Driver Options
#
CONFIG_AUXILIARY_BUS=y
# CONFIG_UEVENT_HELPER is not set
CONFIG_DEVTMPFS=y
# CONFIG_DEVTMPFS_MOUNT is not set
# CONFIG_DEVTMPFS_SAFE is not set
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y

#
# Firmware loader
#
CONFIG_FW_LOADER=y
CONFIG_FW_LOADER_PAGED_BUF=y
CONFIG_FW_LOADER_SYSFS=y
CONFIG_EXTRA_FIRMWARE=""
CONFIG_FW_LOADER_USER_HELPER=y
# CONFIG_FW_LOADER_USER_HELPER_FALLBACK is not set
# CONFIG_FW_LOADER_COMPRESS is not set
CONFIG_FW_CACHE=y
# CONFIG_FW_UPLOAD is not set
# end of Firmware loader

CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
# CONFIG_TEST_ASYNC_DRIVER_PROBE is not set
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_GENERIC_CPU_VULNERABILITIES=y
CONFIG_DMA_SHARED_BUFFER=y
# CONFIG_DMA_FENCE_TRACE is not set
# CONFIG_FW_DEVLINK_SYNC_STATE_TIMEOUT is not set
# end of Generic Driver Options

#
# Bus devices
#
# CONFIG_MHI_BUS is not set
# CONFIG_MHI_BUS_EP is not set
# end of Bus devices

#
# Cache Drivers
#
# end of Cache Drivers

CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y

#
# Firmware Drivers
#

#
# ARM System Control and Management Interface Protocol
#
# end of ARM System Control and Management Interface Protocol

# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DMIID=y
# CONFIG_DMI_SYSFS is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# CONFIG_ISCSI_IBFT is not set
# CONFIG_FW_CFG_SYSFS is not set
CONFIG_SYSFB=y
# CONFIG_SYSFB_SIMPLEFB is not set
CONFIG_GOOGLE_FIRMWARE=y
CONFIG_GOOGLE_SMI=y
# CONFIG_GOOGLE_COREBOOT_TABLE is not set
# CONFIG_GOOGLE_MEMCONSOLE_X86_LEGACY is not set

#
# EFI (Extensible Firmware Interface) Support
#
CONFIG_EFI_ESRT=y
CONFIG_EFI_VARS_PSTORE=y
# CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
CONFIG_EFI_DXE_MEM_ATTRIBUTES=y
CONFIG_EFI_RUNTIME_WRAPPERS=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_APPLE_PROPERTIES is not set
# CONFIG_RESET_ATTACK_MITIGATION is not set
# CONFIG_EFI_RCI2_TABLE is not set
# CONFIG_EFI_DISABLE_PCI_DMA is not set
CONFIG_EFI_EARLYCON=y
# CONFIG_EFI_CUSTOM_SSDT_OVERLAYS is not set
# CONFIG_EFI_DISABLE_RUNTIME is not set
# CONFIG_EFI_COCO_SECRET is not set
CONFIG_UNACCEPTED_MEMORY=y
# end of EFI (Extensible Firmware Interface) Support

CONFIG_UEFI_CPER=y
CONFIG_UEFI_CPER_X86=y

#
# Qualcomm firmware drivers
#
# end of Qualcomm firmware drivers

#
# Tegra firmware driver
#
# end of Tegra firmware driver
# end of Firmware Drivers

# CONFIG_GNSS is not set
CONFIG_MTD=y
# CONFIG_MTD_TESTS is not set

#
# Partition parsers
#
CONFIG_MTD_CMDLINE_PARTS=y
# CONFIG_MTD_REDBOOT_PARTS is not set
# end of Partition parsers

#
# User Modules And Translation Layers
#
CONFIG_MTD_BLKDEVS=y
# CONFIG_MTD_BLOCK is not set
CONFIG_MTD_BLOCK_RO=y

#
# Note that in some cases UBI block is preferred. See MTD_UBI_BLOCK.
#
# CONFIG_FTL is not set
# CONFIG_NFTL is not set
# CONFIG_INFTL is not set
# CONFIG_RFD_FTL is not set
# CONFIG_SSFDC is not set
# CONFIG_SM_FTL is not set
CONFIG_MTD_OOPS=y
# CONFIG_MTD_SWAP is not set
# CONFIG_MTD_PARTITIONED_MASTER is not set

#
# RAM/ROM/Flash chip drivers
#
CONFIG_MTD_CFI=y
# CONFIG_MTD_JEDECPROBE is not set
CONFIG_MTD_GEN_PROBE=y
# CONFIG_MTD_CFI_ADV_OPTIONS is not set
CONFIG_MTD_MAP_BANK_WIDTH_1=y
CONFIG_MTD_MAP_BANK_WIDTH_2=y
CONFIG_MTD_MAP_BANK_WIDTH_4=y
CONFIG_MTD_CFI_I1=y
CONFIG_MTD_CFI_I2=y
CONFIG_MTD_CFI_INTELEXT=y
CONFIG_MTD_CFI_AMDSTD=y
# CONFIG_MTD_CFI_STAA is not set
CONFIG_MTD_CFI_UTIL=y
# CONFIG_MTD_RAM is not set
# CONFIG_MTD_ROM is not set
# CONFIG_MTD_ABSENT is not set
# end of RAM/ROM/Flash chip drivers

#
# Mapping drivers for chip access
#
# CONFIG_MTD_COMPLEX_MAPPINGS is not set
# CONFIG_MTD_PHYSMAP is not set
# CONFIG_MTD_INTEL_VR_NOR is not set
# CONFIG_MTD_PLATRAM is not set
# end of Mapping drivers for chip access

#
# Self-contained MTD device drivers
#
# CONFIG_MTD_PMC551 is not set
# CONFIG_MTD_DATAFLASH is not set
# CONFIG_MTD_MCHP23K256 is not set
# CONFIG_MTD_MCHP48L640 is not set
# CONFIG_MTD_SST25L is not set
# CONFIG_MTD_SLRAM is not set
# CONFIG_MTD_PHRAM is not set
# CONFIG_MTD_MTDRAM is not set
# CONFIG_MTD_BLOCK2MTD is not set

#
# Disk-On-Chip Device Drivers
#
# CONFIG_MTD_DOCG3 is not set
# end of Self-contained MTD device drivers

#
# NAND
#
# CONFIG_MTD_ONENAND is not set
# CONFIG_MTD_RAW_NAND is not set
# CONFIG_MTD_SPI_NAND is not set

#
# ECC engine support
#
# CONFIG_MTD_NAND_ECC_SW_HAMMING is not set
# CONFIG_MTD_NAND_ECC_SW_BCH is not set
# CONFIG_MTD_NAND_ECC_MXIC is not set
# end of ECC engine support
# end of NAND

#
# LPDDR & LPDDR2 PCM memory drivers
#
# CONFIG_MTD_LPDDR is not set
# end of LPDDR & LPDDR2 PCM memory drivers

CONFIG_MTD_SPI_NOR=y
CONFIG_MTD_SPI_NOR_USE_4K_SECTORS=y
# CONFIG_MTD_SPI_NOR_SWP_DISABLE is not set
CONFIG_MTD_SPI_NOR_SWP_DISABLE_ON_VOLATILE=y
# CONFIG_MTD_SPI_NOR_SWP_KEEP is not set
# CONFIG_MTD_UBI is not set
# CONFIG_MTD_HYPERBUS is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
CONFIG_CDROM=m
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_ZRAM is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=1
CONFIG_BLK_DEV_RAM_SIZE=335544320
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
CONFIG_VIRTIO_BLK=m
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_UBLK is not set

#
# NVME Support
#
CONFIG_NVME_CORE=y
CONFIG_BLK_DEV_NVME=y
# CONFIG_NVME_MULTIPATH is not set
# CONFIG_NVME_VERBOSE_ERRORS is not set
# CONFIG_NVME_HWMON is not set
# CONFIG_NVME_RDMA is not set
# CONFIG_NVME_FC is not set
# CONFIG_NVME_TCP is not set
# CONFIG_NVME_HOST_AUTH is not set
# CONFIG_NVME_TARGET is not set
# end of NVME Support

#
# Misc devices
#
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_LATTICE_ECP3_CONFIG is not set
# CONFIG_SRAM is not set
# CONFIG_DW_XDATA_PCIE is not set
# CONFIG_PCI_ENDPOINT_TEST is not set
# CONFIG_XILINX_SDFEC is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_AT25 is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_EEPROM_93XX46 is not set
# CONFIG_EEPROM_IDT_89HPESX is not set
# CONFIG_EEPROM_EE1004 is not set
# end of EEPROM support

# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_TI_ST is not set
# end of Texas Instruments shared transport line discipline

# CONFIG_SENSORS_LIS3_I2C is not set
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
# CONFIG_INTEL_MEI_ME is not set
# CONFIG_INTEL_MEI_TXE is not set
# CONFIG_VMWARE_VMCI is not set
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_BCM_VK is not set
# CONFIG_MISC_ALCOR_PCI is not set
# CONFIG_MISC_RTSX_PCI is not set
# CONFIG_MISC_RTSX_USB is not set
# CONFIG_UACCE is not set
# CONFIG_PVPANIC is not set
# CONFIG_GP_PCI1XXXX is not set
# end of Misc devices

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
CONFIG_RAID_ATTRS=m
CONFIG_SCSI_COMMON=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_BLK_DEV_SR=m
CONFIG_CHR_DEV_SG=y
CONFIG_BLK_DEV_BSG=y
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
# CONFIG_SCSI_SCAN_ASYNC is not set

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_ISCSI_ATTRS=y
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_ATA=y
CONFIG_SCSI_SAS_HOST_SMP=y
# CONFIG_SCSI_SRP_ATTRS is not set
# end of SCSI Transports

CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=y
# CONFIG_ISCSI_BOOT_SYSFS is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_SCSI_CXGB4_ISCSI is not set
# CONFIG_SCSI_BNX2_ISCSI is not set
# CONFIG_BE2ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_HPSA is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_3W_SAS is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
CONFIG_SCSI_AIC7XXX=m
CONFIG_AIC7XXX_CMDS_PER_DEVICE=253
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_MVUMI is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_SCSI_ESAS2R is not set
CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=y
CONFIG_MEGARAID_MAILBOX=y
# CONFIG_MEGARAID_LEGACY is not set
CONFIG_MEGARAID_SAS=y
CONFIG_SCSI_MPT3SAS=m
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
CONFIG_SCSI_MPT3SAS_MAX_SGE=128
CONFIG_SCSI_MPT2SAS=m
# CONFIG_SCSI_MPI3MR is not set
# CONFIG_SCSI_SMARTPQI is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_MYRB is not set
# CONFIG_SCSI_MYRS is not set
# CONFIG_VMWARE_PVSCSI is not set
# CONFIG_LIBFC is not set
# CONFIG_SCSI_SNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_FDOMAIN_PCI is not set
# CONFIG_SCSI_ISCI is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
CONFIG_SCSI_QLA_FC=m
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_WD719X is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_PMCRAID is not set
CONFIG_SCSI_PM8001=m
# CONFIG_SCSI_BFA_FC is not set
CONFIG_SCSI_VIRTIO=y
# CONFIG_SCSI_CHELSIO_FCOE is not set
# CONFIG_SCSI_DH is not set
# end of SCSI device support

CONFIG_ATA=y
CONFIG_SATA_HOST=y
CONFIG_PATA_TIMINGS=y
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_FORCE=y
CONFIG_ATA_ACPI=y
# CONFIG_SATA_ZPODD is not set
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
CONFIG_SATA_MOBILE_LPM_POLICY=0
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_AHCI_DWC is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
CONFIG_SATA_SIL24=m
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
# CONFIG_SATA_DWC is not set
CONFIG_SATA_MV=m
# CONFIG_SATA_NV is not set
# CONFIG_SATA_PROMISE is not set
CONFIG_SATA_SIL=m
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SCH is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
# CONFIG_PATA_ACPI is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_BITMAP_FILE=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=y
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
# CONFIG_BCACHE is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
CONFIG_DM_BUFIO=y
# CONFIG_DM_DEBUG_BLOCK_MANAGER_LOCKING is not set
CONFIG_DM_BIO_PRISON=y
CONFIG_DM_PERSISTENT_DATA=y
# CONFIG_DM_UNSTRIPED is not set
CONFIG_DM_CRYPT=y
# CONFIG_DM_SNAPSHOT is not set
CONFIG_DM_THIN_PROVISIONING=y
# CONFIG_DM_CACHE is not set
# CONFIG_DM_WRITECACHE is not set
# CONFIG_DM_EBS is not set
# CONFIG_DM_ERA is not set
# CONFIG_DM_CLONE is not set
CONFIG_DM_MIRROR=y
# CONFIG_DM_LOG_USERSPACE is not set
# CONFIG_DM_RAID is not set
CONFIG_DM_ZERO=y
CONFIG_DM_MULTIPATH=y
CONFIG_DM_MULTIPATH_QL=y
CONFIG_DM_MULTIPATH_ST=y
CONFIG_DM_MULTIPATH_HST=y
# CONFIG_DM_MULTIPATH_IOA is not set
# CONFIG_DM_DELAY is not set
# CONFIG_DM_DUST is not set
# CONFIG_DM_INIT is not set
# CONFIG_DM_UEVENT is not set
# CONFIG_DM_FLAKEY is not set
CONFIG_DM_VERITY=y
# CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG is not set
# CONFIG_DM_VERITY_FEC is not set
# CONFIG_DM_SWITCH is not set
# CONFIG_DM_LOG_WRITES is not set
# CONFIG_DM_INTEGRITY is not set
# CONFIG_DM_AUDIT is not set
# CONFIG_TARGET_CORE is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
CONFIG_FIREWIRE=m
CONFIG_FIREWIRE_OHCI=m
# CONFIG_FIREWIRE_SBP2 is not set
# CONFIG_FIREWIRE_NET is not set
# CONFIG_FIREWIRE_NOSY is not set
# end of IEEE 1394 (FireWire) support

# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_MII=m
CONFIG_NET_CORE=y
CONFIG_BONDING=y
CONFIG_DUMMY=m
# CONFIG_WIREGUARD is not set
# CONFIG_EQUALIZER is not set
# CONFIG_NET_FC is not set
CONFIG_IFB=m
CONFIG_NET_TEAM=y
CONFIG_NET_TEAM_MODE_BROADCAST=y
CONFIG_NET_TEAM_MODE_ROUNDROBIN=y
CONFIG_NET_TEAM_MODE_RANDOM=y
CONFIG_NET_TEAM_MODE_ACTIVEBACKUP=y
CONFIG_NET_TEAM_MODE_LOADBALANCE=y
CONFIG_MACVLAN=m
CONFIG_MACVTAP=m
CONFIG_IPVLAN_L3S=y
CONFIG_IPVLAN=y
CONFIG_IPVTAP=y
# CONFIG_VXLAN is not set
# CONFIG_GENEVE is not set
# CONFIG_BAREUDP is not set
# CONFIG_GTP is not set
# CONFIG_AMT is not set
# CONFIG_MACSEC is not set
CONFIG_NETCONSOLE=m
# CONFIG_NETCONSOLE_DYNAMIC is not set
# CONFIG_NETCONSOLE_EXTENDED_LOG is not set
CONFIG_NETPOLL=y
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_TUN=y
CONFIG_TAP=y
# CONFIG_TUN_VNET_CROSS_LE is not set
CONFIG_VETH=y
CONFIG_VIRTIO_NET=m
# CONFIG_NLMON is not set
# CONFIG_NETKIT is not set
# CONFIG_NET_VRF is not set
# CONFIG_VSOCKMON is not set
# CONFIG_ARCNET is not set
CONFIG_ETHERNET=y
CONFIG_MDIO=m
CONFIG_NET_VENDOR_3COM=y
# CONFIG_VORTEX is not set
# CONFIG_TYPHOON is not set
CONFIG_NET_VENDOR_ADAPTEC=y
# CONFIG_ADAPTEC_STARFIRE is not set
CONFIG_NET_VENDOR_AGERE=y
# CONFIG_ET131X is not set
CONFIG_NET_VENDOR_ALACRITECH=y
# CONFIG_SLICOSS is not set
CONFIG_NET_VENDOR_ALTEON=y
# CONFIG_ACENIC is not set
# CONFIG_ALTERA_TSE is not set
CONFIG_NET_VENDOR_AMAZON=y
# CONFIG_ENA_ETHERNET is not set
CONFIG_NET_VENDOR_AMD=y
# CONFIG_AMD8111_ETH is not set
# CONFIG_PCNET32 is not set
# CONFIG_AMD_XGBE is not set
# CONFIG_PDS_CORE is not set
CONFIG_NET_VENDOR_AQUANTIA=y
# CONFIG_AQTION is not set
CONFIG_NET_VENDOR_ARC=y
CONFIG_NET_VENDOR_ASIX=y
# CONFIG_SPI_AX88796C is not set
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_ALX is not set
# CONFIG_CX_ECAT is not set
CONFIG_NET_VENDOR_BROADCOM=y
# CONFIG_B44 is not set
# CONFIG_BCMGENET is not set
CONFIG_BNX2=m
# CONFIG_CNIC is not set
CONFIG_TIGON3=m
CONFIG_TIGON3_HWMON=y
CONFIG_BNX2X=m
CONFIG_BNX2X_SRIOV=y
# CONFIG_SYSTEMPORT is not set
CONFIG_BNXT=m
CONFIG_BNXT_SRIOV=y
CONFIG_BNXT_FLOWER_OFFLOAD=y
# CONFIG_BNXT_DCB is not set
CONFIG_BNXT_HWMON=y
CONFIG_NET_VENDOR_CADENCE=y
CONFIG_NET_VENDOR_CAVIUM=y
# CONFIG_THUNDER_NIC_PF is not set
# CONFIG_THUNDER_NIC_VF is not set
# CONFIG_THUNDER_NIC_BGX is not set
# CONFIG_THUNDER_NIC_RGX is not set
# CONFIG_CAVIUM_PTP is not set
# CONFIG_LIQUIDIO is not set
# CONFIG_LIQUIDIO_VF is not set
CONFIG_NET_VENDOR_CHELSIO=y
# CONFIG_CHELSIO_T1 is not set
CONFIG_CHELSIO_T3=m
# CONFIG_CHELSIO_T4 is not set
# CONFIG_CHELSIO_T4VF is not set
CONFIG_NET_VENDOR_CISCO=y
# CONFIG_ENIC is not set
CONFIG_NET_VENDOR_CORTINA=y
CONFIG_NET_VENDOR_DAVICOM=y
# CONFIG_DM9051 is not set
# CONFIG_DNET is not set
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
CONFIG_DE2104X=m
CONFIG_DE2104X_DSL=0
# CONFIG_TULIP is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
CONFIG_NET_VENDOR_DLINK=y
# CONFIG_DL2K is not set
# CONFIG_SUNDANCE is not set
CONFIG_NET_VENDOR_EMULEX=y
# CONFIG_BE2NET is not set
CONFIG_NET_VENDOR_ENGLEDER=y
# CONFIG_TSNEP is not set
CONFIG_NET_VENDOR_EZCHIP=y
CONFIG_NET_VENDOR_FUNGIBLE=y
# CONFIG_FUN_ETH is not set
CONFIG_NET_VENDOR_GOOGLE=y
CONFIG_GVE=m
CONFIG_NET_VENDOR_HUAWEI=y
# CONFIG_HINIC is not set
CONFIG_NET_VENDOR_I825XX=y
CONFIG_NET_VENDOR_INTEL=y
# CONFIG_E100 is not set
# CONFIG_E1000 is not set
CONFIG_E1000E=m
CONFIG_E1000E_HWTS=y
# CONFIG_IGB is not set
# CONFIG_IGBVF is not set
# CONFIG_IXGBE is not set
# CONFIG_IXGBEVF is not set
# CONFIG_I40E is not set
# CONFIG_I40EVF is not set
# CONFIG_ICE is not set
# CONFIG_FM10K is not set
# CONFIG_IGC is not set
# CONFIG_IDPF is not set
# CONFIG_JME is not set
CONFIG_NET_VENDOR_ADI=y
CONFIG_NET_VENDOR_LITEX=y
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_MVMDIO is not set
# CONFIG_SKGE is not set
CONFIG_SKY2=m
# CONFIG_SKY2_DEBUG is not set
# CONFIG_OCTEON_EP is not set
CONFIG_NET_VENDOR_MELLANOX=y
CONFIG_MLX4_EN=m
# CONFIG_MLX4_EN_DCB is not set
CONFIG_MLX4_CORE=m
CONFIG_MLX4_DEBUG=y
CONFIG_MLX4_CORE_GEN2=y
CONFIG_MLX5_CORE=m
# CONFIG_MLX5_FPGA is not set
CONFIG_MLX5_CORE_EN=y
CONFIG_MLX5_EN_ARFS=y
CONFIG_MLX5_EN_RXNFC=y
CONFIG_MLX5_MPFS=y
CONFIG_MLX5_CORE_EN_DCB=y
# CONFIG_MLX5_CORE_IPOIB is not set
# CONFIG_MLX5_SF is not set
# CONFIG_MLX5_DPLL is not set
# CONFIG_MLXSW_CORE is not set
# CONFIG_MLXFW is not set
CONFIG_NET_VENDOR_MICREL=y
# CONFIG_KS8842 is not set
# CONFIG_KS8851 is not set
# CONFIG_KS8851_MLL is not set
# CONFIG_KSZ884X_PCI is not set
CONFIG_NET_VENDOR_MICROCHIP=y
# CONFIG_ENC28J60 is not set
# CONFIG_ENCX24J600 is not set
# CONFIG_LAN743X is not set
# CONFIG_VCAP is not set
CONFIG_NET_VENDOR_MICROSEMI=y
CONFIG_NET_VENDOR_MICROSOFT=y
CONFIG_NET_VENDOR_MYRI=y
# CONFIG_MYRI10GE is not set
# CONFIG_FEALNX is not set
CONFIG_NET_VENDOR_NI=y
# CONFIG_NI_XGE_MANAGEMENT_ENET is not set
CONFIG_NET_VENDOR_NATSEMI=y
# CONFIG_NATSEMI is not set
# CONFIG_NS83820 is not set
CONFIG_NET_VENDOR_NETERION=y
# CONFIG_S2IO is not set
CONFIG_NET_VENDOR_NETRONOME=y
# CONFIG_NFP is not set
CONFIG_NET_VENDOR_8390=y
# CONFIG_NE2K_PCI is not set
CONFIG_NET_VENDOR_NVIDIA=y
# CONFIG_FORCEDETH is not set
CONFIG_NET_VENDOR_OKI=y
# CONFIG_ETHOC is not set
CONFIG_NET_VENDOR_PACKET_ENGINES=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_PENSANDO=y
# CONFIG_IONIC is not set
CONFIG_NET_VENDOR_QLOGIC=y
# CONFIG_QLA3XXX is not set
# CONFIG_QLCNIC is not set
# CONFIG_NETXEN_NIC is not set
# CONFIG_QED is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
CONFIG_NET_VENDOR_QUALCOMM=y
# CONFIG_QCOM_EMAC is not set
# CONFIG_RMNET is not set
CONFIG_NET_VENDOR_RDC=y
# CONFIG_R6040 is not set
CONFIG_NET_VENDOR_REALTEK=y
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_R8169 is not set
CONFIG_NET_VENDOR_RENESAS=y
CONFIG_NET_VENDOR_ROCKER=y
CONFIG_NET_VENDOR_SAMSUNG=y
# CONFIG_SXGBE_ETH is not set
CONFIG_NET_VENDOR_SEEQ=y
CONFIG_NET_VENDOR_SILAN=y
# CONFIG_SC92031 is not set
CONFIG_NET_VENDOR_SIS=y
# CONFIG_SIS900 is not set
# CONFIG_SIS190 is not set
CONFIG_NET_VENDOR_SOLARFLARE=y
# CONFIG_SFC is not set
# CONFIG_SFC_FALCON is not set
# CONFIG_SFC_SIENA is not set
CONFIG_NET_VENDOR_SMSC=y
# CONFIG_EPIC100 is not set
# CONFIG_SMSC911X is not set
# CONFIG_SMSC9420 is not set
CONFIG_NET_VENDOR_SOCIONEXT=y
CONFIG_NET_VENDOR_STMICRO=y
# CONFIG_STMMAC_ETH is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NIU is not set
CONFIG_NET_VENDOR_SYNOPSYS=y
# CONFIG_DWC_XLGMAC is not set
CONFIG_NET_VENDOR_TEHUTI=y
# CONFIG_TEHUTI is not set
CONFIG_NET_VENDOR_TI=y
# CONFIG_TI_CPSW_PHY_SEL is not set
# CONFIG_TLAN is not set
CONFIG_NET_VENDOR_VERTEXCOM=y
# CONFIG_MSE102X is not set
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_NET_VENDOR_WANGXUN=y
# CONFIG_NGBE is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
CONFIG_NET_VENDOR_XILINX=y
# CONFIG_XILINX_EMACLITE is not set
# CONFIG_XILINX_AXI_EMAC is not set
# CONFIG_XILINX_LL_TEMAC is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLIB=y
CONFIG_SWPHY=y
CONFIG_FIXED_PHY=y

#
# MII PHY device drivers
#
# CONFIG_AMD_PHY is not set
# CONFIG_ADIN_PHY is not set
# CONFIG_ADIN1100_PHY is not set
# CONFIG_AQUANTIA_PHY is not set
# CONFIG_AX88796B_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_BCM54140_PHY is not set
# CONFIG_BCM7XXX_PHY is not set
# CONFIG_BCM84881_PHY is not set
# CONFIG_BCM87XX_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_CORTINA_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_INTEL_XWAY_PHY is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_MARVELL_PHY is not set
# CONFIG_MARVELL_10G_PHY is not set
# CONFIG_MARVELL_88Q2XXX_PHY is not set
# CONFIG_MARVELL_88X2222_PHY is not set
# CONFIG_MAXLINEAR_GPHY is not set
# CONFIG_MEDIATEK_GE_PHY is not set
# CONFIG_MICREL_PHY is not set
# CONFIG_MICROCHIP_T1S_PHY is not set
# CONFIG_MICROCHIP_PHY is not set
# CONFIG_MICROCHIP_T1_PHY is not set
# CONFIG_MICROSEMI_PHY is not set
# CONFIG_MOTORCOMM_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_NXP_CBTX_PHY is not set
# CONFIG_NXP_C45_TJA11XX_PHY is not set
# CONFIG_NXP_TJA11XX_PHY is not set
# CONFIG_NCN26000_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_RENESAS_PHY is not set
# CONFIG_ROCKCHIP_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_TERANETICS_PHY is not set
# CONFIG_DP83822_PHY is not set
# CONFIG_DP83TC811_PHY is not set
# CONFIG_DP83848_PHY is not set
# CONFIG_DP83867_PHY is not set
# CONFIG_DP83869_PHY is not set
# CONFIG_DP83TD510_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_XILINX_GMII2RGMII is not set
# CONFIG_MICREL_KS8995MA is not set
# CONFIG_PSE_CONTROLLER is not set
CONFIG_MDIO_DEVICE=y
CONFIG_MDIO_BUS=y
CONFIG_FWNODE_MDIO=y
CONFIG_ACPI_MDIO=y
CONFIG_MDIO_DEVRES=y
# CONFIG_MDIO_BITBANG is not set
# CONFIG_MDIO_BCM_UNIMAC is not set
# CONFIG_MDIO_MVUSB is not set
# CONFIG_MDIO_THUNDER is not set

#
# MDIO Multiplexers
#

#
# PCS device drivers
#
# end of PCS device drivers

# CONFIG_PPP is not set
# CONFIG_SLIP is not set
CONFIG_USB_NET_DRIVERS=y
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_RTL8152 is not set
# CONFIG_USB_LAN78XX is not set
CONFIG_USB_USBNET=m
# CONFIG_USB_NET_AX8817X is not set
# CONFIG_USB_NET_AX88179_178A is not set
CONFIG_USB_NET_CDCETHER=m
CONFIG_USB_NET_CDC_EEM=m
CONFIG_USB_NET_CDC_NCM=m
# CONFIG_USB_NET_HUAWEI_CDC_NCM is not set
# CONFIG_USB_NET_CDC_MBIM is not set
# CONFIG_USB_NET_DM9601 is not set
# CONFIG_USB_NET_SR9700 is not set
# CONFIG_USB_NET_SR9800 is not set
# CONFIG_USB_NET_SMSC75XX is not set
# CONFIG_USB_NET_SMSC95XX is not set
# CONFIG_USB_NET_GL620A is not set
# CONFIG_USB_NET_NET1080 is not set
# CONFIG_USB_NET_PLUSB is not set
# CONFIG_USB_NET_MCS7830 is not set
# CONFIG_USB_NET_RNDIS_HOST is not set
CONFIG_USB_NET_CDC_SUBSET=m
# CONFIG_USB_ALI_M5632 is not set
# CONFIG_USB_AN2720 is not set
# CONFIG_USB_BELKIN is not set
# CONFIG_USB_ARMLINUX is not set
# CONFIG_USB_EPSON2888 is not set
# CONFIG_USB_KC2190 is not set
# CONFIG_USB_NET_ZAURUS is not set
# CONFIG_USB_NET_CX82310_ETH is not set
# CONFIG_USB_NET_KALMIA is not set
# CONFIG_USB_NET_QMI_WWAN is not set
# CONFIG_USB_NET_INT51X1 is not set
# CONFIG_USB_IPHETH is not set
# CONFIG_USB_SIERRA_NET is not set
# CONFIG_USB_VL600 is not set
# CONFIG_USB_NET_CH9200 is not set
# CONFIG_USB_NET_AQC111 is not set
CONFIG_USB_RTL8153_ECM=m
# CONFIG_WLAN is not set
# CONFIG_WAN is not set

#
# Wireless WAN
#
# CONFIG_WWAN is not set
# end of Wireless WAN

# CONFIG_VMXNET3 is not set
# CONFIG_FUJITSU_ES is not set
# CONFIG_NETDEVSIM is not set
CONFIG_NET_FAILOVER=m
# CONFIG_ISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_INPUT_MATRIXKMAP is not set
CONFIG_INPUT_VIVALDIFMAP=y

#
# Userland interfaces
#
# CONFIG_INPUT_MOUSEDEV is not set
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=m
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1050 is not set
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_DLINK_DIR685 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_GPIO is not set
# CONFIG_KEYBOARD_GPIO_POLLED is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_MATRIX is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_CYPRESS_SF is not set
# CONFIG_INPUT_MOUSE is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
# CONFIG_INPUT_E3X0_BUTTON is not set
CONFIG_INPUT_PCSPKR=m
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_GPIO_BEEPER is not set
# CONFIG_INPUT_GPIO_DECODER is not set
# CONFIG_INPUT_GPIO_VIBRA is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_GPIO_ROTARY_ENCODER is not set
# CONFIG_INPUT_DA7280_HAPTICS is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_IQS269A is not set
# CONFIG_INPUT_IQS626A is not set
# CONFIG_INPUT_IQS7222 is not set
# CONFIG_INPUT_CMA3000 is not set
# CONFIG_INPUT_IDEAPAD_SLIDEBAR is not set
# CONFIG_INPUT_DRV260X_HAPTICS is not set
# CONFIG_INPUT_DRV2665_HAPTICS is not set
# CONFIG_INPUT_DRV2667_HAPTICS is not set
# CONFIG_RMI4_CORE is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_SERIO_GPIO_PS2 is not set
# CONFIG_USERIO is not set
# CONFIG_GAMEPORT is not set
# end of Hardware I/O ports
# end of Input device support

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_LEGACY_TIOCSTI=y
CONFIG_LDISC_AUTOLOAD=y

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_16550A_VARIANTS is not set
# CONFIG_SERIAL_8250_FINTEK is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_DMA=y
CONFIG_SERIAL_8250_PCILIB=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_EXAR=y
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
# CONFIG_SERIAL_8250_PCI1XXXX is not set
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_8250_DWLIB=y
# CONFIG_SERIAL_8250_DW is not set
# CONFIG_SERIAL_8250_RT288X is not set
CONFIG_SERIAL_8250_LPSS=y
CONFIG_SERIAL_8250_MID=y
CONFIG_SERIAL_8250_PERICOM=y

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_MAX3100 is not set
# CONFIG_SERIAL_MAX310X is not set
# CONFIG_SERIAL_UARTLITE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_LANTIQ is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_SC16IS7XX is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_SERIAL_FSL_LINFLEXUART is not set
# end of Serial drivers

CONFIG_SERIAL_MCTRL_GPIO=y
# CONFIG_SERIAL_NONSTANDARD is not set
# CONFIG_N_GSM is not set
# CONFIG_NOZOMI is not set
# CONFIG_NULL_TTY is not set
CONFIG_HVC_DRIVER=y
# CONFIG_SERIAL_DEV_BUS is not set
# CONFIG_TTY_PRINTK is not set
CONFIG_VIRTIO_CONSOLE=y
CONFIG_IPMI_HANDLER=y
CONFIG_IPMI_DMI_DECODE=y
CONFIG_IPMI_PLAT_DATA=y
# CONFIG_IPMI_PANIC_EVENT is not set
CONFIG_IPMI_DEVICE_INTERFACE=y
CONFIG_IPMI_SI=y
# CONFIG_IPMI_SSIF is not set
CONFIG_IPMI_WATCHDOG=y
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
CONFIG_HW_RANDOM_INTEL=y
CONFIG_HW_RANDOM_AMD=y
# CONFIG_HW_RANDOM_BA431 is not set
CONFIG_HW_RANDOM_VIA=y
CONFIG_HW_RANDOM_VIRTIO=y
# CONFIG_HW_RANDOM_XIPHERA is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
CONFIG_DEVMEM=y
CONFIG_NVRAM=m
CONFIG_DEVPORT=y
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
# CONFIG_XILLYBUS is not set
# CONFIG_XILLYUSB is not set
# end of Character devices

#
# I2C support
#
CONFIG_I2C=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
CONFIG_I2C_CHARDEV=y
CONFIG_I2C_MUX=m

#
# Multiplexer I2C Chip support
#
# CONFIG_I2C_MUX_GPIO is not set
# CONFIG_I2C_MUX_LTC4306 is not set
CONFIG_I2C_MUX_PCA9541=m
CONFIG_I2C_MUX_PCA954x=m
# CONFIG_I2C_MUX_REG is not set
# CONFIG_I2C_MUX_MLXCPLD is not set
# end of Multiplexer I2C Chip support

CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_SMBUS=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_AMD_MP2 is not set
CONFIG_I2C_I801=y
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
CONFIG_I2C_PIIX4=m
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_NVIDIA_GPU is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_CBUS_GPIO is not set
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_GPIO is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_CP2615 is not set
# CONFIG_I2C_PCI1XXXX is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_MLXCPLD is not set
# CONFIG_I2C_VIRTIO is not set
# end of I2C Hardware Bus support

# CONFIG_I2C_STUB is not set
# CONFIG_I2C_SLAVE is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# end of I2C support

# CONFIG_I3C is not set
CONFIG_SPI=y
# CONFIG_SPI_DEBUG is not set
CONFIG_SPI_MASTER=y
CONFIG_SPI_MEM=y

#
# SPI Master Controller Drivers
#
# CONFIG_SPI_ALTERA is not set
# CONFIG_SPI_AXI_SPI_ENGINE is not set
CONFIG_SPI_BITBANG=m
# CONFIG_SPI_CADENCE is not set
# CONFIG_SPI_DESIGNWARE is not set
# CONFIG_SPI_GPIO is not set
CONFIG_SPI_INTEL=y
CONFIG_SPI_INTEL_PCI=y
CONFIG_SPI_INTEL_PLATFORM=y
# CONFIG_SPI_MICROCHIP_CORE is not set
# CONFIG_SPI_MICROCHIP_CORE_QSPI is not set
# CONFIG_SPI_LANTIQ_SSC is not set
# CONFIG_SPI_OC_TINY is not set
# CONFIG_SPI_PCI1XXXX is not set
# CONFIG_SPI_PXA2XX is not set
# CONFIG_SPI_SC18IS602 is not set
# CONFIG_SPI_SIFIVE is not set
# CONFIG_SPI_MXIC is not set
# CONFIG_SPI_XCOMM is not set
CONFIG_SPI_XILINX=m
# CONFIG_SPI_ZYNQMP_GQSPI is not set
# CONFIG_SPI_AMD is not set

#
# SPI Multiplexer support
#
# CONFIG_SPI_MUX is not set

#
# SPI Protocol Masters
#
CONFIG_SPI_SPIDEV=m
# CONFIG_SPI_LOOPBACK_TEST is not set
# CONFIG_SPI_TLE62X0 is not set
# CONFIG_SPI_SLAVE is not set
CONFIG_SPI_DYNAMIC=y
# CONFIG_SPMI is not set
# CONFIG_HSI is not set
CONFIG_PPS=y
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
# CONFIG_PPS_CLIENT_KTIMER is not set
# CONFIG_PPS_CLIENT_LDISC is not set
# CONFIG_PPS_CLIENT_GPIO is not set

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=y
CONFIG_PTP_1588_CLOCK_OPTIONAL=y

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
# CONFIG_PTP_1588_CLOCK_IDT82P33 is not set
# CONFIG_PTP_1588_CLOCK_IDTCM is not set
# CONFIG_PTP_1588_CLOCK_MOCK is not set
# end of PTP clock support

# CONFIG_PINCTRL is not set
CONFIG_GPIOLIB=y
CONFIG_GPIOLIB_FASTPATH_LIMIT=512
CONFIG_GPIO_ACPI=y
# CONFIG_DEBUG_GPIO is not set
# CONFIG_GPIO_SYSFS is not set
CONFIG_GPIO_CDEV=y
CONFIG_GPIO_CDEV_V1=y

#
# Memory mapped GPIO drivers
#
# CONFIG_GPIO_AMDPT is not set
# CONFIG_GPIO_DWAPB is not set
# CONFIG_GPIO_EXAR is not set
# CONFIG_GPIO_GENERIC_PLATFORM is not set
# CONFIG_GPIO_ICH is not set
# CONFIG_GPIO_MB86S7X is not set
# CONFIG_GPIO_AMD_FCH is not set
# end of Memory mapped GPIO drivers

#
# Port-mapped I/O GPIO drivers
#
# CONFIG_GPIO_VX855 is not set
# CONFIG_GPIO_F7188X is not set
# CONFIG_GPIO_IT87 is not set
# CONFIG_GPIO_SCH311X is not set
# CONFIG_GPIO_WINBOND is not set
# CONFIG_GPIO_WS16C48 is not set
# end of Port-mapped I/O GPIO drivers

#
# I2C GPIO expanders
#
# CONFIG_GPIO_FXL6408 is not set
# CONFIG_GPIO_DS4520 is not set
# CONFIG_GPIO_MAX7300 is not set
# CONFIG_GPIO_MAX732X is not set
# CONFIG_GPIO_PCA953X is not set
# CONFIG_GPIO_PCA9570 is not set
# CONFIG_GPIO_PCF857X is not set
# CONFIG_GPIO_TPIC2810 is not set
# end of I2C GPIO expanders

#
# MFD GPIO expanders
#
# CONFIG_GPIO_ELKHARTLAKE is not set
# end of MFD GPIO expanders

#
# PCI GPIO expanders
#
# CONFIG_GPIO_AMD8111 is not set
# CONFIG_GPIO_BT8XX is not set
# CONFIG_GPIO_ML_IOH is not set
# CONFIG_GPIO_PCI_IDIO_16 is not set
# CONFIG_GPIO_PCIE_IDIO_24 is not set
# CONFIG_GPIO_RDC321X is not set
# end of PCI GPIO expanders

#
# SPI GPIO expanders
#
# CONFIG_GPIO_MAX3191X is not set
# CONFIG_GPIO_MAX7301 is not set
# CONFIG_GPIO_MC33880 is not set
# CONFIG_GPIO_PISOSR is not set
# CONFIG_GPIO_XRA1403 is not set
# end of SPI GPIO expanders

#
# USB GPIO expanders
#
# end of USB GPIO expanders

#
# Virtual GPIO drivers
#
# CONFIG_GPIO_AGGREGATOR is not set
# CONFIG_GPIO_LATCH is not set
# CONFIG_GPIO_MOCKUP is not set
# CONFIG_GPIO_VIRTIO is not set
# CONFIG_GPIO_SIM is not set
# end of Virtual GPIO drivers

CONFIG_W1=m
CONFIG_W1_CON=y

#
# 1-wire Bus Masters
#
# CONFIG_W1_MASTER_MATROX is not set
# CONFIG_W1_MASTER_DS2490 is not set
CONFIG_W1_MASTER_DS2482=m
# CONFIG_W1_MASTER_GPIO is not set
# CONFIG_W1_MASTER_SGI is not set
# end of 1-wire Bus Masters

#
# 1-wire Slaves
#
CONFIG_W1_SLAVE_THERM=m
# CONFIG_W1_SLAVE_SMEM is not set
# CONFIG_W1_SLAVE_DS2405 is not set
# CONFIG_W1_SLAVE_DS2408 is not set
# CONFIG_W1_SLAVE_DS2413 is not set
# CONFIG_W1_SLAVE_DS2406 is not set
# CONFIG_W1_SLAVE_DS2423 is not set
# CONFIG_W1_SLAVE_DS2805 is not set
# CONFIG_W1_SLAVE_DS2430 is not set
# CONFIG_W1_SLAVE_DS2431 is not set
# CONFIG_W1_SLAVE_DS2433 is not set
# CONFIG_W1_SLAVE_DS2438 is not set
# CONFIG_W1_SLAVE_DS250X is not set
# CONFIG_W1_SLAVE_DS2780 is not set
# CONFIG_W1_SLAVE_DS2781 is not set
# CONFIG_W1_SLAVE_DS28E04 is not set
# CONFIG_W1_SLAVE_DS28E17 is not set
# end of 1-wire Slaves

# CONFIG_POWER_RESET is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
CONFIG_POWER_SUPPLY_HWMON=y
# CONFIG_IP5XXX_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_CHARGER_ADP5061 is not set
# CONFIG_BATTERY_CW2015 is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SAMSUNG_SDI is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_CHARGER_SBS is not set
# CONFIG_MANAGER_SBS is not set
# CONFIG_BATTERY_BQ27XXX is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_BATTERY_MAX1721X is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_GPIO is not set
# CONFIG_CHARGER_LT3651 is not set
# CONFIG_CHARGER_LTC4162L is not set
# CONFIG_CHARGER_MAX77976 is not set
# CONFIG_CHARGER_BQ2415X is not set
# CONFIG_CHARGER_BQ24257 is not set
# CONFIG_CHARGER_BQ24735 is not set
# CONFIG_CHARGER_BQ2515X is not set
# CONFIG_CHARGER_BQ25890 is not set
# CONFIG_CHARGER_BQ25980 is not set
# CONFIG_CHARGER_BQ256XX is not set
# CONFIG_BATTERY_GAUGE_LTC2941 is not set
# CONFIG_BATTERY_GOLDFISH is not set
# CONFIG_BATTERY_RT5033 is not set
# CONFIG_CHARGER_RT9455 is not set
# CONFIG_CHARGER_BD99954 is not set
# CONFIG_BATTERY_UG3105 is not set
# CONFIG_FUEL_GAUGE_MM8013 is not set
CONFIG_HWMON=y
CONFIG_HWMON_VID=m
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7314 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM1177 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7310 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_AHT10 is not set
# CONFIG_SENSORS_AQUACOMPUTER_D5NEXT is not set
# CONFIG_SENSORS_AS370 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_AXI_FAN_CONTROL is not set
# CONFIG_SENSORS_K8TEMP is not set
CONFIG_SENSORS_K10TEMP=m
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_CORSAIR_CPRO is not set
# CONFIG_SENSORS_CORSAIR_PSU is not set
# CONFIG_SENSORS_DRIVETEMP is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_DELL_SMM is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_FTSTEUTATES is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_HS3001 is not set
# CONFIG_SENSORS_IBMAEM is not set
# CONFIG_SENSORS_IBMPEX is not set
# CONFIG_SENSORS_I5500 is not set
CONFIG_SENSORS_CORETEMP=m
CONFIG_SENSORS_IT87=m
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_POWERZ is not set
# CONFIG_SENSORS_POWR1220 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LTC2945 is not set
# CONFIG_SENSORS_LTC2947_I2C is not set
# CONFIG_SENSORS_LTC2947_SPI is not set
# CONFIG_SENSORS_LTC2990 is not set
# CONFIG_SENSORS_LTC2991 is not set
# CONFIG_SENSORS_LTC2992 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4222 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4260 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_MAX1111 is not set
# CONFIG_SENSORS_MAX127 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX31722 is not set
# CONFIG_SENSORS_MAX31730 is not set
# CONFIG_SENSORS_MAX31760 is not set
# CONFIG_MAX31827 is not set
# CONFIG_SENSORS_MAX6620 is not set
# CONFIG_SENSORS_MAX6621 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_MAX31790 is not set
# CONFIG_SENSORS_MC34VR500 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_TC654 is not set
# CONFIG_SENSORS_TPS23861 is not set
# CONFIG_SENSORS_MR75203 is not set
# CONFIG_SENSORS_ADCXX is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM70 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LM95234 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_NCT6683 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NCT6775_I2C is not set
# CONFIG_SENSORS_NCT7802 is not set
# CONFIG_SENSORS_NCT7904 is not set
# CONFIG_SENSORS_NPCM7XX is not set
# CONFIG_SENSORS_NZXT_KRAKEN2 is not set
# CONFIG_SENSORS_NZXT_SMART2 is not set
# CONFIG_SENSORS_OCC_P8_I2C is not set
# CONFIG_SENSORS_OXP is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SBTSI is not set
# CONFIG_SENSORS_SBRMI is not set
# CONFIG_SENSORS_SHT15 is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SHT3x is not set
# CONFIG_SENSORS_SHT4x is not set
# CONFIG_SENSORS_SHTC1 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC2305 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH5627 is not set
# CONFIG_SENSORS_SCH5636 is not set
# CONFIG_SENSORS_STTS751 is not set
# CONFIG_SENSORS_ADC128D818 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_ADS7871 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_INA238 is not set
# CONFIG_SENSORS_INA3221 is not set
# CONFIG_SENSORS_TC74 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP103 is not set
# CONFIG_SENSORS_TMP108 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_TMP464 is not set
# CONFIG_SENSORS_TMP513 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83773G is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_ASUS_EC is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_NETLINK is not set
# CONFIG_THERMAL_STATISTICS is not set
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_BANG_BANG is not set
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_THERMAL_EMULATION is not set

#
# Intel thermal drivers
#
# CONFIG_INTEL_POWERCLAMP is not set
CONFIG_X86_THERMAL_VECTOR=y
CONFIG_INTEL_TCC=y
CONFIG_X86_PKG_TEMP_THERMAL=m
# CONFIG_INTEL_SOC_DTS_THERMAL is not set

#
# ACPI INT340X thermal drivers
#
# CONFIG_INT340X_THERMAL is not set
# end of ACPI INT340X thermal drivers

# CONFIG_INTEL_PCH_THERMAL is not set
# CONFIG_INTEL_TCC_COOLING is not set
# CONFIG_INTEL_HFI_THERMAL is not set
# end of Intel thermal drivers

CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set
CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y
CONFIG_WATCHDOG_OPEN_TIMEOUT=0
# CONFIG_WATCHDOG_SYSFS is not set
# CONFIG_WATCHDOG_HRTIMER_PRETIMEOUT is not set

#
# Watchdog Pretimeout Governors
#
# CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_WDAT_WDT is not set
# CONFIG_XILINX_WATCHDOG is not set
# CONFIG_ZIIRAVE_WATCHDOG is not set
# CONFIG_CADENCE_WATCHDOG is not set
# CONFIG_DW_WATCHDOG is not set
# CONFIG_MAX63XX_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ADVANTECH_EC_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_EBC_C384_WDT is not set
# CONFIG_EXAR_WDT is not set
# CONFIG_F71808E_WDT is not set
# CONFIG_SP5100_TCO is not set
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_IE6XX_WDT is not set
CONFIG_ITCO_WDT=y
# CONFIG_ITCO_VENDOR_SUPPORT is not set
CONFIG_IT8712F_WDT=m
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_NV_TCO is not set
# CONFIG_60XX_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_TQMX86_WDT is not set
# CONFIG_VIA_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
# CONFIG_NI903X_WDT is not set
# CONFIG_NIC7018_WDT is not set
# CONFIG_MEN_A21_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
CONFIG_MFD_CORE=y
# CONFIG_MFD_AS3711 is not set
# CONFIG_MFD_SMPRO is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_AAT2870_CORE is not set
# CONFIG_MFD_BCM590XX is not set
# CONFIG_MFD_BD9571MWV is not set
# CONFIG_MFD_AXP20X_I2C is not set
# CONFIG_MFD_CS42L43_I2C is not set
# CONFIG_MFD_MADERA is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_SPI is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9062 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_DA9150 is not set
# CONFIG_MFD_DLN2 is not set
# CONFIG_MFD_MC13XXX_SPI is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_MFD_MP2629 is not set
CONFIG_LPC_ICH=y
# CONFIG_LPC_SCH is not set
# CONFIG_MFD_INTEL_LPSS_ACPI is not set
# CONFIG_MFD_INTEL_LPSS_PCI is not set
# CONFIG_MFD_INTEL_PMC_BXT is not set
# CONFIG_MFD_IQS62X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77541 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX77843 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_MT6360 is not set
# CONFIG_MFD_MT6370 is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_MENF21BMC is not set
# CONFIG_MFD_OCELOT is not set
# CONFIG_EZX_PCAP is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_SY7636A is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RT4831 is not set
# CONFIG_MFD_RT5033 is not set
# CONFIG_MFD_RT5120 is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SKY81452 is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_TI_LMU is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS65010 is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65086 is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TI_LP873X is not set
# CONFIG_MFD_TPS6586X is not set
# CONFIG_MFD_TPS65910 is not set
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_MFD_TPS65912_SPI is not set
# CONFIG_MFD_TPS6594_I2C is not set
# CONFIG_MFD_TPS6594_SPI is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_TQMX86 is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_ARIZONA_SPI is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM831X_SPI is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_MFD_ATC260X_I2C is not set
# CONFIG_MFD_INTEL_M10_BMC_SPI is not set
# end of Multifunction device drivers

# CONFIG_REGULATOR is not set
# CONFIG_RC_CORE is not set

#
# CEC support
#
# CONFIG_MEDIA_CEC_SUPPORT is not set
# end of CEC support

# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_APERTURE_HELPERS=y
CONFIG_VIDEO_CMDLINE=y
# CONFIG_AUXDISPLAY is not set
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
# CONFIG_AGP_INTEL is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_VGA_SWITCHEROO is not set
# CONFIG_DRM is not set
# CONFIG_DRM_DEBUG_MODESET_LOCK is not set
CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y

#
# Frame buffer Devices
#
CONFIG_FB=y
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_IBM_GXT4500 is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_FB_SSD1307 is not set
# CONFIG_FB_SM712 is not set
CONFIG_FB_CORE=y
CONFIG_FB_NOTIFY=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DEVICE=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_IOMEM_FOPS=y
CONFIG_FB_IOMEM_HELPERS=y
# CONFIG_FB_MODE_HELPERS is not set
# CONFIG_FB_TILEBLITTING is not set
# end of Frame buffer Devices

#
# Backlight & LCD device support
#
# CONFIG_LCD_CLASS_DEVICE is not set
# CONFIG_BACKLIGHT_CLASS_DEVICE is not set
# end of Backlight & LCD device support

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_LEGACY_ACCELERATION is not set
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER is not set
# end of Console display driver support

# CONFIG_LOGO is not set
# end of Graphics support

# CONFIG_SOUND is not set
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
# CONFIG_HID_A4TECH is not set
# CONFIG_HID_ACCUTOUCH is not set
# CONFIG_HID_ACRUX is not set
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_AUREAL is not set
# CONFIG_HID_BELKIN is not set
# CONFIG_HID_BETOP_FF is not set
# CONFIG_HID_CHERRY is not set
# CONFIG_HID_CHICONY is not set
# CONFIG_HID_COUGAR is not set
# CONFIG_HID_MACALLY is not set
# CONFIG_HID_CMEDIA is not set
# CONFIG_HID_CP2112 is not set
# CONFIG_HID_CREATIVE_SB0540 is not set
# CONFIG_HID_CYPRESS is not set
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
# CONFIG_HID_EVISION is not set
# CONFIG_HID_EZKEY is not set
# CONFIG_HID_FT260 is not set
# CONFIG_HID_GEMBIRD is not set
# CONFIG_HID_GFRM is not set
# CONFIG_HID_GLORIOUS is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_GOOGLE_STADIA_FF is not set
# CONFIG_HID_VIVALDI is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_VIEWSONIC is not set
# CONFIG_HID_VRC2 is not set
# CONFIG_HID_XIAOMI is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_ICADE is not set
# CONFIG_HID_ITE is not set
# CONFIG_HID_JABRA is not set
# CONFIG_HID_TWINHAN is not set
# CONFIG_HID_KENSINGTON is not set
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LENOVO is not set
# CONFIG_HID_LETSKETCH is not set
# CONFIG_HID_MAGICMOUSE is not set
# CONFIG_HID_MALTRON is not set
# CONFIG_HID_MAYFLASH is not set
# CONFIG_HID_MEGAWORLD_FF is not set
# CONFIG_HID_REDRAGON is not set
# CONFIG_HID_MICROSOFT is not set
# CONFIG_HID_MONTEREY is not set
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NTI is not set
# CONFIG_HID_NTRIG is not set
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PENMOUNT is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PLANTRONICS is not set
# CONFIG_HID_PXRC is not set
# CONFIG_HID_RAZER is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_RETRODE is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SAMSUNG is not set
# CONFIG_HID_SEMITEK is not set
# CONFIG_HID_SIGMAMICRO is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEAM is not set
# CONFIG_HID_STEELSERIES is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_TOPRE is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_UDRAW_PS3 is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set
# CONFIG_HID_ALPS is not set
# CONFIG_HID_MCP2221 is not set
# end of Special HID drivers

#
# HID-BPF support
#
# CONFIG_HID_BPF is not set
# end of HID-BPF support

#
# USB HID support
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y
# end of USB HID support

CONFIG_I2C_HID=y
# CONFIG_I2C_HID_ACPI is not set
# CONFIG_I2C_HID_OF is not set

#
# Intel ISH HID support
#
# CONFIG_INTEL_ISH_HID is not set
# end of Intel ISH HID support

#
# AMD SFH HID Support
#
# CONFIG_AMD_SFH_HID is not set
# end of AMD SFH HID Support

CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
# CONFIG_USB_ULPI_BUS is not set
# CONFIG_USB_CONN_GPIO is not set
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_PCI=y
CONFIG_USB_PCI_AMD=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_FEW_INIT_RETRIES is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_PRODUCTLIST is not set
# CONFIG_USB_OTG_DISABLE_EXTERNAL_HUB is not set
CONFIG_USB_AUTOSUSPEND_DELAY=2
# CONFIG_USB_MON is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=m
# CONFIG_USB_XHCI_DBGCAP is not set
CONFIG_USB_XHCI_PCI=m
# CONFIG_USB_XHCI_PCI_RENESAS is not set
# CONFIG_USB_XHCI_PLATFORM is not set
CONFIG_USB_EHCI_HCD=m
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
CONFIG_USB_EHCI_PCI=m
# CONFIG_USB_EHCI_FSL is not set
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_MAX3421_HCD is not set
CONFIG_USB_OHCI_HCD=m
CONFIG_USB_OHCI_HCD_PCI=m
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
CONFIG_USB_UHCI_HCD=m
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_TEST_MODE is not set

#
# USB Device Class drivers
#
CONFIG_USB_ACM=m
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_REALTEK is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_STORAGE_ENE_UB6250 is not set
# CONFIG_USB_UAS is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USBIP_CORE is not set

#
# USB dual-mode controller drivers
#
# CONFIG_USB_CDNS_SUPPORT is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set
# CONFIG_USB_ISP1760 is not set

#
# USB port drivers
#
CONFIG_USB_SERIAL=m
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_SIMPLE is not set
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_CH341 is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP210X is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
CONFIG_USB_SERIAL_FTDI_SIO=m
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_F81232 is not set
# CONFIG_USB_SERIAL_F8153X is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_IUU is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
CONFIG_USB_SERIAL_KEYSPAN=m
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_METRO is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MXUPORT is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
CONFIG_USB_SERIAL_PL2303=m
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_QCAUX is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_SYMBOL is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_OPTICON is not set
# CONFIG_USB_SERIAL_XSENS_MT is not set
# CONFIG_USB_SERIAL_WISHBONE is not set
# CONFIG_USB_SERIAL_SSU100 is not set
# CONFIG_USB_SERIAL_QT2 is not set
# CONFIG_USB_SERIAL_UPD78F0730 is not set
# CONFIG_USB_SERIAL_XR is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_APPLE_MFI_FASTCHARGE is not set
# CONFIG_USB_LJCA is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
CONFIG_USB_TEST=m
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
CONFIG_USB_EZUSB_FX2=m
# CONFIG_USB_HUB_USB251XB is not set
# CONFIG_USB_HSIC_USB3503 is not set
# CONFIG_USB_HSIC_USB4604 is not set
# CONFIG_USB_LINK_LAYER_TEST is not set
# CONFIG_USB_CHAOSKEY is not set

#
# USB Physical Layer drivers
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_USB_GPIO_VBUS is not set
# CONFIG_USB_ISP1301 is not set
# end of USB Physical Layer drivers

# CONFIG_USB_GADGET is not set
# CONFIG_TYPEC is not set
# CONFIG_USB_ROLE_SWITCH is not set
# CONFIG_MMC is not set
# CONFIG_SCSI_UFSHCD is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
CONFIG_INFINIBAND=m
CONFIG_INFINIBAND_USER_MAD=m
CONFIG_INFINIBAND_USER_ACCESS=m
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ON_DEMAND_PAGING=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS=y
CONFIG_INFINIBAND_VIRT_DMA=y
# CONFIG_INFINIBAND_BNXT_RE is not set
# CONFIG_INFINIBAND_EFA is not set
# CONFIG_INFINIBAND_ERDMA is not set
CONFIG_MLX4_INFINIBAND=m
CONFIG_MLX5_INFINIBAND=m
# CONFIG_INFINIBAND_MTHCA is not set
# CONFIG_INFINIBAND_OCRDMA is not set
# CONFIG_INFINIBAND_USNIC is not set
# CONFIG_INFINIBAND_RDMAVT is not set
# CONFIG_RDMA_RXE is not set
# CONFIG_RDMA_SIW is not set
# CONFIG_INFINIBAND_IPOIB is not set
# CONFIG_INFINIBAND_SRP is not set
# CONFIG_INFINIBAND_ISER is not set
# CONFIG_INFINIBAND_RTRS_CLIENT is not set
# CONFIG_INFINIBAND_RTRS_SERVER is not set
# CONFIG_INFINIBAND_OPA_VNIC is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
# CONFIG_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_SYSTOHC_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set
CONFIG_RTC_NVMEM=y

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_ABB5ZES3 is not set
# CONFIG_RTC_DRV_ABEOZ9 is not set
# CONFIG_RTC_DRV_ABX80X is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF85063 is not set
# CONFIG_RTC_DRV_PCF85363 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8010 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV3028 is not set
# CONFIG_RTC_DRV_RV3032 is not set
# CONFIG_RTC_DRV_RV8803 is not set
# CONFIG_RTC_DRV_SD3078 is not set

#
# SPI RTC drivers
#
# CONFIG_RTC_DRV_M41T93 is not set
# CONFIG_RTC_DRV_M41T94 is not set
# CONFIG_RTC_DRV_DS1302 is not set
# CONFIG_RTC_DRV_DS1305 is not set
# CONFIG_RTC_DRV_DS1343 is not set
# CONFIG_RTC_DRV_DS1347 is not set
# CONFIG_RTC_DRV_DS1390 is not set
# CONFIG_RTC_DRV_MAX6916 is not set
# CONFIG_RTC_DRV_R9701 is not set
# CONFIG_RTC_DRV_RX4581 is not set
# CONFIG_RTC_DRV_RS5C348 is not set
# CONFIG_RTC_DRV_MAX6902 is not set
# CONFIG_RTC_DRV_PCF2123 is not set
# CONFIG_RTC_DRV_MCP795 is not set
CONFIG_RTC_I2C_AND_SPI=y

#
# SPI and I2C RTC drivers
#
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set
# CONFIG_RTC_DRV_RX6110 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_RP5C01 is not set

#
# on-CPU RTC drivers
#
# CONFIG_RTC_DRV_FTRTC010 is not set

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_GOLDFISH is not set
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
CONFIG_DMA_ENGINE=y
CONFIG_DMA_VIRTUAL_CHANNELS=y
CONFIG_DMA_ACPI=y
# CONFIG_ALTERA_MSGDMA is not set
# CONFIG_INTEL_IDMA64 is not set
# CONFIG_INTEL_IDXD is not set
# CONFIG_INTEL_IDXD_COMPAT is not set
# CONFIG_INTEL_IOATDMA is not set
# CONFIG_PLX_DMA is not set
# CONFIG_XILINX_DMA is not set
# CONFIG_XILINX_XDMA is not set
# CONFIG_AMD_PTDMA is not set
# CONFIG_QCOM_HIDMA_MGMT is not set
# CONFIG_QCOM_HIDMA is not set
CONFIG_DW_DMAC_CORE=y
# CONFIG_DW_DMAC is not set
# CONFIG_DW_DMAC_PCI is not set
# CONFIG_DW_EDMA is not set
CONFIG_HSU_DMA=y
# CONFIG_SF_PDMA is not set
# CONFIG_INTEL_LDMA is not set

#
# DMA Clients
#
# CONFIG_ASYNC_TX_DMA is not set
# CONFIG_DMATEST is not set

#
# DMABUF options
#
# CONFIG_SYNC_FILE is not set
# CONFIG_UDMABUF is not set
# CONFIG_DMABUF_MOVE_NOTIFY is not set
# CONFIG_DMABUF_DEBUG is not set
# CONFIG_DMABUF_SELFTESTS is not set
# CONFIG_DMABUF_HEAPS is not set
# CONFIG_DMABUF_SYSFS_STATS is not set
# end of DMABUF options

# CONFIG_UIO is not set
# CONFIG_VFIO is not set
CONFIG_IRQ_BYPASS_MANAGER=y
# CONFIG_VIRT_DRIVERS is not set
CONFIG_VIRTIO_ANCHOR=y
CONFIG_VIRTIO=y
CONFIG_VIRTIO_PCI_LIB=y
CONFIG_VIRTIO_PCI_LIB_LEGACY=y
CONFIG_VIRTIO_MENU=y
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_PCI_LEGACY=y
# CONFIG_VIRTIO_PMEM is not set
CONFIG_VIRTIO_BALLOON=m
# CONFIG_VIRTIO_INPUT is not set
# CONFIG_VIRTIO_MMIO is not set
# CONFIG_VDPA is not set
CONFIG_VHOST_IOTLB=y
CONFIG_VHOST_TASK=y
CONFIG_VHOST=y
CONFIG_VHOST_MENU=y
CONFIG_VHOST_NET=y
CONFIG_VHOST_VSOCK=y
# CONFIG_VHOST_CROSS_ENDIAN_LEGACY is not set

#
# Microsoft Hyper-V guest support
#
# end of Microsoft Hyper-V guest support

# CONFIG_GREYBUS is not set
# CONFIG_COMEDI is not set
# CONFIG_STAGING is not set
# CONFIG_CHROME_PLATFORMS is not set
# CONFIG_MELLANOX_PLATFORM is not set
CONFIG_SURFACE_PLATFORMS=y
# CONFIG_SURFACE_3_POWER_OPREGION is not set
# CONFIG_SURFACE_GPE is not set
# CONFIG_SURFACE_HOTPLUG is not set
# CONFIG_SURFACE_PRO3_BUTTON is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACPI_WMI is not set
# CONFIG_ACERHDF is not set
# CONFIG_ACER_WIRELESS is not set
# CONFIG_AMD_PMF is not set
# CONFIG_AMD_PMC is not set
# CONFIG_AMD_HSMP is not set
# CONFIG_ADV_SWBUTTON is not set
# CONFIG_ASUS_WIRELESS is not set
# CONFIG_ASUS_TF103C_DOCK is not set
CONFIG_X86_PLATFORM_DRIVERS_DELL=y
CONFIG_DCDBAS=m
CONFIG_DELL_RBU=m
CONFIG_DELL_SMBIOS=m
CONFIG_DELL_SMBIOS_SMM=y
CONFIG_DELL_SMO8800=m
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_GPD_POCKET_FAN is not set
# CONFIG_X86_PLATFORM_DRIVERS_HP is not set
# CONFIG_WIRELESS_HOTKEY is not set
# CONFIG_IBM_RTL is not set
# CONFIG_SENSORS_HDAPS is not set
CONFIG_INTEL_IFS=m
# CONFIG_INTEL_SAR_INT1092 is not set
# CONFIG_INTEL_PMC_CORE is not set

#
# Intel Speed Select Technology interface support
#
# CONFIG_INTEL_SPEED_SELECT_INTERFACE is not set
# end of Intel Speed Select Technology interface support

#
# Intel Uncore Frequency Control
#
# CONFIG_INTEL_UNCORE_FREQ_CONTROL is not set
# end of Intel Uncore Frequency Control

# CONFIG_INTEL_HID_EVENT is not set
# CONFIG_INTEL_VBTN is not set
# CONFIG_INTEL_INT0002_VGPIO is not set
# CONFIG_INTEL_PUNIT_IPC is not set
# CONFIG_INTEL_RST is not set
# CONFIG_INTEL_SMARTCONNECT is not set
# CONFIG_INTEL_VSEC is not set
# CONFIG_MSI_EC is not set
# CONFIG_BARCO_P50_GPIO is not set
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_TOSHIBA_HAPS is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_SYSTEM76_ACPI is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_SERIAL_MULTI_INSTANTIATE is not set
# CONFIG_MLX_PLATFORM is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_INTEL_SCU_PCI is not set
# CONFIG_INTEL_SCU_PLATFORM is not set
# CONFIG_SIEMENS_SIMATIC_IPC is not set
# CONFIG_WINMATE_FM07_KEYS is not set
CONFIG_P2SB=y
# CONFIG_COMMON_CLK is not set
# CONFIG_HWSPINLOCK is not set

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# end of Clock Source drivers

# CONFIG_MAILBOX is not set
CONFIG_IOMMU_IOVA=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y

#
# Generic IOMMU Pagetable Support
#
CONFIG_IOMMU_IO_PGTABLE=y
# end of Generic IOMMU Pagetable Support

# CONFIG_IOMMU_DEBUGFS is not set
# CONFIG_IOMMU_DEFAULT_DMA_STRICT is not set
CONFIG_IOMMU_DEFAULT_DMA_LAZY=y
# CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set
CONFIG_IOMMU_DMA=y
CONFIG_AMD_IOMMU=y
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_SVM is not set
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_PERF_EVENTS=y
# CONFIG_IOMMUFD is not set
CONFIG_IRQ_REMAP=y
# CONFIG_VIRTIO_IOMMU is not set

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set
# end of Remoteproc drivers

#
# Rpmsg drivers
#
# CONFIG_RPMSG_VIRTIO is not set
# end of Rpmsg drivers

# CONFIG_SOUNDWIRE is not set

#
# SOC (System On Chip) specific Drivers
#

#
# Amlogic SoC drivers
#
# end of Amlogic SoC drivers

#
# Broadcom SoC drivers
#
# end of Broadcom SoC drivers

#
# NXP/Freescale QorIQ SoC drivers
#
# end of NXP/Freescale QorIQ SoC drivers

#
# fujitsu SoC drivers
#
# end of fujitsu SoC drivers

#
# i.MX SoC drivers
#
# end of i.MX SoC drivers

#
# Enable LiteX SoC Builder specific drivers
#
# end of Enable LiteX SoC Builder specific drivers

# CONFIG_WPCM450_SOC is not set

#
# Qualcomm SoC drivers
#
# end of Qualcomm SoC drivers

# CONFIG_SOC_TI is not set

#
# Xilinx SoC drivers
#
# end of Xilinx SoC drivers
# end of SOC (System On Chip) specific Drivers

#
# PM Domains
#

#
# Amlogic PM Domains
#
# end of Amlogic PM Domains

#
# Broadcom PM Domains
#
# end of Broadcom PM Domains

#
# i.MX PM Domains
#
# end of i.MX PM Domains

#
# Qualcomm PM Domains
#
# end of Qualcomm PM Domains
# end of PM Domains

# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_PWM is not set

#
# IRQ chip support
#
# end of IRQ chip support

# CONFIG_IPACK_BUS is not set
# CONFIG_RESET_CONTROLLER is not set

#
# PHY Subsystem
#
CONFIG_GENERIC_PHY=y
# CONFIG_USB_LGM_PHY is not set
# CONFIG_PHY_CAN_TRANSCEIVER is not set

#
# PHY drivers for Broadcom platforms
#
# CONFIG_BCM_KONA_USB2_PHY is not set
# end of PHY drivers for Broadcom platforms

# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_PHY_INTEL_LGM_EMMC is not set
# end of PHY Subsystem

# CONFIG_POWERCAP is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
# end of Performance monitor support

CONFIG_RAS=y
# CONFIG_RAS_CEC is not set
# CONFIG_USB4 is not set

#
# Android
#
# CONFIG_ANDROID_BINDER_IPC is not set
# end of Android

CONFIG_LIBNVDIMM=y
# CONFIG_BLK_DEV_PMEM is not set
# CONFIG_BTT is not set
CONFIG_NVDIMM_KEYS=y
# CONFIG_NVDIMM_SECURITY_TEST is not set
# CONFIG_DAX is not set
CONFIG_NVMEM=y
CONFIG_NVMEM_SYSFS=y

#
# Layout Types
#
# CONFIG_NVMEM_LAYOUT_SL28_VPD is not set
# CONFIG_NVMEM_LAYOUT_ONIE_TLV is not set
# end of Layout Types

# CONFIG_NVMEM_RMEM is not set

#
# HW tracing support
#
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set
# end of HW tracing support

# CONFIG_FPGA is not set
# CONFIG_TEE is not set
# CONFIG_SIOX is not set
# CONFIG_SLIMBUS is not set
# CONFIG_INTERCONNECT is not set
# CONFIG_COUNTER is not set
# CONFIG_MOST is not set
# CONFIG_PECI is not set
# CONFIG_HTE is not set
# end of Device Drivers

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_VALIDATE_FS_PARSER is not set
CONFIG_FS_IOMAP=y
CONFIG_BUFFER_HEAD=y
CONFIG_LEGACY_DIRECT_IO=y
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT2=y
# CONFIG_EXT4_FS_POSIX_ACL is not set
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
CONFIG_JBD2_DEBUG=y
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_F2FS_FS is not set
# CONFIG_BCACHEFS_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
# CONFIG_EXPORTFS_BLOCK_OPS is not set
CONFIG_FILE_LOCKING=y
CONFIG_FS_ENCRYPTION=y
CONFIG_FS_ENCRYPTION_ALGS=y
# CONFIG_FS_VERITY is not set
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
# CONFIG_FANOTIFY_ACCESS_PERMISSIONS is not set
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
# CONFIG_AUTOFS_FS is not set
CONFIG_FUSE_FS=y
CONFIG_CUSE=m
# CONFIG_VIRTIO_FS is not set
CONFIG_OVERLAY_FS=y
# CONFIG_OVERLAY_FS_REDIRECT_DIR is not set
CONFIG_OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW=y
# CONFIG_OVERLAY_FS_INDEX is not set
# CONFIG_OVERLAY_FS_XINO_AUTO is not set
# CONFIG_OVERLAY_FS_METACOPY is not set
# CONFIG_OVERLAY_FS_DEBUG is not set

#
# Caches
#
CONFIG_NETFS_SUPPORT=m
# CONFIG_NETFS_STATS is not set
# CONFIG_FSCACHE is not set
# end of Caches

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
# CONFIG_UDF_FS is not set
# end of CD-ROM/DVD Filesystems

#
# DOS/FAT/EXFAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_FAT_DEFAULT_UTF8 is not set
# CONFIG_EXFAT_FS is not set
# CONFIG_NTFS_FS is not set
# CONFIG_NTFS3_FS is not set
# end of DOS/FAT/EXFAT/NT Filesystems

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
# CONFIG_PROC_VMCORE_DEVICE_DUMP is not set
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_PROC_CHILDREN=y
CONFIG_PROC_PID_ARCH_STATUS=y
CONFIG_PROC_CPU_RESCTRL=y
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_TMPFS_INODE64=y
# CONFIG_TMPFS_QUOTA is not set
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
# CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON is not set
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_EFIVAR_FS=m
# end of Pseudo filesystems

CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ORANGEFS_FS is not set
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_JFFS2_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
CONFIG_PSTORE_DEFAULT_KMSG_BYTES=10240
CONFIG_PSTORE_COMPRESS=y
# CONFIG_PSTORE_CONSOLE is not set
# CONFIG_PSTORE_PMSG is not set
# CONFIG_PSTORE_FTRACE is not set
# CONFIG_PSTORE_RAM is not set
# CONFIG_PSTORE_BLK is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
# CONFIG_EROFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_SMB_SERVER is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=m
# CONFIG_9P_FS_POSIX_ACL is not set
# CONFIG_9P_FS_SECURITY is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
CONFIG_NLS_UTF8=y
# CONFIG_DLM is not set
# CONFIG_UNICODE is not set
CONFIG_IO_WQ=y
# end of File systems

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_KEYS_REQUEST_CACHE is not set
CONFIG_PERSISTENT_KEYRINGS=y
# CONFIG_TRUSTED_KEYS is not set
CONFIG_ENCRYPTED_KEYS=y
# CONFIG_USER_DECRYPTED_DATA is not set
# CONFIG_KEY_DH_OPERATIONS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
# CONFIG_SECURITYFS is not set
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_INFINIBAND is not set
# CONFIG_SECURITY_PATH is not set
# CONFIG_INTEL_TXT is not set
# CONFIG_HARDENED_USERCOPY is not set
# CONFIG_FORTIFY_SOURCE is not set
# CONFIG_STATIC_USERMODEHELPER is not set
# CONFIG_SECURITY_SELINUX is not set
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_LOADPIN is not set
# CONFIG_SECURITY_YAMA is not set
# CONFIG_SECURITY_SAFESETID is not set
# CONFIG_SECURITY_LOCKDOWN_LSM is not set
# CONFIG_SECURITY_LANDLOCK is not set
CONFIG_INTEGRITY=y
# CONFIG_INTEGRITY_SIGNATURE is not set
CONFIG_INTEGRITY_AUDIT=y
# CONFIG_IMA is not set
# CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT is not set
# CONFIG_EVM is not set
CONFIG_DEFAULT_SECURITY_DAC=y
CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,bpf"

#
# Kernel hardening options
#

#
# Memory initialization
#
CONFIG_CC_HAS_AUTO_VAR_INIT_PATTERN=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO_BARE=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO=y
CONFIG_INIT_STACK_NONE=y
# CONFIG_INIT_STACK_ALL_PATTERN is not set
# CONFIG_INIT_STACK_ALL_ZERO is not set
# CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set
# CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
CONFIG_CC_HAS_ZERO_CALL_USED_REGS=y
# CONFIG_ZERO_CALL_USED_REGS is not set
# end of Memory initialization

#
# Hardening of kernel data structures
#
CONFIG_LIST_HARDENED=y
CONFIG_BUG_ON_DATA_CORRUPTION=y
# end of Hardening of kernel data structures

CONFIG_CC_HAS_RANDSTRUCT=y
CONFIG_RANDSTRUCT_NONE=y
# CONFIG_RANDSTRUCT_FULL is not set
# end of Kernel hardening options
# end of Security options

CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_SIG2=y
CONFIG_CRYPTO_SKCIPHER=y
CONFIG_CRYPTO_SKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_AKCIPHER=m
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_ACOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_CRYPTD=y
CONFIG_CRYPTO_AUTHENC=y
# CONFIG_CRYPTO_TEST is not set
CONFIG_CRYPTO_SIMD=y
# end of Crypto core or helper

#
# Public-key cryptography
#
CONFIG_CRYPTO_RSA=m
# CONFIG_CRYPTO_DH is not set
# CONFIG_CRYPTO_ECDH is not set
# CONFIG_CRYPTO_ECDSA is not set
# CONFIG_CRYPTO_ECRDSA is not set
# CONFIG_CRYPTO_SM2 is not set
# CONFIG_CRYPTO_CURVE25519 is not set
# end of Public-key cryptography

#
# Block ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARIA is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SM4_GENERIC is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# end of Block ciphers

#
# Length-preserving ciphers and modes
#
# CONFIG_CRYPTO_ADIANTUM is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_CHACHA20 is not set
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CFB is not set
# CONFIG_CRYPTO_CTR is not set
CONFIG_CRYPTO_CTS=y
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_HCTR2 is not set
# CONFIG_CRYPTO_KEYWRAP is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_OFB is not set
# CONFIG_CRYPTO_PCBC is not set
CONFIG_CRYPTO_XTS=y
# end of Length-preserving ciphers and modes

#
# AEAD (authenticated encryption with associated data) ciphers
#
# CONFIG_CRYPTO_AEGIS128 is not set
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set
# CONFIG_CRYPTO_ECHAINIV is not set
CONFIG_CRYPTO_ESSIV=y
# end of AEAD (authenticated encryption with associated data) ciphers

#
# Hashes, digests, and MACs
#
# CONFIG_CRYPTO_BLAKE2B is not set
# CONFIG_CRYPTO_CMAC is not set
# CONFIG_CRYPTO_GHASH is not set
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_POLY1305 is not set
# CONFIG_CRYPTO_RMD160 is not set
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
CONFIG_CRYPTO_SHA3=m
# CONFIG_CRYPTO_SM3_GENERIC is not set
# CONFIG_CRYPTO_STREEBOG is not set
CONFIG_CRYPTO_VMAC=y
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_XXHASH is not set
# end of Hashes, digests, and MACs

#
# CRCs (cyclic redundancy checks)
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32 is not set
# CONFIG_CRYPTO_CRCT10DIF is not set
# end of CRCs (cyclic redundancy checks)

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
CONFIG_CRYPTO_LZO=y
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set
# CONFIG_CRYPTO_ZSTD is not set
# end of Compression

#
# Random number generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_DRBG_MENU is not set
# CONFIG_CRYPTO_JITTERENTROPY is not set
# end of Random number generation

#
# Userspace interface
#
CONFIG_CRYPTO_USER_API=y
CONFIG_CRYPTO_USER_API_HASH=y
CONFIG_CRYPTO_USER_API_SKCIPHER=y
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_USER_API_AEAD is not set
CONFIG_CRYPTO_USER_API_ENABLE_OBSOLETE=y
# end of Userspace interface

CONFIG_CRYPTO_HASH_INFO=y

#
# Accelerated Cryptographic Algorithms for CPU (x86)
#
# CONFIG_CRYPTO_CURVE25519_X86 is not set
CONFIG_CRYPTO_AES_NI_INTEL=y
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set
# CONFIG_CRYPTO_ARIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_ARIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_ARIA_GFNI_AVX512_X86_64 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_AEGIS128_AESNI_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_AVX2 is not set
# CONFIG_CRYPTO_BLAKE2S_X86 is not set
# CONFIG_CRYPTO_POLYVAL_CLMUL_NI is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
# CONFIG_CRYPTO_SM3_AVX_X86_64 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
# end of Accelerated Cryptographic Algorithms for CPU (x86)

CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_ATMEL_ECC is not set
# CONFIG_CRYPTO_DEV_ATMEL_SHA204A is not set
CONFIG_CRYPTO_DEV_CCP=y
CONFIG_CRYPTO_DEV_CCP_DD=y
CONFIG_CRYPTO_DEV_SP_CCP=y
CONFIG_CRYPTO_DEV_CCP_CRYPTO=m
CONFIG_CRYPTO_DEV_SP_PSP=y
# CONFIG_CRYPTO_DEV_CCP_DEBUGFS is not set
# CONFIG_CRYPTO_DEV_NITROX_CNN55XX is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCC is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXX is not set
# CONFIG_CRYPTO_DEV_QAT_C62X is not set
# CONFIG_CRYPTO_DEV_QAT_4XXX is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCCVF is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXXVF is not set
# CONFIG_CRYPTO_DEV_QAT_C62XVF is not set
# CONFIG_CRYPTO_DEV_VIRTIO is not set
# CONFIG_CRYPTO_DEV_SAFEXCEL is not set
# CONFIG_CRYPTO_DEV_AMLOGIC_GXL is not set
CONFIG_ASYMMETRIC_KEY_TYPE=y
CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE=m
CONFIG_X509_CERTIFICATE_PARSER=m
# CONFIG_PKCS8_PRIVATE_KEY_PARSER is not set
CONFIG_PKCS7_MESSAGE_PARSER=m
# CONFIG_FIPS_SIGNATURE_SELFTEST is not set

#
# Certificates for signature checking
#
# CONFIG_SYSTEM_BLACKLIST_KEYRING is not set
# end of Certificates for signature checking

CONFIG_BINARY_PRINTF=y

#
# Library routines
#
# CONFIG_PACKING is not set
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
# CONFIG_CORDIC is not set
# CONFIG_PRIME_NUMBERS is not set
CONFIG_RATIONAL=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_ARCH_USE_SYM_ANNOTATIONS=y

#
# Crypto library routines
#
CONFIG_CRYPTO_LIB_UTILS=y
CONFIG_CRYPTO_LIB_AES=y
CONFIG_CRYPTO_LIB_ARC4=y
CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y
# CONFIG_CRYPTO_LIB_CHACHA is not set
# CONFIG_CRYPTO_LIB_CURVE25519 is not set
CONFIG_CRYPTO_LIB_DES=y
CONFIG_CRYPTO_LIB_POLY1305_RSIZE=11
# CONFIG_CRYPTO_LIB_POLY1305 is not set
# CONFIG_CRYPTO_LIB_CHACHA20POLY1305 is not set
CONFIG_CRYPTO_LIB_SHA1=y
CONFIG_CRYPTO_LIB_SHA256=y
# end of Crypto library routines

# CONFIG_CRC_CCITT is not set
CONFIG_CRC16=y
# CONFIG_CRC_T10DIF is not set
# CONFIG_CRC64_ROCKSOFT is not set
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC64 is not set
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
# CONFIG_CRC8 is not set
CONFIG_XXHASH=y
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_LZ4_DECOMPRESS=y
CONFIG_ZSTD_COMMON=y
CONFIG_ZSTD_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
# CONFIG_XZ_DEC_MICROLZMA is not set
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_DECOMPRESS_LZ4=y
CONFIG_DECOMPRESS_ZSTD=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_BTREE=y
CONFIG_INTERVAL_TREE=y
CONFIG_XARRAY_MULTI=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
CONFIG_DMA_OPS=y
CONFIG_NEED_SG_DMA_FLAGS=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED=y
CONFIG_SWIOTLB=y
# CONFIG_SWIOTLB_DYNAMIC is not set
CONFIG_DMA_COHERENT_POOL=y
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_DMA_MAP_BENCHMARK is not set
CONFIG_SGL_ALLOC=y
CONFIG_CHECK_SIGNATURE=y
# CONFIG_FORCE_NR_CPUS is not set
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
CONFIG_CLZ_TAB=y
CONFIG_IRQ_POLL=y
CONFIG_MPILIB=m
CONFIG_DIMLIB=y
CONFIG_OID_REGISTRY=m
CONFIG_UCS2_STRING=y
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_GETTIMEOFDAY=y
CONFIG_GENERIC_VDSO_TIME_NS=y
CONFIG_FONT_SUPPORT=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_MEMREGION=y
CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION=y
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE=y
CONFIG_ARCH_HAS_COPY_MC=y
CONFIG_ARCH_STACKWALK=y
CONFIG_SBITMAP=y
# CONFIG_LWQ_TEST is not set
# end of Library routines

CONFIG_FIRMWARE_TABLE=y

#
# Kernel hacking
#

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
# CONFIG_PRINTK_CALLER is not set
# CONFIG_STACKTRACE_BUILD_ID is not set
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_CONSOLE_LOGLEVEL_QUIET=4
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_DYNAMIC_DEBUG_CORE is not set
CONFIG_SYMBOLIC_ERRNAME=y
CONFIG_DEBUG_BUGVERBOSE=y
# end of printk and dmesg options

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO=y
CONFIG_AS_HAS_NON_CONST_LEB128=y
# CONFIG_DEBUG_INFO_NONE is not set
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
# CONFIG_DEBUG_INFO_DWARF4 is not set
# CONFIG_DEBUG_INFO_DWARF5 is not set
# CONFIG_DEBUG_INFO_REDUCED is not set
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
# CONFIG_DEBUG_INFO_COMPRESSED_ZLIB is not set
# CONFIG_DEBUG_INFO_COMPRESSED_ZSTD is not set
# CONFIG_DEBUG_INFO_SPLIT is not set
CONFIG_DEBUG_INFO_BTF=y
CONFIG_PAHOLE_HAS_SPLIT_BTF=y
CONFIG_PAHOLE_HAS_BTF_TAG=y
CONFIG_PAHOLE_HAS_LANG_EXCLUDE=y
CONFIG_DEBUG_INFO_BTF_MODULES=y
CONFIG_MODULE_ALLOW_BTF_MISMATCH=y
CONFIG_GDB_SCRIPTS=y
CONFIG_FRAME_WARN=2048
# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_HEADERS_INSTALL is not set
# CONFIG_SECTION_MISMATCH_WARN_ONLY is not set
# CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B is not set
CONFIG_FRAME_POINTER=y
CONFIG_OBJTOOL=y
# CONFIG_STACK_VALIDATION is not set
# CONFIG_VMLINUX_MAP is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# end of Compile-time checks and compiler options

#
# Generic Kernel Debugging Instruments
#
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
# CONFIG_DEBUG_FS_DISALLOW_MOUNT is not set
# CONFIG_DEBUG_FS_ALLOW_NONE is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_UBSAN is not set
CONFIG_HAVE_ARCH_KCSAN=y
CONFIG_HAVE_KCSAN_COMPILER=y
# CONFIG_KCSAN is not set
# end of Generic Kernel Debugging Instruments

#
# Networking Debugging
#
# CONFIG_NET_DEV_REFCNT_TRACKER is not set
# CONFIG_NET_NS_REFCNT_TRACKER is not set
# CONFIG_DEBUG_NET is not set
# end of Networking Debugging

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_PAGE_OWNER is not set
# CONFIG_PAGE_TABLE_CHECK is not set
# CONFIG_PAGE_POISONING is not set
# CONFIG_DEBUG_PAGE_REF is not set
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_ARCH_HAS_DEBUG_WX=y
# CONFIG_DEBUG_WX is not set
CONFIG_GENERIC_PTDUMP=y
# CONFIG_PTDUMP_DEBUGFS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_PER_VMA_LOCK_STATS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SHRINKER_DEBUG is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_SCHED_STACK_END_CHECK is not set
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VM_PGTABLE is not set
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP=y
# CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is not set
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_KASAN_SW_TAGS=y
CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
# CONFIG_KASAN is not set
CONFIG_HAVE_ARCH_KFENCE=y
CONFIG_KFENCE=y
CONFIG_KFENCE_SAMPLE_INTERVAL=100
CONFIG_KFENCE_NUM_OBJECTS=1023
# CONFIG_KFENCE_DEFERRABLE is not set
# CONFIG_KFENCE_STATIC_KEYS is not set
CONFIG_KFENCE_STRESS_TEST_FAULTS=0
CONFIG_HAVE_ARCH_KMSAN=y
CONFIG_HAVE_KMSAN_COMPILER=y
# end of Memory Debugging

# CONFIG_DEBUG_SHIRQ is not set

#
# Debug Oops, Lockups and Hangs
#
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_HARDLOCKUP_DETECTOR_PREFER_BUDDY is not set
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
# CONFIG_HARDLOCKUP_DETECTOR_BUDDY is not set
# CONFIG_HARDLOCKUP_DETECTOR_ARCH is not set
CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
# CONFIG_DETECT_HUNG_TASK is not set
# CONFIG_WQ_WATCHDOG is not set
# CONFIG_WQ_CPU_INTENSIVE_REPORT is not set
# CONFIG_TEST_LOCKUP is not set
# end of Debug Oops, Lockups and Hangs

#
# Scheduler Debugging
#
CONFIG_SCHED_DEBUG=y
CONFIG_SCHED_INFO=y
CONFIG_SCHEDSTATS=y
# end of Scheduler Debugging

# CONFIG_DEBUG_TIMEKEEPING is not set

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set
# end of Lock Debugging (spinlocks, mutexes, etc...)

# CONFIG_NMI_CHECK_CPU is not set
# CONFIG_DEBUG_IRQFLAGS is not set
CONFIG_STACKTRACE=y
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set

#
# Debug kernel data structures
#
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_PLIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_MAPLE_TREE is not set
# end of Debug kernel data structures

# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
# CONFIG_RCU_CPU_STALL_CPUTIME is not set
CONFIG_RCU_TRACE=y
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
# CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
# CONFIG_LATENCYTOP is not set
# CONFIG_DEBUG_CGROUP_REF is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_RETHOOK=y
CONFIG_RETHOOK=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_RETVAL=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_NO_PATCHABLE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_OBJTOOL_MCOUNT=y
CONFIG_HAVE_OBJTOOL_NOP_MCOUNT=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_HAVE_BUILDTIME_MCOUNT_SORT=y
CONFIG_BUILDTIME_MCOUNT_SORT=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
# CONFIG_BOOTTIME_TRACING is not set
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
# CONFIG_FUNCTION_GRAPH_RETVAL is not set
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_DYNAMIC_FTRACE_WITH_ARGS=y
# CONFIG_FPROBE is not set
# CONFIG_FUNCTION_PROFILER is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_HWLAT_TRACER is not set
# CONFIG_OSNOISE_TRACER is not set
# CONFIG_TIMERLAT_TRACER is not set
# CONFIG_MMIOTRACE is not set
CONFIG_FTRACE_SYSCALLS=y
# CONFIG_TRACER_SNAPSHOT is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_PROBE_EVENTS_BTF_ARGS=y
CONFIG_KPROBE_EVENTS=y
# CONFIG_KPROBE_EVENTS_ON_NOTRACE is not set
CONFIG_UPROBE_EVENTS=y
CONFIG_BPF_EVENTS=y
CONFIG_DYNAMIC_EVENTS=y
CONFIG_PROBE_EVENTS=y
# CONFIG_BPF_KPROBE_OVERRIDE is not set
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_MCOUNT_USE_OBJTOOL=y
# CONFIG_SYNTH_EVENTS is not set
# CONFIG_USER_EVENTS is not set
# CONFIG_HIST_TRIGGERS is not set
# CONFIG_TRACE_EVENT_INJECT is not set
# CONFIG_TRACEPOINT_BENCHMARK is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_TRACE_EVAL_MAP_FILE is not set
# CONFIG_FTRACE_RECORD_RECURSION is not set
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_FTRACE_SORT_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
# CONFIG_KPROBE_EVENT_GEN_TEST is not set
# CONFIG_RV is not set
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
# CONFIG_SAMPLES is not set
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT=y
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT_MULTI=y
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
CONFIG_STRICT_DEVMEM=y
# CONFIG_IO_STRICT_DEVMEM is not set

#
# x86 Debugging
#
CONFIG_EARLY_PRINTK_USB=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
# CONFIG_EARLY_PRINTK_USB_XDBC is not set
# CONFIG_EFI_PGT_DUMP is not set
# CONFIG_DEBUG_TLBFLUSH is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
# CONFIG_DEBUG_ENTRY is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set
CONFIG_X86_DEBUG_FPU=y
# CONFIG_PUNIT_ATOM_DEBUG is not set
# CONFIG_UNWINDER_ORC is not set
CONFIG_UNWINDER_FRAME_POINTER=y
# CONFIG_UNWINDER_GUESS is not set
# end of x86 Debugging

#
# Kernel Testing and Coverage
#
# CONFIG_KUNIT is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
CONFIG_FUNCTION_ERROR_INJECTION=y
# CONFIG_FAULT_INJECTION is not set
CONFIG_ARCH_HAS_KCOV=y
CONFIG_CC_HAS_SANCOV_TRACE_PC=y
# CONFIG_KCOV is not set
CONFIG_RUNTIME_TESTING_MENU=y
# CONFIG_TEST_DHRY is not set
CONFIG_LKDTM=y
# CONFIG_TEST_MIN_HEAP is not set
# CONFIG_TEST_DIV64 is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_TEST_REF_TRACKER is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_REED_SOLOMON_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
# CONFIG_PERCPU_TEST is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_TEST_HEXDUMP is not set
# CONFIG_STRING_SELFTEST is not set
# CONFIG_TEST_STRING_HELPERS is not set
# CONFIG_TEST_KSTRTOX is not set
# CONFIG_TEST_PRINTF is not set
# CONFIG_TEST_SCANF is not set
# CONFIG_TEST_BITMAP is not set
# CONFIG_TEST_UUID is not set
# CONFIG_TEST_XARRAY is not set
# CONFIG_TEST_MAPLE_TREE is not set
# CONFIG_TEST_RHASHTABLE is not set
# CONFIG_TEST_IDA is not set
# CONFIG_TEST_LKM is not set
# CONFIG_TEST_BITOPS is not set
# CONFIG_TEST_VMALLOC is not set
# CONFIG_TEST_USER_COPY is not set
CONFIG_TEST_BPF=m
# CONFIG_TEST_BLACKHOLE_DEV is not set
# CONFIG_FIND_BIT_BENCHMARK is not set
# CONFIG_TEST_FIRMWARE is not set
# CONFIG_TEST_SYSCTL is not set
# CONFIG_TEST_UDELAY is not set
# CONFIG_TEST_STATIC_KEYS is not set
# CONFIG_TEST_KMOD is not set
# CONFIG_TEST_MEMCAT_P is not set
# CONFIG_TEST_MEMINIT is not set
# CONFIG_TEST_FREE_PAGES is not set
# CONFIG_TEST_FPU is not set
# CONFIG_TEST_CLOCKSOURCE_WATCHDOG is not set
# CONFIG_TEST_OBJPOOL is not set
CONFIG_ARCH_USE_MEMTEST=y
# CONFIG_MEMTEST is not set
# end of Kernel Testing and Coverage

#
# Rust hacking
#
# end of Rust hacking
# end of Kernel hacking

CONFIG_DIORITE_IDPF=y
CONFIG_DIORITE_IDPF_HW=m
# CONFIG_IDPF_GGL_PET_GBMC_WATCHDOG is not set

2023-12-14 18:38:46

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > >
> > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > >
> > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > >
> > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > >
> > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > >
> > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > on:
> > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > buffer pools (anon).
> > > > > > >
> > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > moving average smooths out the spike quickly.
> > > > > > >
> > > > > > > To fix the problem:
> > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > through file descriptors are worth protecting.)
> > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > similar to the above.
> > > > > > >
> > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > Before After Change
> > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > >
> > > > > > Hi Yu,
> > > > > >
> > > > > > Thanks you for your amazing works on MGLRU.
> > > > > >
> > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > https://lwn.net/Articles/945266/
> > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > multiple workloads.
> > > > > >
> > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > upstream.
> > > > > >
> > > > > > I think both this patch and my previous series are for solving the
> > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > similar problem):
> > > > > >
> > > > > > Previous result:
> > > > > > ==================================================================
> > > > > > Execution Results after 905 seconds
> > > > > > ------------------------------------------------------------------
> > > > > > Executed Time (µs) Rate
> > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > ------------------------------------------------------------------
> > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > >
> > > > > > This patch:
> > > > > > ==================================================================
> > > > > > Execution Results after 900 seconds
> > > > > > ------------------------------------------------------------------
> > > > > > Executed Time (µs) Rate
> > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > ------------------------------------------------------------------
> > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > >
> > > > > > Unpatched version is always around ~500.
> > > > >
> > > > > Thanks for the test results!
> > > > >
> > > > > > I think there are a few points here:
> > > > > > - Refault distance make use of page shadow so it can better
> > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > distance).
> > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > memory is too small to hold the whole workingset.
> > > > > >
> > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > combined to work better on this issue, how do you think?
> > > > >
> > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > >
> > > Hi Yu,
> > >
> > > I'm working on V4 of the RFC now, which just update some comments, and
> > > skip anon page re-activation in refault path for mglru which was not
> > > very helpful, only some tiny adjustment.
> > > And I found it easier to test with fio, using following test script:
> > >
> > > #!/bin/bash
> > > swapoff -a
> > >
> > > modprobe brd rd_nr=1 rd_size=16777216
> > > mkfs.ext4 /dev/ram0
> > > mount /dev/ram0 /mnt
> > >
> > > mkdir -p /sys/fs/cgroup/benchmark
> > > cd /sys/fs/cgroup/benchmark
> > >
> > > echo 4G > memory.max
> > > echo $$ > cgroup.procs
> > > echo 3 > /proc/sys/vm/drop_caches
> > >
> > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > >
> > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > towards certain pages.
> > > Unpatched 6.7-rc4:
> > > Run status group 0 (all jobs):
> > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > >
> > > Patched with RFC v4:
> > > Run status group 0 (all jobs):
> > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > >
> > > Patched with this series:
> > > Run status group 0 (all jobs):
> > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > >
> > > MGLRU off:
> > > Run status group 0 (all jobs):
> > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > >
> > > - If I change zipf:0.5 to random:
> > > Unpatched 6.7-rc4:
> > > Patched with this series:
> > > Run status group 0 (all jobs):
> > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > >
> > > Patched with RFC v4:
> > > Run status group 0 (all jobs):
> > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > >
> > > Patched with this series:
> > > Run status group 0 (all jobs):
> > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > >
> > > MGLRU off:
> > > Run status group 0 (all jobs):
> > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > >
> > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > test I provided before uses a SATA SSD so it will have a much higher
> > > impact. I'll provides a script to setup the test case and run it, it's
> > > more complex to setup than fio since involving setting up multiple
> > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > occupied by some other tasks but will try best to send them out as
> > > soon as possible.
> >
> > Thanks! Apparently your RFC did show better IOPS with both access
> > patterns, which was a surprise to me because it had higher refaults
> > and usually higher refautls result in worse performance.
> >
> > So I'm still trying to figure out why it turned out the opposite. My
> > current guess is that:
> > 1. It had a very small but stable inactive LRU list, which was able to
> > fit into the L3 cache entirely.
> > 2. It counted few folios as workingset and therefore incurred less
> > overhead from CONFIG_PSI and/or CONFIG_TASK_DELAY_ACCT.
> >
> > Did you save workingset_refault_file when you ran the test? If so, can
> > you check the difference between this series and your RFC?
>
>
> It seems I was right about #1 above. After I scaled your test up by 20x,
> I saw my series performed ~5% faster with zipf and ~9% faster with random
> accesses.

Hi Yu,

Thank you so much for testing and sharing this result.

I'm not sure about #1, the ramdisk size, access data, are far larger
than L3 (16M on my CPU) even in down scaled test, and both random/zipf
shows similar result.

>
> IOW, I made rd_size from 16GB to 320GB, memory.max from 4GB to 80GB,
> --numjobs from 12 to 60 and --size from 1GB to 4GB.
>
> v6.7-c5 + this series
> =====================
>
> zipf
> ----
>
> mglru: (groupid=0, jobs=60): err= 0: pid=12155: Wed Dec 13 17:50:36 2023
> read: IOPS=5074k, BW=19.4GiB/s (20.8GB/s)(5807GiB/300007msec)
> slat (usec): min=36, max=109326, avg=363.67, stdev=1829.97
> clat (nsec): min=783, max=113292k, avg=1136755.10, stdev=3162056.05
> lat (usec): min=37, max=149232, avg=1500.43, stdev=3644.21
> clat percentiles (usec):
> | 1.00th=[ 490], 5.00th=[ 519], 10.00th=[ 537], 20.00th=[ 553],
> | 30.00th=[ 570], 40.00th=[ 586], 50.00th=[ 627], 60.00th=[ 840],
> | 70.00th=[ 988], 80.00th=[ 1074], 90.00th=[ 1188], 95.00th=[ 1336],
> | 99.00th=[ 7308], 99.50th=[31327], 99.90th=[36963], 99.95th=[45351],
> | 99.99th=[53216]
> bw ( MiB/s): min= 8332, max=27116, per=100.00%, avg=19846.67, stdev=58.20, samples=35903
> iops : min=2133165, max=6941826, avg=5080741.79, stdev=14899.13, samples=35903
> lat (nsec) : 1000=0.01%
> lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
> lat (usec) : 250=0.01%, 500=1.76%, 750=52.94%, 1000=16.65%
> lat (msec) : 2=26.22%, 4=0.15%, 10=1.36%, 20=0.01%, 50=0.90%
> lat (msec) : 100=0.02%, 250=0.01%
> cpu : usr=5.42%, sys=87.59%, ctx=470315, majf=0, minf=2184
> IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
> submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=100.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=1522384845,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
> READ: bw=19.4GiB/s (20.8GB/s), 19.4GiB/s-19.4GiB/s (20.8GB/s-20.8GB/s), io=5807GiB (6236GB), run=300007-300007msec
>
> Disk stats (read/write):
> ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> mglru: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=128
>
> random
> ------
>
> mglru: (groupid=0, jobs=60): err= 0: pid=12576: Wed Dec 13 18:00:50 2023
> read: IOPS=3853k, BW=14.7GiB/s (15.8GB/s)(4410GiB/300014msec)
> slat (usec): min=58, max=118605, avg=486.45, stdev=2311.45
> clat (usec): min=3, max=169810, avg=1496.60, stdev=3982.89
> lat (usec): min=73, max=170019, avg=1983.06, stdev=4585.87
> clat percentiles (usec):
> | 1.00th=[ 586], 5.00th=[ 627], 10.00th=[ 644], 20.00th=[ 668],
> | 30.00th=[ 693], 40.00th=[ 725], 50.00th=[ 816], 60.00th=[ 1123],
> | 70.00th=[ 1221], 80.00th=[ 1352], 90.00th=[ 1516], 95.00th=[ 1713],
> | 99.00th=[31851], 99.50th=[34866], 99.90th=[41681], 99.95th=[54264],
> | 99.99th=[61080]
> bw ( MiB/s): min= 6049, max=21328, per=100.00%, avg=15070.00, stdev=45.96, samples=35940
> iops : min=1548543, max=5459997, avg=3857912.87, stdev=11765.30, samples=35940
> lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=0.01%
> lat (usec) : 500=0.01%, 750=44.64%, 1000=8.20%
> lat (msec) : 2=43.84%, 4=0.27%, 10=1.79%, 20=0.01%, 50=1.20%
> lat (msec) : 100=0.07%, 250=0.01%
> cpu : usr=3.19%, sys=89.87%, ctx=463840, majf=0, minf=2248
> IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
> submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=1155923744,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
> READ: bw=14.7GiB/s (15.8GB/s), 14.7GiB/s-14.7GiB/s (15.8GB/s-15.8GB/s), io=4410GiB (4735GB), run=300014-300014msec
>
> Disk stats (read/write):
> ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
> memcg 3 /zipf
> node 0
> 0 1521654 0 0x
> 0 0r 0e 0p 0 0 0
> 1 0r 0e 0p 0 0 0
> 2 0r 0e 0p 0 0 0
> 3 0r 0e 0p 0 0 0
> 0 0 0 0 0 0
> 1 1521654 0 21
> 0 0 0 0 1077016797r 1111542014e 0p
> 1 0 0 0 317997853r 324814007e 0p
> 2 0 0 0 68064253r 68866308e 124302p
> 3 0 0 0 0r 0e 12282816p
> 0 0 0 0 0 0
> 2 1521654 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 3 1521654 0 0
> 0 0R 0T 0 0R 0T 0
> 1 0R 0T 0 0R 0T 0
> 2 0R 0T 0 0R 0T 0
> 3 0R 0T 0 0R 0T 0
> 0L 0O 0Y 0N 0F 0A
> node 1
> 0 1521654 0 0
> 0 0r 0e 0p 0r 0e 0p
> 1 0r 0e 0p 0r 0e 0p
> 2 0r 0e 0p 0r 0e 0p
> 3 0r 0e 0p 0r 0e 0p
> 0 0 0 0 0 0
> 1 1521654 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 2 1521654 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 3 1521654 0 0
> 0 0R 0T 0 0R 0T 0
> 1 0R 0T 0 0R 0T 0
> 2 0R 0T 0 0R 0T 0
> 3 0R 0T 0 0R 0T 0
> 0L 0O 0Y 0N 0F 0A
> memcg 4 /random
> node 0
> 0 600431 0 0x
> 0 0r 0e 0p 0 0 0
> 1 0r 0e 0p 0 0 0
> 2 0r 0e 0p 0 0 0
> 3 0r 0e 0p 0 0 0
> 0 0 0 0 0 0
> 1 600431 0 11169201
> 0 0 0 0 1071724785r 1103937007e 0p
> 1 0 0 0 376193810r 384852629e 0p
> 2 0 0 0 77315518r 78596395e 0p
> 3 0 0 0 0r 0e 9593442p
> 0 0 0 0 0 0
> 2 600431 1 9593442
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 3 600431 36 754
> 0 0R 0T 0 0R 0T 0
> 1 0R 0T 0 0R 0T 0
> 2 0R 0T 0 0R 0T 0
> 3 0R 0T 0 0R 0T 0
> 0L 0O 0Y 0N 0F 0A
> node 1
> 0 600431 0 0
> 0 0r 0e 0p 0r 0e 0p
> 1 0r 0e 0p 0r 0e 0p
> 2 0r 0e 0p 0r 0e 0p
> 3 0r 0e 0p 0r 0e 0p
> 0 0 0 0 0 0
> 1 600431 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 2 600431 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 3 600431 0 0
> 0 0R 0T 0 0R 0T 0
> 1 0R 0T 0 0R 0T 0
> 2 0R 0T 0 0R 0T 0
> 3 0R 0T 0 0R 0T 0
> 0L 0O 0Y 0N 0F 0A
>
> v6.7-c5 + RFC v3
> ================
>
> zipf
> ----
>
> mglru: (groupid=0, jobs=60): err= 0: pid=11600: Wed Dec 13 18:34:31 2023
> read: IOPS=4816k, BW=18.4GiB/s (19.7GB/s)(5512GiB/300014msec)
> slat (usec): min=3, max=121722, avg=384.46, stdev=2066.10
> clat (nsec): min=356, max=174717k, avg=1197513.60, stdev=3568734.58
> lat (usec): min=3, max=174919, avg=1581.97, stdev=4112.49
> clat percentiles (usec):
> | 1.00th=[ 486], 5.00th=[ 515], 10.00th=[ 529], 20.00th=[ 553],
> | 30.00th=[ 570], 40.00th=[ 594], 50.00th=[ 652], 60.00th=[ 898],
> | 70.00th=[ 988], 80.00th=[ 1139], 90.00th=[ 1254], 95.00th=[ 1369],
> | 99.00th=[ 6915], 99.50th=[35914], 99.90th=[42206], 99.95th=[52167],
> | 99.99th=[61604]
> bw ( MiB/s): min= 7716, max=26325, per=100.00%, avg=18836.65, stdev=57.20, samples=35880
> iops : min=1975306, max=6739280, avg=4822176.85, stdev=14642.35, samples=35880
> lat (nsec) : 500=0.01%, 750=0.01%, 1000=0.01%
> lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=0.01%
> lat (usec) : 500=2.57%, 750=50.99%, 1000=17.56%
> lat (msec) : 2=26.41%, 4=0.16%, 10=1.41%, 20=0.01%, 50=0.84%
> lat (msec) : 100=0.05%, 250=0.01%
> cpu : usr=4.95%, sys=88.09%, ctx=457609, majf=0, minf=2184
> IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
> submit : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=1445015808,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
> READ: bw=18.4GiB/s (19.7GB/s), 18.4GiB/s-18.4GiB/s (19.7GB/s-19.7GB/s), io=5512GiB (5919GB), run=300014-300014msec
>
> Disk stats (read/write):
> ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> mglru: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=128
>
> random
> ------
>
> mglru: (groupid=0, jobs=60): err= 0: pid=12024: Wed Dec 13 18:44:45 2023
> read: IOPS=3519k, BW=13.4GiB/s (14.4GB/s)(4027GiB/300011msec)
> slat (usec): min=54, max=136278, avg=534.57, stdev=2738.72
> clat (usec): min=3, max=176186, avg=1638.66, stdev=4714.55
> lat (usec): min=78, max=176426, avg=2173.23, stdev=5426.40
> clat percentiles (usec):
> | 1.00th=[ 627], 5.00th=[ 676], 10.00th=[ 693], 20.00th=[ 725],
> | 30.00th=[ 766], 40.00th=[ 816], 50.00th=[ 1090], 60.00th=[ 1205],
> | 70.00th=[ 1270], 80.00th=[ 1369], 90.00th=[ 1500], 95.00th=[ 1614],
> | 99.00th=[38536], 99.50th=[41681], 99.90th=[47973], 99.95th=[65799],
> | 99.99th=[72877]
> bw ( MiB/s): min= 5586, max=20476, per=100.00%, avg=13760.26, stdev=45.33, samples=35904
> iops : min=1430070, max=5242110, avg=3522621.15, stdev=11604.46, samples=35904
> lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=0.01%
> lat (usec) : 500=0.01%, 750=26.33%, 1000=21.81%
> lat (msec) : 2=48.54%, 4=0.16%, 10=1.91%, 20=0.01%, 50=1.17%
> lat (msec) : 100=0.09%, 250=0.01%
> cpu : usr=2.74%, sys=90.35%, ctx=481356, majf=0, minf=2244
> IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
> submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=1055590880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
> READ: bw=13.4GiB/s (14.4GB/s), 13.4GiB/s-13.4GiB/s (14.4GB/s-14.4GB/s), io=4027GiB (4324GB), run=300011-300011msec
>
> Disk stats (read/write):
> ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
> memcg 3 /zipf
> node 0
> 0 1522519 0 22
> 0 0r 0e 0p 996363383r 1092111170e 0p
> 1 0r 0e 0p 274581982r 235766575e 0p
> 2 0r 0e 0p 85176438r 71356676e 96114p
> 3 0r 0e 0p 12470364r 11510461e 221796p
> 0 0 0 0 0 0
> 1 1522519 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 2 1522519 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 3 1522519 0 0
> 0 0R 0T 0 0R 0T 0
> 1 0R 0T 0 0R 0T 0
> 2 0R 0T 0 0R 0T 0
> 3 0R 0T 0 0R 0T 0
> 0L 0O 0Y 0N 0F 0A
> node 1
> 0 1522519 0 0
> 0 0r 0e 0p 0r 0e 0p
> 1 0r 0e 0p 0r 0e 0p
> 2 0r 0e 0p 0r 0e 0p
> 3 0r 0e 0p 0r 0e 0p
> 0 0 0 0 0 0
> 1 1522519 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 2 1522519 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 3 1522519 0 0
> 0 0R 0T 0 0R 0T 0
> 1 0R 0T 0 0R 0T 0
> 2 0R 0T 0 0R 0T 0
> 3 0R 0T 0 0R 0T 0
> 0L 0O 0Y 0N 0F 0A
> memcg 4 /random
> node 0
> 0 600413 0 2289676
> 0 0r 0e 0p 875605725r 960492874e 0p
> 1 0r 0e 0p 411230731r 383704269e 0p
> 2 0r 0e 0p 112639317r 97774351e 0p
> 3 0r 0e 0p 2103334r 1766407e 0p
> 0 0 0 0 0 0
> 1 600413 1 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 2 600413 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 3 600413 35 18466878
> 0 0R 0T 0 0R 0T 0
> 1 0R 0T 0 0R 0T 0
> 2 0R 0T 0 0R 0T 0
> 3 0R 0T 0 0R 0T 0
> 0L 0O 0Y 0N 0F 0A
> node 1
> 0 600413 0 0
> 0 0r 0e 0p 0r 0e 0p
> 1 0r 0e 0p 0r 0e 0p
> 2 0r 0e 0p 0r 0e 0p
> 3 0r 0e 0p 0r 0e 0p
> 0 0 0 0 0 0
> 1 600413 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 2 600413 0 0
> 0 0 0 0 0 0 0
> 1 0 0 0 0 0 0
> 2 0 0 0 0 0 0
> 3 0 0 0 0 0 0
> 0 0 0 0 0 0
> 3 600413 0 0
> 0 0R 0T 0 0R 0T 0
> 1 0R 0T 0 0R 0T 0
> 2 0R 0T 0 0R 0T 0
> 3 0R 0T 0 0R 0T 0
> 0L 0O 0Y 0N 0F 0A

And I reran the scaled down zipf test again:

RFC:
Jobs: 12 (f=12): [r(12)][100.0%][r=7267MiB/s][r=1860k IOPS][eta 00m:00s]7s]s]
mglru: (groupid=0, jobs=12): err= 0: pid=5159: Thu Dec 14 23:57:01 2023
read: IOPS=1862k, BW=7274MiB/s (7628MB/s)(2131GiB/300001msec)
slat (usec): min=60, max=4711, avg=195.05, stdev=138.41
clat (usec): min=2, max=5097, avg=619.70, stdev=215.90
lat (usec): min=112, max=5271, avg=814.78, stdev=237.75
clat percentiles (usec):
| 1.00th=[ 388], 5.00th=[ 408], 10.00th=[ 424], 20.00th=[ 457],
| 30.00th=[ 482], 40.00th=[ 502], 50.00th=[ 523], 60.00th=[ 545],
| 70.00th=[ 603], 80.00th=[ 889], 90.00th=[ 988], 95.00th=[ 1037],
| 99.00th=[ 1106], 99.50th=[ 1139], 99.90th=[ 1237], 99.95th=[ 1369],
| 99.99th=[ 1483]
bw ( MiB/s): min= 6526, max= 8474, per=100.00%, avg=7284.26,
stdev=48.62, samples=7176
iops : min=1670753, max=2169575, avg=1864770.39,
stdev=12446.01, samples=7176
lat (usec) : 4=0.01%, 10=0.01%, 250=0.01%, 500=38.35%, 750=33.88%
lat (usec) : 1000=19.46%
lat (msec) : 2=8.30%, 4=0.01%, 10=0.01%
cpu : usr=8.62%, sys=91.24%, ctx=531703, majf=0, minf=700
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
issued rwts: total=558664800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
READ: bw=7274MiB/s (7628MB/s), 7274MiB/s-7274MiB/s
(7628MB/s-7628MB/s), io=2131GiB (2288GB), run=300001-300001msec

workingset_refault_file 628192729

memcg 73 /benchmark
node 0
0 1092186 0 0x
0 0r 0e 0p 0
0 0·
1 0r 0e 0p 0
0 0·
2 0r 0e 0p 0
0 0·
3 0r 0e 0p 0
0 0·
0 0 0 0
0 0·
1 1092186 0 4283·
0 0 0 0 507816078r
511714221e 0p
1 0 0 0 4682206r
3201136e 0p
2 0 0 0 64762r
43587e 0p
3 0 0 0 0r
0e 0p
0 0 0 0
0 0·
2 1092186 0 0·
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·
3 1092186 0 750308·
0 0R 0T 0 49689099R
52516254T 0·
1 0R 0T 0 5786054R
5786054T 0·
2 0R 0T 0 1140749R
1140749T 0·
3 0R 0T 0 0R
0T 0·
0L 0O 0Y 0N
0F 0A

This series:
Jobs: 12 (f=12): [r(12)][100.0%][r=6447MiB/s][r=1650k IOPS][eta 00m:00s]
mglru: (groupid=0, jobs=12): err= 0: pid=3665: Fri Dec 15 00:16:06 2023
read: IOPS=1830k, BW=7148MiB/s (7495MB/s)(2094GiB/300001msec)
slat (usec): min=59, max=35006, avg=198.58, stdev=201.99
clat (nsec): min=972, max=37489k, avg=630651.61, stdev=384748.50
lat (usec): min=108, max=39688, avg=829.26, stdev=461.06
clat percentiles (usec):
| 1.00th=[ 355], 5.00th=[ 379], 10.00th=[ 392], 20.00th=[ 424],
| 30.00th=[ 478], 40.00th=[ 510], 50.00th=[ 529], 60.00th=[ 553],
| 70.00th=[ 635], 80.00th=[ 898], 90.00th=[ 1012], 95.00th=[ 1090],
| 99.00th=[ 1221], 99.50th=[ 1401], 99.90th=[ 2606], 99.95th=[ 3654],
| 99.99th=[18220]
bw ( MiB/s): min= 4870, max= 9145, per=100.00%, avg=7157.39,
stdev=81.13, samples=7176
iops : min=1246811, max=2341342, avg=1832289.80,
stdev=20768.76, samples=7176
lat (nsec) : 1000=0.01%
lat (usec) : 4=0.01%, 10=0.01%, 250=0.01%, 500=36.53%, 750=36.20%
lat (usec) : 1000=15.90%
lat (msec) : 2=11.18%, 4=0.15%, 10=0.02%, 20=0.01%, 50=0.01%
cpu : usr=8.59%, sys=91.27%, ctx=512635, majf=0, minf=711
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
issued rwts: total=548956313,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
READ: bw=7148MiB/s (7495MB/s), 7148MiB/s-7148MiB/s
(7495MB/s-7495MB/s), io=2094GiB (2249GB), run=300001-300001msec

workingset_refault_file 596790506

memcg 68 /benchmark
node 0
122 160248 0 0x
0 0r 0e 0p 0
0 0·
1 0r 0e 0p 0
0 0·
2 0r 0e 0p 0
0 0·
3 0r 0e 0p 0
0 0·
0 0 0 0
0 0·
123 155360 0 239405·
0 0 0 0 301462r
1186271e 0p
1 0 0 0 80013r
218961e 0p
2 0 0 0 0r
0e 516139p
3 0 0 0 0r
0e 0p
0 0 0 0
0 0·
124 150495 0 516188·
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·
125 145582 0 1345·
0 0R 0T 0 2577270R
4518284T 0·
1 0R 0T 0 290933R
369324T 0·
2 0R 0T 0 0R
752170T 0·
3 0R 0T 0 0R
0T 0·
388483L 17226O 18419Y 95408N
1314F 578A

I think the problem might be related to this series ages faster and so
have higher overhead in some case. In your test the test is large
scaled so MGLRU just keep reclaiming last gen, no aging, and my RFC
bring extra overhead due to workingset checking and memcg flushing
(the memcg flushing patch in unstable tree may help?), and also the
current refault distance checking model, simply glued to MGLRU (some
known issues, the most obvious issue is that refault distance check
can't prevent the file page underprotected issue at all when active is
low or empty, and using active/inactive is not accurate enough for
MGLRU), not performing good enough.

And for the MongoDB test, I still didn't have enough time to tidy up
the setup scripts and modified repo yet, sorry about this, in past few
days I only have time to check this issue at late night... but a quick
test shows interesting reading too:

RFC:
==================================================================
Execution Results after 902 seconds
------------------------------------------------------------------
Executed Time (µs) Rate············
STOCK_LEVEL 2544 27114484261.0 0.09 txn/s······
------------------------------------------------------------------
TOTAL 2544 27114484261.0 0.09 txn/s······

workingset_refault_anon 10512
workingset_refault_file 22751782

memcg 44 /system.slice/docker-1313de5323016713a0efa95d3b3f1aeafc9f43df80051bd013f3d29f1e13fa58.scope
node 0
12 190714 41736 640699·
0 0r 2e 0p 0r
1293703e 0p
1 0r 0e 0p 0r
0e 463477p
2 0r 0e 0p 0r
0e 5029378p
3 0r 0e 0p 0r
0e 0p
0 0 0 0
0 0·
13 139686 462351 5483828·
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·
14 86529 692892 3795·
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·
15 41548 47767 366·
0 12R 1113T 0 3497R
1857252T 0·
1 0R 0T 0 1000193R
1692818T 0·
2 0R 0T 0 0R
5422505T 0·
3 0R 0T 0 0R
0T 0·
3889671L 42917O 3674613Y 11910N
7609F 7547A

This series:
==================================================================
Execution Results after 904 seconds
------------------------------------------------------------------
Executed Time (µs) Rate············
STOCK_LEVEL 1668 27108414456.6 0.06 txn/s······
------------------------------------------------------------------
TOTAL 1668 27108414456.6 0.06 txn/s······

workingset_refault_anon 35277
workingset_refault_file 20335355

memcg 77 /system.slice/docker-731f3d33dca1dbea9d763a7a9519bb92c4ca1bbdb06c6a23d5203f8baad97f6e.scope
node 0
14 218191 0x 0x
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·
15 170722 1923 6172558·
0 0r 0e 0p 9r
29052e 0p
1 0r 0e 0p 0r
10643e 0p
2 0r 0e 0p 0r
0e 5714p
3 0r 0e 0p 0r
0e 0p
0 0 0 0
0 0·
16 127628 1223689 10249·
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·
17 79949 40444 408·
0 1413R 5628T 0 352479R
1259370T 0·
1 0R 0T 0 252950R
439843T 0·
2 0R 1T 0 0R
5083446T 0·
3 0R 0T 0 0R
0T 0·
18667726L 229222O 17641112Y 40116N
36473F 35963A

And I've turned off all unrelated features off (psi, delayacct) for
above tests. When PSI is on, the MongoDB test shows 70 - 100 PSI SOME,
it's not using a very high performance disk.
I think this could suggest some time evict of file page is not that
costly. And page shadow can store fine grained data of page's access
distance, so maybe I can tune the refault distance checking model for
MGLRU, combine with this series, which may help to improve the protect
policy to be more balanced (not too fast, and still accurate)?

2023-12-14 23:51:52

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
>
> Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > >
> > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > >
> > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > >
> > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > >
> > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > >
> > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > on:
> > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > buffer pools (anon).
> > > > > > > >
> > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > moving average smooths out the spike quickly.
> > > > > > > >
> > > > > > > > To fix the problem:
> > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > similar to the above.
> > > > > > > >
> > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > Before After Change
> > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > >
> > > > > > > Hi Yu,
> > > > > > >
> > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > >
> > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > https://lwn.net/Articles/945266/
> > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > multiple workloads.
> > > > > > >
> > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > upstream.
> > > > > > >
> > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > similar problem):
> > > > > > >
> > > > > > > Previous result:
> > > > > > > ==================================================================
> > > > > > > Execution Results after 905 seconds
> > > > > > > ------------------------------------------------------------------
> > > > > > > Executed Time (µs) Rate
> > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > ------------------------------------------------------------------
> > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > >
> > > > > > > This patch:
> > > > > > > ==================================================================
> > > > > > > Execution Results after 900 seconds
> > > > > > > ------------------------------------------------------------------
> > > > > > > Executed Time (µs) Rate
> > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > ------------------------------------------------------------------
> > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > >
> > > > > > > Unpatched version is always around ~500.
> > > > > >
> > > > > > Thanks for the test results!
> > > > > >
> > > > > > > I think there are a few points here:
> > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > distance).
> > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > memory is too small to hold the whole workingset.
> > > > > > >
> > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > combined to work better on this issue, how do you think?
> > > > > >
> > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > >
> > > > Hi Yu,
> > > >
> > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > skip anon page re-activation in refault path for mglru which was not
> > > > very helpful, only some tiny adjustment.
> > > > And I found it easier to test with fio, using following test script:
> > > >
> > > > #!/bin/bash
> > > > swapoff -a
> > > >
> > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > mkfs.ext4 /dev/ram0
> > > > mount /dev/ram0 /mnt
> > > >
> > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > cd /sys/fs/cgroup/benchmark
> > > >
> > > > echo 4G > memory.max
> > > > echo $$ > cgroup.procs
> > > > echo 3 > /proc/sys/vm/drop_caches
> > > >
> > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > >
> > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > towards certain pages.
> > > > Unpatched 6.7-rc4:
> > > > Run status group 0 (all jobs):
> > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > >
> > > > Patched with RFC v4:
> > > > Run status group 0 (all jobs):
> > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > >
> > > > Patched with this series:
> > > > Run status group 0 (all jobs):
> > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > >
> > > > MGLRU off:
> > > > Run status group 0 (all jobs):
> > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > >
> > > > - If I change zipf:0.5 to random:
> > > > Unpatched 6.7-rc4:
> > > > Patched with this series:
> > > > Run status group 0 (all jobs):
> > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > >
> > > > Patched with RFC v4:
> > > > Run status group 0 (all jobs):
> > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > >
> > > > Patched with this series:
> > > > Run status group 0 (all jobs):
> > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > >
> > > > MGLRU off:
> > > > Run status group 0 (all jobs):
> > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > >
> > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > more complex to setup than fio since involving setting up multiple
> > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > occupied by some other tasks but will try best to send them out as
> > > > soon as possible.
> > >
> > > Thanks! Apparently your RFC did show better IOPS with both access
> > > patterns, which was a surprise to me because it had higher refaults
> > > and usually higher refautls result in worse performance.
> > >
> > > So I'm still trying to figure out why it turned out the opposite. My
> > > current guess is that:
> > > 1. It had a very small but stable inactive LRU list, which was able to
> > > fit into the L3 cache entirely.
> > > 2. It counted few folios as workingset and therefore incurred less
> > > overhead from CONFIG_PSI and/or CONFIG_TASK_DELAY_ACCT.
> > >
> > > Did you save workingset_refault_file when you ran the test? If so, can
> > > you check the difference between this series and your RFC?
> >
> >
> > It seems I was right about #1 above. After I scaled your test up by 20x,
> > I saw my series performed ~5% faster with zipf and ~9% faster with random
> > accesses.
>
> Hi Yu,
>
> Thank you so much for testing and sharing this result.
>
> I'm not sure about #1, the ramdisk size, access data, are far larger
> than L3 (16M on my CPU) even in down scaled test, and both random/zipf
> shows similar result.

It's the LRU list not pages. IOW, the kernel data structure, not the
content in LRU pages. Does it make sense?

> > IOW, I made rd_size from 16GB to 320GB, memory.max from 4GB to 80GB,
> > --numjobs from 12 to 60 and --size from 1GB to 4GB.

Would you be able to try a larger configuration like above instead?

2023-12-15 04:56:52

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> >
> > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > >
> > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > >
> > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > >
> > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > >
> > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > on:
> > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > buffer pools (anon).
> > > > > > > > >
> > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > >
> > > > > > > > > To fix the problem:
> > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > similar to the above.
> > > > > > > > >
> > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > Before After Change
> > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > >
> > > > > > > > Hi Yu,
> > > > > > > >
> > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > >
> > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > multiple workloads.
> > > > > > > >
> > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > upstream.
> > > > > > > >
> > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > similar problem):
> > > > > > > >
> > > > > > > > Previous result:
> > > > > > > > ==================================================================
> > > > > > > > Execution Results after 905 seconds
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > Executed Time (µs) Rate
> > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > >
> > > > > > > > This patch:
> > > > > > > > ==================================================================
> > > > > > > > Execution Results after 900 seconds
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > Executed Time (µs) Rate
> > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > >
> > > > > > > > Unpatched version is always around ~500.
> > > > > > >
> > > > > > > Thanks for the test results!
> > > > > > >
> > > > > > > > I think there are a few points here:
> > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > distance).
> > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > >
> > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > combined to work better on this issue, how do you think?
> > > > > > >
> > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > >
> > > > > Hi Yu,
> > > > >
> > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > very helpful, only some tiny adjustment.
> > > > > And I found it easier to test with fio, using following test script:
> > > > >
> > > > > #!/bin/bash
> > > > > swapoff -a
> > > > >
> > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > mkfs.ext4 /dev/ram0
> > > > > mount /dev/ram0 /mnt
> > > > >
> > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > cd /sys/fs/cgroup/benchmark
> > > > >
> > > > > echo 4G > memory.max
> > > > > echo $$ > cgroup.procs
> > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > >
> > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > >
> > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > towards certain pages.
> > > > > Unpatched 6.7-rc4:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > >
> > > > > Patched with RFC v4:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > >
> > > > > Patched with this series:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > >
> > > > > MGLRU off:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > >
> > > > > - If I change zipf:0.5 to random:
> > > > > Unpatched 6.7-rc4:
> > > > > Patched with this series:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > >
> > > > > Patched with RFC v4:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > >
> > > > > Patched with this series:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > >
> > > > > MGLRU off:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > >
> > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > more complex to setup than fio since involving setting up multiple
> > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > occupied by some other tasks but will try best to send them out as
> > > > > soon as possible.
> > > >
> > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > patterns, which was a surprise to me because it had higher refaults
> > > > and usually higher refautls result in worse performance.

And thanks for providing the refaults I requested for -- your data
below confirms what I mentioned above:

For fio:
Your RFC This series Change
workingset_refault_file 628192729 596790506 -5%
IOPS 1862k 1830k -2%

For MongoDB:
Your RFC This series Change
workingset_refault_anon 10512 35277 +30%
workingset_refault_file 22751782 20335355 -11%
total 22762294 20370632 -11%
TPS 0.09 0.06 -33%

For MongoDB, this series should be a big win (but apparently it's not),
especially when using zram, since an anon refault should be a lot
cheaper than a file refault.

So, I'm baffled...

One important detail I forgot to mention: based on your data from
lru_gen_full, I think there is another difference between our Kconfigs:

Your Kconfig My Kconfig Max possible
LRU_REFS_WIDTH 1 2 2

This means you can only track 3 accesses through file descriptors
at most, and when you hit 3, the folio is moved to the next generation
by sort_folio() in the eviction path. IOW, your aging runs faster than
mine sine more folios are moved to the next generation (mine only does
so when it hits 5).

In case you want to try a larger LRU_REFS_WIDTH, you can make
CONFIG_NODES_SHIFT smaller or disable CONFIG_IDLE_PAGE_TRACKING.

2023-12-17 18:32:31

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Thu, Dec 14, 2023 at 4:51 PM Yu Zhao <[email protected]> wrote:
>
> On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> >
> > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > >
> > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > >
> > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > >
> > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > >
> > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > on:
> > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > buffer pools (anon).
> > > > > > > > >
> > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > >
> > > > > > > > > To fix the problem:
> > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > similar to the above.
> > > > > > > > >
> > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > Before After Change
> > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > >
> > > > > > > > Hi Yu,
> > > > > > > >
> > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > >
> > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > multiple workloads.
> > > > > > > >
> > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > upstream.
> > > > > > > >
> > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > similar problem):
> > > > > > > >
> > > > > > > > Previous result:
> > > > > > > > ==================================================================
> > > > > > > > Execution Results after 905 seconds
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > Executed Time (µs) Rate
> > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > >
> > > > > > > > This patch:
> > > > > > > > ==================================================================
> > > > > > > > Execution Results after 900 seconds
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > Executed Time (µs) Rate
> > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > >
> > > > > > > > Unpatched version is always around ~500.
> > > > > > >
> > > > > > > Thanks for the test results!
> > > > > > >
> > > > > > > > I think there are a few points here:
> > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > distance).
> > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > >
> > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > combined to work better on this issue, how do you think?
> > > > > > >
> > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > >
> > > > > Hi Yu,
> > > > >
> > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > very helpful, only some tiny adjustment.
> > > > > And I found it easier to test with fio, using following test script:
> > > > >
> > > > > #!/bin/bash
> > > > > swapoff -a
> > > > >
> > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > mkfs.ext4 /dev/ram0
> > > > > mount /dev/ram0 /mnt
> > > > >
> > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > cd /sys/fs/cgroup/benchmark
> > > > >
> > > > > echo 4G > memory.max
> > > > > echo $$ > cgroup.procs
> > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > >
> > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > >
> > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > towards certain pages.
> > > > > Unpatched 6.7-rc4:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > >
> > > > > Patched with RFC v4:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > >
> > > > > Patched with this series:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > >
> > > > > MGLRU off:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > >
> > > > > - If I change zipf:0.5 to random:
> > > > > Unpatched 6.7-rc4:
> > > > > Patched with this series:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > >
> > > > > Patched with RFC v4:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > >
> > > > > Patched with this series:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > >
> > > > > MGLRU off:
> > > > > Run status group 0 (all jobs):
> > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > >
> > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > more complex to setup than fio since involving setting up multiple
> > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > occupied by some other tasks but will try best to send them out as
> > > > > soon as possible.
> > > >
> > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > patterns, which was a surprise to me because it had higher refaults
> > > > and usually higher refautls result in worse performance.
> > > >
> > > > So I'm still trying to figure out why it turned out the opposite. My
> > > > current guess is that:
> > > > 1. It had a very small but stable inactive LRU list, which was able to
> > > > fit into the L3 cache entirely.
> > > > 2. It counted few folios as workingset and therefore incurred less
> > > > overhead from CONFIG_PSI and/or CONFIG_TASK_DELAY_ACCT.
> > > >
> > > > Did you save workingset_refault_file when you ran the test? If so, can
> > > > you check the difference between this series and your RFC?
> > >
> > >
> > > It seems I was right about #1 above. After I scaled your test up by 20x,
> > > I saw my series performed ~5% faster with zipf and ~9% faster with random
> > > accesses.
> >
> > Hi Yu,
> >
> > Thank you so much for testing and sharing this result.
> >
> > I'm not sure about #1, the ramdisk size, access data, are far larger
> > than L3 (16M on my CPU) even in down scaled test, and both random/zipf
> > shows similar result.
>
> It's the LRU list not pages. IOW, the kernel data structure, not the
> content in LRU pages. Does it make sense?

FYI. Willy just reminded me that he explained it a lot better than I
did: https://lore.kernel.org/linux-mm/[email protected]/

> > > IOW, I made rd_size from 16GB to 320GB, memory.max from 4GB to 80GB,
> > > --numjobs from 12 to 60 and --size from 1GB to 4GB.
>
> Would you be able to try a larger configuration like above instead?

2023-12-18 18:15:20

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
>
> On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > >
> > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > >
> > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > >
> > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > >
> > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > >
> > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > on:
> > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > buffer pools (anon).
> > > > > > > > > >
> > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > >
> > > > > > > > > > To fix the problem:
> > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > similar to the above.
> > > > > > > > > >
> > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > Before After Change
> > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > >
> > > > > > > > > Hi Yu,
> > > > > > > > >
> > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > >
> > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > multiple workloads.
> > > > > > > > >
> > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > upstream.
> > > > > > > > >
> > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > similar problem):
> > > > > > > > >
> > > > > > > > > Previous result:
> > > > > > > > > ==================================================================
> > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > >
> > > > > > > > > This patch:
> > > > > > > > > ==================================================================
> > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > >
> > > > > > > > > Unpatched version is always around ~500.
> > > > > > > >
> > > > > > > > Thanks for the test results!
> > > > > > > >
> > > > > > > > > I think there are a few points here:
> > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > distance).
> > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > >
> > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > >
> > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > >
> > > > > > Hi Yu,
> > > > > >
> > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > very helpful, only some tiny adjustment.
> > > > > > And I found it easier to test with fio, using following test script:
> > > > > >
> > > > > > #!/bin/bash
> > > > > > swapoff -a
> > > > > >
> > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > mkfs.ext4 /dev/ram0
> > > > > > mount /dev/ram0 /mnt
> > > > > >
> > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > cd /sys/fs/cgroup/benchmark
> > > > > >
> > > > > > echo 4G > memory.max
> > > > > > echo $$ > cgroup.procs
> > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > >
> > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > >
> > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > towards certain pages.
> > > > > > Unpatched 6.7-rc4:
> > > > > > Run status group 0 (all jobs):
> > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > >
> > > > > > Patched with RFC v4:
> > > > > > Run status group 0 (all jobs):
> > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > >
> > > > > > Patched with this series:
> > > > > > Run status group 0 (all jobs):
> > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > >
> > > > > > MGLRU off:
> > > > > > Run status group 0 (all jobs):
> > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > >
> > > > > > - If I change zipf:0.5 to random:
> > > > > > Unpatched 6.7-rc4:
> > > > > > Patched with this series:
> > > > > > Run status group 0 (all jobs):
> > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > >
> > > > > > Patched with RFC v4:
> > > > > > Run status group 0 (all jobs):
> > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > >
> > > > > > Patched with this series:
> > > > > > Run status group 0 (all jobs):
> > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > >
> > > > > > MGLRU off:
> > > > > > Run status group 0 (all jobs):
> > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > >
> > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > soon as possible.
> > > > >
> > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > and usually higher refautls result in worse performance.
>
> And thanks for providing the refaults I requested for -- your data
> below confirms what I mentioned above:
>
> For fio:
> Your RFC This series Change
> workingset_refault_file 628192729 596790506 -5%
> IOPS 1862k 1830k -2%
>
> For MongoDB:
> Your RFC This series Change
> workingset_refault_anon 10512 35277 +30%
> workingset_refault_file 22751782 20335355 -11%
> total 22762294 20370632 -11%
> TPS 0.09 0.06 -33%
>
> For MongoDB, this series should be a big win (but apparently it's not),
> especially when using zram, since an anon refault should be a lot
> cheaper than a file refault.
>
> So, I'm baffled...
>
> One important detail I forgot to mention: based on your data from
> lru_gen_full, I think there is another difference between our Kconfigs:
>
> Your Kconfig My Kconfig Max possible
> LRU_REFS_WIDTH 1 2 2

Hi Yu,

Thanks for the info, my fault, I forgot to update my config as I was
testing some other features.
Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
got much worse for MongoDB test:

With LRU_REFS_WIDTH == 2:

This patch:
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 488 27598136201.9 0.02 txn/s
------------------------------------------------------------------
TOTAL 488 27598136201.9 0.02 txn/s

memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
node 0
1 948187 0x 0x
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·
2 948187 0 6051788·
0 0r 0e 0p 11916r
66442e 0p
1 0r 0e 0p 903r
16888e 0p
2 0r 0e 0p 459r
9764e 0p
3 0r 0e 0p 0r
0e 2874p
0 0 0 0
0 0·
3 948187 1353160 6351·
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·
4 73045 23573 12·
0 0R 0T 0 3498607R
4868605T 0·
1 0R 0T 0 3012246R
3270261T 0·
2 0R 0T 0 2498608R
2839104T 0·
3 0R 0T 0 0R
1983947T 0·
1486579L 0O 1380614Y 2945N
2945F 2734A

workingset_refault_anon 0
workingset_refault_file 18130598

total used free shared buff/cache available
Mem: 31978 6705 312 20 24960 24786
Swap: 31977 4 31973

RFC:
==================================================================
Execution Results after 908 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
------------------------------------------------------------------
TOTAL 2252 27159962888.2 0.08 txn/s

workingset_refault_anon 22585
workingset_refault_file 22715256

memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
node 0
22 563007 2274 1198225·
0 0r 1e 0p 0r
697076e 0p
1 0r 0e 0p 0r
0e 325661p
2 0r 0e 0p 0r
0e 888728p
3 0r 0e 0p 0r
0e 3602238p
0 0 0 0
0 0·
23 532222 7525 4948747·
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·
24 500367 1214667 3292·
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·
25 469692 40797 466·
0 0R 271T 0 0R
1162165T 0·
1 0R 0T 0 774028R
1205332T 0·
2 0R 0T 0 0R
932484T 0·
3 0R 1T 0 0R
4252158T 0·
25178380L 156515O 23953602Y 59234N
49391F 48664A

total used free shared buff/cache available
Mem: 31978 6968 338 5 24671 24555
Swap: 31977 1533 30444

Using same mongodb config (a 3 replica cluster using the same config):
{
"net": {
"bindIpAll": true,
"ipv6": false,
"maxIncomingConnections": 10000,
},
"setParameter": {
"disabledSecureAllocatorDomains": "*"
},
"replication": {
"oplogSizeMB": 10480,
"replSetName": "issa-tpcc_0"
},
"security": {
"keyFile": "/data/db/keyfile"
},
"storage": {
"dbPath": "/data/db/",
"syncPeriodSecs": 60,
"directoryPerDB": true,
"wiredTiger": {
"engineConfig": {
"cacheSizeGB": 5
}
}
},
"systemLog": {
"destination": "file",
"logAppend": true,
"logRotate": "rename",
"path": "/data/db/mongod.log",
"verbosity": 0
}
}

The test environment have 32g memory and 16 core.

Per my analyze, the access pattern for the mongodb test is that page
will be re-access long after it's evicted so PID controller won't
protect higher tier. That RFC will make use of the long existing
shadow to do feedback to PID/Gen so the result will be much better.
Still need more adjusting though, will try to do a rebase on top of
mm-unstable which includes your patch.

I've no idea why the workingset_refault_* is higher in the better
case, this a clearly an IO bound workload, Memory and IO is busy while
CPU is not full...

I've uploaded my local reproducer here:
https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
https://github.com/ryncsn/py-tpcc

2023-12-19 03:21:52

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
>
> Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> >
> > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > >
> > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > >
> > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > >
> > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > >
> > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > >
> > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > on:
> > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > >
> > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > >
> > > > > > > > > > > To fix the problem:
> > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > similar to the above.
> > > > > > > > > > >
> > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > Before After Change
> > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > >
> > > > > > > > > > Hi Yu,
> > > > > > > > > >
> > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > >
> > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > multiple workloads.
> > > > > > > > > >
> > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > upstream.
> > > > > > > > > >
> > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > similar problem):
> > > > > > > > > >
> > > > > > > > > > Previous result:
> > > > > > > > > > ==================================================================
> > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > >
> > > > > > > > > > This patch:
> > > > > > > > > > ==================================================================
> > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > >
> > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > >
> > > > > > > > > Thanks for the test results!
> > > > > > > > >
> > > > > > > > > > I think there are a few points here:
> > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > distance).
> > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > >
> > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > >
> > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > >
> > > > > > > Hi Yu,
> > > > > > >
> > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > very helpful, only some tiny adjustment.
> > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > >
> > > > > > > #!/bin/bash
> > > > > > > swapoff -a
> > > > > > >
> > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > mount /dev/ram0 /mnt
> > > > > > >
> > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > >
> > > > > > > echo 4G > memory.max
> > > > > > > echo $$ > cgroup.procs
> > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > >
> > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > >
> > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > towards certain pages.
> > > > > > > Unpatched 6.7-rc4:
> > > > > > > Run status group 0 (all jobs):
> > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > >
> > > > > > > Patched with RFC v4:
> > > > > > > Run status group 0 (all jobs):
> > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > >
> > > > > > > Patched with this series:
> > > > > > > Run status group 0 (all jobs):
> > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > >
> > > > > > > MGLRU off:
> > > > > > > Run status group 0 (all jobs):
> > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > >
> > > > > > > - If I change zipf:0.5 to random:
> > > > > > > Unpatched 6.7-rc4:
> > > > > > > Patched with this series:
> > > > > > > Run status group 0 (all jobs):
> > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > >
> > > > > > > Patched with RFC v4:
> > > > > > > Run status group 0 (all jobs):
> > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > >
> > > > > > > Patched with this series:
> > > > > > > Run status group 0 (all jobs):
> > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > >
> > > > > > > MGLRU off:
> > > > > > > Run status group 0 (all jobs):
> > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > >
> > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > soon as possible.
> > > > > >
> > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > and usually higher refautls result in worse performance.
> >
> > And thanks for providing the refaults I requested for -- your data
> > below confirms what I mentioned above:
> >
> > For fio:
> > Your RFC This series Change
> > workingset_refault_file 628192729 596790506 -5%
> > IOPS 1862k 1830k -2%
> >
> > For MongoDB:
> > Your RFC This series Change
> > workingset_refault_anon 10512 35277 +30%
> > workingset_refault_file 22751782 20335355 -11%
> > total 22762294 20370632 -11%
> > TPS 0.09 0.06 -33%
> >
> > For MongoDB, this series should be a big win (but apparently it's not),
> > especially when using zram, since an anon refault should be a lot
> > cheaper than a file refault.
> >
> > So, I'm baffled...
> >
> > One important detail I forgot to mention: based on your data from
> > lru_gen_full, I think there is another difference between our Kconfigs:
> >
> > Your Kconfig My Kconfig Max possible
> > LRU_REFS_WIDTH 1 2 2
>
> Hi Yu,
>
> Thanks for the info, my fault, I forgot to update my config as I was
> testing some other features.
> Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> got much worse for MongoDB test:
>
> With LRU_REFS_WIDTH == 2:
>
> This patch:
> ==================================================================
> Execution Results after 919 seconds
> ------------------------------------------------------------------
> Executed Time (µs) Rate
> STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> ------------------------------------------------------------------
> TOTAL 488 27598136201.9 0.02 txn/s
>
> memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> node 0
> 1 948187 0x 0x
> 0 0 0 0 0
> 0 0·
> 1 0 0 0 0
> 0 0·
> 2 0 0 0 0
> 0 0·
> 3 0 0 0 0
> 0 0·
> 0 0 0 0
> 0 0·
> 2 948187 0 6051788·
> 0 0r 0e 0p 11916r
> 66442e 0p
> 1 0r 0e 0p 903r
> 16888e 0p
> 2 0r 0e 0p 459r
> 9764e 0p
> 3 0r 0e 0p 0r
> 0e 2874p
> 0 0 0 0
> 0 0·
> 3 948187 1353160 6351·
> 0 0 0 0 0
> 0 0·
> 1 0 0 0 0
> 0 0·
> 2 0 0 0 0
> 0 0·
> 3 0 0 0 0
> 0 0·
> 0 0 0 0
> 0 0·
> 4 73045 23573 12·
> 0 0R 0T 0 3498607R
> 4868605T 0·
> 1 0R 0T 0 3012246R
> 3270261T 0·
> 2 0R 0T 0 2498608R
> 2839104T 0·
> 3 0R 0T 0 0R
> 1983947T 0·
> 1486579L 0O 1380614Y 2945N
> 2945F 2734A
>
> workingset_refault_anon 0
> workingset_refault_file 18130598
>
> total used free shared buff/cache available
> Mem: 31978 6705 312 20 24960 24786
> Swap: 31977 4 31973
>
> RFC:
> ==================================================================
> Execution Results after 908 seconds
> ------------------------------------------------------------------
> Executed Time (µs) Rate
> STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> ------------------------------------------------------------------
> TOTAL 2252 27159962888.2 0.08 txn/s
>
> workingset_refault_anon 22585
> workingset_refault_file 22715256
>
> memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> node 0
> 22 563007 2274 1198225·
> 0 0r 1e 0p 0r
> 697076e 0p
> 1 0r 0e 0p 0r
> 0e 325661p
> 2 0r 0e 0p 0r
> 0e 888728p
> 3 0r 0e 0p 0r
> 0e 3602238p
> 0 0 0 0
> 0 0·
> 23 532222 7525 4948747·
> 0 0 0 0 0
> 0 0·
> 1 0 0 0 0
> 0 0·
> 2 0 0 0 0
> 0 0·
> 3 0 0 0 0
> 0 0·
> 0 0 0 0
> 0 0·
> 24 500367 1214667 3292·
> 0 0 0 0 0
> 0 0·
> 1 0 0 0 0
> 0 0·
> 2 0 0 0 0
> 0 0·
> 3 0 0 0 0
> 0 0·
> 0 0 0 0
> 0 0·
> 25 469692 40797 466·
> 0 0R 271T 0 0R
> 1162165T 0·
> 1 0R 0T 0 774028R
> 1205332T 0·
> 2 0R 0T 0 0R
> 932484T 0·
> 3 0R 1T 0 0R
> 4252158T 0·
> 25178380L 156515O 23953602Y 59234N
> 49391F 48664A
>
> total used free shared buff/cache available
> Mem: 31978 6968 338 5 24671 24555
> Swap: 31977 1533 30444
>
> Using same mongodb config (a 3 replica cluster using the same config):
> {
> "net": {
> "bindIpAll": true,
> "ipv6": false,
> "maxIncomingConnections": 10000,
> },
> "setParameter": {
> "disabledSecureAllocatorDomains": "*"
> },
> "replication": {
> "oplogSizeMB": 10480,
> "replSetName": "issa-tpcc_0"
> },
> "security": {
> "keyFile": "/data/db/keyfile"
> },
> "storage": {
> "dbPath": "/data/db/",
> "syncPeriodSecs": 60,
> "directoryPerDB": true,
> "wiredTiger": {
> "engineConfig": {
> "cacheSizeGB": 5
> }
> }
> },
> "systemLog": {
> "destination": "file",
> "logAppend": true,
> "logRotate": "rename",
> "path": "/data/db/mongod.log",
> "verbosity": 0
> }
> }
>
> The test environment have 32g memory and 16 core.
>
> Per my analyze, the access pattern for the mongodb test is that page
> will be re-access long after it's evicted so PID controller won't
> protect higher tier. That RFC will make use of the long existing
> shadow to do feedback to PID/Gen so the result will be much better.
> Still need more adjusting though, will try to do a rebase on top of
> mm-unstable which includes your patch.
>
> I've no idea why the workingset_refault_* is higher in the better
> case, this a clearly an IO bound workload, Memory and IO is busy while
> CPU is not full...
>
> I've uploaded my local reproducer here:
> https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> https://github.com/ryncsn/py-tpcc

Thanks for the repos -- I'm trying them right now. Which MongoDB
version did you use? setup.sh didn't seem to install it.

Also do you have a QEMU image? It'd be a lot easier for me to
duplicate the exact environment by looking into it.

2023-12-19 03:45:13

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
>
> On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> >
> > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> > >
> > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > >
> > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > > >
> > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > > >
> > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > > >
> > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > on:
> > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > >
> > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > >
> > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > similar to the above.
> > > > > > > > > > > >
> > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > Before After Change
> > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > >
> > > > > > > > > > > Hi Yu,
> > > > > > > > > > >
> > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > >
> > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > multiple workloads.
> > > > > > > > > > >
> > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > upstream.
> > > > > > > > > > >
> > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > similar problem):
> > > > > > > > > > >
> > > > > > > > > > > Previous result:
> > > > > > > > > > > ==================================================================
> > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > >
> > > > > > > > > > > This patch:
> > > > > > > > > > > ==================================================================
> > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > >
> > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > >
> > > > > > > > > > Thanks for the test results!
> > > > > > > > > >
> > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > distance).
> > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > >
> > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > >
> > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > >
> > > > > > > > Hi Yu,
> > > > > > > >
> > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > >
> > > > > > > > #!/bin/bash
> > > > > > > > swapoff -a
> > > > > > > >
> > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > mount /dev/ram0 /mnt
> > > > > > > >
> > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > >
> > > > > > > > echo 4G > memory.max
> > > > > > > > echo $$ > cgroup.procs
> > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > >
> > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > >
> > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > towards certain pages.
> > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > Run status group 0 (all jobs):
> > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > >
> > > > > > > > Patched with RFC v4:
> > > > > > > > Run status group 0 (all jobs):
> > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > >
> > > > > > > > Patched with this series:
> > > > > > > > Run status group 0 (all jobs):
> > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > >
> > > > > > > > MGLRU off:
> > > > > > > > Run status group 0 (all jobs):
> > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > >
> > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > Patched with this series:
> > > > > > > > Run status group 0 (all jobs):
> > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > >
> > > > > > > > Patched with RFC v4:
> > > > > > > > Run status group 0 (all jobs):
> > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > >
> > > > > > > > Patched with this series:
> > > > > > > > Run status group 0 (all jobs):
> > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > >
> > > > > > > > MGLRU off:
> > > > > > > > Run status group 0 (all jobs):
> > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > >
> > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > soon as possible.
> > > > > > >
> > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > and usually higher refautls result in worse performance.
> > >
> > > And thanks for providing the refaults I requested for -- your data
> > > below confirms what I mentioned above:
> > >
> > > For fio:
> > > Your RFC This series Change
> > > workingset_refault_file 628192729 596790506 -5%
> > > IOPS 1862k 1830k -2%
> > >
> > > For MongoDB:
> > > Your RFC This series Change
> > > workingset_refault_anon 10512 35277 +30%
> > > workingset_refault_file 22751782 20335355 -11%
> > > total 22762294 20370632 -11%
> > > TPS 0.09 0.06 -33%
> > >
> > > For MongoDB, this series should be a big win (but apparently it's not),
> > > especially when using zram, since an anon refault should be a lot
> > > cheaper than a file refault.
> > >
> > > So, I'm baffled...
> > >
> > > One important detail I forgot to mention: based on your data from
> > > lru_gen_full, I think there is another difference between our Kconfigs:
> > >
> > > Your Kconfig My Kconfig Max possible
> > > LRU_REFS_WIDTH 1 2 2
> >
> > Hi Yu,
> >
> > Thanks for the info, my fault, I forgot to update my config as I was
> > testing some other features.
> > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > got much worse for MongoDB test:
> >
> > With LRU_REFS_WIDTH == 2:
> >
> > This patch:
> > ==================================================================
> > Execution Results after 919 seconds
> > ------------------------------------------------------------------
> > Executed Time (µs) Rate
> > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > ------------------------------------------------------------------
> > TOTAL 488 27598136201.9 0.02 txn/s
> >
> > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > node 0
> > 1 948187 0x 0x
> > 0 0 0 0 0
> > 0 0·
> > 1 0 0 0 0
> > 0 0·
> > 2 0 0 0 0
> > 0 0·
> > 3 0 0 0 0
> > 0 0·
> > 0 0 0 0
> > 0 0·
> > 2 948187 0 6051788·
> > 0 0r 0e 0p 11916r
> > 66442e 0p
> > 1 0r 0e 0p 903r
> > 16888e 0p
> > 2 0r 0e 0p 459r
> > 9764e 0p
> > 3 0r 0e 0p 0r
> > 0e 2874p
> > 0 0 0 0
> > 0 0·
> > 3 948187 1353160 6351·
> > 0 0 0 0 0
> > 0 0·
> > 1 0 0 0 0
> > 0 0·
> > 2 0 0 0 0
> > 0 0·
> > 3 0 0 0 0
> > 0 0·
> > 0 0 0 0
> > 0 0·
> > 4 73045 23573 12·
> > 0 0R 0T 0 3498607R
> > 4868605T 0·
> > 1 0R 0T 0 3012246R
> > 3270261T 0·
> > 2 0R 0T 0 2498608R
> > 2839104T 0·
> > 3 0R 0T 0 0R
> > 1983947T 0·
> > 1486579L 0O 1380614Y 2945N
> > 2945F 2734A
> >
> > workingset_refault_anon 0
> > workingset_refault_file 18130598
> >
> > total used free shared buff/cache available
> > Mem: 31978 6705 312 20 24960 24786
> > Swap: 31977 4 31973
> >
> > RFC:
> > ==================================================================
> > Execution Results after 908 seconds
> > ------------------------------------------------------------------
> > Executed Time (µs) Rate
> > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > ------------------------------------------------------------------
> > TOTAL 2252 27159962888.2 0.08 txn/s
> >
> > workingset_refault_anon 22585
> > workingset_refault_file 22715256
> >
> > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > node 0
> > 22 563007 2274 1198225·
> > 0 0r 1e 0p 0r
> > 697076e 0p
> > 1 0r 0e 0p 0r
> > 0e 325661p
> > 2 0r 0e 0p 0r
> > 0e 888728p
> > 3 0r 0e 0p 0r
> > 0e 3602238p
> > 0 0 0 0
> > 0 0·
> > 23 532222 7525 4948747·
> > 0 0 0 0 0
> > 0 0·
> > 1 0 0 0 0
> > 0 0·
> > 2 0 0 0 0
> > 0 0·
> > 3 0 0 0 0
> > 0 0·
> > 0 0 0 0
> > 0 0·
> > 24 500367 1214667 3292·
> > 0 0 0 0 0
> > 0 0·
> > 1 0 0 0 0
> > 0 0·
> > 2 0 0 0 0
> > 0 0·
> > 3 0 0 0 0
> > 0 0·
> > 0 0 0 0
> > 0 0·
> > 25 469692 40797 466·
> > 0 0R 271T 0 0R
> > 1162165T 0·
> > 1 0R 0T 0 774028R
> > 1205332T 0·
> > 2 0R 0T 0 0R
> > 932484T 0·
> > 3 0R 1T 0 0R
> > 4252158T 0·
> > 25178380L 156515O 23953602Y 59234N
> > 49391F 48664A
> >
> > total used free shared buff/cache available
> > Mem: 31978 6968 338 5 24671 24555
> > Swap: 31977 1533 30444
> >
> > Using same mongodb config (a 3 replica cluster using the same config):
> > {
> > "net": {
> > "bindIpAll": true,
> > "ipv6": false,
> > "maxIncomingConnections": 10000,
> > },
> > "setParameter": {
> > "disabledSecureAllocatorDomains": "*"
> > },
> > "replication": {
> > "oplogSizeMB": 10480,
> > "replSetName": "issa-tpcc_0"
> > },
> > "security": {
> > "keyFile": "/data/db/keyfile"
> > },
> > "storage": {
> > "dbPath": "/data/db/",
> > "syncPeriodSecs": 60,
> > "directoryPerDB": true,
> > "wiredTiger": {
> > "engineConfig": {
> > "cacheSizeGB": 5
> > }
> > }
> > },
> > "systemLog": {
> > "destination": "file",
> > "logAppend": true,
> > "logRotate": "rename",
> > "path": "/data/db/mongod.log",
> > "verbosity": 0
> > }
> > }
> >
> > The test environment have 32g memory and 16 core.
> >
> > Per my analyze, the access pattern for the mongodb test is that page
> > will be re-access long after it's evicted so PID controller won't
> > protect higher tier. That RFC will make use of the long existing
> > shadow to do feedback to PID/Gen so the result will be much better.
> > Still need more adjusting though, will try to do a rebase on top of
> > mm-unstable which includes your patch.
> >
> > I've no idea why the workingset_refault_* is higher in the better
> > case, this a clearly an IO bound workload, Memory and IO is busy while
> > CPU is not full...
> >
> > I've uploaded my local reproducer here:
> > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > https://github.com/ryncsn/py-tpcc
>
> Thanks for the repos -- I'm trying them right now. Which MongoDB
> version did you use? setup.sh didn't seem to install it.
>
> Also do you have a QEMU image? It'd be a lot easier for me to
> duplicate the exact environment by looking into it.

I ended up using docker.io/mongodb/mongodb-community-server:latest,
and it's not working:

# docker exec -it mongo-r1 mongosh --eval \
'"rs.initiate({
_id: "issa-tpcc_0",
members: [
{_id: 0, host: "mongo-r1"},
{_id: 1, host: "mongo-r2"},
{_id: 2, host: "mongo-r3"}
]
})"'
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
Error: can only create exec sessions on running containers: container
state improper

2023-12-19 19:00:48

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <[email protected]> 于2023年12月19日周二 11:45写道:
>
> On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
> >
> > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> > >
> > > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> > > >
> > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > > >
> > > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > > > >
> > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > on:
> > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > > >
> > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > similar to the above.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > Before After Change
> > > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > >
> > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > >
> > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > upstream.
> > > > > > > > > > > >
> > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > similar problem):
> > > > > > > > > > > >
> > > > > > > > > > > > Previous result:
> > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > >
> > > > > > > > > > > > This patch:
> > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > >
> > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > >
> > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > distance).
> > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > >
> > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > >
> > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > >
> > > > > > > > > Hi Yu,
> > > > > > > > >
> > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > >
> > > > > > > > > #!/bin/bash
> > > > > > > > > swapoff -a
> > > > > > > > >
> > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > >
> > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > >
> > > > > > > > > echo 4G > memory.max
> > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > >
> > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > >
> > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > towards certain pages.
> > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > >
> > > > > > > > > Patched with RFC v4:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > >
> > > > > > > > > Patched with this series:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > >
> > > > > > > > > MGLRU off:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > >
> > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > Patched with this series:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > >
> > > > > > > > > Patched with RFC v4:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > >
> > > > > > > > > Patched with this series:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > >
> > > > > > > > > MGLRU off:
> > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > >
> > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > soon as possible.
> > > > > > > >
> > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > and usually higher refautls result in worse performance.
> > > >
> > > > And thanks for providing the refaults I requested for -- your data
> > > > below confirms what I mentioned above:
> > > >
> > > > For fio:
> > > > Your RFC This series Change
> > > > workingset_refault_file 628192729 596790506 -5%
> > > > IOPS 1862k 1830k -2%
> > > >
> > > > For MongoDB:
> > > > Your RFC This series Change
> > > > workingset_refault_anon 10512 35277 +30%
> > > > workingset_refault_file 22751782 20335355 -11%
> > > > total 22762294 20370632 -11%
> > > > TPS 0.09 0.06 -33%
> > > >
> > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > especially when using zram, since an anon refault should be a lot
> > > > cheaper than a file refault.
> > > >
> > > > So, I'm baffled...
> > > >
> > > > One important detail I forgot to mention: based on your data from
> > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > >
> > > > Your Kconfig My Kconfig Max possible
> > > > LRU_REFS_WIDTH 1 2 2
> > >
> > > Hi Yu,
> > >
> > > Thanks for the info, my fault, I forgot to update my config as I was
> > > testing some other features.
> > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > got much worse for MongoDB test:
> > >
> > > With LRU_REFS_WIDTH == 2:
> > >
> > > This patch:
> > > ==================================================================
> > > Execution Results after 919 seconds
> > > ------------------------------------------------------------------
> > > Executed Time (µs) Rate
> > > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > > ------------------------------------------------------------------
> > > TOTAL 488 27598136201.9 0.02 txn/s
> > >
> > > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > node 0
> > > 1 948187 0x 0x
> > > 0 0 0 0 0
> > > 0 0·
> > > 1 0 0 0 0
> > > 0 0·
> > > 2 0 0 0 0
> > > 0 0·
> > > 3 0 0 0 0
> > > 0 0·
> > > 0 0 0 0
> > > 0 0·
> > > 2 948187 0 6051788·
> > > 0 0r 0e 0p 11916r
> > > 66442e 0p
> > > 1 0r 0e 0p 903r
> > > 16888e 0p
> > > 2 0r 0e 0p 459r
> > > 9764e 0p
> > > 3 0r 0e 0p 0r
> > > 0e 2874p
> > > 0 0 0 0
> > > 0 0·
> > > 3 948187 1353160 6351·
> > > 0 0 0 0 0
> > > 0 0·
> > > 1 0 0 0 0
> > > 0 0·
> > > 2 0 0 0 0
> > > 0 0·
> > > 3 0 0 0 0
> > > 0 0·
> > > 0 0 0 0
> > > 0 0·
> > > 4 73045 23573 12·
> > > 0 0R 0T 0 3498607R
> > > 4868605T 0·
> > > 1 0R 0T 0 3012246R
> > > 3270261T 0·
> > > 2 0R 0T 0 2498608R
> > > 2839104T 0·
> > > 3 0R 0T 0 0R
> > > 1983947T 0·
> > > 1486579L 0O 1380614Y 2945N
> > > 2945F 2734A
> > >
> > > workingset_refault_anon 0
> > > workingset_refault_file 18130598
> > >
> > > total used free shared buff/cache available
> > > Mem: 31978 6705 312 20 24960 24786
> > > Swap: 31977 4 31973
> > >
> > > RFC:
> > > ==================================================================
> > > Execution Results after 908 seconds
> > > ------------------------------------------------------------------
> > > Executed Time (µs) Rate
> > > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > > ------------------------------------------------------------------
> > > TOTAL 2252 27159962888.2 0.08 txn/s
> > >
> > > workingset_refault_anon 22585
> > > workingset_refault_file 22715256
> > >
> > > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > node 0
> > > 22 563007 2274 1198225·
> > > 0 0r 1e 0p 0r
> > > 697076e 0p
> > > 1 0r 0e 0p 0r
> > > 0e 325661p
> > > 2 0r 0e 0p 0r
> > > 0e 888728p
> > > 3 0r 0e 0p 0r
> > > 0e 3602238p
> > > 0 0 0 0
> > > 0 0·
> > > 23 532222 7525 4948747·
> > > 0 0 0 0 0
> > > 0 0·
> > > 1 0 0 0 0
> > > 0 0·
> > > 2 0 0 0 0
> > > 0 0·
> > > 3 0 0 0 0
> > > 0 0·
> > > 0 0 0 0
> > > 0 0·
> > > 24 500367 1214667 3292·
> > > 0 0 0 0 0
> > > 0 0·
> > > 1 0 0 0 0
> > > 0 0·
> > > 2 0 0 0 0
> > > 0 0·
> > > 3 0 0 0 0
> > > 0 0·
> > > 0 0 0 0
> > > 0 0·
> > > 25 469692 40797 466·
> > > 0 0R 271T 0 0R
> > > 1162165T 0·
> > > 1 0R 0T 0 774028R
> > > 1205332T 0·
> > > 2 0R 0T 0 0R
> > > 932484T 0·
> > > 3 0R 1T 0 0R
> > > 4252158T 0·
> > > 25178380L 156515O 23953602Y 59234N
> > > 49391F 48664A
> > >
> > > total used free shared buff/cache available
> > > Mem: 31978 6968 338 5 24671 24555
> > > Swap: 31977 1533 30444
> > >
> > > Using same mongodb config (a 3 replica cluster using the same config):
> > > {
> > > "net": {
> > > "bindIpAll": true,
> > > "ipv6": false,
> > > "maxIncomingConnections": 10000,
> > > },
> > > "setParameter": {
> > > "disabledSecureAllocatorDomains": "*"
> > > },
> > > "replication": {
> > > "oplogSizeMB": 10480,
> > > "replSetName": "issa-tpcc_0"
> > > },
> > > "security": {
> > > "keyFile": "/data/db/keyfile"
> > > },
> > > "storage": {
> > > "dbPath": "/data/db/",
> > > "syncPeriodSecs": 60,
> > > "directoryPerDB": true,
> > > "wiredTiger": {
> > > "engineConfig": {
> > > "cacheSizeGB": 5
> > > }
> > > }
> > > },
> > > "systemLog": {
> > > "destination": "file",
> > > "logAppend": true,
> > > "logRotate": "rename",
> > > "path": "/data/db/mongod.log",
> > > "verbosity": 0
> > > }
> > > }
> > >
> > > The test environment have 32g memory and 16 core.
> > >
> > > Per my analyze, the access pattern for the mongodb test is that page
> > > will be re-access long after it's evicted so PID controller won't
> > > protect higher tier. That RFC will make use of the long existing
> > > shadow to do feedback to PID/Gen so the result will be much better.
> > > Still need more adjusting though, will try to do a rebase on top of
> > > mm-unstable which includes your patch.
> > >
> > > I've no idea why the workingset_refault_* is higher in the better
> > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > CPU is not full...
> > >
> > > I've uploaded my local reproducer here:
> > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > https://github.com/ryncsn/py-tpcc
> >
> > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > version did you use? setup.sh didn't seem to install it.
> >
> > Also do you have a QEMU image? It'd be a lot easier for me to
> > duplicate the exact environment by looking into it.
>
> I ended up using docker.io/mongodb/mongodb-community-server:latest,
> and it's not working:
>
> # docker exec -it mongo-r1 mongosh --eval \
> '"rs.initiate({
> _id: "issa-tpcc_0",
> members: [
> {_id: 0, host: "mongo-r1"},
> {_id: 1, host: "mongo-r2"},
> {_id: 2, host: "mongo-r3"}
> ]
> })"'
> Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> Error: can only create exec sessions on running containers: container
> state improper

Hi Yu,

I've updated the test repo:
https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster

I've tested it on top of latest Fedora Cloud Image 39 and it worked
well for me, the README now contains detailed and not hard to follow
steps to reproduce this test.

Also I've updated the patch series, I plan to sent out maybe RFC v4
later but need a another or couple days to tidy up and collect test
result:
https://github.com/ryncsn/linux/commits/kasong/devel/refault-distance-v4/

You may want to do test on top of it, I'll be very grateful if there
are any feedback.

It's on top of current mm-unstable to make it work well with your fix
too. I managed to tweak it to be compatible with this series, but it
seems it might cause over-protection of pages and so the performance
is slightly worse than RFC v3.

And this commit message contains my latest test result on the MongoDB case:
https://github.com/ryncsn/linux/commit/cd84e5c8e2449d33d411bce1d863bc391f36d7c8
And you can see it's a IO bound task (100% ioutil and low CPU usage)
and anon pages are really idle, using ZRAM/Same disk as swap result in
similar performance on patched kernel.

And about the aging overhead issue I suspected before (regression in
FIO due to more aging), I think it's true, and I added two patches:
https://github.com/ryncsn/linux/commit/f80cc280752da59272870378947aad6c822be2b4
https://github.com/ryncsn/linux/commit/01d091c98077a74bc70153cc7a0179a17da4f26f

In the test cases we talked about above, where > ~100 generations are
generated during FIO test, I suspected that the aging overhead is
large and causing performance drain.
After these two patches, for a similar test cases, FIO improved from this:

Run status group 0 (all jobs):
READ: bw=7593MiB/s (7962MB/s), 7593MiB/s-7593MiB/s
(7962MB/s-7962MB/s), io=2225GiB (2389GB), run=300002-300002msec
workingset_refault_anon 0
workingset_refault_file 641594126

To this:
Run status group 0 (all jobs):
READ: bw=7747MiB/s (8124MB/s), 7747MiB/s-7747MiB/s
(8124MB/s-8124MB/s), io=2270GiB (2437GB), run=300001-300001msec
workingset_refault_anon 0
workingset_refault_file 641511205

lru_gen stat is similar for both case:
memcg 66 /benchmark
node 0
119 155874 0 0x
0 0r 0e 0p 0
0 0·
1 0r 0e 0p 0
0 0·
2 0r 0e 0p 0
0 0·
3 0r 0e 0p 0
0 0·
0 0 0 0
0 0·

120 151024 0 71410·
0 0 0 0 0r
587382e 0p
1 0 0 0 0r
0e 117796p
2 0 0 0 0r
0e 193086p
3 0 0 0 0r
0e 371926p
0 0 0 0
0 0·

121 146375 0 682854·
0 0 0 0 0
0 0·
1 0 0 0 0
0 0·
2 0 0 0 0
0 0·
3 0 0 0 0
0 0·
0 0 0 0
0 0·

122 141469 0 1348·
0 0R 0T 0 0R
5132602T 0·
1 0R 0T 0 86010R
244504T 0·
2 0R 0T 0 0R
196061T 0·
3 0R 0T 0 0R
397253T 0·
367101L 15850O 15820Y 93396N
1275F 459A

The overhead of cmpxchg on page flag update is unavoidable though. I
think I could send out the two bulk update patch first for a proper
review first?

2023-12-20 06:39:24

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Tue, Dec 19, 2023 at 11:58 AM Kairui Song <[email protected]> wrote:
>
> Yu Zhao <[email protected]> 于2023年12月19日周二 11:45写道:
> >
> > On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
> > >
> > > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> > > >
> > > > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> > > > >
> > > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > > > >
> > > > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > > > > >
> > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > > on:
> > > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > > similar to the above.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > > Before After Change
> > > > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > > >
> > > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > > upstream.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > > similar problem):
> > > > > > > > > > > > >
> > > > > > > > > > > > > Previous result:
> > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > >
> > > > > > > > > > > > > This patch:
> > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > > >
> > > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > > distance).
> > > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > > >
> > > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > > >
> > > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > > >
> > > > > > > > > > Hi Yu,
> > > > > > > > > >
> > > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > > >
> > > > > > > > > > #!/bin/bash
> > > > > > > > > > swapoff -a
> > > > > > > > > >
> > > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > > >
> > > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > > >
> > > > > > > > > > echo 4G > memory.max
> > > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > > >
> > > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > > >
> > > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > > towards certain pages.
> > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > > >
> > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > > >
> > > > > > > > > > Patched with this series:
> > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > > >
> > > > > > > > > > MGLRU off:
> > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > > >
> > > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > Patched with this series:
> > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > > >
> > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > > >
> > > > > > > > > > Patched with this series:
> > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > > >
> > > > > > > > > > MGLRU off:
> > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > > >
> > > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > > soon as possible.
> > > > > > > > >
> > > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > > and usually higher refautls result in worse performance.
> > > > >
> > > > > And thanks for providing the refaults I requested for -- your data
> > > > > below confirms what I mentioned above:
> > > > >
> > > > > For fio:
> > > > > Your RFC This series Change
> > > > > workingset_refault_file 628192729 596790506 -5%
> > > > > IOPS 1862k 1830k -2%
> > > > >
> > > > > For MongoDB:
> > > > > Your RFC This series Change
> > > > > workingset_refault_anon 10512 35277 +30%
> > > > > workingset_refault_file 22751782 20335355 -11%
> > > > > total 22762294 20370632 -11%
> > > > > TPS 0.09 0.06 -33%
> > > > >
> > > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > > especially when using zram, since an anon refault should be a lot
> > > > > cheaper than a file refault.
> > > > >
> > > > > So, I'm baffled...
> > > > >
> > > > > One important detail I forgot to mention: based on your data from
> > > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > > >
> > > > > Your Kconfig My Kconfig Max possible
> > > > > LRU_REFS_WIDTH 1 2 2
> > > >
> > > > Hi Yu,
> > > >
> > > > Thanks for the info, my fault, I forgot to update my config as I was
> > > > testing some other features.
> > > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > > got much worse for MongoDB test:
> > > >
> > > > With LRU_REFS_WIDTH == 2:
> > > >
> > > > This patch:
> > > > ==================================================================
> > > > Execution Results after 919 seconds
> > > > ------------------------------------------------------------------
> > > > Executed Time (µs) Rate
> > > > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > > > ------------------------------------------------------------------
> > > > TOTAL 488 27598136201.9 0.02 txn/s
> > > >
> > > > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > > node 0
> > > > 1 948187 0x 0x
> > > > 0 0 0 0 0
> > > > 0 0·
> > > > 1 0 0 0 0
> > > > 0 0·
> > > > 2 0 0 0 0
> > > > 0 0·
> > > > 3 0 0 0 0
> > > > 0 0·
> > > > 0 0 0 0
> > > > 0 0·
> > > > 2 948187 0 6051788·
> > > > 0 0r 0e 0p 11916r
> > > > 66442e 0p
> > > > 1 0r 0e 0p 903r
> > > > 16888e 0p
> > > > 2 0r 0e 0p 459r
> > > > 9764e 0p
> > > > 3 0r 0e 0p 0r
> > > > 0e 2874p
> > > > 0 0 0 0
> > > > 0 0·
> > > > 3 948187 1353160 6351·
> > > > 0 0 0 0 0
> > > > 0 0·
> > > > 1 0 0 0 0
> > > > 0 0·
> > > > 2 0 0 0 0
> > > > 0 0·
> > > > 3 0 0 0 0
> > > > 0 0·
> > > > 0 0 0 0
> > > > 0 0·
> > > > 4 73045 23573 12·
> > > > 0 0R 0T 0 3498607R
> > > > 4868605T 0·
> > > > 1 0R 0T 0 3012246R
> > > > 3270261T 0·
> > > > 2 0R 0T 0 2498608R
> > > > 2839104T 0·
> > > > 3 0R 0T 0 0R
> > > > 1983947T 0·
> > > > 1486579L 0O 1380614Y 2945N
> > > > 2945F 2734A
> > > >
> > > > workingset_refault_anon 0
> > > > workingset_refault_file 18130598
> > > >
> > > > total used free shared buff/cache available
> > > > Mem: 31978 6705 312 20 24960 24786
> > > > Swap: 31977 4 31973
> > > >
> > > > RFC:
> > > > ==================================================================
> > > > Execution Results after 908 seconds
> > > > ------------------------------------------------------------------
> > > > Executed Time (µs) Rate
> > > > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > > > ------------------------------------------------------------------
> > > > TOTAL 2252 27159962888.2 0.08 txn/s
> > > >
> > > > workingset_refault_anon 22585
> > > > workingset_refault_file 22715256
> > > >
> > > > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > > node 0
> > > > 22 563007 2274 1198225·
> > > > 0 0r 1e 0p 0r
> > > > 697076e 0p
> > > > 1 0r 0e 0p 0r
> > > > 0e 325661p
> > > > 2 0r 0e 0p 0r
> > > > 0e 888728p
> > > > 3 0r 0e 0p 0r
> > > > 0e 3602238p
> > > > 0 0 0 0
> > > > 0 0·
> > > > 23 532222 7525 4948747·
> > > > 0 0 0 0 0
> > > > 0 0·
> > > > 1 0 0 0 0
> > > > 0 0·
> > > > 2 0 0 0 0
> > > > 0 0·
> > > > 3 0 0 0 0
> > > > 0 0·
> > > > 0 0 0 0
> > > > 0 0·
> > > > 24 500367 1214667 3292·
> > > > 0 0 0 0 0
> > > > 0 0·
> > > > 1 0 0 0 0
> > > > 0 0·
> > > > 2 0 0 0 0
> > > > 0 0·
> > > > 3 0 0 0 0
> > > > 0 0·
> > > > 0 0 0 0
> > > > 0 0·
> > > > 25 469692 40797 466·
> > > > 0 0R 271T 0 0R
> > > > 1162165T 0·
> > > > 1 0R 0T 0 774028R
> > > > 1205332T 0·
> > > > 2 0R 0T 0 0R
> > > > 932484T 0·
> > > > 3 0R 1T 0 0R
> > > > 4252158T 0·
> > > > 25178380L 156515O 23953602Y 59234N
> > > > 49391F 48664A
> > > >
> > > > total used free shared buff/cache available
> > > > Mem: 31978 6968 338 5 24671 24555
> > > > Swap: 31977 1533 30444
> > > >
> > > > Using same mongodb config (a 3 replica cluster using the same config):
> > > > {
> > > > "net": {
> > > > "bindIpAll": true,
> > > > "ipv6": false,
> > > > "maxIncomingConnections": 10000,
> > > > },
> > > > "setParameter": {
> > > > "disabledSecureAllocatorDomains": "*"
> > > > },
> > > > "replication": {
> > > > "oplogSizeMB": 10480,
> > > > "replSetName": "issa-tpcc_0"
> > > > },
> > > > "security": {
> > > > "keyFile": "/data/db/keyfile"
> > > > },
> > > > "storage": {
> > > > "dbPath": "/data/db/",
> > > > "syncPeriodSecs": 60,
> > > > "directoryPerDB": true,
> > > > "wiredTiger": {
> > > > "engineConfig": {
> > > > "cacheSizeGB": 5
> > > > }
> > > > }
> > > > },
> > > > "systemLog": {
> > > > "destination": "file",
> > > > "logAppend": true,
> > > > "logRotate": "rename",
> > > > "path": "/data/db/mongod.log",
> > > > "verbosity": 0
> > > > }
> > > > }
> > > >
> > > > The test environment have 32g memory and 16 core.
> > > >
> > > > Per my analyze, the access pattern for the mongodb test is that page
> > > > will be re-access long after it's evicted so PID controller won't
> > > > protect higher tier. That RFC will make use of the long existing
> > > > shadow to do feedback to PID/Gen so the result will be much better.
> > > > Still need more adjusting though, will try to do a rebase on top of
> > > > mm-unstable which includes your patch.
> > > >
> > > > I've no idea why the workingset_refault_* is higher in the better
> > > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > > CPU is not full...
> > > >
> > > > I've uploaded my local reproducer here:
> > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > https://github.com/ryncsn/py-tpcc
> > >
> > > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > > version did you use? setup.sh didn't seem to install it.
> > >
> > > Also do you have a QEMU image? It'd be a lot easier for me to
> > > duplicate the exact environment by looking into it.
> >
> > I ended up using docker.io/mongodb/mongodb-community-server:latest,
> > and it's not working:
> >
> > # docker exec -it mongo-r1 mongosh --eval \
> > '"rs.initiate({
> > _id: "issa-tpcc_0",
> > members: [
> > {_id: 0, host: "mongo-r1"},
> > {_id: 1, host: "mongo-r2"},
> > {_id: 2, host: "mongo-r3"}
> > ]
> > })"'
> > Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> > Error: can only create exec sessions on running containers: container
> > state improper
>
> Hi Yu,
>
> I've updated the test repo:
> https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
>
> I've tested it on top of latest Fedora Cloud Image 39 and it worked
> well for me, the README now contains detailed and not hard to follow
> steps to reproduce this test.

Thanks. I was following the instructions down to the letter and it
fell apart again at line 46 (./tpcc.py).

Were you able to successfully run the benchmark on a fresh VM by
following the instructions? If not, I'd appreciate it if you could do
so and document all the missing steps.

2023-12-20 08:17:40

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Tue, Dec 19, 2023 at 11:38 PM Yu Zhao <[email protected]> wrote:
>
> On Tue, Dec 19, 2023 at 11:58 AM Kairui Song <[email protected]> wrote:
> >
> > Yu Zhao <[email protected]> 于2023年12月19日周二 11:45写道:
> > >
> > > On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
> > > >
> > > > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> > > > >
> > > > > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> > > > > >
> > > > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > > > > > >
> > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > > > on:
> > > > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > > > similar to the above.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > > > Before After Change
> > > > > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > > > upstream.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > > > similar problem):
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Previous result:
> > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This patch:
> > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > > > distance).
> > > > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > > > >
> > > > > > > > > > > Hi Yu,
> > > > > > > > > > >
> > > > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > > > >
> > > > > > > > > > > #!/bin/bash
> > > > > > > > > > > swapoff -a
> > > > > > > > > > >
> > > > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > > > >
> > > > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > > > >
> > > > > > > > > > > echo 4G > memory.max
> > > > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > > > >
> > > > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > > > >
> > > > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > > > towards certain pages.
> > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > > > >
> > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > > > >
> > > > > > > > > > > Patched with this series:
> > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > > > >
> > > > > > > > > > > MGLRU off:
> > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > > > >
> > > > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > Patched with this series:
> > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > > > >
> > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > > > >
> > > > > > > > > > > Patched with this series:
> > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > > > >
> > > > > > > > > > > MGLRU off:
> > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > > > >
> > > > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > > > soon as possible.
> > > > > > > > > >
> > > > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > > > and usually higher refautls result in worse performance.
> > > > > >
> > > > > > And thanks for providing the refaults I requested for -- your data
> > > > > > below confirms what I mentioned above:
> > > > > >
> > > > > > For fio:
> > > > > > Your RFC This series Change
> > > > > > workingset_refault_file 628192729 596790506 -5%
> > > > > > IOPS 1862k 1830k -2%
> > > > > >
> > > > > > For MongoDB:
> > > > > > Your RFC This series Change
> > > > > > workingset_refault_anon 10512 35277 +30%
> > > > > > workingset_refault_file 22751782 20335355 -11%
> > > > > > total 22762294 20370632 -11%
> > > > > > TPS 0.09 0.06 -33%
> > > > > >
> > > > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > > > especially when using zram, since an anon refault should be a lot
> > > > > > cheaper than a file refault.
> > > > > >
> > > > > > So, I'm baffled...
> > > > > >
> > > > > > One important detail I forgot to mention: based on your data from
> > > > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > > > >
> > > > > > Your Kconfig My Kconfig Max possible
> > > > > > LRU_REFS_WIDTH 1 2 2
> > > > >
> > > > > Hi Yu,
> > > > >
> > > > > Thanks for the info, my fault, I forgot to update my config as I was
> > > > > testing some other features.
> > > > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > > > got much worse for MongoDB test:
> > > > >
> > > > > With LRU_REFS_WIDTH == 2:
> > > > >
> > > > > This patch:
> > > > > ==================================================================
> > > > > Execution Results after 919 seconds
> > > > > ------------------------------------------------------------------
> > > > > Executed Time (µs) Rate
> > > > > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > > > > ------------------------------------------------------------------
> > > > > TOTAL 488 27598136201.9 0.02 txn/s
> > > > >
> > > > > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > > > node 0
> > > > > 1 948187 0x 0x
> > > > > 0 0 0 0 0
> > > > > 0 0·
> > > > > 1 0 0 0 0
> > > > > 0 0·
> > > > > 2 0 0 0 0
> > > > > 0 0·
> > > > > 3 0 0 0 0
> > > > > 0 0·
> > > > > 0 0 0 0
> > > > > 0 0·
> > > > > 2 948187 0 6051788·
> > > > > 0 0r 0e 0p 11916r
> > > > > 66442e 0p
> > > > > 1 0r 0e 0p 903r
> > > > > 16888e 0p
> > > > > 2 0r 0e 0p 459r
> > > > > 9764e 0p
> > > > > 3 0r 0e 0p 0r
> > > > > 0e 2874p
> > > > > 0 0 0 0
> > > > > 0 0·
> > > > > 3 948187 1353160 6351·
> > > > > 0 0 0 0 0
> > > > > 0 0·
> > > > > 1 0 0 0 0
> > > > > 0 0·
> > > > > 2 0 0 0 0
> > > > > 0 0·
> > > > > 3 0 0 0 0
> > > > > 0 0·
> > > > > 0 0 0 0
> > > > > 0 0·
> > > > > 4 73045 23573 12·
> > > > > 0 0R 0T 0 3498607R
> > > > > 4868605T 0·
> > > > > 1 0R 0T 0 3012246R
> > > > > 3270261T 0·
> > > > > 2 0R 0T 0 2498608R
> > > > > 2839104T 0·
> > > > > 3 0R 0T 0 0R
> > > > > 1983947T 0·
> > > > > 1486579L 0O 1380614Y 2945N
> > > > > 2945F 2734A
> > > > >
> > > > > workingset_refault_anon 0
> > > > > workingset_refault_file 18130598
> > > > >
> > > > > total used free shared buff/cache available
> > > > > Mem: 31978 6705 312 20 24960 24786
> > > > > Swap: 31977 4 31973
> > > > >
> > > > > RFC:
> > > > > ==================================================================
> > > > > Execution Results after 908 seconds
> > > > > ------------------------------------------------------------------
> > > > > Executed Time (µs) Rate
> > > > > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > > > > ------------------------------------------------------------------
> > > > > TOTAL 2252 27159962888.2 0.08 txn/s
> > > > >
> > > > > workingset_refault_anon 22585
> > > > > workingset_refault_file 22715256
> > > > >
> > > > > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > > > node 0
> > > > > 22 563007 2274 1198225·
> > > > > 0 0r 1e 0p 0r
> > > > > 697076e 0p
> > > > > 1 0r 0e 0p 0r
> > > > > 0e 325661p
> > > > > 2 0r 0e 0p 0r
> > > > > 0e 888728p
> > > > > 3 0r 0e 0p 0r
> > > > > 0e 3602238p
> > > > > 0 0 0 0
> > > > > 0 0·
> > > > > 23 532222 7525 4948747·
> > > > > 0 0 0 0 0
> > > > > 0 0·
> > > > > 1 0 0 0 0
> > > > > 0 0·
> > > > > 2 0 0 0 0
> > > > > 0 0·
> > > > > 3 0 0 0 0
> > > > > 0 0·
> > > > > 0 0 0 0
> > > > > 0 0·
> > > > > 24 500367 1214667 3292·
> > > > > 0 0 0 0 0
> > > > > 0 0·
> > > > > 1 0 0 0 0
> > > > > 0 0·
> > > > > 2 0 0 0 0
> > > > > 0 0·
> > > > > 3 0 0 0 0
> > > > > 0 0·
> > > > > 0 0 0 0
> > > > > 0 0·
> > > > > 25 469692 40797 466·
> > > > > 0 0R 271T 0 0R
> > > > > 1162165T 0·
> > > > > 1 0R 0T 0 774028R
> > > > > 1205332T 0·
> > > > > 2 0R 0T 0 0R
> > > > > 932484T 0·
> > > > > 3 0R 1T 0 0R
> > > > > 4252158T 0·
> > > > > 25178380L 156515O 23953602Y 59234N
> > > > > 49391F 48664A
> > > > >
> > > > > total used free shared buff/cache available
> > > > > Mem: 31978 6968 338 5 24671 24555
> > > > > Swap: 31977 1533 30444
> > > > >
> > > > > Using same mongodb config (a 3 replica cluster using the same config):
> > > > > {
> > > > > "net": {
> > > > > "bindIpAll": true,
> > > > > "ipv6": false,
> > > > > "maxIncomingConnections": 10000,
> > > > > },
> > > > > "setParameter": {
> > > > > "disabledSecureAllocatorDomains": "*"
> > > > > },
> > > > > "replication": {
> > > > > "oplogSizeMB": 10480,
> > > > > "replSetName": "issa-tpcc_0"
> > > > > },
> > > > > "security": {
> > > > > "keyFile": "/data/db/keyfile"
> > > > > },
> > > > > "storage": {
> > > > > "dbPath": "/data/db/",
> > > > > "syncPeriodSecs": 60,
> > > > > "directoryPerDB": true,
> > > > > "wiredTiger": {
> > > > > "engineConfig": {
> > > > > "cacheSizeGB": 5
> > > > > }
> > > > > }
> > > > > },
> > > > > "systemLog": {
> > > > > "destination": "file",
> > > > > "logAppend": true,
> > > > > "logRotate": "rename",
> > > > > "path": "/data/db/mongod.log",
> > > > > "verbosity": 0
> > > > > }
> > > > > }
> > > > >
> > > > > The test environment have 32g memory and 16 core.
> > > > >
> > > > > Per my analyze, the access pattern for the mongodb test is that page
> > > > > will be re-access long after it's evicted so PID controller won't
> > > > > protect higher tier. That RFC will make use of the long existing
> > > > > shadow to do feedback to PID/Gen so the result will be much better.
> > > > > Still need more adjusting though, will try to do a rebase on top of
> > > > > mm-unstable which includes your patch.
> > > > >
> > > > > I've no idea why the workingset_refault_* is higher in the better
> > > > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > > > CPU is not full...
> > > > >
> > > > > I've uploaded my local reproducer here:
> > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > https://github.com/ryncsn/py-tpcc
> > > >
> > > > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > > > version did you use? setup.sh didn't seem to install it.
> > > >
> > > > Also do you have a QEMU image? It'd be a lot easier for me to
> > > > duplicate the exact environment by looking into it.
> > >
> > > I ended up using docker.io/mongodb/mongodb-community-server:latest,
> > > and it's not working:
> > >
> > > # docker exec -it mongo-r1 mongosh --eval \
> > > '"rs.initiate({
> > > _id: "issa-tpcc_0",
> > > members: [
> > > {_id: 0, host: "mongo-r1"},
> > > {_id: 1, host: "mongo-r2"},
> > > {_id: 2, host: "mongo-r3"}
> > > ]
> > > })"'
> > > Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> > > Error: can only create exec sessions on running containers: container
> > > state improper
> >
> > Hi Yu,
> >
> > I've updated the test repo:
> > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> >
> > I've tested it on top of latest Fedora Cloud Image 39 and it worked
> > well for me, the README now contains detailed and not hard to follow
> > steps to reproduce this test.
>
> Thanks. I was following the instructions down to the letter and it
> fell apart again at line 46 (./tpcc.py).

I think you just broke it by
https://github.com/ryncsn/py-tpcc/commit/7b9b380d636cb84faa5b11b5562e531f924eeb7e

(But it's also possible you actually wanted me to use this latest
commit but forgot to account for it in your instructions.)

> Were you able to successfully run the benchmark on a fresh VM by
> following the instructions? If not, I'd appreciate it if you could do
> so and document all the missing steps.

2023-12-20 08:24:58

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <[email protected]> 于2023年12月20日周三 16:17写道:
>
> On Tue, Dec 19, 2023 at 11:38 PM Yu Zhao <[email protected]> wrote:
> >
> > On Tue, Dec 19, 2023 at 11:58 AM Kairui Song <[email protected]> wrote:
> > >
> > > Yu Zhao <[email protected]> 于2023年12月19日周二 11:45写道:
> > > >
> > > > On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
> > > > >
> > > > > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> > > > > >
> > > > > > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> > > > > > >
> > > > > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > > > > on:
> > > > > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > > > > similar to the above.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > > > > Before After Change
> > > > > > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > > > > upstream.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > > > > similar problem):
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Previous result:
> > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This patch:
> > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > > > > distance).
> > > > > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > >
> > > > > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > > > > >
> > > > > > > > > > > > #!/bin/bash
> > > > > > > > > > > > swapoff -a
> > > > > > > > > > > >
> > > > > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > > > > >
> > > > > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > > > > >
> > > > > > > > > > > > echo 4G > memory.max
> > > > > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > > > > >
> > > > > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > > > > >
> > > > > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > > > > towards certain pages.
> > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > > > > >
> > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > > > > >
> > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > > > > >
> > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > > > > >
> > > > > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > > > > >
> > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > > > > >
> > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > > > > >
> > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > > > > >
> > > > > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > > > > soon as possible.
> > > > > > > > > > >
> > > > > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > > > > and usually higher refautls result in worse performance.
> > > > > > >
> > > > > > > And thanks for providing the refaults I requested for -- your data
> > > > > > > below confirms what I mentioned above:
> > > > > > >
> > > > > > > For fio:
> > > > > > > Your RFC This series Change
> > > > > > > workingset_refault_file 628192729 596790506 -5%
> > > > > > > IOPS 1862k 1830k -2%
> > > > > > >
> > > > > > > For MongoDB:
> > > > > > > Your RFC This series Change
> > > > > > > workingset_refault_anon 10512 35277 +30%
> > > > > > > workingset_refault_file 22751782 20335355 -11%
> > > > > > > total 22762294 20370632 -11%
> > > > > > > TPS 0.09 0.06 -33%
> > > > > > >
> > > > > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > > > > especially when using zram, since an anon refault should be a lot
> > > > > > > cheaper than a file refault.
> > > > > > >
> > > > > > > So, I'm baffled...
> > > > > > >
> > > > > > > One important detail I forgot to mention: based on your data from
> > > > > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > > > > >
> > > > > > > Your Kconfig My Kconfig Max possible
> > > > > > > LRU_REFS_WIDTH 1 2 2
> > > > > >
> > > > > > Hi Yu,
> > > > > >
> > > > > > Thanks for the info, my fault, I forgot to update my config as I was
> > > > > > testing some other features.
> > > > > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > > > > got much worse for MongoDB test:
> > > > > >
> > > > > > With LRU_REFS_WIDTH == 2:
> > > > > >
> > > > > > This patch:
> > > > > > ==================================================================
> > > > > > Execution Results after 919 seconds
> > > > > > ------------------------------------------------------------------
> > > > > > Executed Time (µs) Rate
> > > > > > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > > > > > ------------------------------------------------------------------
> > > > > > TOTAL 488 27598136201.9 0.02 txn/s
> > > > > >
> > > > > > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > > > > node 0
> > > > > > 1 948187 0x 0x
> > > > > > 0 0 0 0 0
> > > > > > 0 0·
> > > > > > 1 0 0 0 0
> > > > > > 0 0·
> > > > > > 2 0 0 0 0
> > > > > > 0 0·
> > > > > > 3 0 0 0 0
> > > > > > 0 0·
> > > > > > 0 0 0 0
> > > > > > 0 0·
> > > > > > 2 948187 0 6051788·
> > > > > > 0 0r 0e 0p 11916r
> > > > > > 66442e 0p
> > > > > > 1 0r 0e 0p 903r
> > > > > > 16888e 0p
> > > > > > 2 0r 0e 0p 459r
> > > > > > 9764e 0p
> > > > > > 3 0r 0e 0p 0r
> > > > > > 0e 2874p
> > > > > > 0 0 0 0
> > > > > > 0 0·
> > > > > > 3 948187 1353160 6351·
> > > > > > 0 0 0 0 0
> > > > > > 0 0·
> > > > > > 1 0 0 0 0
> > > > > > 0 0·
> > > > > > 2 0 0 0 0
> > > > > > 0 0·
> > > > > > 3 0 0 0 0
> > > > > > 0 0·
> > > > > > 0 0 0 0
> > > > > > 0 0·
> > > > > > 4 73045 23573 12·
> > > > > > 0 0R 0T 0 3498607R
> > > > > > 4868605T 0·
> > > > > > 1 0R 0T 0 3012246R
> > > > > > 3270261T 0·
> > > > > > 2 0R 0T 0 2498608R
> > > > > > 2839104T 0·
> > > > > > 3 0R 0T 0 0R
> > > > > > 1983947T 0·
> > > > > > 1486579L 0O 1380614Y 2945N
> > > > > > 2945F 2734A
> > > > > >
> > > > > > workingset_refault_anon 0
> > > > > > workingset_refault_file 18130598
> > > > > >
> > > > > > total used free shared buff/cache available
> > > > > > Mem: 31978 6705 312 20 24960 24786
> > > > > > Swap: 31977 4 31973
> > > > > >
> > > > > > RFC:
> > > > > > ==================================================================
> > > > > > Execution Results after 908 seconds
> > > > > > ------------------------------------------------------------------
> > > > > > Executed Time (µs) Rate
> > > > > > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > > > > > ------------------------------------------------------------------
> > > > > > TOTAL 2252 27159962888.2 0.08 txn/s
> > > > > >
> > > > > > workingset_refault_anon 22585
> > > > > > workingset_refault_file 22715256
> > > > > >
> > > > > > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > > > > node 0
> > > > > > 22 563007 2274 1198225·
> > > > > > 0 0r 1e 0p 0r
> > > > > > 697076e 0p
> > > > > > 1 0r 0e 0p 0r
> > > > > > 0e 325661p
> > > > > > 2 0r 0e 0p 0r
> > > > > > 0e 888728p
> > > > > > 3 0r 0e 0p 0r
> > > > > > 0e 3602238p
> > > > > > 0 0 0 0
> > > > > > 0 0·
> > > > > > 23 532222 7525 4948747·
> > > > > > 0 0 0 0 0
> > > > > > 0 0·
> > > > > > 1 0 0 0 0
> > > > > > 0 0·
> > > > > > 2 0 0 0 0
> > > > > > 0 0·
> > > > > > 3 0 0 0 0
> > > > > > 0 0·
> > > > > > 0 0 0 0
> > > > > > 0 0·
> > > > > > 24 500367 1214667 3292·
> > > > > > 0 0 0 0 0
> > > > > > 0 0·
> > > > > > 1 0 0 0 0
> > > > > > 0 0·
> > > > > > 2 0 0 0 0
> > > > > > 0 0·
> > > > > > 3 0 0 0 0
> > > > > > 0 0·
> > > > > > 0 0 0 0
> > > > > > 0 0·
> > > > > > 25 469692 40797 466·
> > > > > > 0 0R 271T 0 0R
> > > > > > 1162165T 0·
> > > > > > 1 0R 0T 0 774028R
> > > > > > 1205332T 0·
> > > > > > 2 0R 0T 0 0R
> > > > > > 932484T 0·
> > > > > > 3 0R 1T 0 0R
> > > > > > 4252158T 0·
> > > > > > 25178380L 156515O 23953602Y 59234N
> > > > > > 49391F 48664A
> > > > > >
> > > > > > total used free shared buff/cache available
> > > > > > Mem: 31978 6968 338 5 24671 24555
> > > > > > Swap: 31977 1533 30444
> > > > > >
> > > > > > Using same mongodb config (a 3 replica cluster using the same config):
> > > > > > {
> > > > > > "net": {
> > > > > > "bindIpAll": true,
> > > > > > "ipv6": false,
> > > > > > "maxIncomingConnections": 10000,
> > > > > > },
> > > > > > "setParameter": {
> > > > > > "disabledSecureAllocatorDomains": "*"
> > > > > > },
> > > > > > "replication": {
> > > > > > "oplogSizeMB": 10480,
> > > > > > "replSetName": "issa-tpcc_0"
> > > > > > },
> > > > > > "security": {
> > > > > > "keyFile": "/data/db/keyfile"
> > > > > > },
> > > > > > "storage": {
> > > > > > "dbPath": "/data/db/",
> > > > > > "syncPeriodSecs": 60,
> > > > > > "directoryPerDB": true,
> > > > > > "wiredTiger": {
> > > > > > "engineConfig": {
> > > > > > "cacheSizeGB": 5
> > > > > > }
> > > > > > }
> > > > > > },
> > > > > > "systemLog": {
> > > > > > "destination": "file",
> > > > > > "logAppend": true,
> > > > > > "logRotate": "rename",
> > > > > > "path": "/data/db/mongod.log",
> > > > > > "verbosity": 0
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > The test environment have 32g memory and 16 core.
> > > > > >
> > > > > > Per my analyze, the access pattern for the mongodb test is that page
> > > > > > will be re-access long after it's evicted so PID controller won't
> > > > > > protect higher tier. That RFC will make use of the long existing
> > > > > > shadow to do feedback to PID/Gen so the result will be much better.
> > > > > > Still need more adjusting though, will try to do a rebase on top of
> > > > > > mm-unstable which includes your patch.
> > > > > >
> > > > > > I've no idea why the workingset_refault_* is higher in the better
> > > > > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > > > > CPU is not full...
> > > > > >
> > > > > > I've uploaded my local reproducer here:
> > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > https://github.com/ryncsn/py-tpcc
> > > > >
> > > > > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > > > > version did you use? setup.sh didn't seem to install it.
> > > > >
> > > > > Also do you have a QEMU image? It'd be a lot easier for me to
> > > > > duplicate the exact environment by looking into it.
> > > >
> > > > I ended up using docker.io/mongodb/mongodb-community-server:latest,
> > > > and it's not working:
> > > >
> > > > # docker exec -it mongo-r1 mongosh --eval \
> > > > '"rs.initiate({
> > > > _id: "issa-tpcc_0",
> > > > members: [
> > > > {_id: 0, host: "mongo-r1"},
> > > > {_id: 1, host: "mongo-r2"},
> > > > {_id: 2, host: "mongo-r3"}
> > > > ]
> > > > })"'
> > > > Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> > > > Error: can only create exec sessions on running containers: container
> > > > state improper
> > >
> > > Hi Yu,
> > >
> > > I've updated the test repo:
> > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > >
> > > I've tested it on top of latest Fedora Cloud Image 39 and it worked
> > > well for me, the README now contains detailed and not hard to follow
> > > steps to reproduce this test.
> >
> > Thanks. I was following the instructions down to the letter and it
> > fell apart again at line 46 (./tpcc.py).
>
> I think you just broke it by
> https://github.com/ryncsn/py-tpcc/commit/7b9b380d636cb84faa5b11b5562e531f924eeb7e
>
> (But it's also possible you actually wanted me to use this latest
> commit but forgot to account for it in your instructions.)
>
> > Were you able to successfully run the benchmark on a fresh VM by
> > following the instructions? If not, I'd appreciate it if you could do
> > so and document all the missing steps.

Ah, you are right, I attempted to convert it to Python3 but found it
only brought more trouble, so I gave up and the instruction is still
using Python2. However I accidentally pushed the WIP python3 convert
commit... I've reset the repo to
https://github.com/ryncsn/py-tpcc/commit/86e862c5cf3b2d1f51e0297742fa837c7a99ebf8,
this is working well. Sorry for the inconvenient.

2023-12-25 06:31:10

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Wed, Dec 20, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
>
> Yu Zhao <[email protected]> 于2023年12月20日周三 16:17写道:
> >
> > On Tue, Dec 19, 2023 at 11:38 PM Yu Zhao <[email protected]> wrote:
> > >
> > > On Tue, Dec 19, 2023 at 11:58 AM Kairui Song <[email protected]> wrote:
> > > >
> > > > Yu Zhao <[email protected]> 于2023年12月19日周二 11:45写道:
> > > > >
> > > > > On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> > > > > > >
> > > > > > > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> > > > > > > >
> > > > > > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > > > > > on:
> > > > > > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > > > > > similar to the above.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > > > > > Before After Change
> > > > > > > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > > > > > upstream.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > > > > > similar problem):
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Previous result:
> > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patch:
> > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > > > > > distance).
> > > > > > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > > > > > >
> > > > > > > > > > > > > #!/bin/bash
> > > > > > > > > > > > > swapoff -a
> > > > > > > > > > > > >
> > > > > > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > > > > > >
> > > > > > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > > > > > >
> > > > > > > > > > > > > echo 4G > memory.max
> > > > > > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > > > > > >
> > > > > > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > > > > > >
> > > > > > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > > > > > towards certain pages.
> > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > > > > > >
> > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > > > > > >
> > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > > > > > >
> > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > > > > > >
> > > > > > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > > > > > >
> > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > > > > > >
> > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > > > > > >
> > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > > > > > >
> > > > > > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > > > > > soon as possible.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > > > > > and usually higher refautls result in worse performance.
> > > > > > > >
> > > > > > > > And thanks for providing the refaults I requested for -- your data
> > > > > > > > below confirms what I mentioned above:
> > > > > > > >
> > > > > > > > For fio:
> > > > > > > > Your RFC This series Change
> > > > > > > > workingset_refault_file 628192729 596790506 -5%
> > > > > > > > IOPS 1862k 1830k -2%
> > > > > > > >
> > > > > > > > For MongoDB:
> > > > > > > > Your RFC This series Change
> > > > > > > > workingset_refault_anon 10512 35277 +30%
> > > > > > > > workingset_refault_file 22751782 20335355 -11%
> > > > > > > > total 22762294 20370632 -11%
> > > > > > > > TPS 0.09 0.06 -33%
> > > > > > > >
> > > > > > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > > > > > especially when using zram, since an anon refault should be a lot
> > > > > > > > cheaper than a file refault.
> > > > > > > >
> > > > > > > > So, I'm baffled...
> > > > > > > >
> > > > > > > > One important detail I forgot to mention: based on your data from
> > > > > > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > > > > > >
> > > > > > > > Your Kconfig My Kconfig Max possible
> > > > > > > > LRU_REFS_WIDTH 1 2 2
> > > > > > >
> > > > > > > Hi Yu,
> > > > > > >
> > > > > > > Thanks for the info, my fault, I forgot to update my config as I was
> > > > > > > testing some other features.
> > > > > > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > > > > > got much worse for MongoDB test:
> > > > > > >
> > > > > > > With LRU_REFS_WIDTH == 2:
> > > > > > >
> > > > > > > This patch:
> > > > > > > ==================================================================
> > > > > > > Execution Results after 919 seconds
> > > > > > > ------------------------------------------------------------------
> > > > > > > Executed Time (µs) Rate
> > > > > > > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > > > > > > ------------------------------------------------------------------
> > > > > > > TOTAL 488 27598136201.9 0.02 txn/s
> > > > > > >
> > > > > > > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > > > > > node 0
> > > > > > > 1 948187 0x 0x
> > > > > > > 0 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 1 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 2 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 3 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 2 948187 0 6051788·
> > > > > > > 0 0r 0e 0p 11916r
> > > > > > > 66442e 0p
> > > > > > > 1 0r 0e 0p 903r
> > > > > > > 16888e 0p
> > > > > > > 2 0r 0e 0p 459r
> > > > > > > 9764e 0p
> > > > > > > 3 0r 0e 0p 0r
> > > > > > > 0e 2874p
> > > > > > > 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 3 948187 1353160 6351·
> > > > > > > 0 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 1 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 2 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 3 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 4 73045 23573 12·
> > > > > > > 0 0R 0T 0 3498607R
> > > > > > > 4868605T 0·
> > > > > > > 1 0R 0T 0 3012246R
> > > > > > > 3270261T 0·
> > > > > > > 2 0R 0T 0 2498608R
> > > > > > > 2839104T 0·
> > > > > > > 3 0R 0T 0 0R
> > > > > > > 1983947T 0·
> > > > > > > 1486579L 0O 1380614Y 2945N
> > > > > > > 2945F 2734A
> > > > > > >
> > > > > > > workingset_refault_anon 0
> > > > > > > workingset_refault_file 18130598
> > > > > > >
> > > > > > > total used free shared buff/cache available
> > > > > > > Mem: 31978 6705 312 20 24960 24786
> > > > > > > Swap: 31977 4 31973
> > > > > > >
> > > > > > > RFC:
> > > > > > > ==================================================================
> > > > > > > Execution Results after 908 seconds
> > > > > > > ------------------------------------------------------------------
> > > > > > > Executed Time (µs) Rate
> > > > > > > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > > > > > > ------------------------------------------------------------------
> > > > > > > TOTAL 2252 27159962888.2 0.08 txn/s
> > > > > > >
> > > > > > > workingset_refault_anon 22585
> > > > > > > workingset_refault_file 22715256
> > > > > > >
> > > > > > > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > > > > > node 0
> > > > > > > 22 563007 2274 1198225·
> > > > > > > 0 0r 1e 0p 0r
> > > > > > > 697076e 0p
> > > > > > > 1 0r 0e 0p 0r
> > > > > > > 0e 325661p
> > > > > > > 2 0r 0e 0p 0r
> > > > > > > 0e 888728p
> > > > > > > 3 0r 0e 0p 0r
> > > > > > > 0e 3602238p
> > > > > > > 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 23 532222 7525 4948747·
> > > > > > > 0 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 1 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 2 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 3 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 24 500367 1214667 3292·
> > > > > > > 0 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 1 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 2 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 3 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 0 0 0 0
> > > > > > > 0 0·
> > > > > > > 25 469692 40797 466·
> > > > > > > 0 0R 271T 0 0R
> > > > > > > 1162165T 0·
> > > > > > > 1 0R 0T 0 774028R
> > > > > > > 1205332T 0·
> > > > > > > 2 0R 0T 0 0R
> > > > > > > 932484T 0·
> > > > > > > 3 0R 1T 0 0R
> > > > > > > 4252158T 0·
> > > > > > > 25178380L 156515O 23953602Y 59234N
> > > > > > > 49391F 48664A
> > > > > > >
> > > > > > > total used free shared buff/cache available
> > > > > > > Mem: 31978 6968 338 5 24671 24555
> > > > > > > Swap: 31977 1533 30444
> > > > > > >
> > > > > > > Using same mongodb config (a 3 replica cluster using the same config):
> > > > > > > {
> > > > > > > "net": {
> > > > > > > "bindIpAll": true,
> > > > > > > "ipv6": false,
> > > > > > > "maxIncomingConnections": 10000,
> > > > > > > },
> > > > > > > "setParameter": {
> > > > > > > "disabledSecureAllocatorDomains": "*"
> > > > > > > },
> > > > > > > "replication": {
> > > > > > > "oplogSizeMB": 10480,
> > > > > > > "replSetName": "issa-tpcc_0"
> > > > > > > },
> > > > > > > "security": {
> > > > > > > "keyFile": "/data/db/keyfile"
> > > > > > > },
> > > > > > > "storage": {
> > > > > > > "dbPath": "/data/db/",
> > > > > > > "syncPeriodSecs": 60,
> > > > > > > "directoryPerDB": true,
> > > > > > > "wiredTiger": {
> > > > > > > "engineConfig": {
> > > > > > > "cacheSizeGB": 5
> > > > > > > }
> > > > > > > }
> > > > > > > },
> > > > > > > "systemLog": {
> > > > > > > "destination": "file",
> > > > > > > "logAppend": true,
> > > > > > > "logRotate": "rename",
> > > > > > > "path": "/data/db/mongod.log",
> > > > > > > "verbosity": 0
> > > > > > > }
> > > > > > > }
> > > > > > >
> > > > > > > The test environment have 32g memory and 16 core.
> > > > > > >
> > > > > > > Per my analyze, the access pattern for the mongodb test is that page
> > > > > > > will be re-access long after it's evicted so PID controller won't
> > > > > > > protect higher tier. That RFC will make use of the long existing
> > > > > > > shadow to do feedback to PID/Gen so the result will be much better.
> > > > > > > Still need more adjusting though, will try to do a rebase on top of
> > > > > > > mm-unstable which includes your patch.
> > > > > > >
> > > > > > > I've no idea why the workingset_refault_* is higher in the better
> > > > > > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > > > > > CPU is not full...
> > > > > > >
> > > > > > > I've uploaded my local reproducer here:
> > > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > > https://github.com/ryncsn/py-tpcc
> > > > > >
> > > > > > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > > > > > version did you use? setup.sh didn't seem to install it.
> > > > > >
> > > > > > Also do you have a QEMU image? It'd be a lot easier for me to
> > > > > > duplicate the exact environment by looking into it.
> > > > >
> > > > > I ended up using docker.io/mongodb/mongodb-community-server:latest,
> > > > > and it's not working:
> > > > >
> > > > > # docker exec -it mongo-r1 mongosh --eval \
> > > > > '"rs.initiate({
> > > > > _id: "issa-tpcc_0",
> > > > > members: [
> > > > > {_id: 0, host: "mongo-r1"},
> > > > > {_id: 1, host: "mongo-r2"},
> > > > > {_id: 2, host: "mongo-r3"}
> > > > > ]
> > > > > })"'
> > > > > Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> > > > > Error: can only create exec sessions on running containers: container
> > > > > state improper
> > > >
> > > > Hi Yu,
> > > >
> > > > I've updated the test repo:
> > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > >
> > > > I've tested it on top of latest Fedora Cloud Image 39 and it worked
> > > > well for me, the README now contains detailed and not hard to follow
> > > > steps to reproduce this test.
> > >
> > > Thanks. I was following the instructions down to the letter and it
> > > fell apart again at line 46 (./tpcc.py).
> >
> > I think you just broke it by
> > https://github.com/ryncsn/py-tpcc/commit/7b9b380d636cb84faa5b11b5562e531f924eeb7e
> >
> > (But it's also possible you actually wanted me to use this latest
> > commit but forgot to account for it in your instructions.)
> >
> > > Were you able to successfully run the benchmark on a fresh VM by
> > > following the instructions? If not, I'd appreciate it if you could do
> > > so and document all the missing steps.
>
> Ah, you are right, I attempted to convert it to Python3 but found it
> only brought more trouble, so I gave up and the instruction is still
> using Python2. However I accidentally pushed the WIP python3 convert
> commit... I've reset the repo to
> https://github.com/ryncsn/py-tpcc/commit/86e862c5cf3b2d1f51e0297742fa837c7a99ebf8,
> this is working well. Sorry for the inconvenient.

Thanks -- I was able to reproduce results similar to yours.

It turned out the mystery (fewer refaults but worse performance) was caused by
13.89% 13.89% kswapd0 [kernel.vmlinux] [k]
__list_del_entry_valid_or_report

Apparently Fedora has CONFIG_DEBUG_LIST=y by default, and after I
turned it off (the only change I made), this series showed better TPS
(I used"--duration=10800" for more reliable results):
v6.7-rc6 RFC [1] change
total txns 25024 24672 +1%
workingset_refault_anon 573668 680248 -16%
workingset_refault_file 260631976 265808452 -2%

I think this is easy to explain: this series is "lazy", i.e.,
deferring the protection to eviction time, whereas your RFC tries to
do it upfront, i.e., at (re)fault time. The advantage of the former is
that it has more up-to-date information because a folio that is hot
when it's faulted in doesn't mean it's still hot later when memory
pressure kicks in. The disadvantage is that it needs to protect folios
that are still hot at eviction time, by moving them to a younger
generation, where the slow down happened with CONFIG_DEBUG_LIST=y.

(It's not really a priority for me to investigate why
__list_del_entry_valid_or_report() is so heavy. Hopefully someone else
can shed some light on it.)

[1] v3 on top of v6.7-rc6 with "mm/mglru: fix underprotected page
cache" reverted.

2023-12-25 12:03:24

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <[email protected]> 于2023年12月25日周一 14:30写道:
>
> On Wed, Dec 20, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> >
> > Yu Zhao <[email protected]> 于2023年12月20日周三 16:17写道:
> > >
> > > On Tue, Dec 19, 2023 at 11:38 PM Yu Zhao <[email protected]> wrote:
> > > >
> > > > On Tue, Dec 19, 2023 at 11:58 AM Kairui Song <[email protected]> wrote:
> > > > >
> > > > > Yu Zhao <[email protected]> 于2023年12月19日周二 11:45写道:
> > > > > >
> > > > > > On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
> > > > > > >
> > > > > > > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> > > > > > > > >
> > > > > > > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > > > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > > > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > > > > > > on:
> > > > > > > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > > > > > > similar to the above.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > > > > > > Before After Change
> > > > > > > > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > > > > > > upstream.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > > > > > > similar problem):
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Previous result:
> > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This patch:
> > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > > > > > > distance).
> > > > > > > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > #!/bin/bash
> > > > > > > > > > > > > > swapoff -a
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > echo 4G > memory.max
> > > > > > > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > > > > > > towards certain pages.
> > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > > > > > > soon as possible.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > > > > > > and usually higher refautls result in worse performance.
> > > > > > > > >
> > > > > > > > > And thanks for providing the refaults I requested for -- your data
> > > > > > > > > below confirms what I mentioned above:
> > > > > > > > >
> > > > > > > > > For fio:
> > > > > > > > > Your RFC This series Change
> > > > > > > > > workingset_refault_file 628192729 596790506 -5%
> > > > > > > > > IOPS 1862k 1830k -2%
> > > > > > > > >
> > > > > > > > > For MongoDB:
> > > > > > > > > Your RFC This series Change
> > > > > > > > > workingset_refault_anon 10512 35277 +30%
> > > > > > > > > workingset_refault_file 22751782 20335355 -11%
> > > > > > > > > total 22762294 20370632 -11%
> > > > > > > > > TPS 0.09 0.06 -33%
> > > > > > > > >
> > > > > > > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > > > > > > especially when using zram, since an anon refault should be a lot
> > > > > > > > > cheaper than a file refault.
> > > > > > > > >
> > > > > > > > > So, I'm baffled...
> > > > > > > > >
> > > > > > > > > One important detail I forgot to mention: based on your data from
> > > > > > > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > > > > > > >
> > > > > > > > > Your Kconfig My Kconfig Max possible
> > > > > > > > > LRU_REFS_WIDTH 1 2 2
> > > > > > > >
> > > > > > > > Hi Yu,
> > > > > > > >
> > > > > > > > Thanks for the info, my fault, I forgot to update my config as I was
> > > > > > > > testing some other features.
> > > > > > > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > > > > > > got much worse for MongoDB test:
> > > > > > > >
> > > > > > > > With LRU_REFS_WIDTH == 2:
> > > > > > > >
> > > > > > > > This patch:
> > > > > > > > ==================================================================
> > > > > > > > Execution Results after 919 seconds
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > Executed Time (µs) Rate
> > > > > > > > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > TOTAL 488 27598136201.9 0.02 txn/s
> > > > > > > >
> > > > > > > > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > > > > > > node 0
> > > > > > > > 1 948187 0x 0x
> > > > > > > > 0 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 1 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 2 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 3 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 2 948187 0 6051788·
> > > > > > > > 0 0r 0e 0p 11916r
> > > > > > > > 66442e 0p
> > > > > > > > 1 0r 0e 0p 903r
> > > > > > > > 16888e 0p
> > > > > > > > 2 0r 0e 0p 459r
> > > > > > > > 9764e 0p
> > > > > > > > 3 0r 0e 0p 0r
> > > > > > > > 0e 2874p
> > > > > > > > 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 3 948187 1353160 6351·
> > > > > > > > 0 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 1 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 2 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 3 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 4 73045 23573 12·
> > > > > > > > 0 0R 0T 0 3498607R
> > > > > > > > 4868605T 0·
> > > > > > > > 1 0R 0T 0 3012246R
> > > > > > > > 3270261T 0·
> > > > > > > > 2 0R 0T 0 2498608R
> > > > > > > > 2839104T 0·
> > > > > > > > 3 0R 0T 0 0R
> > > > > > > > 1983947T 0·
> > > > > > > > 1486579L 0O 1380614Y 2945N
> > > > > > > > 2945F 2734A
> > > > > > > >
> > > > > > > > workingset_refault_anon 0
> > > > > > > > workingset_refault_file 18130598
> > > > > > > >
> > > > > > > > total used free shared buff/cache available
> > > > > > > > Mem: 31978 6705 312 20 24960 24786
> > > > > > > > Swap: 31977 4 31973
> > > > > > > >
> > > > > > > > RFC:
> > > > > > > > ==================================================================
> > > > > > > > Execution Results after 908 seconds
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > Executed Time (µs) Rate
> > > > > > > > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > > > > > > > ------------------------------------------------------------------
> > > > > > > > TOTAL 2252 27159962888.2 0.08 txn/s
> > > > > > > >
> > > > > > > > workingset_refault_anon 22585
> > > > > > > > workingset_refault_file 22715256
> > > > > > > >
> > > > > > > > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > > > > > > node 0
> > > > > > > > 22 563007 2274 1198225·
> > > > > > > > 0 0r 1e 0p 0r
> > > > > > > > 697076e 0p
> > > > > > > > 1 0r 0e 0p 0r
> > > > > > > > 0e 325661p
> > > > > > > > 2 0r 0e 0p 0r
> > > > > > > > 0e 888728p
> > > > > > > > 3 0r 0e 0p 0r
> > > > > > > > 0e 3602238p
> > > > > > > > 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 23 532222 7525 4948747·
> > > > > > > > 0 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 1 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 2 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 3 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 24 500367 1214667 3292·
> > > > > > > > 0 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 1 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 2 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 3 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 0 0 0 0
> > > > > > > > 0 0·
> > > > > > > > 25 469692 40797 466·
> > > > > > > > 0 0R 271T 0 0R
> > > > > > > > 1162165T 0·
> > > > > > > > 1 0R 0T 0 774028R
> > > > > > > > 1205332T 0·
> > > > > > > > 2 0R 0T 0 0R
> > > > > > > > 932484T 0·
> > > > > > > > 3 0R 1T 0 0R
> > > > > > > > 4252158T 0·
> > > > > > > > 25178380L 156515O 23953602Y 59234N
> > > > > > > > 49391F 48664A
> > > > > > > >
> > > > > > > > total used free shared buff/cache available
> > > > > > > > Mem: 31978 6968 338 5 24671 24555
> > > > > > > > Swap: 31977 1533 30444
> > > > > > > >
> > > > > > > > Using same mongodb config (a 3 replica cluster using the same config):
> > > > > > > > {
> > > > > > > > "net": {
> > > > > > > > "bindIpAll": true,
> > > > > > > > "ipv6": false,
> > > > > > > > "maxIncomingConnections": 10000,
> > > > > > > > },
> > > > > > > > "setParameter": {
> > > > > > > > "disabledSecureAllocatorDomains": "*"
> > > > > > > > },
> > > > > > > > "replication": {
> > > > > > > > "oplogSizeMB": 10480,
> > > > > > > > "replSetName": "issa-tpcc_0"
> > > > > > > > },
> > > > > > > > "security": {
> > > > > > > > "keyFile": "/data/db/keyfile"
> > > > > > > > },
> > > > > > > > "storage": {
> > > > > > > > "dbPath": "/data/db/",
> > > > > > > > "syncPeriodSecs": 60,
> > > > > > > > "directoryPerDB": true,
> > > > > > > > "wiredTiger": {
> > > > > > > > "engineConfig": {
> > > > > > > > "cacheSizeGB": 5
> > > > > > > > }
> > > > > > > > }
> > > > > > > > },
> > > > > > > > "systemLog": {
> > > > > > > > "destination": "file",
> > > > > > > > "logAppend": true,
> > > > > > > > "logRotate": "rename",
> > > > > > > > "path": "/data/db/mongod.log",
> > > > > > > > "verbosity": 0
> > > > > > > > }
> > > > > > > > }
> > > > > > > >
> > > > > > > > The test environment have 32g memory and 16 core.
> > > > > > > >
> > > > > > > > Per my analyze, the access pattern for the mongodb test is that page
> > > > > > > > will be re-access long after it's evicted so PID controller won't
> > > > > > > > protect higher tier. That RFC will make use of the long existing
> > > > > > > > shadow to do feedback to PID/Gen so the result will be much better.
> > > > > > > > Still need more adjusting though, will try to do a rebase on top of
> > > > > > > > mm-unstable which includes your patch.
> > > > > > > >
> > > > > > > > I've no idea why the workingset_refault_* is higher in the better
> > > > > > > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > > > > > > CPU is not full...
> > > > > > > >
> > > > > > > > I've uploaded my local reproducer here:
> > > > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > > > https://github.com/ryncsn/py-tpcc
> > > > > > >
> > > > > > > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > > > > > > version did you use? setup.sh didn't seem to install it.
> > > > > > >
> > > > > > > Also do you have a QEMU image? It'd be a lot easier for me to
> > > > > > > duplicate the exact environment by looking into it.
> > > > > >
> > > > > > I ended up using docker.io/mongodb/mongodb-community-server:latest,
> > > > > > and it's not working:
> > > > > >
> > > > > > # docker exec -it mongo-r1 mongosh --eval \
> > > > > > '"rs.initiate({
> > > > > > _id: "issa-tpcc_0",
> > > > > > members: [
> > > > > > {_id: 0, host: "mongo-r1"},
> > > > > > {_id: 1, host: "mongo-r2"},
> > > > > > {_id: 2, host: "mongo-r3"}
> > > > > > ]
> > > > > > })"'
> > > > > > Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> > > > > > Error: can only create exec sessions on running containers: container
> > > > > > state improper
> > > > >
> > > > > Hi Yu,
> > > > >
> > > > > I've updated the test repo:
> > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > >
> > > > > I've tested it on top of latest Fedora Cloud Image 39 and it worked
> > > > > well for me, the README now contains detailed and not hard to follow
> > > > > steps to reproduce this test.
> > > >
> > > > Thanks. I was following the instructions down to the letter and it
> > > > fell apart again at line 46 (./tpcc.py).
> > >
> > > I think you just broke it by
> > > https://github.com/ryncsn/py-tpcc/commit/7b9b380d636cb84faa5b11b5562e531f924eeb7e
> > >
> > > (But it's also possible you actually wanted me to use this latest
> > > commit but forgot to account for it in your instructions.)
> > >
> > > > Were you able to successfully run the benchmark on a fresh VM by
> > > > following the instructions? If not, I'd appreciate it if you could do
> > > > so and document all the missing steps.
> >
> > Ah, you are right, I attempted to convert it to Python3 but found it
> > only brought more trouble, so I gave up and the instruction is still
> > using Python2. However I accidentally pushed the WIP python3 convert
> > commit... I've reset the repo to
> > https://github.com/ryncsn/py-tpcc/commit/86e862c5cf3b2d1f51e0297742fa837c7a99ebf8,
> > this is working well. Sorry for the inconvenient.
>
> Thanks -- I was able to reproduce results similar to yours.
>

Hi Yu,

Thanks for the testing, and merry xmas.

> It turned out the mystery (fewer refaults but worse performance) was caused by
> 13.89% 13.89% kswapd0 [kernel.vmlinux] [k]
> __list_del_entry_valid_or_report

I'm not sure about this, if the task is CPU bounded, this could
explain. But it's not, the performance gap is larger when tested on
slow IO device.

The iostat output during my test run:
avg-cpu: %user %nice %system %iowait %steal %idle
7.40 0.00 2.42 83.37 0.00 6.80
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vda 35.00 0.80 167.60 17.20 6.90 3.50
16.47 81.40 0.47 1.62 0.02 4.79 21.50 0.63 2.27
vdb 5999.30 4.80 104433.60 84.00 0.00 8.30
0.00 63.36 6.54 1.31 39.25 17.41 17.50 0.17 100.00
zram0 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

You can see CPU is waiting for IO, %user is always around 10%.
The hotspot you posted only take up 13.89% of the runtime, which
shouldn't cause so much performance drop.

>
> Apparently Fedora has CONFIG_DEBUG_LIST=y by default, and after I
> turned it off (the only change I made), this series showed better TPS
> (I used"--duration=10800" for more reliable results):
> v6.7-rc6 RFC [1] change
> total txns 25024 24672 +1%
> workingset_refault_anon 573668 680248 -16%
> workingset_refault_file 260631976 265808452 -2%

I have disabled CONFIG_DEBUG_LIST when doing performance comparison test.

I believe you are using higher performance SSD, so the bottle neck is
CPU, and the RFC involves more lru/memcg counter update/iteration, so
it is slower by 1%.

> I think this is easy to explain: this series is "lazy", i.e.,
> deferring the protection to eviction time, whereas your RFC tries to
> do it upfront, i.e., at (re)fault time. The advantage of the former is
> that it has more up-to-date information because a folio that is hot
> when it's faulted in doesn't mean it's still hot later when memory
> pressure kicks in. The disadvantage is that it needs to protect folios
> that are still hot at eviction time, by moving them to a younger
> generation, where the slow down happened with CONFIG_DEBUG_LIST=y.
>
> (It's not really a priority for me to investigate why
> __list_del_entry_valid_or_report() is so heavy. Hopefully someone else
> can shed some light on it.)

I've just setup another cluster with high performance SSD, where now
CPU is the bottle neck to better understand this. Will try to do more
test to see if I can find out something.

2023-12-25 21:53:33

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Mon, Dec 25, 2023 at 5:03 AM Kairui Song <[email protected]> wrote:
>
> Yu Zhao <[email protected]> 于2023年12月25日周一 14:30写道:
> >
> > On Wed, Dec 20, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > >
> > > Yu Zhao <[email protected]> 于2023年12月20日周三 16:17写道:
> > > >
> > > > On Tue, Dec 19, 2023 at 11:38 PM Yu Zhao <[email protected]> wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 11:58 AM Kairui Song <[email protected]> wrote:
> > > > > >
> > > > > > Yu Zhao <[email protected]> 于2023年12月19日周二 11:45写道:
> > > > > > >
> > > > > > > On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> > > > > > > > > >
> > > > > > > > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > > > > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > > > > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > > > > > > > on:
> > > > > > > > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > > > > > > > similar to the above.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > > > > > > > Before After Change
> > > > > > > > > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > > > > > > > upstream.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > > > > > > > similar problem):
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Previous result:
> > > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This patch:
> > > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > > > > > > > distance).
> > > > > > > > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > #!/bin/bash
> > > > > > > > > > > > > > > swapoff -a
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > echo 4G > memory.max
> > > > > > > > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > > > > > > > towards certain pages.
> > > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > > > > > > > soon as possible.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > > > > > > > and usually higher refautls result in worse performance.
> > > > > > > > > >
> > > > > > > > > > And thanks for providing the refaults I requested for -- your data
> > > > > > > > > > below confirms what I mentioned above:
> > > > > > > > > >
> > > > > > > > > > For fio:
> > > > > > > > > > Your RFC This series Change
> > > > > > > > > > workingset_refault_file 628192729 596790506 -5%
> > > > > > > > > > IOPS 1862k 1830k -2%
> > > > > > > > > >
> > > > > > > > > > For MongoDB:
> > > > > > > > > > Your RFC This series Change
> > > > > > > > > > workingset_refault_anon 10512 35277 +30%
> > > > > > > > > > workingset_refault_file 22751782 20335355 -11%
> > > > > > > > > > total 22762294 20370632 -11%
> > > > > > > > > > TPS 0.09 0.06 -33%
> > > > > > > > > >
> > > > > > > > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > > > > > > > especially when using zram, since an anon refault should be a lot
> > > > > > > > > > cheaper than a file refault.
> > > > > > > > > >
> > > > > > > > > > So, I'm baffled...
> > > > > > > > > >
> > > > > > > > > > One important detail I forgot to mention: based on your data from
> > > > > > > > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > > > > > > > >
> > > > > > > > > > Your Kconfig My Kconfig Max possible
> > > > > > > > > > LRU_REFS_WIDTH 1 2 2
> > > > > > > > >
> > > > > > > > > Hi Yu,
> > > > > > > > >
> > > > > > > > > Thanks for the info, my fault, I forgot to update my config as I was
> > > > > > > > > testing some other features.
> > > > > > > > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > > > > > > > got much worse for MongoDB test:
> > > > > > > > >
> > > > > > > > > With LRU_REFS_WIDTH == 2:
> > > > > > > > >
> > > > > > > > > This patch:
> > > > > > > > > ==================================================================
> > > > > > > > > Execution Results after 919 seconds
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > TOTAL 488 27598136201.9 0.02 txn/s
> > > > > > > > >
> > > > > > > > > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > > > > > > > node 0
> > > > > > > > > 1 948187 0x 0x
> > > > > > > > > 0 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 1 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 2 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 3 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 2 948187 0 6051788·
> > > > > > > > > 0 0r 0e 0p 11916r
> > > > > > > > > 66442e 0p
> > > > > > > > > 1 0r 0e 0p 903r
> > > > > > > > > 16888e 0p
> > > > > > > > > 2 0r 0e 0p 459r
> > > > > > > > > 9764e 0p
> > > > > > > > > 3 0r 0e 0p 0r
> > > > > > > > > 0e 2874p
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 3 948187 1353160 6351·
> > > > > > > > > 0 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 1 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 2 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 3 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 4 73045 23573 12·
> > > > > > > > > 0 0R 0T 0 3498607R
> > > > > > > > > 4868605T 0·
> > > > > > > > > 1 0R 0T 0 3012246R
> > > > > > > > > 3270261T 0·
> > > > > > > > > 2 0R 0T 0 2498608R
> > > > > > > > > 2839104T 0·
> > > > > > > > > 3 0R 0T 0 0R
> > > > > > > > > 1983947T 0·
> > > > > > > > > 1486579L 0O 1380614Y 2945N
> > > > > > > > > 2945F 2734A
> > > > > > > > >
> > > > > > > > > workingset_refault_anon 0
> > > > > > > > > workingset_refault_file 18130598
> > > > > > > > >
> > > > > > > > > total used free shared buff/cache available
> > > > > > > > > Mem: 31978 6705 312 20 24960 24786
> > > > > > > > > Swap: 31977 4 31973
> > > > > > > > >
> > > > > > > > > RFC:
> > > > > > > > > ==================================================================
> > > > > > > > > Execution Results after 908 seconds
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > TOTAL 2252 27159962888.2 0.08 txn/s
> > > > > > > > >
> > > > > > > > > workingset_refault_anon 22585
> > > > > > > > > workingset_refault_file 22715256
> > > > > > > > >
> > > > > > > > > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > > > > > > > node 0
> > > > > > > > > 22 563007 2274 1198225·
> > > > > > > > > 0 0r 1e 0p 0r
> > > > > > > > > 697076e 0p
> > > > > > > > > 1 0r 0e 0p 0r
> > > > > > > > > 0e 325661p
> > > > > > > > > 2 0r 0e 0p 0r
> > > > > > > > > 0e 888728p
> > > > > > > > > 3 0r 0e 0p 0r
> > > > > > > > > 0e 3602238p
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 23 532222 7525 4948747·
> > > > > > > > > 0 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 1 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 2 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 3 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 24 500367 1214667 3292·
> > > > > > > > > 0 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 1 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 2 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 3 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 0 0 0 0
> > > > > > > > > 0 0·
> > > > > > > > > 25 469692 40797 466·
> > > > > > > > > 0 0R 271T 0 0R
> > > > > > > > > 1162165T 0·
> > > > > > > > > 1 0R 0T 0 774028R
> > > > > > > > > 1205332T 0·
> > > > > > > > > 2 0R 0T 0 0R
> > > > > > > > > 932484T 0·
> > > > > > > > > 3 0R 1T 0 0R
> > > > > > > > > 4252158T 0·
> > > > > > > > > 25178380L 156515O 23953602Y 59234N
> > > > > > > > > 49391F 48664A
> > > > > > > > >
> > > > > > > > > total used free shared buff/cache available
> > > > > > > > > Mem: 31978 6968 338 5 24671 24555
> > > > > > > > > Swap: 31977 1533 30444
> > > > > > > > >
> > > > > > > > > Using same mongodb config (a 3 replica cluster using the same config):
> > > > > > > > > {
> > > > > > > > > "net": {
> > > > > > > > > "bindIpAll": true,
> > > > > > > > > "ipv6": false,
> > > > > > > > > "maxIncomingConnections": 10000,
> > > > > > > > > },
> > > > > > > > > "setParameter": {
> > > > > > > > > "disabledSecureAllocatorDomains": "*"
> > > > > > > > > },
> > > > > > > > > "replication": {
> > > > > > > > > "oplogSizeMB": 10480,
> > > > > > > > > "replSetName": "issa-tpcc_0"
> > > > > > > > > },
> > > > > > > > > "security": {
> > > > > > > > > "keyFile": "/data/db/keyfile"
> > > > > > > > > },
> > > > > > > > > "storage": {
> > > > > > > > > "dbPath": "/data/db/",
> > > > > > > > > "syncPeriodSecs": 60,
> > > > > > > > > "directoryPerDB": true,
> > > > > > > > > "wiredTiger": {
> > > > > > > > > "engineConfig": {
> > > > > > > > > "cacheSizeGB": 5
> > > > > > > > > }
> > > > > > > > > }
> > > > > > > > > },
> > > > > > > > > "systemLog": {
> > > > > > > > > "destination": "file",
> > > > > > > > > "logAppend": true,
> > > > > > > > > "logRotate": "rename",
> > > > > > > > > "path": "/data/db/mongod.log",
> > > > > > > > > "verbosity": 0
> > > > > > > > > }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > The test environment have 32g memory and 16 core.
> > > > > > > > >
> > > > > > > > > Per my analyze, the access pattern for the mongodb test is that page
> > > > > > > > > will be re-access long after it's evicted so PID controller won't
> > > > > > > > > protect higher tier. That RFC will make use of the long existing
> > > > > > > > > shadow to do feedback to PID/Gen so the result will be much better.
> > > > > > > > > Still need more adjusting though, will try to do a rebase on top of
> > > > > > > > > mm-unstable which includes your patch.
> > > > > > > > >
> > > > > > > > > I've no idea why the workingset_refault_* is higher in the better
> > > > > > > > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > > > > > > > CPU is not full...
> > > > > > > > >
> > > > > > > > > I've uploaded my local reproducer here:
> > > > > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > > > > https://github.com/ryncsn/py-tpcc
> > > > > > > >
> > > > > > > > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > > > > > > > version did you use? setup.sh didn't seem to install it.
> > > > > > > >
> > > > > > > > Also do you have a QEMU image? It'd be a lot easier for me to
> > > > > > > > duplicate the exact environment by looking into it.
> > > > > > >
> > > > > > > I ended up using docker.io/mongodb/mongodb-community-server:latest,
> > > > > > > and it's not working:
> > > > > > >
> > > > > > > # docker exec -it mongo-r1 mongosh --eval \
> > > > > > > '"rs.initiate({
> > > > > > > _id: "issa-tpcc_0",
> > > > > > > members: [
> > > > > > > {_id: 0, host: "mongo-r1"},
> > > > > > > {_id: 1, host: "mongo-r2"},
> > > > > > > {_id: 2, host: "mongo-r3"}
> > > > > > > ]
> > > > > > > })"'
> > > > > > > Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> > > > > > > Error: can only create exec sessions on running containers: container
> > > > > > > state improper
> > > > > >
> > > > > > Hi Yu,
> > > > > >
> > > > > > I've updated the test repo:
> > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > >
> > > > > > I've tested it on top of latest Fedora Cloud Image 39 and it worked
> > > > > > well for me, the README now contains detailed and not hard to follow
> > > > > > steps to reproduce this test.
> > > > >
> > > > > Thanks. I was following the instructions down to the letter and it
> > > > > fell apart again at line 46 (./tpcc.py).
> > > >
> > > > I think you just broke it by
> > > > https://github.com/ryncsn/py-tpcc/commit/7b9b380d636cb84faa5b11b5562e531f924eeb7e
> > > >
> > > > (But it's also possible you actually wanted me to use this latest
> > > > commit but forgot to account for it in your instructions.)
> > > >
> > > > > Were you able to successfully run the benchmark on a fresh VM by
> > > > > following the instructions? If not, I'd appreciate it if you could do
> > > > > so and document all the missing steps.
> > >
> > > Ah, you are right, I attempted to convert it to Python3 but found it
> > > only brought more trouble, so I gave up and the instruction is still
> > > using Python2. However I accidentally pushed the WIP python3 convert
> > > commit... I've reset the repo to
> > > https://github.com/ryncsn/py-tpcc/commit/86e862c5cf3b2d1f51e0297742fa837c7a99ebf8,
> > > this is working well. Sorry for the inconvenient.
> >
> > Thanks -- I was able to reproduce results similar to yours.
> >
>
> Hi Yu,
>
> Thanks for the testing, and merry xmas.
>
> > It turned out the mystery (fewer refaults but worse performance) was caused by
> > 13.89% 13.89% kswapd0 [kernel.vmlinux] [k]
> > __list_del_entry_valid_or_report
>
> I'm not sure about this, if the task is CPU bounded, this could
> explain. But it's not, the performance gap is larger when tested on
> slow IO device.
>
> The iostat output during my test run:
> avg-cpu: %user %nice %system %iowait %steal %idle
> 7.40 0.00 2.42 83.37 0.00 6.80
> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s
> %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
> vda 35.00 0.80 167.60 17.20 6.90 3.50
> 16.47 81.40 0.47 1.62 0.02 4.79 21.50 0.63 2.27
> vdb 5999.30 4.80 104433.60 84.00 0.00 8.30
> 0.00 63.36 6.54 1.31 39.25 17.41 17.50 0.17 100.00
> zram0 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

I ran the benchmark on the slowest bare metal I have that roughly
matches your CPU/DRAM configurations (ThinkPad P1 G4
https://support.lenovo.com/us/en/solutions/pd031426).

But it seems you used a VM (vda/vdb) -- I never run performance
benchmarks in VMs because the host and hypervisor can complicate
things, for example, in this case, is it possible the host page cache
cached your disk image containing the database files?

> You can see CPU is waiting for IO, %user is always around 10%.
> The hotspot you posted only take up 13.89% of the runtime, which
> shouldn't cause so much performance drop.
>
> >
> > Apparently Fedora has CONFIG_DEBUG_LIST=y by default, and after I
> > turned it off (the only change I made), this series showed better TPS
> > (I used"--duration=10800" for more reliable results):
> > v6.7-rc6 RFC [1] change
> > total txns 25024 24672 +1%
> > workingset_refault_anon 573668 680248 -16%
> > workingset_refault_file 260631976 265808452 -2%
>
> I have disabled CONFIG_DEBUG_LIST when doing performance comparison test.
>
> I believe you are using higher performance SSD, so the bottle neck is
> CPU, and the RFC involves more lru/memcg counter update/iteration, so
> it is slower by 1%.
>
> > I think this is easy to explain: this series is "lazy", i.e.,
> > deferring the protection to eviction time, whereas your RFC tries to
> > do it upfront, i.e., at (re)fault time. The advantage of the former is
> > that it has more up-to-date information because a folio that is hot
> > when it's faulted in doesn't mean it's still hot later when memory
> > pressure kicks in. The disadvantage is that it needs to protect folios
> > that are still hot at eviction time, by moving them to a younger
> > generation, where the slow down happened with CONFIG_DEBUG_LIST=y.
> >
> > (It's not really a priority for me to investigate why
> > __list_del_entry_valid_or_report() is so heavy. Hopefully someone else
> > can shed some light on it.)
>
> I've just setup another cluster with high performance SSD, where now
> CPU is the bottle neck to better understand this. Will try to do more
> test to see if I can find out something.

I'd suggest we both stick to bare metal until we can reconcile our
test results. Otherwise, there'd be too many moving parts for us to
get to the bottom of this.

2023-12-25 22:01:49

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Mon, Dec 25, 2023 at 2:52 PM Yu Zhao <[email protected]> wrote:
>
> On Mon, Dec 25, 2023 at 5:03 AM Kairui Song <[email protected]> wrote:
> >
> > Yu Zhao <[email protected]> 于2023年12月25日周一 14:30写道:
> > >
> > > On Wed, Dec 20, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > >
> > > > Yu Zhao <[email protected]> 于2023年12月20日周三 16:17写道:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 11:38 PM Yu Zhao <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Dec 19, 2023 at 11:58 AM Kairui Song <[email protected]> wrote:
> > > > > > >
> > > > > > > Yu Zhao <[email protected]> 于2023年12月19日周二 11:45写道:
> > > > > > > >
> > > > > > > > On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > > > > > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > > > > > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > > > > > > > > on:
> > > > > > > > > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > > > > > > > > similar to the above.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > > > > > > > > Before After Change
> > > > > > > > > > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > > > > > > > > upstream.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > > > > > > > > similar problem):
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Previous result:
> > > > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > This patch:
> > > > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > > > > > > > > distance).
> > > > > > > > > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > #!/bin/bash
> > > > > > > > > > > > > > > > swapoff -a
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > echo 4G > memory.max
> > > > > > > > > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > > > > > > > > towards certain pages.
> > > > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > > > > > > > > soon as possible.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > > > > > > > > and usually higher refautls result in worse performance.
> > > > > > > > > > >
> > > > > > > > > > > And thanks for providing the refaults I requested for -- your data
> > > > > > > > > > > below confirms what I mentioned above:
> > > > > > > > > > >
> > > > > > > > > > > For fio:
> > > > > > > > > > > Your RFC This series Change
> > > > > > > > > > > workingset_refault_file 628192729 596790506 -5%
> > > > > > > > > > > IOPS 1862k 1830k -2%
> > > > > > > > > > >
> > > > > > > > > > > For MongoDB:
> > > > > > > > > > > Your RFC This series Change
> > > > > > > > > > > workingset_refault_anon 10512 35277 +30%
> > > > > > > > > > > workingset_refault_file 22751782 20335355 -11%
> > > > > > > > > > > total 22762294 20370632 -11%
> > > > > > > > > > > TPS 0.09 0.06 -33%
> > > > > > > > > > >
> > > > > > > > > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > > > > > > > > especially when using zram, since an anon refault should be a lot
> > > > > > > > > > > cheaper than a file refault.
> > > > > > > > > > >
> > > > > > > > > > > So, I'm baffled...
> > > > > > > > > > >
> > > > > > > > > > > One important detail I forgot to mention: based on your data from
> > > > > > > > > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > > > > > > > > >
> > > > > > > > > > > Your Kconfig My Kconfig Max possible
> > > > > > > > > > > LRU_REFS_WIDTH 1 2 2
> > > > > > > > > >
> > > > > > > > > > Hi Yu,
> > > > > > > > > >
> > > > > > > > > > Thanks for the info, my fault, I forgot to update my config as I was
> > > > > > > > > > testing some other features.
> > > > > > > > > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > > > > > > > > got much worse for MongoDB test:
> > > > > > > > > >
> > > > > > > > > > With LRU_REFS_WIDTH == 2:
> > > > > > > > > >
> > > > > > > > > > This patch:
> > > > > > > > > > ==================================================================
> > > > > > > > > > Execution Results after 919 seconds
> > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > TOTAL 488 27598136201.9 0.02 txn/s
> > > > > > > > > >
> > > > > > > > > > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > > > > > > > > node 0
> > > > > > > > > > 1 948187 0x 0x
> > > > > > > > > > 0 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 1 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 2 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 3 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 2 948187 0 6051788·
> > > > > > > > > > 0 0r 0e 0p 11916r
> > > > > > > > > > 66442e 0p
> > > > > > > > > > 1 0r 0e 0p 903r
> > > > > > > > > > 16888e 0p
> > > > > > > > > > 2 0r 0e 0p 459r
> > > > > > > > > > 9764e 0p
> > > > > > > > > > 3 0r 0e 0p 0r
> > > > > > > > > > 0e 2874p
> > > > > > > > > > 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 3 948187 1353160 6351·
> > > > > > > > > > 0 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 1 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 2 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 3 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 4 73045 23573 12·
> > > > > > > > > > 0 0R 0T 0 3498607R
> > > > > > > > > > 4868605T 0·
> > > > > > > > > > 1 0R 0T 0 3012246R
> > > > > > > > > > 3270261T 0·
> > > > > > > > > > 2 0R 0T 0 2498608R
> > > > > > > > > > 2839104T 0·
> > > > > > > > > > 3 0R 0T 0 0R
> > > > > > > > > > 1983947T 0·
> > > > > > > > > > 1486579L 0O 1380614Y 2945N
> > > > > > > > > > 2945F 2734A
> > > > > > > > > >
> > > > > > > > > > workingset_refault_anon 0
> > > > > > > > > > workingset_refault_file 18130598
> > > > > > > > > >
> > > > > > > > > > total used free shared buff/cache available
> > > > > > > > > > Mem: 31978 6705 312 20 24960 24786
> > > > > > > > > > Swap: 31977 4 31973
> > > > > > > > > >
> > > > > > > > > > RFC:
> > > > > > > > > > ==================================================================
> > > > > > > > > > Execution Results after 908 seconds
> > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > TOTAL 2252 27159962888.2 0.08 txn/s
> > > > > > > > > >
> > > > > > > > > > workingset_refault_anon 22585
> > > > > > > > > > workingset_refault_file 22715256
> > > > > > > > > >
> > > > > > > > > > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > > > > > > > > node 0
> > > > > > > > > > 22 563007 2274 1198225·
> > > > > > > > > > 0 0r 1e 0p 0r
> > > > > > > > > > 697076e 0p
> > > > > > > > > > 1 0r 0e 0p 0r
> > > > > > > > > > 0e 325661p
> > > > > > > > > > 2 0r 0e 0p 0r
> > > > > > > > > > 0e 888728p
> > > > > > > > > > 3 0r 0e 0p 0r
> > > > > > > > > > 0e 3602238p
> > > > > > > > > > 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 23 532222 7525 4948747·
> > > > > > > > > > 0 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 1 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 2 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 3 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 24 500367 1214667 3292·
> > > > > > > > > > 0 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 1 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 2 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 3 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 0 0 0 0
> > > > > > > > > > 0 0·
> > > > > > > > > > 25 469692 40797 466·
> > > > > > > > > > 0 0R 271T 0 0R
> > > > > > > > > > 1162165T 0·
> > > > > > > > > > 1 0R 0T 0 774028R
> > > > > > > > > > 1205332T 0·
> > > > > > > > > > 2 0R 0T 0 0R
> > > > > > > > > > 932484T 0·
> > > > > > > > > > 3 0R 1T 0 0R
> > > > > > > > > > 4252158T 0·
> > > > > > > > > > 25178380L 156515O 23953602Y 59234N
> > > > > > > > > > 49391F 48664A
> > > > > > > > > >
> > > > > > > > > > total used free shared buff/cache available
> > > > > > > > > > Mem: 31978 6968 338 5 24671 24555
> > > > > > > > > > Swap: 31977 1533 30444
> > > > > > > > > >
> > > > > > > > > > Using same mongodb config (a 3 replica cluster using the same config):
> > > > > > > > > > {
> > > > > > > > > > "net": {
> > > > > > > > > > "bindIpAll": true,
> > > > > > > > > > "ipv6": false,
> > > > > > > > > > "maxIncomingConnections": 10000,
> > > > > > > > > > },
> > > > > > > > > > "setParameter": {
> > > > > > > > > > "disabledSecureAllocatorDomains": "*"
> > > > > > > > > > },
> > > > > > > > > > "replication": {
> > > > > > > > > > "oplogSizeMB": 10480,
> > > > > > > > > > "replSetName": "issa-tpcc_0"
> > > > > > > > > > },
> > > > > > > > > > "security": {
> > > > > > > > > > "keyFile": "/data/db/keyfile"
> > > > > > > > > > },
> > > > > > > > > > "storage": {
> > > > > > > > > > "dbPath": "/data/db/",
> > > > > > > > > > "syncPeriodSecs": 60,
> > > > > > > > > > "directoryPerDB": true,
> > > > > > > > > > "wiredTiger": {
> > > > > > > > > > "engineConfig": {
> > > > > > > > > > "cacheSizeGB": 5
> > > > > > > > > > }
> > > > > > > > > > }
> > > > > > > > > > },
> > > > > > > > > > "systemLog": {
> > > > > > > > > > "destination": "file",
> > > > > > > > > > "logAppend": true,
> > > > > > > > > > "logRotate": "rename",
> > > > > > > > > > "path": "/data/db/mongod.log",
> > > > > > > > > > "verbosity": 0
> > > > > > > > > > }
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > The test environment have 32g memory and 16 core.
> > > > > > > > > >
> > > > > > > > > > Per my analyze, the access pattern for the mongodb test is that page
> > > > > > > > > > will be re-access long after it's evicted so PID controller won't
> > > > > > > > > > protect higher tier. That RFC will make use of the long existing
> > > > > > > > > > shadow to do feedback to PID/Gen so the result will be much better.
> > > > > > > > > > Still need more adjusting though, will try to do a rebase on top of
> > > > > > > > > > mm-unstable which includes your patch.
> > > > > > > > > >
> > > > > > > > > > I've no idea why the workingset_refault_* is higher in the better
> > > > > > > > > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > > > > > > > > CPU is not full...
> > > > > > > > > >
> > > > > > > > > > I've uploaded my local reproducer here:
> > > > > > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > > > > > https://github.com/ryncsn/py-tpcc
> > > > > > > > >
> > > > > > > > > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > > > > > > > > version did you use? setup.sh didn't seem to install it.
> > > > > > > > >
> > > > > > > > > Also do you have a QEMU image? It'd be a lot easier for me to
> > > > > > > > > duplicate the exact environment by looking into it.
> > > > > > > >
> > > > > > > > I ended up using docker.io/mongodb/mongodb-community-server:latest,
> > > > > > > > and it's not working:
> > > > > > > >
> > > > > > > > # docker exec -it mongo-r1 mongosh --eval \
> > > > > > > > '"rs.initiate({
> > > > > > > > _id: "issa-tpcc_0",
> > > > > > > > members: [
> > > > > > > > {_id: 0, host: "mongo-r1"},
> > > > > > > > {_id: 1, host: "mongo-r2"},
> > > > > > > > {_id: 2, host: "mongo-r3"}
> > > > > > > > ]
> > > > > > > > })"'
> > > > > > > > Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> > > > > > > > Error: can only create exec sessions on running containers: container
> > > > > > > > state improper
> > > > > > >
> > > > > > > Hi Yu,
> > > > > > >
> > > > > > > I've updated the test repo:
> > > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > >
> > > > > > > I've tested it on top of latest Fedora Cloud Image 39 and it worked
> > > > > > > well for me, the README now contains detailed and not hard to follow
> > > > > > > steps to reproduce this test.
> > > > > >
> > > > > > Thanks. I was following the instructions down to the letter and it
> > > > > > fell apart again at line 46 (./tpcc.py).
> > > > >
> > > > > I think you just broke it by
> > > > > https://github.com/ryncsn/py-tpcc/commit/7b9b380d636cb84faa5b11b5562e531f924eeb7e
> > > > >
> > > > > (But it's also possible you actually wanted me to use this latest
> > > > > commit but forgot to account for it in your instructions.)
> > > > >
> > > > > > Were you able to successfully run the benchmark on a fresh VM by
> > > > > > following the instructions? If not, I'd appreciate it if you could do
> > > > > > so and document all the missing steps.
> > > >
> > > > Ah, you are right, I attempted to convert it to Python3 but found it
> > > > only brought more trouble, so I gave up and the instruction is still
> > > > using Python2. However I accidentally pushed the WIP python3 convert
> > > > commit... I've reset the repo to
> > > > https://github.com/ryncsn/py-tpcc/commit/86e862c5cf3b2d1f51e0297742fa837c7a99ebf8,
> > > > this is working well. Sorry for the inconvenient.
> > >
> > > Thanks -- I was able to reproduce results similar to yours.
> > >
> >
> > Hi Yu,
> >
> > Thanks for the testing, and merry xmas.
> >
> > > It turned out the mystery (fewer refaults but worse performance) was caused by
> > > 13.89% 13.89% kswapd0 [kernel.vmlinux] [k]
> > > __list_del_entry_valid_or_report
> >
> > I'm not sure about this, if the task is CPU bounded, this could
> > explain. But it's not, the performance gap is larger when tested on
> > slow IO device.
> >
> > The iostat output during my test run:
> > avg-cpu: %user %nice %system %iowait %steal %idle
> > 7.40 0.00 2.42 83.37 0.00 6.80
> > Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s
> > %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
> > vda 35.00 0.80 167.60 17.20 6.90 3.50
> > 16.47 81.40 0.47 1.62 0.02 4.79 21.50 0.63 2.27
> > vdb 5999.30 4.80 104433.60 84.00 0.00 8.30
> > 0.00 63.36 6.54 1.31 39.25 17.41 17.50 0.17 100.00
> > zram0 0.00 0.00 0.00 0.00 0.00 0.00
> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>
> I ran the benchmark on the slowest bare metal I have that roughly
> matches your CPU/DRAM configurations (ThinkPad P1 G4
> https://support.lenovo.com/us/en/solutions/pd031426).
>
> But it seems you used a VM (vda/vdb) -- I never run performance
> benchmarks in VMs because the host and hypervisor can complicate
> things, for example, in this case, is it possible the host page cache
> cached your disk image containing the database files?
>
> > You can see CPU is waiting for IO, %user is always around 10%.
> > The hotspot you posted only take up 13.89% of the runtime, which
> > shouldn't cause so much performance drop.
> >
> > >
> > > Apparently Fedora has CONFIG_DEBUG_LIST=y by default, and after I
> > > turned it off (the only change I made), this series showed better TPS
> > > (I used"--duration=10800" for more reliable results):
> > > v6.7-rc6 RFC [1] change
> > > total txns 25024 24672 +1%
> > > workingset_refault_anon 573668 680248 -16%
> > > workingset_refault_file 260631976 265808452 -2%
> >
> > I have disabled CONFIG_DEBUG_LIST when doing performance comparison test.

Also I'd suggest we both use the same distro you shared with me and
the default .config except CONFIG_DEBUG_LIST=n, and v6.7-rc6 for now.

(I'm attaching the default .config based on /boot/config-6.5.6-300.fc39.x86_64.)

> > I believe you are using higher performance SSD, so the bottle neck is
> > CPU, and the RFC involves more lru/memcg counter update/iteration, so
> > it is slower by 1%.
> >
> > > I think this is easy to explain: this series is "lazy", i.e.,
> > > deferring the protection to eviction time, whereas your RFC tries to
> > > do it upfront, i.e., at (re)fault time. The advantage of the former is
> > > that it has more up-to-date information because a folio that is hot
> > > when it's faulted in doesn't mean it's still hot later when memory
> > > pressure kicks in. The disadvantage is that it needs to protect folios
> > > that are still hot at eviction time, by moving them to a younger
> > > generation, where the slow down happened with CONFIG_DEBUG_LIST=y.
> > >
> > > (It's not really a priority for me to investigate why
> > > __list_del_entry_valid_or_report() is so heavy. Hopefully someone else
> > > can shed some light on it.)
> >
> > I've just setup another cluster with high performance SSD, where now
> > CPU is the bottle neck to better understand this. Will try to do more
> > test to see if I can find out something.
>
> I'd suggest we both stick to bare metal until we can reconcile our
> test results. Otherwise, there'd be too many moving parts for us to
> get to the bottom of this.


Attachments:
config (261.79 kB)

2024-01-10 19:17:41

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <[email protected]> 于2023年12月26日周二 06:01写道:
>
> On Mon, Dec 25, 2023 at 2:52 PM Yu Zhao <[email protected]> wrote:
> >
> > On Mon, Dec 25, 2023 at 5:03 AM Kairui Song <[email protected]> wrote:
> > >
> > > Yu Zhao <[email protected]> 于2023年12月25日周一 14:30写道:
> > > >
> > > > On Wed, Dec 20, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > >
> > > > > Yu Zhao <[email protected]> 于2023年12月20日周三 16:17写道:
> > > > > >
> > > > > > On Tue, Dec 19, 2023 at 11:38 PM Yu Zhao <yuzhao@googlecom> wrote:
> > > > > > >
> > > > > > > On Tue, Dec 19, 2023 at 11:58 AM Kairui Song <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Yu Zhao <[email protected]> 于2023年12月19日周二 11:45写道:
> > > > > > > > >
> > > > > > > > > On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月15日周五 12:56写道:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > > > > > > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月14日周四 11:09写道:
> > > > > > > > > > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > > > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Kairui Song <[email protected]> 于2023年12月12日周二 14:52写道:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月12日周二 06:07写道:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <[email protected]> wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Yu Zhao <[email protected]> 于2023年12月8日周五 14:14写道:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > > > > > > > > > on:
> > > > > > > > > > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > > > > > > > > > rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > > > > > > > > > cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > > > > > > > > > client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > > > > > > > > > and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > > > > > > > > > it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > > > > > > > > > buffer pools (anon).
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > > > > > > > > > tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > > > > > > > > > descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > > > > > > > > > more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > > > > > > > > > to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > > > > > > > > > through file descriptors are worth protecting.)
> > > > > > > > > > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > > > > > > > > > that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > > > > > > > > > similar to the above.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > > > > > > > > > Before After Change
> > > > > > > > > > > > > > > > > > > > > workingset_refault_anon 25641024 25598972 0%
> > > > > > > > > > > > > > > > > > > > > workingset_refault_file 115016834 106178438 -8%
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > > > > > > > > > upstream.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > > > > > > > > > similar problem):
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Previous result:
> > > > > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > > > > > STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > > TOTAL 2542 27121571486.2 0.09 txn/s
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > This patch:
> > > > > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > > > > > > > > > > STOCK_LEVEL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > > TOTAL 1594 27061522574.4 0.06 txn/s
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > > > > > > > > > distance).
> > > > > > > > > > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > #!/bin/bash
> > > > > > > > > > > > > > > > > swapoff -a
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > echo 4G > memory.max
> > > > > > > > > > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > > > > > > > > > --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > > > > > > > > > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > > > > > > > > > --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > > > > > > > > > --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > > > > > > > > > towards certain pages.
> > > > > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > > > > > > > > > soon as possible.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > > > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > > > > > > > > > and usually higher refautls result in worse performance.
> > > > > > > > > > > >
> > > > > > > > > > > > And thanks for providing the refaults I requested for -- your data
> > > > > > > > > > > > below confirms what I mentioned above:
> > > > > > > > > > > >
> > > > > > > > > > > > For fio:
> > > > > > > > > > > > Your RFC This series Change
> > > > > > > > > > > > workingset_refault_file 628192729 596790506 -5%
> > > > > > > > > > > > IOPS 1862k 1830k -2%
> > > > > > > > > > > >
> > > > > > > > > > > > For MongoDB:
> > > > > > > > > > > > Your RFC This series Change
> > > > > > > > > > > > workingset_refault_anon 10512 35277 +30%
> > > > > > > > > > > > workingset_refault_file 22751782 20335355 -11%
> > > > > > > > > > > > total 22762294 20370632 -11%
> > > > > > > > > > > > TPS 0.09 0.06 -33%
> > > > > > > > > > > >
> > > > > > > > > > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > > > > > > > > > especially when using zram, since an anon refault should be a lot
> > > > > > > > > > > > cheaper than a file refault.
> > > > > > > > > > > >
> > > > > > > > > > > > So, I'm baffled...
> > > > > > > > > > > >
> > > > > > > > > > > > One important detail I forgot to mention: based on your data from
> > > > > > > > > > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > > > > > > > > > >
> > > > > > > > > > > > Your Kconfig My Kconfig Max possible
> > > > > > > > > > > > LRU_REFS_WIDTH 1 2 2
> > > > > > > > > > >
> > > > > > > > > > > Hi Yu,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the info, my fault, I forgot to update my config as I was
> > > > > > > > > > > testing some other features.
> > > > > > > > > > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > > > > > > > > > got much worse for MongoDB test:
> > > > > > > > > > >
> > > > > > > > > > > With LRU_REFS_WIDTH == 2:
> > > > > > > > > > >
> > > > > > > > > > > This patch:
> > > > > > > > > > > ==================================================================
> > > > > > > > > > > Execution Results after 919 seconds
> > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > STOCK_LEVEL 488 27598136201.9 0.02 txn/s
> > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > TOTAL 488 27598136201.9 0.02 txn/s
> > > > > > > > > > >
> > > > > > > > > > > memcg 86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > > > > > > > > > node 0
> > > > > > > > > > > 1 948187 0x 0x
> > > > > > > > > > > 0 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 1 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 2 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 3 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 2 948187 0 6051788·
> > > > > > > > > > > 0 0r 0e 0p 11916r
> > > > > > > > > > > 66442e 0p
> > > > > > > > > > > 1 0r 0e 0p 903r
> > > > > > > > > > > 16888e 0p
> > > > > > > > > > > 2 0r 0e 0p 459r
> > > > > > > > > > > 9764e 0p
> > > > > > > > > > > 3 0r 0e 0p 0r
> > > > > > > > > > > 0e 2874p
> > > > > > > > > > > 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 3 948187 1353160 6351·
> > > > > > > > > > > 0 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 1 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 2 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 3 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 4 73045 23573 12·
> > > > > > > > > > > 0 0R 0T 0 3498607R
> > > > > > > > > > > 4868605T 0·
> > > > > > > > > > > 1 0R 0T 0 3012246R
> > > > > > > > > > > 3270261T 0·
> > > > > > > > > > > 2 0R 0T 0 2498608R
> > > > > > > > > > > 2839104T 0·
> > > > > > > > > > > 3 0R 0T 0 0R
> > > > > > > > > > > 1983947T 0·
> > > > > > > > > > > 1486579L 0O 1380614Y 2945N
> > > > > > > > > > > 2945F 2734A
> > > > > > > > > > >
> > > > > > > > > > > workingset_refault_anon 0
> > > > > > > > > > > workingset_refault_file 18130598
> > > > > > > > > > >
> > > > > > > > > > > total used free shared buff/cache available
> > > > > > > > > > > Mem: 31978 6705 312 20 24960 24786
> > > > > > > > > > > Swap: 31977 4 31973
> > > > > > > > > > >
> > > > > > > > > > > RFC:
> > > > > > > > > > > ==================================================================
> > > > > > > > > > > Execution Results after 908 seconds
> > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > Executed Time (µs) Rate
> > > > > > > > > > > STOCK_LEVEL 2252 27159962888.2 0.08 txn/s
> > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > TOTAL 2252 27159962888.2 0.08 txn/s
> > > > > > > > > > >
> > > > > > > > > > > workingset_refault_anon 22585
> > > > > > > > > > > workingset_refault_file 22715256
> > > > > > > > > > >
> > > > > > > > > > > memcg 66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > > > > > > > > > node 0
> > > > > > > > > > > 22 563007 2274 1198225·
> > > > > > > > > > > 0 0r 1e 0p 0r
> > > > > > > > > > > 697076e 0p
> > > > > > > > > > > 1 0r 0e 0p 0r
> > > > > > > > > > > 0e 325661p
> > > > > > > > > > > 2 0r 0e 0p 0r
> > > > > > > > > > > 0e 888728p
> > > > > > > > > > > 3 0r 0e 0p 0r
> > > > > > > > > > > 0e 3602238p
> > > > > > > > > > > 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 23 532222 7525 4948747·
> > > > > > > > > > > 0 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 1 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 2 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 3 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 24 500367 1214667 3292·
> > > > > > > > > > > 0 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 1 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 2 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 3 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 0 0 0 0
> > > > > > > > > > > 0 0·
> > > > > > > > > > > 25 469692 40797 466·
> > > > > > > > > > > 0 0R 271T 0 0R
> > > > > > > > > > > 1162165T 0·
> > > > > > > > > > > 1 0R 0T 0 774028R
> > > > > > > > > > > 1205332T 0·
> > > > > > > > > > > 2 0R 0T 0 0R
> > > > > > > > > > > 932484T 0·
> > > > > > > > > > > 3 0R 1T 0 0R
> > > > > > > > > > > 4252158T 0·
> > > > > > > > > > > 25178380L 156515O 23953602Y 59234N
> > > > > > > > > > > 49391F 48664A
> > > > > > > > > > >
> > > > > > > > > > > total used free shared buff/cache available
> > > > > > > > > > > Mem: 31978 6968 338 5 24671 24555
> > > > > > > > > > > Swap: 31977 1533 30444
> > > > > > > > > > >
> > > > > > > > > > > Using same mongodb config (a 3 replica cluster using the same config):
> > > > > > > > > > > {
> > > > > > > > > > > "net": {
> > > > > > > > > > > "bindIpAll": true,
> > > > > > > > > > > "ipv6": false,
> > > > > > > > > > > "maxIncomingConnections": 10000,
> > > > > > > > > > > },
> > > > > > > > > > > "setParameter": {
> > > > > > > > > > > "disabledSecureAllocatorDomains": "*"
> > > > > > > > > > > },
> > > > > > > > > > > "replication": {
> > > > > > > > > > > "oplogSizeMB": 10480,
> > > > > > > > > > > "replSetName": "issa-tpcc_0"
> > > > > > > > > > > },
> > > > > > > > > > > "security": {
> > > > > > > > > > > "keyFile": "/data/db/keyfile"
> > > > > > > > > > > },
> > > > > > > > > > > "storage": {
> > > > > > > > > > > "dbPath": "/data/db/",
> > > > > > > > > > > "syncPeriodSecs": 60,
> > > > > > > > > > > "directoryPerDB": true,
> > > > > > > > > > > "wiredTiger": {
> > > > > > > > > > > "engineConfig": {
> > > > > > > > > > > "cacheSizeGB": 5
> > > > > > > > > > > }
> > > > > > > > > > > }
> > > > > > > > > > > },
> > > > > > > > > > > "systemLog": {
> > > > > > > > > > > "destination": "file",
> > > > > > > > > > > "logAppend": true,
> > > > > > > > > > > "logRotate": "rename",
> > > > > > > > > > > "path": "/data/db/mongod.log",
> > > > > > > > > > > "verbosity": 0
> > > > > > > > > > > }
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > The test environment have 32g memory and 16 core.
> > > > > > > > > > >
> > > > > > > > > > > Per my analyze, the access pattern for the mongodb test is that page
> > > > > > > > > > > will be re-access long after it's evicted so PID controller won't
> > > > > > > > > > > protect higher tier. That RFC will make use of the long existing
> > > > > > > > > > > shadow to do feedback to PID/Gen so the result will be much better.
> > > > > > > > > > > Still need more adjusting though, will try to do a rebase on top of
> > > > > > > > > > > mm-unstable which includes your patch.
> > > > > > > > > > >
> > > > > > > > > > > I've no idea why the workingset_refault_* is higher in the better
> > > > > > > > > > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > > > > > > > > > CPU is not full...
> > > > > > > > > > >
> > > > > > > > > > > I've uploaded my local reproducer here:
> > > > > > > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > > > > > > https://github.com/ryncsn/py-tpcc
> > > > > > > > > >
> > > > > > > > > > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > > > > > > > > > version did you use? setup.sh didn't seem to install it.
> > > > > > > > > >
> > > > > > > > > > Also do you have a QEMU image? It'd be a lot easier for me to
> > > > > > > > > > duplicate the exact environment by looking into it.
> > > > > > > > >
> > > > > > > > > I ended up using docker.io/mongodb/mongodb-community-server:latest,
> > > > > > > > > and it's not working:
> > > > > > > > >
> > > > > > > > > # docker exec -it mongo-r1 mongosh --eval \
> > > > > > > > > '"rs.initiate({
> > > > > > > > > _id: "issa-tpcc_0",
> > > > > > > > > members: [
> > > > > > > > > {_id: 0, host: "mongo-r1"},
> > > > > > > > > {_id: 1, host: "mongo-r2"},
> > > > > > > > > {_id: 2, host: "mongo-r3"}
> > > > > > > > > ]
> > > > > > > > > })"'
> > > > > > > > > Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> > > > > > > > > Error: can only create exec sessions on running containers: container
> > > > > > > > > state improper
> > > > > > > >
> > > > > > > > Hi Yu,
> > > > > > > >
> > > > > > > > I've updated the test repo:
> > > > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > > >
> > > > > > > > I've tested it on top of latest Fedora Cloud Image 39 and it worked
> > > > > > > > well for me, the README now contains detailed and not hard to follow
> > > > > > > > steps to reproduce this test.
> > > > > > >
> > > > > > > Thanks. I was following the instructions down to the letter and it
> > > > > > > fell apart again at line 46 (./tpcc.py).
> > > > > >
> > > > > > I think you just broke it by
> > > > > > https://github.com/ryncsn/py-tpcc/commit/7b9b380d636cb84faa5b11b5562e531f924eeb7e
> > > > > >
> > > > > > (But it's also possible you actually wanted me to use this latest
> > > > > > commit but forgot to account for it in your instructions.)
> > > > > >
> > > > > > > Were you able to successfully run the benchmark on a fresh VM by
> > > > > > > following the instructions? If not, I'd appreciate it if you could do
> > > > > > > so and document all the missing steps.
> > > > >
> > > > > Ah, you are right, I attempted to convert it to Python3 but found it
> > > > > only brought more trouble, so I gave up and the instruction is still
> > > > > using Python2. However I accidentally pushed the WIP python3 convert
> > > > > commit... I've reset the repo to
> > > > > https://github.com/ryncsn/py-tpcc/commit/86e862c5cf3b2d1f51e0297742fa837c7a99ebf8,
> > > > > this is working well. Sorry for the inconvenient.
> > > >
> > > > Thanks -- I was able to reproduce results similar to yours.
> > > >
> > >
> > > Hi Yu,
> > >
> > > Thanks for the testing, and merry xmas.
> > >
> > > > It turned out the mystery (fewer refaults but worse performance) was caused by
> > > > 13.89% 13.89% kswapd0 [kernel.vmlinux] [k]
> > > > __list_del_entry_valid_or_report
> > >
> > > I'm not sure about this, if the task is CPU bounded, this could
> > > explain. But it's not, the performance gap is larger when tested on
> > > slow IO device.
> > >
> > > The iostat output during my test run:
> > > avg-cpu: %user %nice %system %iowait %steal %idle
> > > 7.40 0.00 2.42 83.37 0.00 6.80
> > > Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s
> > > %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
> > > vda 35.00 0.80 167.60 17.20 6.90 3.50
> > > 16.47 81.40 0.47 1.62 0.02 4.79 21.50 0.63 2.27
> > > vdb 5999.30 4.80 104433.60 84.00 0.00 8.30
> > > 0.00 63.36 6.54 1.31 39.25 17.41 17.50 0.17 100.00
> > > zram0 0.00 0.00 0.00 0.00 0.00 0.00
> > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> >
> > I ran the benchmark on the slowest bare metal I have that roughly
> > matches your CPU/DRAM configurations (ThinkPad P1 G4
> > https://support.lenovo.com/us/en/solutions/pd031426).
> >
> > But it seems you used a VM (vda/vdb) -- I never run performance
> > benchmarks in VMs because the host and hypervisor can complicate
> > things, for example, in this case, is it possible the host page cache
> > cached your disk image containing the database files?
> >
> > > You can see CPU is waiting for IO, %user is always around 10%.
> > > The hotspot you posted only take up 13.89% of the runtime, which
> > > shouldn't cause so much performance drop.
> > >
> > > >
> > > > Apparently Fedora has CONFIG_DEBUG_LIST=y by default, and after I
> > > > turned it off (the only change I made), this series showed better TPS
> > > > (I used"--duration=10800" for more reliable results):
> > > > v6.7-rc6 RFC [1] change
> > > > total txns 25024 24672 +1%
> > > > workingset_refault_anon 573668 680248 -16%
> > > > workingset_refault_file 260631976 265808452 -2%
> > >
> > > I have disabled CONFIG_DEBUG_LIST when doing performance comparison test.
>
> Also I'd suggest we both use the same distro you shared with me and
> the default .config except CONFIG_DEBUG_LIST=n, and v6.7-rc6 for now.
>
> (I'm attaching the default .config based on /boot/config-6.5.6-300.fc39.x86_64.)
>

Hi Yu

I've been adapting and testing the refault distance series based on
latest 6.7. Also I found a serious bug in my previous V3, so I updated
it here with some importance changes (using a seperate refault
distance model, instead of glueing to active/inactive model):
https://github.com/ryncsn/linux/commits/kasong/devel/refault-distance-2024-1/

So far I can conclude that previous result is not caused by host
cache, I setup a baremetal test environment, strictly using your
config, I gathered some data (I also updated the refault distance
patch series, updated version in link above, and also the baremetal
hava a fast NVME so the performance gap wasn't so large but still
stably observable):

With latest 6.7 (Your config):
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 4025 27162035181.5 0.15 txn/s
------------------------------------------------------------------
TOTAL 4025 27162035181.5 0.15 txn/s

vmstat:
workingset_nodes 82996
workingset_refault_anon 269371
workingset_refault_file 37671044
workingset_activate_anon 121061
workingset_activate_file 8927227
workingset_restore_anon 121061
workingset_restore_file 2578269
workingset_nodereclaim 62394

lru_gen_full:
memcg 67 /machine.slice/libpod-38b33777db34724cf8edfbef1ac2e4fd0621f14151e241bbf1430d397d3dee51.scope/container
node 0
34 60565 21248 1254331
0 0r 0e 0p 121186r
169948e 0p
1 0r 0e 0p 156224r
222553e 0p
2 0r 0e 0p 0r
0e 4227858p
3 0r 0e 0p 0r
0e 0p
0 0 0 0
0 0
35 41132 714504 4300280
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
36 20586 473476 2105
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
37 2035 817 876
0 6647R 9116T 0 166836R
871850T 0
1 0R 0T 0 110807R
296447T 0
2 0R 268T 0 0R
4655276T 0
3 0R 0T 0 0R
0T 0
12510062L 639646O 11048666Y 45512N
24520F 23613A
iostat:
avg-cpu: %user %nice %system %iowait %steal %idle
76.29 0.00 12.09 3.50 1.44 6.69

Device tps kB_read/s kB_wrtn/s kB_dscd/s
kB_read kB_wrtn kB_dscd
dm-0 16.12 684.50 36.93 0.00
649996 35070 0
dm-1 0.05 1.10 0.00 0.00
1044 0 0
nvme0n1 16.47 700.22 39.09 0.00 664922
37118 0
nvme1n1 4905.93 205287.92 1030.70 0.00
194939353 978740 0
zram0 4440.17 5356.90 12404.81 0.00
5086856 11779480 0

free -m:
total used free shared buff/cache available
Mem: 31830 9475 403 0 21951 21918
Swap: 31829 6500 25329

With latest refault distance series (Your config):
==================================================================
Execution Results after 902 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 4260 27065448172.8 0.16 txn/s
------------------------------------------------------------------
TOTAL 4260 27065448172.8 0.16 txn/s

workingset_nodes 113824
workingset_refault_anon 293426
workingset_refault_file 42700484
workingset_activate_anon 0
workingset_activate_file 13410174
workingset_restore_anon 289106
workingset_restore_file 5592042
workingset_nodereclaim 33249

memcg 67 /machine.slice/libpod-8eff6b7b65e34fe0497ff5c0c88c750f6896c43a06bb26e8cd6470de596be76e.scope/container
node 0
15 261222 266350 65081
0 0r 0e 0p 185212r
2314329e 0p
1 0r 0e 0p 40887r
710312e 0p
2 0r 0e 0p 0r
0e 5026031p
3 0r 0e 0p 0r
0e 0p
0 0 0 0
0 0
16 199341 267661 5034442
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
17 120655 547852 592
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
18 55172 127769 3855
0 1910R 2975T 0 1894614R
4375361T 0
1 0R 0T 0 2099208R
2861460T 0
2 0R 27T 0 446000R
5569781T 0
3 0R 0T 0 0R
0T 0
2817512L 35421O 2579377Y 10452N
5517F 5414A

avg-cpu: %user %nice %system %iowait %steal %idle
76.34 0.00 11.25 4.22 1.29 6.90

Device tps kB_read/s kB_wrtn/s kB_dscd/s
kB_read kB_wrtn kB_dscd
dm-0 12.85 563.18 30.75 0.00
532390 29070 0
dm-1 0.05 1.10 0.00 0.00
1044 0 0
nvme0n1 13.22 578.97 32.92 0.00
547315 31119 0
nvme1n1 5384.11 229164.12 1038.95 0.00
216635713 982152 0
zram0 3590.88 4730.84 9633.71 0.00
4472204 9107032 0

total used free shared buff/cache available
Mem: 31830 10854 520 0 20455 20541
Swap: 31829 4508 27321

You see actually refault distance is now protecting more anon page,
total IO on ZRAM is lower, It's mostly CPU bound, and NVME is fast
enough, and result in a better performance.

Things get more interesting if I disable page idle flag (so refs bits
is extended, in your config, refs bit is only one bit, so it maybe
overprotect file pages):

Latest 6.7 (You config with page idle flag disabled):
==================================================================
Execution Results after 904 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 4016 27122163703.9 0.15 txn/s
------------------------------------------------------------------
TOTAL 4016 27122163703.9 0.15 txn/s

workingset_nodes 99637
workingset_refault_anon 309548
workingset_refault_file 45896663
workingset_activate_anon 129401
workingset_activate_file 18461236
workingset_restore_anon 129400
workingset_restore_file 4963707
workingset_nodereclaim 43970

memcg 67 /machine.slice/libpod-7546463bd2b257a9b799817ca11bee1389d7deec20032529098520a89a207d7e.scope/container
node 0
27 103004 328070 269733
0 0r 0e 0p 509949r
1957117e 0p
1 0r 0e 0p 141642r
319695e 0p
2 0r 0e 0p 777835r
793518e 0p
3 0r 0e 0p 0r
0e 4333835p
0 0 0 0
0 0
28 82361 24748 5192182
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
29 57025 786386 5681
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
30 18619 76289 1273
0 4295R 8888T 0 222326R
1044601T 0
1 0R 0T 0 117646R
301735T 0
2 0R 0T 0 433431R
825516T 0
3 0R 1T 0 0R
4076839T 0
13369819L 603360O 11981074Y 47388N
26235F 25276A

avg-cpu: %user %nice %system %iowait %steal %idle
74.90 0.00 11.96 4.92 1.62 6.60

Device tps kB_read/s kB_wrtn/s kB_dscd/s
kB_read kB_wrtn kB_dscd
dm-0 14.93 645.44 36.54 0.00
610150 34540 0
dm-1 0.05 1.10 0.00 0.00
1044 0 0
nvme0n1 15.30 661.23 38.71 0.00
625076 36589 0
nvme1n1 6352.42 240726.35 1036.47 0.00
227565845 979808 0
zram0 4189.65 4883.27 11876.36 0.00
4616304 11227080 0

total used free shared buff/cache available
Mem: 31830 9529 509 0 21791 21867
Swap: 31829 6441 25388

Refault distance seriese (Your config with page idle flag disabled):
==================================================================
Execution Results after 901 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 4268 27060267967.7 0.16 txn/s
------------------------------------------------------------------
TOTAL 4268 27060267967.7 0.16 txn/s

workingset_nodes 115394
workingset_refault_anon 144774
workingset_refault_file 41055081
workingset_activate_anon 8
workingset_activate_file 13194460
workingset_restore_anon 144629
workingset_restore_file 187419
workingset_nodereclaim 19598

memcg 66 /machine.slice/libpod-4866051af817731602b37017b0e71feb2a8f2cbaa949f577e0444af01b4f3b0c.scope/container
node 0
12 213402 18054 1287510
0 0r 0e 0p 0r
15755e 0p
1 0r 0e 0p 0r
4771e 0p
2 0r 0e 0p 908r
6810e 0p
3 0r 0e 0p 0r
0e 3533888p
0 0 0 0
0 0
13 141209 10690 3571958
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
14 69327 1165064 34657
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
15 6404 21574 3363
0 953R 1773T 0 1263395R
3816639T 0
1 0R 0T 0 1164069R
1936973T 0
2 0R 0T 0 350041R
409121T 0
3 0R 3T 0 12305R
4767303T 0
3622197L 36916O 3338446Y 10409N
7120F 6945A

avg-cpu: %user %nice %system %iowait %steal %idle
75.79 0.00 10.68 3.91 1.18 8.44

Device tps kB_read/s kB_wrtn/s kB_dscd/s
kB_read kB_wrtn kB_dscd
dm-0 12.66 547.71 38.73 0.00
526782 37248 0
dm-1 0.05 1.09 0.00 0.00
1044 0 0
nvme0n1 13.02 563.23 40.86 0.00
541708 39297 0
nvme1n1 4916.00 217529.48 1018.04 0.00
209217677 979136 0
zram0 1744.90 1970.86 5009.75 0.00
1895556 4818328 0

total used free shared buff/cache available
Mem: 31830 11713 485 0 19630 19684
Swap: 31829 2847 28982

And still refault distance series is better, and refault is also lower
for both anon/file pages.

------
I did some more test using MySQL and other workflow, no performance
drop observed so far.
And with a loop MongoDB test (keep running 900s test in loop) using my
previous VM env
(the SATA SSD vdb is using cache bypass so not a host cache issue here)
I found one thing interesting (refs bit is set to 2 in config):

Loop test using 6.7:
STOCK_LEVEL 874 27246011368.3 0.03 txn/s
STOCK_LEVEL 1533 27023181579.6 0.06 txn/s
STOCK_LEVEL 1122 28044867987.6 0.04 txn/s
STOCK_LEVEL 1032 27378070931.9 0.04 txn/s
STOCK_LEVEL 1021 27612530579.1 0.04 txn/s
STOCK_LEVEL 750 28076187896.3 0.03 txn/s
STOCK_LEVEL 780 27519993034.8 0.03 txn/s
Refault stat here:
workingset_refault_anon 126369
workingset_refault_file 170389428
STOCK_LEVEL 750 27464016123.5 0.03 txn/s
STOCK_LEVEL 780 27529550313.0 0.03 txn/s
STOCK_LEVEL 750 28296286486.1 0.03 txn/s
STOCK_LEVEL 690 27504193850.3 0.03 txn/s
STOCK_LEVEL 716 28089360754.5 0.03 txn/s
STOCK_LEVEL 607 27852180474.3 0.02 txn/s
STOCK_LEVEL 689 27703367075.4 0.02 txn/s
STOCK_LEVEL 630 28184685482.7 0.02 txn/s
STOCK_LEVEL 450 28667721196.2 0.02 txn/s
STOCK_LEVEL 450 28047985314.4 0.02 txn/s
STOCK_LEVEL 450 28125609857.3 0.02 txn/s
STOCK_LEVEL 420 27393478488.0 0.02 txn/s
STOCK_LEVEL 420 27435537312.3 0.02 txn/s
STOCK_LEVEL 420 29060748699.2 0.01 txn/s
STOCK_LEVEL 420 28155584095.2 0.01 txn/s
STOCK_LEVEL 420 27888635407.0 0.02 txn/s
STOCK_LEVEL 420 27307856858.5 0.02 txn/s
STOCK_LEVEL 420 28842280889.0 0.01 txn/s
STOCK_LEVEL 390 27640696814.1 0.01 txn/s
STOCK_LEVEL 420 28471605716.7 0.01 txn/s
STOCK_LEVEL 420 27648174237.5 0.02 txn/s
STOCK_LEVEL 420 27848217938.7 0.02 txn/s
STOCK_LEVEL 420 27344698602.2 0.02 txn/s
STOCK_LEVEL 420 27046819537.2 0.02 txn/s
STOCK_LEVEL 420 27855626843.2 0.02 txn/s
STOCK_LEVEL 420 27971873627.9 0.02 txn/s
STOCK_LEVEL 420 28007014046.4 0.01 txn/s
STOCK_LEVEL 420 28445164626.1 0.01 txn/s
STOCK_LEVEL 420 27902621006.5 0.02 txn/s
STOCK_LEVEL 420 28282574433.3 0.01 txn/s
STOCK_LEVEL 390 27161599608.7 0.01 txn/s

Using refault distance seriese:
STOCK_LEVEL 2605 27120667462.8 0.10 txn/s
STOCK_LEVEL 3000 27106854857.2 0.11 txn/s
STOCK_LEVEL 2925 27066601064.4 0.11 txn/s
STOCK_LEVEL 2757 27035248005.2 0.10 txn/s
STOCK_LEVEL 1325 28053716046.8 0.05 txn/s
STOCK_LEVEL 717 27455091366.3 0.03 txn/s
STOCK_LEVEL 967 27404085208.2 0.04 txn/s
Refault stat here:
workingset_refault_anon 109337
workingset_refault_file 191249716
STOCK_LEVEL 716 27448213557.2 0.03 txn/s
STOCK_LEVEL 807 28607974517.8 0.03 txn/s
STOCK_LEVEL 760 28081442513.2 0.03 txn/s
STOCK_LEVEL 745 28594555797.6 0.03 txn/s
STOCK_LEVEL 450 27999536348.3 0.02 txn/s
STOCK_LEVEL 598 27095531895.4 0.02 txn/s
STOCK_LEVEL 711 27623112841.1 0.03 txn/s
STOCK_LEVEL 540 28358770820.6 0.02 txn/s
STOCK_LEVEL 480 27734277554.5 0.02 txn/s
STOCK_LEVEL 450 27313906125.3 0.02 txn/s
STOCK_LEVEL 480 27487299100.4 0.02 txn/s
STOCK_LEVEL 480 27804589683.5 0.02 txn/s
STOCK_LEVEL 480 28242205820.8 0.02 txn/s
STOCK_LEVEL 480 27540680102.3 0.02 txn/s
STOCK_LEVEL 450 27428645816.8 0.02 txn/s
STOCK_LEVEL 480 27946866129.2 0.02 txn/s
STOCK_LEVEL 480 27266068262.3 0.02 txn/s
STOCK_LEVEL 450 27267487051.5 0.02 txn/s
STOCK_LEVEL 480 27896369224.8 0.02 txn/s
STOCK_LEVEL 480 28784662706.1 0.02 txn/s
STOCK_LEVEL 450 27179853217.8 0.02 txn/s
STOCK_LEVEL 480 28170594101.7 0.02 txn/s
STOCK_LEVEL 450 28084651341.0 0.02 txn/s
STOCK_LEVEL 480 27901608868.6 0.02 txn/s
STOCK_LEVEL 480 27323790886.6 0.02 txn/s
STOCK_LEVEL 480 28891008895.4 0.02 txn/s
STOCK_LEVEL 480 27964563148.0 0.02 txn/s
STOCK_LEVEL 450 27942421198.4 0.02 txn/s
STOCK_LEVEL 480 28833968825.8 0.02 txn/s
STOCK_LEVEL 480 28090975437.9 0.02 txn/s
STOCK_LEVEL 480 27915246877.4 0.02 txn/s

It seems the performance will drain as the test keep running (might be
caused by MongoDB anon usage rising or DB internal caching/logging),
that explains why for a long term test the performance gap seem to be
smaller. The VM have a poor IO performance so the test run speed is
much slower too, take a long time to warm up.

But I think it's clear that refault distance series will boost the
performance during warm up, and for long time workload it's also
looking better, especially for low IO performance machines.

I still can't explain about why workingset_refault is higher for the
better case in the VM environment... I can resetup/reboot/randomize
the test the the performance is same here. My guess is maybe related
to readahead or some kernel space IO path issue? The actual IO usage
is lower when refault distance series is applied.

I notices a slight performance regression (1 - 3%) for pure in-mem FIO
though, the "bulk series" I sent previous can help improve it.

There is a bug in my previous V3 that will cause PID controller to
lost control in long term (due to a bugged bit operation, my bad),
which I've fixed in link above, I can send out new series if you think
it's acceptable.

2024-01-11 18:25:01

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <[email protected]> 于2024年1月11日周四 15:02写道:
> Could you try the attached patch on the mainline v6.7 and see how it
> compares with the results above? Thanks.

Hi Yu,

Thanks for the patch, it helped in some degrees, but not as effective:
On that exclusive baremetal, I did a resetup, rebase on 6.7 mainline
and reran the test:

Refault distance series:
==================================================================
Execution Results after 901 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 4224 27030724835.9 0.16 txn/s
------------------------------------------------------------------
TOTAL 4224 27030724835.9 0.16 txn/s

workingset_nodes 111349
workingset_refault_anon 261331
workingset_refault_file 42862224
workingset_activate_anon 0
workingset_activate_file 13803763
workingset_restore_anon 250743
workingset_restore_file 599031
workingset_nodereclaim 23708

memcg 67 /machine.slice/libpod-edbf5a3cb2574c60180c1fb5ddb2fb160df00bcee3758b7649f2b31baa97ed78.scope/container
node 0
10 347163 518379 207449
0 0r 2e 0p 33017r
1726749e 0p
1 0r 0e 0p 7278r
496268e 0p
2 0r 0e 0p 19789r
55418e 0p
3 0r 0e 0p 0r
0e 4747801p
0 0 0 0
0 0
11 283279 154400 4791558
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
12 158723 431513 37647
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
13 44775 104986 27258
0 576R 982T 0 2488768R
5769505T 0
1 0R 0T 0 2335910R
3357277T 0
2 0R 0T 0 647398R
753021T 0
3 0R 20T 0 52725R
4740516T 0
2819476L 31196O 2551928Y 8298N
5549F 5329A

Device tps kB_read/s kB_wrtn/s kB_dscd/s
kB_read kB_wrtn kB_dscd
dm-0 12.81 546.32 39.04 0.00
520178 37171 0
dm-1 0.05 1.10 0.00 0.00
1044 0 0
nvme0n1 13.17 561.99 41.19 0.00
535103 39219 0
nvme1n1 5220.39 227385.96 1028.17 0.00
216505545 978976 0
zram0 2440.61 2856.32 6907.13 0.00
2719644 6576628 0

total used free shared buff/cache available
Mem: 31830 11251 332 0 20246 20144
Swap: 31829 3761 28068

Your attachment:
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 4070 27170023578.4 0.15 txn/s
------------------------------------------------------------------
TOTAL 4070 27170023578.4 0.15 txn/s

workingset_nodes 121864
workingset_refault_anon 430917
workingset_refault_file 42915675
workingset_activate_anon 100194
workingset_activate_file 21619480
workingset_restore_anon 100194
workingset_restore_file 165054
workingset_nodereclaim 26851

memcg 65 /machine.slice/libpod-c6d8c5fedb9b390ec7f1db7d0d7c57d6a284a94e74a3923d93ea0ce4e4ffdf28.scope/container
node 0
8 418689 55033 106862
0 16r 17e 0p 2789768r
6034831e 0p
1 0r 0e 0p 239664r
490278e 0p
2 0r 0e 0p 79145r
126408e 0p
3 23r 23e 0p 23404r
27107e 4736933p
0 0 0 0
0 0
9 322798 237713 4759110
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
10 182729 942701 5348
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
11 120287 560 375
0 25187R 29324T 0 1679308R
4256147T 0
1 0R 0T 0 153592R
364122T 0
2 0R 0T 0 51825R
98646T 0
3 101R 2944T 0 13985R
4743515T 0
7702245L 865749O 6514831Y 16843N
15088F 14167A

Device tps kB_read/s kB_wrtn/s kB_dscd/s
kB_read kB_wrtn kB_dscd
dm-0 11.49 489.97 41.80 0.00
488006 41633 0
dm-1 0.05 1.05 0.00 0.00
1044 0 0
nvme0n1 11.83 504.95 43.86 0.00
502932 43682 0
nvme0n1 5145.44 218803.29 984.46 0.00
217928081 980520 0
zram0 3164.11 4399.55 8257.84 0.00
4381952 8224812 0

total used free shared buff/cache available
Mem: 31830 11583 310 1 19935 19809
Swap: 31829 3710 28119

Refault distance series still have a better performance and lower total IO.

Similar result on that VM:
==================================================================
Execution Results after 907 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 1667 27151581934.5 0.06 txn/s
------------------------------------------------------------------
TOTAL 1667 27151581934.5 0.06 txn/s

While refault distance series had about ~2500 - 2600 txns, mainline
6.7 had about ~800 - 900 txns.

Loop test so far:
Using refault distance seriese (previous result, it doesn't change much anyway):
STOCK_LEVEL 2605 27120667462.8 0.10 txn/s
STOCK_LEVEL 3000 27106854857.2 0.11 txn/s
STOCK_LEVEL 2925 27066601064.4 0.11 txn/s
STOCK_LEVEL 2757 27035248005.2 0.10 txn/s
STOCK_LEVEL 1325 28053716046.8 0.05 txn/s
STOCK_LEVEL 717 27455091366.3 0.03 txn/s
STOCK_LEVEL 967 27404085208.2 0.04 txn/s
Refault stat here:
workingset_refault_anon 109337
workingset_refault_file 191249716

Using the attached patch:
STOCK_LEVEL 1667 27151581934.5 0.06 txn/s
STOCK_LEVEL 2999 27085125092.3 0.11 txn/s
STOCK_LEVEL 2874 27120635371.2 0.11 txn/s
STOCK_LEVEL 2658 27139142413.9 0.10 txn/s
STOCK_LEVEL 1254 27526009063.7 0.05 txn/s
STOCK_LEVEL 993 28065506801.8 0.04 txn/s
STOCK_LEVEL 954 27226012906.3 0.04 txn/s
Refault stat here:
workingset_refault_anon 383579
workingset_refault_file 205493832

The peak performance almost equal, but still starts slow, refault is
higher too. File refault might be interfered due to some IO layer
issue, but anon refault is always accurate.

I see the improvement you did in the attachment patch, I think
actually they are not in conflict with the refault distance series.
Maybe they can be combined into a even better result.

Refault distance (which originally used by active/inactive LRU) is
used here to give evicted pages priorities based on eviction distance
and add extra feedback to PID and gen. While the PID info recorded in
page flags/shadow represents pages's access pattern before eviction,
and all the check and logics about it can also be improved.

One critical effect of the refault distance series that boost the
MongoDB startup (and I haven't see any negative effect of it on other
test / workload / benchmark yet, except the overhead of memcg
statistics itself) is it prevents overprotecting of tier 0 page: that
is, a tier 0 page evicted but refaulted very quickly (refault distance
< LRU / MAX_NR_GEN, this value may worth some more adjustment, but
with LRU / MAX_NR_GEN, it can be imaged as an idea that having a small
shadow gen holding these page shadows...) will be categorised as tier
1 and get protect. Other wise, if I got everything right, when most
pages are stuck in tier 0 and keep refaulting, tier 0 will have a very
high refault rate, and no pages will be protect, until randomness
causes quick repeated read of some page, so they get promoted to tier
3 get get protected.

Now min_seq contains lower tier pages and new pages will be added to
min_seq too, so min_seq will stay for a long time, while min_seq + 1
holds protected full ref tier 3 pages and they stay long enough to get
promoted as tier 3 again, so they will always be kept in memory.
Now MongoDB will perform well even without refault distance series,
but this period may take a long time (~15 min for the MongoDB test for
SATA SSD, which is based on a real workload), long enough to cause
real issue.

And this also means PID won't react to workload change fast enough.

Also the anon refault's refs value is adjusted by refault distance too
in the series, it tries to split the whole LRU as at least two gens
for refaulted pages (only page with refault distance < LRU /
MIN_NR_GEN will have full refs set, else will have refs - 1 set as
penalty for long time evicted and unused page, which complies with
LRU's nature). Which seems actually decreased refault of anon pages.

There are some other issue that refault distance series is trying to
solve too, eg. if there is a user agent force MGLRU to age
periodically for proactive memory reclaim, or MGLRU simply ages fast,
min_seq will grow periodically and PID won't catch enough feedback
using previous logic.

2024-01-12 01:46:35

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Thu, Jan 11, 2024 at 11:24 AM Kairui Song <[email protected]> wrote:
>
> Yu Zhao <[email protected]> 于2024年1月11日周四 15:02写道:
> > Could you try the attached patch on the mainline v6.7 and see how it
> > compares with the results above? Thanks.
>
> Hi Yu,
>
> Thanks for the patch, it helped in some degrees, but not as effective:
> On that exclusive baremetal, I did a resetup, rebase on 6.7 mainline
> and reran the test:
>
> Refault distance series:
> ==================================================================
> Execution Results after 901 seconds
> ------------------------------------------------------------------
> Executed Time (µs) Rate
> STOCK_LEVEL 4224 27030724835.9 0.16 txn/s
> ------------------------------------------------------------------
> TOTAL 4224 27030724835.9 0.16 txn/s
>
> workingset_nodes 111349
> workingset_refault_anon 261331
> workingset_refault_file 42862224
> workingset_activate_anon 0
> workingset_activate_file 13803763
> workingset_restore_anon 250743
> workingset_restore_file 599031
> workingset_nodereclaim 23708
>
> memcg 67 /machine.slice/libpod-edbf5a3cb2574c60180c1fb5ddb2fb160df00bcee3758b7649f2b31baa97ed78.scope/container
> node 0
> 10 347163 518379 207449
> 0 0r 2e 0p 33017r
> 1726749e 0p
> 1 0r 0e 0p 7278r
> 496268e 0p
> 2 0r 0e 0p 19789r
> 55418e 0p
> 3 0r 0e 0p 0r
> 0e 4747801p
> 0 0 0 0
> 0 0
> 11 283279 154400 4791558
> 0 0 0 0 0
> 0 0
> 1 0 0 0 0
> 0 0
> 2 0 0 0 0
> 0 0
> 3 0 0 0 0
> 0 0
> 0 0 0 0
> 0 0
> 12 158723 431513 37647
> 0 0 0 0 0
> 0 0
> 1 0 0 0 0
> 0 0
> 2 0 0 0 0
> 0 0
> 3 0 0 0 0
> 0 0
> 0 0 0 0
> 0 0
> 13 44775 104986 27258
> 0 576R 982T 0 2488768R
> 5769505T 0
> 1 0R 0T 0 2335910R
> 3357277T 0
> 2 0R 0T 0 647398R
> 753021T 0
> 3 0R 20T 0 52725R
> 4740516T 0
> 2819476L 31196O 2551928Y 8298N
> 5549F 5329A
>
> Device tps kB_read/s kB_wrtn/s kB_dscd/s
> kB_read kB_wrtn kB_dscd
> dm-0 12.81 546.32 39.04 0.00
> 520178 37171 0
> dm-1 0.05 1.10 0.00 0.00
> 1044 0 0
> nvme0n1 13.17 561.99 41.19 0.00
> 535103 39219 0
> nvme1n1 5220.39 227385.96 1028.17 0.00
> 216505545 978976 0
> zram0 2440.61 2856.32 6907.13 0.00
> 2719644 6576628 0
>
> total used free shared buff/cache available
> Mem: 31830 11251 332 0 20246 20144
> Swap: 31829 3761 28068
>
> Your attachment:
> ==================================================================
> Execution Results after 905 seconds
> ------------------------------------------------------------------
> Executed Time (µs) Rate
> STOCK_LEVEL 4070 27170023578.4 0.15 txn/s
> ------------------------------------------------------------------
> TOTAL 4070 27170023578.4 0.15 txn/s
>
> workingset_nodes 121864
> workingset_refault_anon 430917
> workingset_refault_file 42915675
> workingset_activate_anon 100194
> workingset_activate_file 21619480
> workingset_restore_anon 100194
> workingset_restore_file 165054
> workingset_nodereclaim 26851
>
> memcg 65 /machine.slice/libpod-c6d8c5fedb9b390ec7f1db7d0d7c57d6a284a94e74a3923d93ea0ce4e4ffdf28.scope/container
> node 0
> 8 418689 55033 106862
> 0 16r 17e 0p 2789768r
> 6034831e 0p
> 1 0r 0e 0p 239664r
> 490278e 0p
> 2 0r 0e 0p 79145r
> 126408e 0p
> 3 23r 23e 0p 23404r
> 27107e 4736933p
> 0 0 0 0
> 0 0
> 9 322798 237713 4759110
> 0 0 0 0 0
> 0 0
> 1 0 0 0 0
> 0 0
> 2 0 0 0 0
> 0 0
> 3 0 0 0 0
> 0 0
> 0 0 0 0
> 0 0
> 10 182729 942701 5348
> 0 0 0 0 0
> 0 0
> 1 0 0 0 0
> 0 0
> 2 0 0 0 0
> 0 0
> 3 0 0 0 0
> 0 0
> 0 0 0 0
> 0 0
> 11 120287 560 375
> 0 25187R 29324T 0 1679308R
> 4256147T 0
> 1 0R 0T 0 153592R
> 364122T 0
> 2 0R 0T 0 51825R
> 98646T 0
> 3 101R 2944T 0 13985R
> 4743515T 0
> 7702245L 865749O 6514831Y 16843N
> 15088F 14167A
>
> Device tps kB_read/s kB_wrtn/s kB_dscd/s
> kB_read kB_wrtn kB_dscd
> dm-0 11.49 489.97 41.80 0.00
> 488006 41633 0
> dm-1 0.05 1.05 0.00 0.00
> 1044 0 0
> nvme0n1 11.83 504.95 43.86 0.00
> 502932 43682 0
> nvme0n1 5145.44 218803.29 984.46 0.00
> 217928081 980520 0
> zram0 3164.11 4399.55 8257.84 0.00
> 4381952 8224812 0
>
> total used free shared buff/cache available
> Mem: 31830 11583 310 1 19935 19809
> Swap: 31829 3710 28119
>
> Refault distance series still have a better performance and lower total IO.
>
> Similar result on that VM:
> ==================================================================
> Execution Results after 907 seconds
> ------------------------------------------------------------------
> Executed Time (µs) Rate
> STOCK_LEVEL 1667 27151581934.5 0.06 txn/s
> ------------------------------------------------------------------
> TOTAL 1667 27151581934.5 0.06 txn/s
>
> While refault distance series had about ~2500 - 2600 txns, mainline
> 6.7 had about ~800 - 900 txns.
>
> Loop test so far:
> Using refault distance seriese (previous result, it doesn't change much anyway):
> STOCK_LEVEL 2605 27120667462.8 0.10 txn/s
> STOCK_LEVEL 3000 27106854857.2 0.11 txn/s
> STOCK_LEVEL 2925 27066601064.4 0.11 txn/s
> STOCK_LEVEL 2757 27035248005.2 0.10 txn/s
> STOCK_LEVEL 1325 28053716046.8 0.05 txn/s
> STOCK_LEVEL 717 27455091366.3 0.03 txn/s
> STOCK_LEVEL 967 27404085208.2 0.04 txn/s
> Refault stat here:
> workingset_refault_anon 109337
> workingset_refault_file 191249716
>
> Using the attached patch:
> STOCK_LEVEL 1667 27151581934.5 0.06 txn/s
> STOCK_LEVEL 2999 27085125092.3 0.11 txn/s
> STOCK_LEVEL 2874 27120635371.2 0.11 txn/s
> STOCK_LEVEL 2658 27139142413.9 0.10 txn/s
> STOCK_LEVEL 1254 27526009063.7 0.05 txn/s
> STOCK_LEVEL 993 28065506801.8 0.04 txn/s
> STOCK_LEVEL 954 27226012906.3 0.04 txn/s
> Refault stat here:
> workingset_refault_anon 383579
> workingset_refault_file 205493832
>
> The peak performance almost equal, but still starts slow, refault is
> higher too. File refault might be interfered due to some IO layer
> issue, but anon refault is always accurate.
>
> I see the improvement you did in the attachment patch, I think
> actually they are not in conflict with the refault distance series.
> Maybe they can be combined into a even better result.
>
> Refault distance (which originally used by active/inactive LRU) is
> used here to give evicted pages priorities based on eviction distance
> and add extra feedback to PID and gen. While the PID info recorded in
> page flags/shadow represents pages's access pattern before eviction,
> and all the check and logics about it can also be improved.
>
> One critical effect of the refault distance series that boost the
> MongoDB startup (and I haven't see any negative effect of it on other
> test / workload / benchmark yet, except the overhead of memcg
> statistics itself) is it prevents overprotecting of tier 0 page: that
> is, a tier 0 page evicted but refaulted very quickly (refault distance
> < LRU / MAX_NR_GEN, this value may worth some more adjustment, but
> with LRU / MAX_NR_GEN, it can be imaged as an idea that having a small
> shadow gen holding these page shadows...) will be categorised as tier
> 1 and get protect. Other wise, if I got everything right, when most
> pages are stuck in tier 0 and keep refaulting, tier 0 will have a very
> high refault rate, and no pages will be protect, until randomness
> causes quick repeated read of some page, so they get promoted to tier
> 3 get get protected.
>
> Now min_seq contains lower tier pages and new pages will be added to
> min_seq too, so min_seq will stay for a long time, while min_seq + 1
> holds protected full ref tier 3 pages and they stay long enough to get
> promoted as tier 3 again, so they will always be kept in memory.
> Now MongoDB will perform well even without refault distance series,
> but this period may take a long time (~15 min for the MongoDB test for
> SATA SSD, which is based on a real workload), long enough to cause
> real issue.
>
> And this also means PID won't react to workload change fast enough.
>
> Also the anon refault's refs value is adjusted by refault distance too
> in the series, it tries to split the whole LRU as at least two gens
> for refaulted pages (only page with refault distance < LRU /
> MIN_NR_GEN will have full refs set, else will have refs - 1 set as
> penalty for long time evicted and unused page, which complies with
> LRU's nature). Which seems actually decreased refault of anon pages.
>
> There are some other issue that refault distance series is trying to
> solve too, eg. if there is a user agent force MGLRU to age
> periodically for proactive memory reclaim, or MGLRU simply ages fast,
> min_seq will grow periodically and PID won't catch enough feedback
> using previous logic.

Thanks. So far I've been making shots in the dark since I haven't been
able to reproduce your results on bare metal or VMs. So, either the
benchmark itself is not reliable, which according to your results is
unlikely, or I've been using different hardware configurations. Do you
think you can share some off-the-shelf hardware configuration that I
can buy and use to reliably reproduce your results? Ideally we use the
exactly same model from, for example, Dell, HP or Lenovo.