2024-03-08 03:11:53

by 李培锋

[permalink] [raw]
Subject: [PATCH v2 0/2] reclaim contended folios asynchronously instead of promoting them

From: Peifeng Li <[email protected]>

Commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")
prevents the reclaim path from becoming stuck on the rmap lock. However,
it reinserts those folios at the head of the LRU during shrink_folio_list,
even if those folios are very cold.

This can have a detrimental effect on performance by increasing refaults
and the likelihood of OOM (Out of Memory) killing.

This patchset introduces a new kthread:kshrinkd thread to asynchronously
reclaim contended folios rather than promoting them, thereby preventing
excessive violations of LRU rules. We observed a noticeable decrease in
refaults and OOM killing as a result.

-v2:
* rewrite the commit messages;
* rebase on top of mm-unstable
-v1:
https://lore.kernel.org/linux-mm/[email protected]/

Peifeng Li (2):
mm/rmap: provide folio_referenced with the options to try_lock or lock
mm: vmscan: reclaim contended folios asynchronously instead of
promoting them

include/linux/mmzone.h | 6 +
include/linux/rmap.h | 5 +-
include/linux/swap.h | 3 +
include/linux/vm_event_item.h | 2 +
mm/memory_hotplug.c | 2 +
mm/rmap.c | 5 +-
mm/vmscan.c | 205 +++++++++++++++++++++++++++++++++-
mm/vmstat.c | 2 +
8 files changed, 221 insertions(+), 9 deletions(-)

--
2.34.1



2024-03-08 03:12:19

by 李培锋

[permalink] [raw]
Subject: [PATCH v2 1/2] mm/rmap: provide folio_referenced with the options to try_lock or lock

From: Peifeng Li <[email protected]>

The commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")
unconditionally switches to try_lock to avoid lock contention. This patch
introduces a parameter to allow folio_referenced to genuinely wait and
hold the lock in certain scenarios.
Before introducing the new context, we always set try_lock to true to
maintain the current behavior of the code.

Signed-off-by: Peifeng Li <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/linux/rmap.h | 5 +++--
mm/rmap.c | 5 +++--
mm/vmscan.c | 16 ++++++++++++++--
3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b7944a833668..846b2617a9f2 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -623,7 +623,8 @@ static inline int folio_try_share_anon_rmap_pmd(struct folio *folio,
* Called from mm/vmscan.c to handle paging out
*/
int folio_referenced(struct folio *, int is_locked,
- struct mem_cgroup *memcg, unsigned long *vm_flags);
+ struct mem_cgroup *memcg, unsigned long *vm_flags,
+ unsigned int rw_try_lock);

void try_to_migrate(struct folio *folio, enum ttu_flags flags);
void try_to_unmap(struct folio *, enum ttu_flags flags);
@@ -739,7 +740,7 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,

static inline int folio_referenced(struct folio *folio, int is_locked,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags, unsigned int rw_try_lock)
{
*vm_flags = 0;
return 0;
diff --git a/mm/rmap.c b/mm/rmap.c
index 3746a5531018..7d01f81ca587 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -952,6 +952,7 @@ static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg)
* @is_locked: Caller holds lock on the folio.
* @memcg: target memory cgroup
* @vm_flags: A combination of all the vma->vm_flags which referenced the folio.
+ * @rw_try_lock: if try_lock in rmap_walk
*
* Quick test_and_clear_referenced for all mappings of a folio,
*
@@ -959,7 +960,7 @@ static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg)
* the function bailed out due to rmap lock contention.
*/
int folio_referenced(struct folio *folio, int is_locked,
- struct mem_cgroup *memcg, unsigned long *vm_flags)
+ struct mem_cgroup *memcg, unsigned long *vm_flags, unsigned int rw_try_lock)
{
int we_locked = 0;
struct folio_referenced_arg pra = {
@@ -970,7 +971,7 @@ int folio_referenced(struct folio *folio, int is_locked,
.rmap_one = folio_referenced_one,
.arg = (void *)&pra,
.anon_lock = folio_lock_anon_vma_read,
- .try_lock = true,
+ .try_lock = rw_try_lock ? true : false,
.invalid_vma = invalid_folio_referenced_vma,
};

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0e53999a865..509b5e0dffd3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -147,6 +147,9 @@ struct scan_control {
/* Always discard instead of demoting to lower tier memory */
unsigned int no_demotion:1;

+ /* if try_lock in rmap_walk */
+ unsigned int rw_try_lock:1;
+
/* Allocation order */
s8 order;

@@ -850,7 +853,7 @@ static enum folio_references folio_check_references(struct folio *folio,
unsigned long vm_flags;

referenced_ptes = folio_referenced(folio, 1, sc->target_mem_cgroup,
- &vm_flags);
+ &vm_flags, sc->rw_try_lock);
referenced_folio = folio_test_clear_referenced(folio);

/*
@@ -1522,6 +1525,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
+ .rw_try_lock = 1,
};
struct reclaim_stat stat;
unsigned int nr_reclaimed;
@@ -2059,7 +2063,7 @@ static void shrink_active_list(unsigned long nr_to_scan,

/* Referenced or rmap lock contention: rotate */
if (folio_referenced(folio, 0, sc->target_mem_cgroup,
- &vm_flags) != 0) {
+ &vm_flags, sc->rw_try_lock) != 0) {
/*
* Identify referenced, file-backed active folios and
* give them one more trip around the active list. So
@@ -2114,6 +2118,7 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list,
.may_unmap = 1,
.may_swap = 1,
.no_demotion = 1,
+ .rw_try_lock = 1,
};

nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, ignore_references);
@@ -5459,6 +5464,7 @@ static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
.may_swap = true,
.reclaim_idx = MAX_NR_ZONES - 1,
.gfp_mask = GFP_KERNEL,
+ .rw_try_lock = 1,
};

buf = kvmalloc(len + 1, GFP_KERNEL);
@@ -6436,6 +6442,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.may_writepage = !laptop_mode,
.may_unmap = 1,
.may_swap = 1,
+ .rw_try_lock = 1,
};

/*
@@ -6481,6 +6488,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
.may_unmap = 1,
.reclaim_idx = MAX_NR_ZONES - 1,
.may_swap = !noswap,
+ .rw_try_lock = 1,
};

WARN_ON_ONCE(!current->reclaim_state);
@@ -6527,6 +6535,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
.may_unmap = 1,
.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+ .rw_try_lock = 1,
};
/*
* Traverse the ZONELIST_FALLBACK zonelist of the current node to put
@@ -6788,6 +6797,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
.gfp_mask = GFP_KERNEL,
.order = order,
.may_unmap = 1,
+ .rw_try_lock = 1,
};

set_task_reclaim_state(current, &sc.reclaim_state);
@@ -7257,6 +7267,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
.may_unmap = 1,
.may_swap = 1,
.hibernation_mode = 1,
+ .rw_try_lock = 1,
};
struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
unsigned long nr_reclaimed;
@@ -7415,6 +7426,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
.may_swap = 1,
.reclaim_idx = gfp_zone(gfp_mask),
+ .rw_try_lock = 1,
};
unsigned long pflags;

--
2.34.1


2024-03-08 03:12:43

by 李培锋

[permalink] [raw]
Subject: [PATCH v2 2/2] mm: vmscan: reclaim contended folios asynchronously instead of promoting them

From: Peifeng Li <[email protected]>

Commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")
prevents the reclaim path from becoming stuck on the rmap lock. However,
it reinserts those folios at the head of the LRU during shrink_folio_list,
even if those folios are very cold.

While running an Android phone with 6GiB memory for 2 hours, I observed
that 321728 folios can be incorrectly placed back to the inactive head
of the LRU due to lock contention, which amounts to approximately 44
folios per second. Similarly, the same test conducted on 4GiB phones
shows that 106 folios are improperly promoted per second. This can
have a detrimental effect on performance by increasing refaults and
the likelihood of OOM (Out of Memory) killing.

For this reason, the patch introduces a separate list for contended folios
and wakes up a new kthread:kshrinkd thread to asynchronously reclaim them,
thus preventing excessive violations of LRU rules. This new thread will
set try_lock to false and always wait until it holds the lock.

Below is some data collected from two phones running monkey for two
hours(less is better):

Phone with 6GiB memory:
w/o patch w/patch delta
workingset_refault 1451043114 1408015823 -2.9%
lmkd count 9231 9009 -2.4%

Phone with 4GiB memory:
w/o patch w/patch delta
workingset_refault 2674649801 2581150132 -3.4%
lmkd count 13800 13061 -5.3%

The Monkey is a program that runs on your emulator or device and generates
pseudo-random streams of user events such as clicks, touches, or gestures,
as well as a number of system-level events.

The Android low memory killer daemon (lmkd) process monitors the memory
state of a running Android system and reacts to high memory pressure by
killing the least essential processes to keep the system performing at
acceptable levels.

Signed-off-by: Peifeng Li <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/linux/mmzone.h | 6 ++
include/linux/swap.h | 3 +
include/linux/vm_event_item.h | 2 +
mm/memory_hotplug.c | 2 +
mm/vmscan.c | 189 +++++++++++++++++++++++++++++++++-
mm/vmstat.c | 2 +
6 files changed, 201 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c11b7cde81ef..19acacf92cc9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1332,6 +1332,12 @@ typedef struct pglist_data {

int kswapd_failures; /* Number of 'reclaimed == 0' runs */

+ struct list_head kshrinkd_folios; /* rmap_walk contended folios list*/
+ spinlock_t kf_lock; /* Protect kshrinkd_folios list*/
+
+ struct task_struct *kshrinkd; /* reclaim kshrinkd_folios*/
+ wait_queue_head_t kshrinkd_wait;
+
#ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_highest_zoneidx;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2955f7a78d8d..6d15b577b6a3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -438,6 +438,9 @@ void check_move_unevictable_folios(struct folio_batch *fbatch);
extern void __meminit kswapd_run(int nid);
extern void __meminit kswapd_stop(int nid);

+extern void kshrinkd_run(int nid);
+extern void kshrinkd_stop(int nid);
+
#ifdef CONFIG_SWAP

int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 747943bc8cc2..ee95ab138c87 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,9 +38,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGLAZYFREED,
PGREFILL,
PGREUSE,
+ PGSTEAL_KSHRINKD,
PGSTEAL_KSWAPD,
PGSTEAL_DIRECT,
PGSTEAL_KHUGEPAGED,
+ PGSCAN_KSHRINKD,
PGSCAN_KSWAPD,
PGSCAN_DIRECT,
PGSCAN_KHUGEPAGED,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a444e2d7dd2b..5e1c326a8bde 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1218,6 +1218,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,

kswapd_run(nid);
kcompactd_run(nid);
+ kshrinkd_run(nid);

writeback_set_ratelimit();

@@ -2098,6 +2099,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
}

if (arg.status_change_nid >= 0) {
+ kshrinkd_stop(node);
kcompactd_stop(node);
kswapd_stop(node);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 509b5e0dffd3..ef540a520b47 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -150,6 +150,9 @@ struct scan_control {
/* if try_lock in rmap_walk */
unsigned int rw_try_lock:1;

+ /* need kshrinkd to reclaim if rwc trylock contended*/
+ unsigned int need_kshrinkd:1;
+
/* Allocation order */
s8 order;

@@ -201,6 +204,17 @@ struct scan_control {
*/
int vm_swappiness = 60;

+/*
+ * Wakeup kshrinkd those folios which lock-contended in ramp_walk
+ * during shrink_folio_list, instead of putting back to the head
+ * of LRU, to avoid to break the rules of LRU.
+ */
+static void wakeup_kshrinkd(struct pglist_data *pgdat)
+{
+ if (likely(pgdat->kshrinkd))
+ wake_up_interruptible(&pgdat->kshrinkd_wait);
+}
+
#ifdef CONFIG_MEMCG

/* Returns true for reclaim through cgroup limits or cgroup interfaces. */
@@ -844,6 +858,7 @@ enum folio_references {
FOLIOREF_RECLAIM_CLEAN,
FOLIOREF_KEEP,
FOLIOREF_ACTIVATE,
+ FOLIOREF_LOCK_CONTENDED,
};

static enum folio_references folio_check_references(struct folio *folio,
@@ -864,8 +879,12 @@ static enum folio_references folio_check_references(struct folio *folio,
return FOLIOREF_ACTIVATE;

/* rmap lock contention: rotate */
- if (referenced_ptes == -1)
- return FOLIOREF_KEEP;
+ if (referenced_ptes == -1) {
+ if (sc->need_kshrinkd && folio_pgdat(folio)->kshrinkd)
+ return FOLIOREF_LOCK_CONTENDED;
+ else
+ return FOLIOREF_KEEP;
+ }

if (referenced_ptes) {
/*
@@ -1035,6 +1054,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
struct folio_batch free_folios;
LIST_HEAD(ret_folios);
LIST_HEAD(demote_folios);
+ LIST_HEAD(contended_folios);
unsigned int nr_reclaimed = 0;
unsigned int pgactivate = 0;
bool do_demote_pass;
@@ -1052,6 +1072,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
enum folio_references references = FOLIOREF_RECLAIM;
bool dirty, writeback;
unsigned int nr_pages;
+ bool lock_contended = false;

cond_resched();

@@ -1193,6 +1214,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
case FOLIOREF_KEEP:
stat->nr_ref_keep += nr_pages;
goto keep_locked;
+ case FOLIOREF_LOCK_CONTENDED:
+ lock_contended = true;
+ goto keep_locked;
case FOLIOREF_RECLAIM:
case FOLIOREF_RECLAIM_CLEAN:
; /* try to reclaim the folio below */
@@ -1470,7 +1494,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
keep_locked:
folio_unlock(folio);
keep:
- list_add(&folio->lru, &ret_folios);
+ if (unlikely(lock_contended))
+ list_add(&folio->lru, &contended_folios);
+ else
+ list_add(&folio->lru, &ret_folios);
VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
folio_test_unevictable(folio), folio);
}
@@ -1512,6 +1539,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
free_unref_folios(&free_folios);

list_splice(&ret_folios, folio_list);
+
+ if (!list_empty(&contended_folios)) {
+ spin_lock_irq(&pgdat->kf_lock);
+ list_splice(&contended_folios, &pgdat->kshrinkd_folios);
+ spin_unlock_irq(&pgdat->kf_lock);
+ wakeup_kshrinkd(pgdat);
+ }
+
count_vm_events(PGACTIVATE, pgactivate);

if (plug)
@@ -1526,6 +1561,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};
struct reclaim_stat stat;
unsigned int nr_reclaimed;
@@ -2119,6 +2155,7 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list,
.may_swap = 1,
.no_demotion = 1,
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};

nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, ignore_references);
@@ -5465,6 +5502,7 @@ static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
.reclaim_idx = MAX_NR_ZONES - 1,
.gfp_mask = GFP_KERNEL,
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};

buf = kvmalloc(len + 1, GFP_KERNEL);
@@ -6443,6 +6481,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.may_unmap = 1,
.may_swap = 1,
.rw_try_lock = 1,
+ .need_kshrinkd = 1,
};

/*
@@ -6489,6 +6528,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
.reclaim_idx = MAX_NR_ZONES - 1,
.may_swap = !noswap,
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};

WARN_ON_ONCE(!current->reclaim_state);
@@ -6536,6 +6576,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};
/*
* Traverse the ZONELIST_FALLBACK zonelist of the current node to put
@@ -6798,6 +6839,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
.order = order,
.may_unmap = 1,
.rw_try_lock = 1,
+ .need_kshrinkd = 1,
};

set_task_reclaim_state(current, &sc.reclaim_state);
@@ -7268,6 +7310,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
.may_swap = 1,
.hibernation_mode = 1,
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};
struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
unsigned long nr_reclaimed;
@@ -7338,6 +7381,145 @@ static int __init kswapd_init(void)

module_init(kswapd_init)

+static int kshrinkd_should_run(pg_data_t *pgdat)
+{
+ int should_run;
+
+ spin_lock_irq(&pgdat->kf_lock);
+ should_run = !list_empty(&pgdat->kshrinkd_folios);
+ spin_unlock_irq(&pgdat->kf_lock);
+
+ return should_run;
+}
+
+static unsigned long kshrinkd_reclaim_folios(struct list_head *folio_list,
+ struct pglist_data *pgdat)
+{
+ struct reclaim_stat dummy_stat;
+ unsigned int nr_reclaimed = 0;
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .may_writepage = 1,
+ .may_unmap = 1,
+ .may_swap = 1,
+ .no_demotion = 1,
+ .rw_try_lock = 0,
+ .need_kshrinkd = 0,
+ };
+
+ if (list_empty(folio_list))
+ return nr_reclaimed;
+
+ nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false);
+
+ return nr_reclaimed;
+}
+
+/*
+ * The background kshrink daemon, started as a kernel thread
+ * from the init process.
+ *
+ * Kshrinkd is to reclaim the contended-folio in rmap_walk when
+ * shrink_folio_list instead of putting back into the head of LRU
+ * directly, to avoid to break the rules of LRU.
+ */
+
+static int kshrinkd(void *p)
+{
+ pg_data_t *pgdat;
+ LIST_HEAD(tmp_contended_folios);
+
+ pgdat = (pg_data_t *)p;
+
+ current->flags |= PF_MEMALLOC | PF_KSWAPD;
+ set_freezable();
+
+ while (!kthread_should_stop()) {
+ unsigned long nr_reclaimed = 0;
+ unsigned long nr_putback = 0;
+
+ wait_event_freezable(pgdat->kshrinkd_wait,
+ kshrinkd_should_run(pgdat));
+
+ /* splice rmap_walk contended folios to tmp-list */
+ spin_lock_irq(&pgdat->kf_lock);
+ list_splice(&pgdat->kshrinkd_folios, &tmp_contended_folios);
+ INIT_LIST_HEAD(&pgdat->kshrinkd_folios);
+ spin_unlock_irq(&pgdat->kf_lock);
+
+ /* reclaim rmap_walk contended folios */
+ nr_reclaimed = kshrinkd_reclaim_folios(&tmp_contended_folios, pgdat);
+ __count_vm_events(PGSTEAL_KSHRINKD, nr_reclaimed);
+
+ /* putback the folios which failed to reclaim to lru */
+ while (!list_empty(&tmp_contended_folios)) {
+ struct folio *folio = lru_to_folio(&tmp_contended_folios);
+
+ nr_putback += folio_nr_pages(folio);
+ list_del(&folio->lru);
+ folio_putback_lru(folio);
+ }
+
+ __count_vm_events(PGSCAN_KSHRINKD, nr_reclaimed + nr_putback);
+ }
+
+ current->flags &= ~(PF_MEMALLOC | PF_KSWAPD);
+
+ return 0;
+}
+
+/*
+ * This kshrinkd start function will be called by init and node-hot-add.
+ */
+void kshrinkd_run(int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ if (pgdat->kshrinkd)
+ return;
+
+ pgdat->kshrinkd = kthread_run(kshrinkd, pgdat, "kshrinkd%d", nid);
+ if (IS_ERR(pgdat->kshrinkd)) {
+ /* failure to start kshrinkd */
+ WARN_ON_ONCE(system_state < SYSTEM_RUNNING);
+ pr_err("Failed to start kshrinkd on node %d\n", nid);
+ pgdat->kshrinkd = NULL;
+ }
+}
+
+/*
+ * Called by memory hotplug when all memory in a node is offlined. Caller must
+ * be holding mem_hotplug_begin/done().
+ */
+void kshrinkd_stop(int nid)
+{
+ struct task_struct *kshrinkd = NODE_DATA(nid)->kshrinkd;
+
+ if (kshrinkd) {
+ kthread_stop(kshrinkd);
+ NODE_DATA(nid)->kshrinkd = NULL;
+ }
+}
+
+static int __init kshrinkd_init(void)
+{
+ int nid;
+
+ for_each_node_state(nid, N_MEMORY) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ spin_lock_init(&pgdat->kf_lock);
+ init_waitqueue_head(&pgdat->kshrinkd_wait);
+ INIT_LIST_HEAD(&pgdat->kshrinkd_folios);
+
+ kshrinkd_run(nid);
+ }
+
+ return 0;
+}
+
+module_init(kshrinkd_init)
+
#ifdef CONFIG_NUMA
/*
* Node reclaim mode
@@ -7427,6 +7609,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
.may_swap = 1,
.reclaim_idx = gfp_zone(gfp_mask),
.rw_try_lock = 1,
+ .need_kshrinkd = 1,
};
unsigned long pflags;

diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935e4a54..76d8a3b2d1a8 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1279,9 +1279,11 @@ const char * const vmstat_text[] = {

"pgrefill",
"pgreuse",
+ "pgsteal_kshrinkd",
"pgsteal_kswapd",
"pgsteal_direct",
"pgsteal_khugepaged",
+ "pgscan_kshrinkd",
"pgscan_kswapd",
"pgscan_direct",
"pgscan_khugepaged",
--
2.34.1


2024-03-08 04:57:05

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] reclaim contended folios asynchronously instead of promoting them

On Fri, Mar 08, 2024 at 11:11:24AM +0800, [email protected] wrote:
> Commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")
> prevents the reclaim path from becoming stuck on the rmap lock. However,
> it reinserts those folios at the head of the LRU during shrink_folio_list,
> even if those folios are very cold.

This seems like a lot of new code. Did you consider something simpler
like this?

Also, this is Minchan's patch you're complaining about. Add him to the
cc.

+++ b/mm/vmscan.c
@@ -817,6 +817,7 @@ enum folio_references {
FOLIOREF_RECLAIM,
FOLIOREF_RECLAIM_CLEAN,
FOLIOREF_KEEP,
+ FOLIOREF_RESCAN,
FOLIOREF_ACTIVATE,
};

@@ -837,9 +838,9 @@ static enum folio_references folio_check_references(struct folio *folio,
if (vm_flags & VM_LOCKED)
return FOLIOREF_ACTIVATE;

- /* rmap lock contention: rotate */
+ /* rmap lock contention: keep at the tail */
if (referenced_ptes == -1)
- return FOLIOREF_KEEP;
+ return FOLIOREF_RESCAN;

if (referenced_ptes) {
/*
@@ -1164,6 +1165,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
case FOLIOREF_ACTIVATE:
goto activate_locked;
case FOLIOREF_KEEP:
+ case FOLIOREF_RESCAN:
stat->nr_ref_keep += nr_pages;
goto keep_locked;
case FOLIOREF_RECLAIM:
@@ -1446,7 +1448,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
keep_locked:
folio_unlock(folio);
keep:
- list_add(&folio->lru, &ret_folios);
+ if (references == FOLIOREF_RESCAN)
+ list_add(&folio->lru, &rescan_folios);
+ else
+ list_add(&folio->lru, &ret_folios);
VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
folio_test_unevictable(folio), folio);
}


2024-03-12 09:25:49

by 李培锋

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] reclaim contended folios asynchronously instead of promoting them


在 2024/3/8 14:41, 李培锋 写道:
>
>
> 在 2024/3/8 12:56, Matthew Wilcox 写道:
>> On Fri, Mar 08, 2024 at 11:11:24AM +0800,[email protected] wrote:
>>> Commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")
>>> prevents the reclaim path from becoming stuck on the rmap lock. However,
>>> it reinserts those folios at the head of the LRU during shrink_folio_list,
>>> even if those folios are very cold.
>> This seems like a lot of new code. Did you consider something simpler
>> like this?
>>
>> Also, this is Minchan's patch you're complaining about. Add him to the
>> cc.
>>
>> +++ b/mm/vmscan.c
>> @@ -817,6 +817,7 @@ enum folio_references {
>> FOLIOREF_RECLAIM,
>> FOLIOREF_RECLAIM_CLEAN,
>> FOLIOREF_KEEP,
>> + FOLIOREF_RESCAN,
>> FOLIOREF_ACTIVATE,
>> };
>>
>> @@ -837,9 +838,9 @@ static enum folio_references folio_check_references(struct folio *folio,
>> if (vm_flags & VM_LOCKED)
>> return FOLIOREF_ACTIVATE;
>>
>> - /* rmap lock contention: rotate */
>> + /* rmap lock contention: keep at the tail */
>> if (referenced_ptes == -1)
>> - return FOLIOREF_KEEP;
>> + return FOLIOREF_RESCAN;
>>
>> if (referenced_ptes) {
>> /*
>> @@ -1164,6 +1165,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> case FOLIOREF_ACTIVATE:
>> goto activate_locked;
>> case FOLIOREF_KEEP:
>> + case FOLIOREF_RESCAN:
>> stat->nr_ref_keep += nr_pages;
>> goto keep_locked;
>> case FOLIOREF_RECLAIM:
>> @@ -1446,7 +1448,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> keep_locked:
>> folio_unlock(folio);
>> keep:
>> - list_add(&folio->lru, &ret_folios);
>> + if (references == FOLIOREF_RESCAN)
>> + list_add(&folio->lru, &rescan_folios);
>> + else
>> + list_add(&folio->lru, &ret_folios);
>> VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
>> folio_test_unevictable(folio), folio);
>> }
>
> Actually, we have tested the implementation method you mentioned:
>
> Putting back the contended-folios in the tail of LRU during
> shrink_folio_list
>
> and rescan it in next shrink_folio_list.
>
> In some cases, we found the another serious problems that more and more
>
> contended-folios were piled up at the tail of the LRU, which caused to
> the
>
> serious lowmem-situation, because none of folios isolated could be
> reclaimed
>
> since lock-contended during shrink_folio_list.
>
Let me provide more detail.

In fact, we have tested the implementation you mentioned:

if folio is found to be in rmap lock-contention during
shrink_folio_list, it would be put back to the end of LRU and rescanned
in the next shrink_fofolio_list.

During the testing, we found a serious problem:

In some shrink_folio_list,all isolated pages could not be reclaimed due
to rmap lock-contention, resulting in a serious memory reclam
inefficiency and insufficient memfree.

The specific reasons are as follows:

In the case of insufficient memory, if folios are put back to the tail
of LRU due to rmap lock-contention during shirnk_folio_list, they will
be isolated in shrink_inactive_list soon and attempted to be reclaimed
by the next shrink_folio_list.But these folios are still likely to fail
to reclaim due to rmap lock-contention in the short term and put back to
the tail of LRU again.

As the testing progressed, more and more folios with high probability of
rmap lock-contention were put back to the tail of the LRU during
shrink_inactive_list, ultimately resulting in no folios isolated could
be successfully reclaimed in shrink_folio_list.

The shrink_inactive_list procedure does the following:

shrink_inactive_list()

-> isolate_lru_folios():

isolate the 32 folios from the tail of LRU(some of which may have been
put back in LRU last shrink_folio_list since rmap lock-contention)

-> shrink_folio_list():

reclaime folios and putback rmap lock-contended folios to the tail of LRU

For example, assuming all folios which were put back in LRU due to rmap
lock-contention in last shrink_folio_list, can not be reclaimed
successfully because of rmap lock-contention in some case:

1st shrink_inactive_list():

-> isolate_lru_folios():isolate 32 folios

-> shrink_folio_list():reclaim 24 folios, putback 8 rmap lock-contended
folios

2nd shrink_inactive_list():

-> isolate_lru_folios():isolate 32 folios, include 8 rmap lock-contended
folios

-> shrink_folio_list():reclaim 16 folios, putback 16 rmap lock-contended
folios

3rd shrink_inactive_list():

-> isolate_lru_folios():isolate 32 folios, include 16 rmap
lock-contended folios

-> shrink_folio_list():reclaim 8 folios, putback 24 rmap lock-contended
folios

4th shrink_inactive_list():

-> isolate_lru_folios():isolate 32 folios, include 24 rmap
lock-contended folios

-> shrink_folio_list():reclaim 0 folios, putback 32 rmap lock-contended
folios

5th shrink_inactive_list():

-> isolate_lru_folios():isolate 32 folios, include 32 rmap
lock-contended folios

-> shrink_folio_list():reclaim 0 folios, putback 32 rmap lock-contended
folios