2024-02-19 14:18:06

by 李培锋

[permalink] [raw]
Subject: [PATCH 0/2] Support kshrinkd

From: lipeifeng <[email protected]>

'commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")'
The above patch would avoid reclaim path to stuck rmap lock.
But it would cause some folios in LRU not sorted by aging because
the contended-folios in rmap_walk would be putbacked to the head of LRU
during shrink_folio_list even if the folios are very cold.

The patchset setups new kthread:kshrinkd to reclaim the contended-folio
in rmap_walk when shrink_folio_list, to avoid to break the rules of LRU.

lipeifeng (2):
mm/rmap: support folio_referenced to control if try_lock in rmap_walk
mm: support kshrinkd

include/linux/mmzone.h | 6 ++
include/linux/rmap.h | 5 +-
include/linux/swap.h | 3 +
include/linux/vm_event_item.h | 2 +
mm/memory_hotplug.c | 2 +
mm/rmap.c | 5 +-
mm/vmscan.c | 205 ++++++++++++++++++++++++++++++++++++++++--
mm/vmstat.c | 2 +
8 files changed, 221 insertions(+), 9 deletions(-)

--
2.7.4



2024-02-19 14:18:10

by 李培锋

[permalink] [raw]
Subject: [PATCH 2/2] mm: support kshrinkd

From: lipeifeng <[email protected]>

'commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")'
The above patch would avoid reclaim path to stuck rmap lock.
But it would cause some folios in LRU not sorted by aging because
the contended-folio in rmap_walk would be putback to the head of LRU
when shrink_folio_list even if the folio is very cold.

Monkey-test in phone for 300 hours shows that almost one-third of the
contended-pages can be freed successfully next time, putting back those
folios to LRU's head would break the rules of LRU.
- pgsteal_kshrinkd 262577
- pgscan_kshrinkd 795503

For the above reason, the patch setups new kthread:kshrinkd to reclaim
the contended-folio in rmap_walk when shrink_folio_list, to avoid to
break the rules of LRU.

Signed-off-by: lipeifeng <[email protected]>
---
include/linux/mmzone.h | 6 ++
include/linux/swap.h | 3 +
include/linux/vm_event_item.h | 2 +
mm/memory_hotplug.c | 2 +
mm/vmscan.c | 189 +++++++++++++++++++++++++++++++++++++++++-
mm/vmstat.c | 2 +
6 files changed, 201 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a497f18..83d7202 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1329,6 +1329,12 @@ typedef struct pglist_data {

int kswapd_failures; /* Number of 'reclaimed == 0' runs */

+ struct list_head kshrinkd_folios; /* rmap_walk contended folios list*/
+ spinlock_t kf_lock; /* Protect kshrinkd_folios list*/
+
+ struct task_struct *kshrinkd; /* reclaim kshrinkd_folios*/
+ wait_queue_head_t kshrinkd_wait;
+
#ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_highest_zoneidx;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4db00dd..155fcb6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -435,6 +435,9 @@ void check_move_unevictable_folios(struct folio_batch *fbatch);
extern void __meminit kswapd_run(int nid);
extern void __meminit kswapd_stop(int nid);

+extern void kshrinkd_run(int nid);
+extern void kshrinkd_stop(int nid);
+
#ifdef CONFIG_SWAP

int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 747943b..ee95ab1 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,9 +38,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGLAZYFREED,
PGREFILL,
PGREUSE,
+ PGSTEAL_KSHRINKD,
PGSTEAL_KSWAPD,
PGSTEAL_DIRECT,
PGSTEAL_KHUGEPAGED,
+ PGSCAN_KSHRINKD,
PGSCAN_KSWAPD,
PGSCAN_DIRECT,
PGSCAN_KHUGEPAGED,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2189099..1b6c4c6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,6 +1209,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,

kswapd_run(nid);
kcompactd_run(nid);
+ kshrinkd_run(nid);

writeback_set_ratelimit();

@@ -2092,6 +2093,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
}

if (arg.status_change_nid >= 0) {
+ kshrinkd_stop(node);
kcompactd_stop(node);
kswapd_stop(node);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0296d48..63e4fd4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -139,6 +139,9 @@ struct scan_control {
/* if try_lock in rmap_walk */
unsigned int rw_try_lock:1;

+ /* need kshrinkd to reclaim if rwc trylock contended*/
+ unsigned int need_kshrinkd:1;
+
/* Allocation order */
s8 order;

@@ -190,6 +193,17 @@ struct scan_control {
*/
int vm_swappiness = 60;

+/*
+ * Wakeup kshrinkd those folios which lock-contended in ramp_walk
+ * during shrink_folio_list, instead of putting back to the head
+ * of LRU, to avoid to break the rules of LRU.
+ */
+static void wakeup_kshrinkd(struct pglist_data *pgdat)
+{
+ if (likely(pgdat->kshrinkd))
+ wake_up_interruptible(&pgdat->kshrinkd_wait);
+}
+
#ifdef CONFIG_MEMCG

/* Returns true for reclaim through cgroup limits or cgroup interfaces. */
@@ -821,6 +835,7 @@ enum folio_references {
FOLIOREF_RECLAIM_CLEAN,
FOLIOREF_KEEP,
FOLIOREF_ACTIVATE,
+ FOLIOREF_LOCK_CONTENDED,
};

static enum folio_references folio_check_references(struct folio *folio,
@@ -841,8 +856,12 @@ static enum folio_references folio_check_references(struct folio *folio,
return FOLIOREF_ACTIVATE;

/* rmap lock contention: rotate */
- if (referenced_ptes == -1)
- return FOLIOREF_KEEP;
+ if (referenced_ptes == -1) {
+ if (sc->need_kshrinkd && folio_pgdat(folio)->kshrinkd)
+ return FOLIOREF_LOCK_CONTENDED;
+ else
+ return FOLIOREF_KEEP;
+ }

if (referenced_ptes) {
/*
@@ -1012,6 +1031,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
LIST_HEAD(ret_folios);
LIST_HEAD(free_folios);
LIST_HEAD(demote_folios);
+ LIST_HEAD(contended_folios);
unsigned int nr_reclaimed = 0;
unsigned int pgactivate = 0;
bool do_demote_pass;
@@ -1028,6 +1048,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
enum folio_references references = FOLIOREF_RECLAIM;
bool dirty, writeback;
unsigned int nr_pages;
+ bool lock_contended = false;

cond_resched();

@@ -1169,6 +1190,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
case FOLIOREF_KEEP:
stat->nr_ref_keep += nr_pages;
goto keep_locked;
+ case FOLIOREF_LOCK_CONTENDED:
+ lock_contended = true;
+ goto keep_locked;
case FOLIOREF_RECLAIM:
case FOLIOREF_RECLAIM_CLEAN:
; /* try to reclaim the folio below */
@@ -1449,7 +1473,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
keep_locked:
folio_unlock(folio);
keep:
- list_add(&folio->lru, &ret_folios);
+ if (unlikely(lock_contended))
+ list_add(&folio->lru, &contended_folios);
+ else
+ list_add(&folio->lru, &ret_folios);
VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
folio_test_unevictable(folio), folio);
}
@@ -1491,6 +1518,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
free_unref_page_list(&free_folios);

list_splice(&ret_folios, folio_list);
+
+ if (!list_empty(&contended_folios)) {
+ spin_lock_irq(&pgdat->kf_lock);
+ list_splice(&contended_folios, &pgdat->kshrinkd_folios);
+ spin_unlock_irq(&pgdat->kf_lock);
+ wakeup_kshrinkd(pgdat);
+ }
+
count_vm_events(PGACTIVATE, pgactivate);

if (plug)
@@ -1505,6 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};
struct reclaim_stat stat;
unsigned int nr_reclaimed;
@@ -2101,6 +2137,7 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list,
.may_swap = 1,
.no_demotion = 1,
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};

nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false);
@@ -5448,6 +5485,7 @@ static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
.reclaim_idx = MAX_NR_ZONES - 1,
.gfp_mask = GFP_KERNEL,
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};

buf = kvmalloc(len + 1, GFP_KERNEL);
@@ -6421,6 +6459,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.may_unmap = 1,
.may_swap = 1,
.rw_try_lock = 1,
+ .need_kshrinkd = 1,
};

/*
@@ -6467,6 +6506,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
.reclaim_idx = MAX_NR_ZONES - 1,
.may_swap = !noswap,
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};

WARN_ON_ONCE(!current->reclaim_state);
@@ -6512,6 +6552,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};
/*
* Traverse the ZONELIST_FALLBACK zonelist of the current node to put
@@ -6774,6 +6815,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
.order = order,
.may_unmap = 1,
.rw_try_lock = 1,
+ .need_kshrinkd = 1,
};

set_task_reclaim_state(current, &sc.reclaim_state);
@@ -7234,6 +7276,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
.may_swap = 1,
.hibernation_mode = 1,
.rw_try_lock = 1,
+ .need_kshrinkd = 0,
};
struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
unsigned long nr_reclaimed;
@@ -7304,6 +7347,145 @@ static int __init kswapd_init(void)

module_init(kswapd_init)

+static int kshrinkd_should_run(pg_data_t *pgdat)
+{
+ int should_run;
+
+ spin_lock_irq(&pgdat->kf_lock);
+ should_run = !list_empty(&pgdat->kshrinkd_folios);
+ spin_unlock_irq(&pgdat->kf_lock);
+
+ return should_run;
+}
+
+static unsigned long kshrinkd_reclaim_folios(struct list_head *folio_list,
+ struct pglist_data *pgdat)
+{
+ struct reclaim_stat dummy_stat;
+ unsigned int nr_reclaimed = 0;
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .may_writepage = 1,
+ .may_unmap = 1,
+ .may_swap = 1,
+ .no_demotion = 1,
+ .rw_try_lock = 0,
+ .need_kshrinkd = 0,
+ };
+
+ if (list_empty(folio_list))
+ return nr_reclaimed;
+
+ nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false);
+
+ return nr_reclaimed;
+}
+
+/*
+ * The background kshrink daemon, started as a kernel thread
+ * from the init process.
+ *
+ * Kshrinkd is to reclaim the contended-folio in rmap_walk when
+ * shrink_folio_list instead of putting back into the head of LRU
+ * directly, to avoid to break the rules of LRU.
+ */
+
+static int kshrinkd(void *p)
+{
+ pg_data_t *pgdat;
+ LIST_HEAD(tmp_contended_folios);
+
+ pgdat = (pg_data_t *)p;
+
+ current->flags |= PF_MEMALLOC | PF_KSWAPD;
+ set_freezable();
+
+ while (!kthread_should_stop()) {
+ unsigned long nr_reclaimed = 0;
+ unsigned long nr_putback = 0;
+
+ wait_event_freezable(pgdat->kshrinkd_wait,
+ kshrinkd_should_run(pgdat));
+
+ /* splice rmap_walk contended folios to tmp-list */
+ spin_lock_irq(&pgdat->kf_lock);
+ list_splice(&pgdat->kshrinkd_folios, &tmp_contended_folios);
+ INIT_LIST_HEAD(&pgdat->kshrinkd_folios);
+ spin_unlock_irq(&pgdat->kf_lock);
+
+ /* reclaim rmap_walk contended folios */
+ nr_reclaimed = kshrinkd_reclaim_folios(&tmp_contended_folios, pgdat);
+ __count_vm_events(PGSTEAL_KSHRINKD, nr_reclaimed);
+
+ /* putback the folios which failed to reclaim to lru */
+ while (!list_empty(&tmp_contended_folios)) {
+ struct folio *folio = lru_to_folio(&tmp_contended_folios);
+
+ nr_putback += folio_nr_pages(folio);
+ list_del(&folio->lru);
+ folio_putback_lru(folio);
+ }
+
+ __count_vm_events(PGSCAN_KSHRINKD, nr_reclaimed + nr_putback);
+ }
+
+ current->flags &= ~(PF_MEMALLOC | PF_KSWAPD);
+
+ return 0;
+}
+
+/*
+ * This kshrinkd start function will be called by init and node-hot-add.
+ */
+void kshrinkd_run(int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ if (pgdat->kshrinkd)
+ return;
+
+ pgdat->kshrinkd = kthread_run(kshrinkd, pgdat, "kshrinkd%d", nid);
+ if (IS_ERR(pgdat->kshrinkd)) {
+ /* failure to start kshrinkd */
+ WARN_ON_ONCE(system_state < SYSTEM_RUNNING);
+ pr_err("Failed to start kshrinkd on node %d\n", nid);
+ pgdat->kshrinkd = NULL;
+ }
+}
+
+/*
+ * Called by memory hotplug when all memory in a node is offlined. Caller must
+ * be holding mem_hotplug_begin/done().
+ */
+void kshrinkd_stop(int nid)
+{
+ struct task_struct *kshrinkd = NODE_DATA(nid)->kshrinkd;
+
+ if (kshrinkd) {
+ kthread_stop(kshrinkd);
+ NODE_DATA(nid)->kshrinkd = NULL;
+ }
+}
+
+static int __init kshrinkd_init(void)
+{
+ int nid;
+
+ for_each_node_state(nid, N_MEMORY) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ spin_lock_init(&pgdat->kf_lock);
+ init_waitqueue_head(&pgdat->kshrinkd_wait);
+ INIT_LIST_HEAD(&pgdat->kshrinkd_folios);
+
+ kshrinkd_run(nid);
+ }
+
+ return 0;
+}
+
+module_init(kshrinkd_init)
+
#ifdef CONFIG_NUMA
/*
* Node reclaim mode
@@ -7393,6 +7575,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
.may_swap = 1,
.reclaim_idx = gfp_zone(gfp_mask),
.rw_try_lock = 1,
+ .need_kshrinkd = 1,
};
unsigned long pflags;

diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935..76d8a3b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1279,9 +1279,11 @@ const char * const vmstat_text[] = {

"pgrefill",
"pgreuse",
+ "pgsteal_kshrinkd",
"pgsteal_kswapd",
"pgsteal_direct",
"pgsteal_khugepaged",
+ "pgscan_kshrinkd",
"pgscan_kswapd",
"pgscan_direct",
"pgscan_khugepaged",
--
2.7.4


2024-02-19 16:51:22

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/2] Support kshrinkd

On Mon, Feb 19, 2024 at 10:17:01PM +0800, [email protected] wrote:
> 'commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")'
> The above patch would avoid reclaim path to stuck rmap lock.
> But it would cause some folios in LRU not sorted by aging because
> the contended-folios in rmap_walk would be putbacked to the head of LRU
> during shrink_folio_list even if the folios are very cold.
>
> The patchset setups new kthread:kshrinkd to reclaim the contended-folio
> in rmap_walk when shrink_folio_list, to avoid to break the rules of LRU.

Patch 1/2 didn't make it to my inbox or to lore. But you should talk
about the real world consequences of this in the cover letter. What do
we observe if this problem happens? How much extra performance will we
gain by applying this patch?


2024-02-20 02:05:01

by 李培锋

[permalink] [raw]
Subject: Re: [PATCH 0/2] Support kshrinkd


在 2024/2/20 0:51, Matthew Wilcox 写道:
> On Mon, Feb 19, 2024 at 10:17:01PM +0800, [email protected] wrote:
>> 'commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")'
>> The above patch would avoid reclaim path to stuck rmap lock.
>> But it would cause some folios in LRU not sorted by aging because
>> the contended-folios in rmap_walk would be putbacked to the head of LRU
>> during shrink_folio_list even if the folios are very cold.
>>
>> The patchset setups new kthread:kshrinkd to reclaim the contended-folio
>> in rmap_walk when shrink_folio_list, to avoid to break the rules of LRU.
> Patch 1/2 didn't make it to my inbox or to lore.
Hi Sir, I had resent to you.
> But you should talk
> about the real world consequences of this in the cover letter. What do
> we observe if this problem happens? How much extra performance will we
> gain by applying this patch?

Hi Sir:

Monkey-test in phone with 16G-ram for 300 hours shows that almost one-third

of the contended-pages can be freed successfully next time, putting back
those

folios to LRU's head would break the rules of inative-LRU.

- pgsteal_kshrinkd 262577
- pgscan_kshrinkd 795503


"pgsteal_kshrinkd" means that the amount of those contended-folios which
can be

freed successfully but be putbacked in the head of inactive-LRU, more
than 1GB(262577 folios).

Mobile-phone with 16-ram, the total amount of inactive are around 4.5G,
so that the

contended-folios would break the rules of inactive-LRU.

- nr_inactive_anon 1020953
- nr_inactive_file 204801


Actually, The patchset had been merged in Google kernel/common since

android12-5.10 and android13-5.15, and were taken in more than 100 millions

android-phone devices more than 1.5 years.

But for the reason of GKI, the patches were implemented in the form of
hooks,

the patches merged in google-line as follows:

https://android-review.googlesource.com/c/kernel/common/+/2163904

https://android-review.googlesource.com/c/kernel/common/+/2191343

https://android-review.googlesource.com/c/kernel/common/+/2550490

https://android-review.googlesource.com/c/kernel/common/+/2318311



2024-02-20 02:09:32

by 李培锋

[permalink] [raw]
Subject: Re: [PATCH 0/2] Support kshrinkd

add experts from Linux and Google.


在 2024/2/19 22:17, [email protected] 写道:
> From: lipeifeng <[email protected]>
>
> 'commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")'
> The above patch would avoid reclaim path to stuck rmap lock.
> But it would cause some folios in LRU not sorted by aging because
> the contended-folios in rmap_walk would be putbacked to the head of LRU
> during shrink_folio_list even if the folios are very cold.
>
> The patchset setups new kthread:kshrinkd to reclaim the contended-folio
> in rmap_walk when shrink_folio_list, to avoid to break the rules of LRU.
>
> lipeifeng (2):
> mm/rmap: support folio_referenced to control if try_lock in rmap_walk
> mm: support kshrinkd
>
> include/linux/mmzone.h | 6 ++
> include/linux/rmap.h | 5 +-
> include/linux/swap.h | 3 +
> include/linux/vm_event_item.h | 2 +
> mm/memory_hotplug.c | 2 +
> mm/rmap.c | 5 +-
> mm/vmscan.c | 205 ++++++++++++++++++++++++++++++++++++++++--
> mm/vmstat.c | 2 +
> 8 files changed, 221 insertions(+), 9 deletions(-)
>

2024-02-20 02:22:11

by 李培锋

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: support kshrinkd

add experts from Linux and Google.


在 2024/2/19 22:17, [email protected] 写道:
> From: lipeifeng <[email protected]>
>
> 'commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")'
> The above patch would avoid reclaim path to stuck rmap lock.
> But it would cause some folios in LRU not sorted by aging because
> the contended-folio in rmap_walk would be putback to the head of LRU
> when shrink_folio_list even if the folio is very cold.
>
> Monkey-test in phone for 300 hours shows that almost one-third of the
> contended-pages can be freed successfully next time, putting back those
> folios to LRU's head would break the rules of LRU.
> - pgsteal_kshrinkd 262577
> - pgscan_kshrinkd 795503
>
> For the above reason, the patch setups new kthread:kshrinkd to reclaim
> the contended-folio in rmap_walk when shrink_folio_list, to avoid to
> break the rules of LRU.
>
> Signed-off-by: lipeifeng <[email protected]>
> ---
> include/linux/mmzone.h | 6 ++
> include/linux/swap.h | 3 +
> include/linux/vm_event_item.h | 2 +
> mm/memory_hotplug.c | 2 +
> mm/vmscan.c | 189 +++++++++++++++++++++++++++++++++++++++++-
> mm/vmstat.c | 2 +
> 6 files changed, 201 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a497f18..83d7202 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1329,6 +1329,12 @@ typedef struct pglist_data {
>
> int kswapd_failures; /* Number of 'reclaimed == 0' runs */
>
> + struct list_head kshrinkd_folios; /* rmap_walk contended folios list*/
> + spinlock_t kf_lock; /* Protect kshrinkd_folios list*/
> +
> + struct task_struct *kshrinkd; /* reclaim kshrinkd_folios*/
> + wait_queue_head_t kshrinkd_wait;
> +
> #ifdef CONFIG_COMPACTION
> int kcompactd_max_order;
> enum zone_type kcompactd_highest_zoneidx;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4db00dd..155fcb6 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -435,6 +435,9 @@ void check_move_unevictable_folios(struct folio_batch *fbatch);
> extern void __meminit kswapd_run(int nid);
> extern void __meminit kswapd_stop(int nid);
>
> +extern void kshrinkd_run(int nid);
> +extern void kshrinkd_stop(int nid);
> +
> #ifdef CONFIG_SWAP
>
> int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 747943b..ee95ab1 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -38,9 +38,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> PGLAZYFREED,
> PGREFILL,
> PGREUSE,
> + PGSTEAL_KSHRINKD,
> PGSTEAL_KSWAPD,
> PGSTEAL_DIRECT,
> PGSTEAL_KHUGEPAGED,
> + PGSCAN_KSHRINKD,
> PGSCAN_KSWAPD,
> PGSCAN_DIRECT,
> PGSCAN_KHUGEPAGED,
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2189099..1b6c4c6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1209,6 +1209,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>
> kswapd_run(nid);
> kcompactd_run(nid);
> + kshrinkd_run(nid);
>
> writeback_set_ratelimit();
>
> @@ -2092,6 +2093,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> }
>
> if (arg.status_change_nid >= 0) {
> + kshrinkd_stop(node);
> kcompactd_stop(node);
> kswapd_stop(node);
> }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0296d48..63e4fd4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -139,6 +139,9 @@ struct scan_control {
> /* if try_lock in rmap_walk */
> unsigned int rw_try_lock:1;
>
> + /* need kshrinkd to reclaim if rwc trylock contended*/
> + unsigned int need_kshrinkd:1;
> +
> /* Allocation order */
> s8 order;
>
> @@ -190,6 +193,17 @@ struct scan_control {
> */
> int vm_swappiness = 60;
>
> +/*
> + * Wakeup kshrinkd those folios which lock-contended in ramp_walk
> + * during shrink_folio_list, instead of putting back to the head
> + * of LRU, to avoid to break the rules of LRU.
> + */
> +static void wakeup_kshrinkd(struct pglist_data *pgdat)
> +{
> + if (likely(pgdat->kshrinkd))
> + wake_up_interruptible(&pgdat->kshrinkd_wait);
> +}
> +
> #ifdef CONFIG_MEMCG
>
> /* Returns true for reclaim through cgroup limits or cgroup interfaces. */
> @@ -821,6 +835,7 @@ enum folio_references {
> FOLIOREF_RECLAIM_CLEAN,
> FOLIOREF_KEEP,
> FOLIOREF_ACTIVATE,
> + FOLIOREF_LOCK_CONTENDED,
> };
>
> static enum folio_references folio_check_references(struct folio *folio,
> @@ -841,8 +856,12 @@ static enum folio_references folio_check_references(struct folio *folio,
> return FOLIOREF_ACTIVATE;
>
> /* rmap lock contention: rotate */
> - if (referenced_ptes == -1)
> - return FOLIOREF_KEEP;
> + if (referenced_ptes == -1) {
> + if (sc->need_kshrinkd && folio_pgdat(folio)->kshrinkd)
> + return FOLIOREF_LOCK_CONTENDED;
> + else
> + return FOLIOREF_KEEP;
> + }
>
> if (referenced_ptes) {
> /*
> @@ -1012,6 +1031,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> LIST_HEAD(ret_folios);
> LIST_HEAD(free_folios);
> LIST_HEAD(demote_folios);
> + LIST_HEAD(contended_folios);
> unsigned int nr_reclaimed = 0;
> unsigned int pgactivate = 0;
> bool do_demote_pass;
> @@ -1028,6 +1048,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> enum folio_references references = FOLIOREF_RECLAIM;
> bool dirty, writeback;
> unsigned int nr_pages;
> + bool lock_contended = false;
>
> cond_resched();
>
> @@ -1169,6 +1190,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> case FOLIOREF_KEEP:
> stat->nr_ref_keep += nr_pages;
> goto keep_locked;
> + case FOLIOREF_LOCK_CONTENDED:
> + lock_contended = true;
> + goto keep_locked;
> case FOLIOREF_RECLAIM:
> case FOLIOREF_RECLAIM_CLEAN:
> ; /* try to reclaim the folio below */
> @@ -1449,7 +1473,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> keep_locked:
> folio_unlock(folio);
> keep:
> - list_add(&folio->lru, &ret_folios);
> + if (unlikely(lock_contended))
> + list_add(&folio->lru, &contended_folios);
> + else
> + list_add(&folio->lru, &ret_folios);
> VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
> folio_test_unevictable(folio), folio);
> }
> @@ -1491,6 +1518,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> free_unref_page_list(&free_folios);
>
> list_splice(&ret_folios, folio_list);
> +
> + if (!list_empty(&contended_folios)) {
> + spin_lock_irq(&pgdat->kf_lock);
> + list_splice(&contended_folios, &pgdat->kshrinkd_folios);
> + spin_unlock_irq(&pgdat->kf_lock);
> + wakeup_kshrinkd(pgdat);
> + }
> +
> count_vm_events(PGACTIVATE, pgactivate);
>
> if (plug)
> @@ -1505,6 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
> .gfp_mask = GFP_KERNEL,
> .may_unmap = 1,
> .rw_try_lock = 1,
> + .need_kshrinkd = 0,
> };
> struct reclaim_stat stat;
> unsigned int nr_reclaimed;
> @@ -2101,6 +2137,7 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list,
> .may_swap = 1,
> .no_demotion = 1,
> .rw_try_lock = 1,
> + .need_kshrinkd = 0,
> };
>
> nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false);
> @@ -5448,6 +5485,7 @@ static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
> .reclaim_idx = MAX_NR_ZONES - 1,
> .gfp_mask = GFP_KERNEL,
> .rw_try_lock = 1,
> + .need_kshrinkd = 0,
> };
>
> buf = kvmalloc(len + 1, GFP_KERNEL);
> @@ -6421,6 +6459,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> .may_unmap = 1,
> .may_swap = 1,
> .rw_try_lock = 1,
> + .need_kshrinkd = 1,
> };
>
> /*
> @@ -6467,6 +6506,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> .reclaim_idx = MAX_NR_ZONES - 1,
> .may_swap = !noswap,
> .rw_try_lock = 1,
> + .need_kshrinkd = 0,
> };
>
> WARN_ON_ONCE(!current->reclaim_state);
> @@ -6512,6 +6552,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
> .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> .rw_try_lock = 1,
> + .need_kshrinkd = 0,
> };
> /*
> * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> @@ -6774,6 +6815,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> .order = order,
> .may_unmap = 1,
> .rw_try_lock = 1,
> + .need_kshrinkd = 1,
> };
>
> set_task_reclaim_state(current, &sc.reclaim_state);
> @@ -7234,6 +7276,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
> .may_swap = 1,
> .hibernation_mode = 1,
> .rw_try_lock = 1,
> + .need_kshrinkd = 0,
> };
> struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
> unsigned long nr_reclaimed;
> @@ -7304,6 +7347,145 @@ static int __init kswapd_init(void)
>
> module_init(kswapd_init)
>
> +static int kshrinkd_should_run(pg_data_t *pgdat)
> +{
> + int should_run;
> +
> + spin_lock_irq(&pgdat->kf_lock);
> + should_run = !list_empty(&pgdat->kshrinkd_folios);
> + spin_unlock_irq(&pgdat->kf_lock);
> +
> + return should_run;
> +}
> +
> +static unsigned long kshrinkd_reclaim_folios(struct list_head *folio_list,
> + struct pglist_data *pgdat)
> +{
> + struct reclaim_stat dummy_stat;
> + unsigned int nr_reclaimed = 0;
> + struct scan_control sc = {
> + .gfp_mask = GFP_KERNEL,
> + .may_writepage = 1,
> + .may_unmap = 1,
> + .may_swap = 1,
> + .no_demotion = 1,
> + .rw_try_lock = 0,
> + .need_kshrinkd = 0,
> + };
> +
> + if (list_empty(folio_list))
> + return nr_reclaimed;
> +
> + nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false);
> +
> + return nr_reclaimed;
> +}
> +
> +/*
> + * The background kshrink daemon, started as a kernel thread
> + * from the init process.
> + *
> + * Kshrinkd is to reclaim the contended-folio in rmap_walk when
> + * shrink_folio_list instead of putting back into the head of LRU
> + * directly, to avoid to break the rules of LRU.
> + */
> +
> +static int kshrinkd(void *p)
> +{
> + pg_data_t *pgdat;
> + LIST_HEAD(tmp_contended_folios);
> +
> + pgdat = (pg_data_t *)p;
> +
> + current->flags |= PF_MEMALLOC | PF_KSWAPD;
> + set_freezable();
> +
> + while (!kthread_should_stop()) {
> + unsigned long nr_reclaimed = 0;
> + unsigned long nr_putback = 0;
> +
> + wait_event_freezable(pgdat->kshrinkd_wait,
> + kshrinkd_should_run(pgdat));
> +
> + /* splice rmap_walk contended folios to tmp-list */
> + spin_lock_irq(&pgdat->kf_lock);
> + list_splice(&pgdat->kshrinkd_folios, &tmp_contended_folios);
> + INIT_LIST_HEAD(&pgdat->kshrinkd_folios);
> + spin_unlock_irq(&pgdat->kf_lock);
> +
> + /* reclaim rmap_walk contended folios */
> + nr_reclaimed = kshrinkd_reclaim_folios(&tmp_contended_folios, pgdat);
> + __count_vm_events(PGSTEAL_KSHRINKD, nr_reclaimed);
> +
> + /* putback the folios which failed to reclaim to lru */
> + while (!list_empty(&tmp_contended_folios)) {
> + struct folio *folio = lru_to_folio(&tmp_contended_folios);
> +
> + nr_putback += folio_nr_pages(folio);
> + list_del(&folio->lru);
> + folio_putback_lru(folio);
> + }
> +
> + __count_vm_events(PGSCAN_KSHRINKD, nr_reclaimed + nr_putback);
> + }
> +
> + current->flags &= ~(PF_MEMALLOC | PF_KSWAPD);
> +
> + return 0;
> +}
> +
> +/*
> + * This kshrinkd start function will be called by init and node-hot-add.
> + */
> +void kshrinkd_run(int nid)
> +{
> + pg_data_t *pgdat = NODE_DATA(nid);
> +
> + if (pgdat->kshrinkd)
> + return;
> +
> + pgdat->kshrinkd = kthread_run(kshrinkd, pgdat, "kshrinkd%d", nid);
> + if (IS_ERR(pgdat->kshrinkd)) {
> + /* failure to start kshrinkd */
> + WARN_ON_ONCE(system_state < SYSTEM_RUNNING);
> + pr_err("Failed to start kshrinkd on node %d\n", nid);
> + pgdat->kshrinkd = NULL;
> + }
> +}
> +
> +/*
> + * Called by memory hotplug when all memory in a node is offlined. Caller must
> + * be holding mem_hotplug_begin/done().
> + */
> +void kshrinkd_stop(int nid)
> +{
> + struct task_struct *kshrinkd = NODE_DATA(nid)->kshrinkd;
> +
> + if (kshrinkd) {
> + kthread_stop(kshrinkd);
> + NODE_DATA(nid)->kshrinkd = NULL;
> + }
> +}
> +
> +static int __init kshrinkd_init(void)
> +{
> + int nid;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + pg_data_t *pgdat = NODE_DATA(nid);
> +
> + spin_lock_init(&pgdat->kf_lock);
> + init_waitqueue_head(&pgdat->kshrinkd_wait);
> + INIT_LIST_HEAD(&pgdat->kshrinkd_folios);
> +
> + kshrinkd_run(nid);
> + }
> +
> + return 0;
> +}
> +
> +module_init(kshrinkd_init)
> +
> #ifdef CONFIG_NUMA
> /*
> * Node reclaim mode
> @@ -7393,6 +7575,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
> .may_swap = 1,
> .reclaim_idx = gfp_zone(gfp_mask),
> .rw_try_lock = 1,
> + .need_kshrinkd = 1,
> };
> unsigned long pflags;
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index db79935..76d8a3b 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1279,9 +1279,11 @@ const char * const vmstat_text[] = {
>
> "pgrefill",
> "pgreuse",
> + "pgsteal_kshrinkd",
> "pgsteal_kswapd",
> "pgsteal_direct",
> "pgsteal_khugepaged",
> + "pgscan_kshrinkd",
> "pgscan_kswapd",
> "pgscan_direct",
> "pgscan_khugepaged",

2024-02-20 02:55:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/2] Support kshrinkd

On Tue, Feb 20, 2024 at 10:04:33AM +0800, 李培锋 wrote:
> Monkey-test in phone with 16G-ram for 300 hours shows that almost one-third
>
> of the contended-pages can be freed successfully next time, putting back
> those
>
> folios to LRU's head would break the rules of inative-LRU.

You talk about "the rules of inactive LRU" like we care. The LRU is
an approximation at best. What are the *consequences*? Is there a
benchmark that executes more operations per second as a result of
this patch?


2024-02-20 03:19:45

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: support kshrinkd

Hi Peifeng,

On Tue, Feb 20, 2024 at 3:21 PM 李培锋 <[email protected]> wrote:
>
> add experts from Linux and Google.
>
>
> 在 2024/2/19 22:17, [email protected] 写道:
> > From: lipeifeng <[email protected]>
> >
> > 'commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path")'
> > The above patch would avoid reclaim path to stuck rmap lock.
> > But it would cause some folios in LRU not sorted by aging because
> > the contended-folio in rmap_walk would be putback to the head of LRU
> > when shrink_folio_list even if the folio is very cold.
> >
> > Monkey-test in phone for 300 hours shows that almost one-third of the
> > contended-pages can be freed successfully next time, putting back those
> > folios to LRU's head would break the rules of LRU.

the commit message seems hard to read.

how serious the LRU aging is broken? what is the percentage of folios
being contended?

what is the negative impact if the contented folios are aged improperly?

> > - pgsteal_kshrinkd 262577
> > - pgscan_kshrinkd 795503
> >
> > For the above reason, the patch setups new kthread:kshrinkd to reclaim
> > the contended-folio in rmap_walk when shrink_folio_list, to avoid to
> > break the rules of LRU.

what benefits the real users experiences have got from the "fixed" aging
by your approach putting contended folios in a separate list and having a
separate thread to reclaim them?

> >
> > Signed-off-by: lipeifeng <[email protected]>
> > ---
> > include/linux/mmzone.h | 6 ++
> > include/linux/swap.h | 3 +
> > include/linux/vm_event_item.h | 2 +
> > mm/memory_hotplug.c | 2 +
> > mm/vmscan.c | 189 +++++++++++++++++++++++++++++++++++++++++-
> > mm/vmstat.c | 2 +
> > 6 files changed, 201 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index a497f18..83d7202 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1329,6 +1329,12 @@ typedef struct pglist_data {
> >
> > int kswapd_failures; /* Number of 'reclaimed == 0' runs */
> >
> > + struct list_head kshrinkd_folios; /* rmap_walk contended folios list*/
> > + spinlock_t kf_lock; /* Protect kshrinkd_folios list*/
> > +
> > + struct task_struct *kshrinkd; /* reclaim kshrinkd_folios*/
> > + wait_queue_head_t kshrinkd_wait;
> > +
> > #ifdef CONFIG_COMPACTION
> > int kcompactd_max_order;
> > enum zone_type kcompactd_highest_zoneidx;
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 4db00dd..155fcb6 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -435,6 +435,9 @@ void check_move_unevictable_folios(struct folio_batch *fbatch);
> > extern void __meminit kswapd_run(int nid);
> > extern void __meminit kswapd_stop(int nid);
> >
> > +extern void kshrinkd_run(int nid);
> > +extern void kshrinkd_stop(int nid);
> > +
> > #ifdef CONFIG_SWAP
> >
> > int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index 747943b..ee95ab1 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -38,9 +38,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > PGLAZYFREED,
> > PGREFILL,
> > PGREUSE,
> > + PGSTEAL_KSHRINKD,
> > PGSTEAL_KSWAPD,
> > PGSTEAL_DIRECT,
> > PGSTEAL_KHUGEPAGED,
> > + PGSCAN_KSHRINKD,
> > PGSCAN_KSWAPD,
> > PGSCAN_DIRECT,
> > PGSCAN_KHUGEPAGED,
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 2189099..1b6c4c6 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1209,6 +1209,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> >
> > kswapd_run(nid);
> > kcompactd_run(nid);
> > + kshrinkd_run(nid);
> >
> > writeback_set_ratelimit();
> >
> > @@ -2092,6 +2093,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> > }
> >
> > if (arg.status_change_nid >= 0) {
> > + kshrinkd_stop(node);
> > kcompactd_stop(node);
> > kswapd_stop(node);
> > }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0296d48..63e4fd4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -139,6 +139,9 @@ struct scan_control {
> > /* if try_lock in rmap_walk */
> > unsigned int rw_try_lock:1;
> >
> > + /* need kshrinkd to reclaim if rwc trylock contended*/
> > + unsigned int need_kshrinkd:1;
> > +
> > /* Allocation order */
> > s8 order;
> >
> > @@ -190,6 +193,17 @@ struct scan_control {
> > */
> > int vm_swappiness = 60;
> >
> > +/*
> > + * Wakeup kshrinkd those folios which lock-contended in ramp_walk
> > + * during shrink_folio_list, instead of putting back to the head
> > + * of LRU, to avoid to break the rules of LRU.
> > + */
> > +static void wakeup_kshrinkd(struct pglist_data *pgdat)
> > +{
> > + if (likely(pgdat->kshrinkd))
> > + wake_up_interruptible(&pgdat->kshrinkd_wait);
> > +}
> > +
> > #ifdef CONFIG_MEMCG
> >
> > /* Returns true for reclaim through cgroup limits or cgroup interfaces. */
> > @@ -821,6 +835,7 @@ enum folio_references {
> > FOLIOREF_RECLAIM_CLEAN,
> > FOLIOREF_KEEP,
> > FOLIOREF_ACTIVATE,
> > + FOLIOREF_LOCK_CONTENDED,
> > };
> >
> > static enum folio_references folio_check_references(struct folio *folio,
> > @@ -841,8 +856,12 @@ static enum folio_references folio_check_references(struct folio *folio,
> > return FOLIOREF_ACTIVATE;
> >
> > /* rmap lock contention: rotate */
> > - if (referenced_ptes == -1)
> > - return FOLIOREF_KEEP;
> > + if (referenced_ptes == -1) {
> > + if (sc->need_kshrinkd && folio_pgdat(folio)->kshrinkd)
> > + return FOLIOREF_LOCK_CONTENDED;
> > + else
> > + return FOLIOREF_KEEP;
> > + }
> >
> > if (referenced_ptes) {
> > /*
> > @@ -1012,6 +1031,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > LIST_HEAD(ret_folios);
> > LIST_HEAD(free_folios);
> > LIST_HEAD(demote_folios);
> > + LIST_HEAD(contended_folios);
> > unsigned int nr_reclaimed = 0;
> > unsigned int pgactivate = 0;
> > bool do_demote_pass;
> > @@ -1028,6 +1048,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > enum folio_references references = FOLIOREF_RECLAIM;
> > bool dirty, writeback;
> > unsigned int nr_pages;
> > + bool lock_contended = false;
> >
> > cond_resched();
> >
> > @@ -1169,6 +1190,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > case FOLIOREF_KEEP:
> > stat->nr_ref_keep += nr_pages;
> > goto keep_locked;
> > + case FOLIOREF_LOCK_CONTENDED:
> > + lock_contended = true;
> > + goto keep_locked;
> > case FOLIOREF_RECLAIM:
> > case FOLIOREF_RECLAIM_CLEAN:
> > ; /* try to reclaim the folio below */
> > @@ -1449,7 +1473,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > keep_locked:
> > folio_unlock(folio);
> > keep:
> > - list_add(&folio->lru, &ret_folios);
> > + if (unlikely(lock_contended))
> > + list_add(&folio->lru, &contended_folios);
> > + else
> > + list_add(&folio->lru, &ret_folios);
> > VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
> > folio_test_unevictable(folio), folio);
> > }
> > @@ -1491,6 +1518,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > free_unref_page_list(&free_folios);
> >
> > list_splice(&ret_folios, folio_list);
> > +
> > + if (!list_empty(&contended_folios)) {
> > + spin_lock_irq(&pgdat->kf_lock);
> > + list_splice(&contended_folios, &pgdat->kshrinkd_folios);
> > + spin_unlock_irq(&pgdat->kf_lock);
> > + wakeup_kshrinkd(pgdat);
> > + }
> > +
> > count_vm_events(PGACTIVATE, pgactivate);
> >
> > if (plug)
> > @@ -1505,6 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
> > .gfp_mask = GFP_KERNEL,
> > .may_unmap = 1,
> > .rw_try_lock = 1,
> > + .need_kshrinkd = 0,
> > };
> > struct reclaim_stat stat;
> > unsigned int nr_reclaimed;
> > @@ -2101,6 +2137,7 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list,
> > .may_swap = 1,
> > .no_demotion = 1,
> > .rw_try_lock = 1,
> > + .need_kshrinkd = 0,
> > };
> >
> > nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false);
> > @@ -5448,6 +5485,7 @@ static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
> > .reclaim_idx = MAX_NR_ZONES - 1,
> > .gfp_mask = GFP_KERNEL,
> > .rw_try_lock = 1,
> > + .need_kshrinkd = 0,
> > };
> >
> > buf = kvmalloc(len + 1, GFP_KERNEL);
> > @@ -6421,6 +6459,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > .may_unmap = 1,
> > .may_swap = 1,
> > .rw_try_lock = 1,
> > + .need_kshrinkd = 1,
> > };
> >
> > /*
> > @@ -6467,6 +6506,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> > .reclaim_idx = MAX_NR_ZONES - 1,
> > .may_swap = !noswap,
> > .rw_try_lock = 1,
> > + .need_kshrinkd = 0,
> > };
> >
> > WARN_ON_ONCE(!current->reclaim_state);
> > @@ -6512,6 +6552,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> > .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
> > .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> > .rw_try_lock = 1,
> > + .need_kshrinkd = 0,
> > };
> > /*
> > * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> > @@ -6774,6 +6815,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> > .order = order,
> > .may_unmap = 1,
> > .rw_try_lock = 1,
> > + .need_kshrinkd = 1,
> > };
> >
> > set_task_reclaim_state(current, &sc.reclaim_state);
> > @@ -7234,6 +7276,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
> > .may_swap = 1,
> > .hibernation_mode = 1,
> > .rw_try_lock = 1,
> > + .need_kshrinkd = 0,
> > };
> > struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
> > unsigned long nr_reclaimed;
> > @@ -7304,6 +7347,145 @@ static int __init kswapd_init(void)
> >
> > module_init(kswapd_init)
> >
> > +static int kshrinkd_should_run(pg_data_t *pgdat)
> > +{
> > + int should_run;
> > +
> > + spin_lock_irq(&pgdat->kf_lock);
> > + should_run = !list_empty(&pgdat->kshrinkd_folios);
> > + spin_unlock_irq(&pgdat->kf_lock);
> > +
> > + return should_run;
> > +}
> > +
> > +static unsigned long kshrinkd_reclaim_folios(struct list_head *folio_list,
> > + struct pglist_data *pgdat)
> > +{
> > + struct reclaim_stat dummy_stat;
> > + unsigned int nr_reclaimed = 0;
> > + struct scan_control sc = {
> > + .gfp_mask = GFP_KERNEL,
> > + .may_writepage = 1,
> > + .may_unmap = 1,
> > + .may_swap = 1,
> > + .no_demotion = 1,
> > + .rw_try_lock = 0,
> > + .need_kshrinkd = 0,
> > + };
> > +
> > + if (list_empty(folio_list))
> > + return nr_reclaimed;
> > +
> > + nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false);
> > +
> > + return nr_reclaimed;
> > +}
> > +
> > +/*
> > + * The background kshrink daemon, started as a kernel thread
> > + * from the init process.
> > + *
> > + * Kshrinkd is to reclaim the contended-folio in rmap_walk when
> > + * shrink_folio_list instead of putting back into the head of LRU
> > + * directly, to avoid to break the rules of LRU.
> > + */
> > +
> > +static int kshrinkd(void *p)
> > +{
> > + pg_data_t *pgdat;
> > + LIST_HEAD(tmp_contended_folios);
> > +
> > + pgdat = (pg_data_t *)p;
> > +
> > + current->flags |= PF_MEMALLOC | PF_KSWAPD;
> > + set_freezable();
> > +
> > + while (!kthread_should_stop()) {
> > + unsigned long nr_reclaimed = 0;
> > + unsigned long nr_putback = 0;
> > +
> > + wait_event_freezable(pgdat->kshrinkd_wait,
> > + kshrinkd_should_run(pgdat));
> > +
> > + /* splice rmap_walk contended folios to tmp-list */
> > + spin_lock_irq(&pgdat->kf_lock);
> > + list_splice(&pgdat->kshrinkd_folios, &tmp_contended_folios);
> > + INIT_LIST_HEAD(&pgdat->kshrinkd_folios);
> > + spin_unlock_irq(&pgdat->kf_lock);
> > +
> > + /* reclaim rmap_walk contended folios */
> > + nr_reclaimed = kshrinkd_reclaim_folios(&tmp_contended_folios, pgdat);
> > + __count_vm_events(PGSTEAL_KSHRINKD, nr_reclaimed);
> > +
> > + /* putback the folios which failed to reclaim to lru */
> > + while (!list_empty(&tmp_contended_folios)) {
> > + struct folio *folio = lru_to_folio(&tmp_contended_folios);
> > +
> > + nr_putback += folio_nr_pages(folio);
> > + list_del(&folio->lru);
> > + folio_putback_lru(folio);
> > + }
> > +
> > + __count_vm_events(PGSCAN_KSHRINKD, nr_reclaimed + nr_putback);
> > + }
> > +
> > + current->flags &= ~(PF_MEMALLOC | PF_KSWAPD);
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * This kshrinkd start function will be called by init and node-hot-add.
> > + */
> > +void kshrinkd_run(int nid)
> > +{
> > + pg_data_t *pgdat = NODE_DATA(nid);
> > +
> > + if (pgdat->kshrinkd)
> > + return;
> > +
> > + pgdat->kshrinkd = kthread_run(kshrinkd, pgdat, "kshrinkd%d", nid);
> > + if (IS_ERR(pgdat->kshrinkd)) {
> > + /* failure to start kshrinkd */
> > + WARN_ON_ONCE(system_state < SYSTEM_RUNNING);
> > + pr_err("Failed to start kshrinkd on node %d\n", nid);
> > + pgdat->kshrinkd = NULL;
> > + }
> > +}
> > +
> > +/*
> > + * Called by memory hotplug when all memory in a node is offlined. Caller must
> > + * be holding mem_hotplug_begin/done().
> > + */
> > +void kshrinkd_stop(int nid)
> > +{
> > + struct task_struct *kshrinkd = NODE_DATA(nid)->kshrinkd;
> > +
> > + if (kshrinkd) {
> > + kthread_stop(kshrinkd);
> > + NODE_DATA(nid)->kshrinkd = NULL;
> > + }
> > +}
> > +
> > +static int __init kshrinkd_init(void)
> > +{
> > + int nid;
> > +
> > + for_each_node_state(nid, N_MEMORY) {
> > + pg_data_t *pgdat = NODE_DATA(nid);
> > +
> > + spin_lock_init(&pgdat->kf_lock);
> > + init_waitqueue_head(&pgdat->kshrinkd_wait);
> > + INIT_LIST_HEAD(&pgdat->kshrinkd_folios);
> > +
> > + kshrinkd_run(nid);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +module_init(kshrinkd_init)
> > +
> > #ifdef CONFIG_NUMA
> > /*
> > * Node reclaim mode
> > @@ -7393,6 +7575,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
> > .may_swap = 1,
> > .reclaim_idx = gfp_zone(gfp_mask),
> > .rw_try_lock = 1,
> > + .need_kshrinkd = 1,
> > };
> > unsigned long pflags;
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index db79935..76d8a3b 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1279,9 +1279,11 @@ const char * const vmstat_text[] = {
> >
> > "pgrefill",
> > "pgreuse",
> > + "pgsteal_kshrinkd",
> > "pgsteal_kswapd",
> > "pgsteal_direct",
> > "pgsteal_khugepaged",
> > + "pgscan_kshrinkd",
> > "pgscan_kswapd",
> > "pgscan_direct",
> > "pgscan_khugepaged",
>

Thanks
Barry

2024-02-20 04:16:35

by 李培锋

[permalink] [raw]
Subject: Re: [PATCH 0/2] Support kshrinkd


在 2024/2/20 10:55, Matthew Wilcox 写道:
> On Tue, Feb 20, 2024 at 10:04:33AM +0800, 李培锋 wrote:
>> Monkey-test in phone with 16G-ram for 300 hours shows that almost one-third
>>
>> of the contended-pages can be freed successfully next time, putting back
>> those
>>
>> folios to LRU's head would break the rules of inative-LRU.
> You talk about "the rules of inactive LRU" like we care. The LRU is
> an approximation at best. What are the *consequences*?
> Is there a
> benchmark that executes more operations per second as a result of
> this patch?

Hi Sir:

1. For the above data in 300 hours test in 16G-ram device:

- 795503 folios would be passed during shrink_folio_list since lock
contended;

- 262577 folios would be reclaimed successfully but putback in head of
inative-lru.


2. Converted to per second,:

- 0.243 folios would be putback in the head of inative-lru mistakenly


3. issues:

There are two issues with the current situation:

1. some cold-pages would not be freed in time, like the date we got in
16GB-devices almost 1GB-folios

would not be freed in time during the test, which would cause
shrink_folio_list to become inefficient.

Especially for some folios, which are very cold and correspond to a
common virtual memory space,

we had found some cases that more than 20 folios were contended in
rmap_walk and putback

in the head of inactive-LRU during one shrink_folio_list
proccess(isolate 32 folios) and more background

user-process was killed by lmkd. Kshrinkd would let reclaim-path more
efficient, and reduce 2% lmkd rate.


2. another issue is that staying more cold folios at the head of
inative-lru would result in some hot-pages

to be reclaimed, and more file-refault and anon-swapin. Data would be
updated soon if need.