2023-10-18 08:22:27

by Huan Yang

[permalink] [raw]
Subject: [PATCH 0/2] some fix of multi-gen lru

For multi-gen lru shrink_inactive trace, here are
some mistake:

First, nr_scanned in evict_folios means all folios
which it touched in `get_nr_to_scan`(so not means nr_taken).
So, we may get some info from trace like this:

```
kswapd0-89 [000] 64.887613: mm_vmscan_lru_shrink_inactive:
nid=0 nr_scanned=64 nr_reclaimed=0 nr_dirty=0 nr_writeback=0
nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0
nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
```

Actually, we don't taken too much folios in shrink. This patch
use difference between before sc->nr_scanned and after shrink to
replace nr_taken and pass this to shrink_inactive tracing.

Second, also like above info, we can't get nr_folios going to where,
actually in multi-gen lru shrink, many keep by folio_update_gen when
look around or other promote this folio.


Third, evict_folios don't gather reclaim_stat info after shrink list,
so, dirty\writeback\congested stat will miss, and we can't control
LRUVEC_CONGESTED or reclaim_throttle.

Patch 1 aim to resolve first and second, patch 2 resolve third.

Huan Yang (2):
tracing: mm: multigen-lru: fix mglru trace
mm: multi-gen lru: fix stat count

include/linux/vmstat.h | 2 ++
include/trace/events/vmscan.h | 8 ++++-
mm/vmscan.c | 61 ++++++++++++++++++++++++++++++++---
3 files changed, 65 insertions(+), 6 deletions(-)

--
2.34.1


2023-10-18 08:22:31

by Huan Yang

[permalink] [raw]
Subject: [PATCH 1/2] tracing: mm: multigen-lru: fix mglru trace

This patch add two reclaim stat:
nr_promote: nr_pages shrink before promote by folio_update_gen.
nr_demote: nr_pages NUMA demotion passed.

And then, use correct nr_scanned which evict_folios passed into
trace_mm_vmscan_lru_shrink_inactive.

Mistake info like this:
```
kswapd0-89 [000] 64.887613: mm_vmscan_lru_shrink_inactive:
nid=0 nr_scanned=64 nr_reclaimed=0 nr_dirty=0 nr_writeback=0
nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0
nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
```
Correct info like this:
```
<...>-9041 [006] 43.738481: mm_vmscan_lru_shrink_inactive:
nid=0 nr_scanned=13 nr_reclaimed=0 nr_dirty=0 nr_writeback=0
nr_congested=0 nr_immediate=0 nr_activate_anon=9 nr_activate_file=0
nr_ref_keep=0 nr_unmap_fail=0 nr_promote=4 nr_demote=0 priority=12
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
```

Signed-off-by: Huan Yang <[email protected]>
---
include/linux/vmstat.h | 2 ++
include/trace/events/vmscan.h | 8 +++++++-
mm/vmscan.c | 26 +++++++++++++++++++++-----
3 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fed855bae6d8..ac2dd9168780 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -32,6 +32,8 @@ struct reclaim_stat {
unsigned nr_ref_keep;
unsigned nr_unmap_fail;
unsigned nr_lazyfree_fail;
+ unsigned nr_promote;
+ unsigned nr_demote;
};

enum writeback_stat_item {
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 1a488c30afa5..9b403824a293 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -366,6 +366,8 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
__field(unsigned int, nr_activate1)
__field(unsigned long, nr_ref_keep)
__field(unsigned long, nr_unmap_fail)
+ __field(unsigned long, nr_promote)
+ __field(unsigned long, nr_demote)
__field(int, priority)
__field(int, reclaim_flags)
),
@@ -382,17 +384,21 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
__entry->nr_activate1 = stat->nr_activate[1];
__entry->nr_ref_keep = stat->nr_ref_keep;
__entry->nr_unmap_fail = stat->nr_unmap_fail;
+ __entry->nr_promote = stat->nr_promote;
+ __entry->nr_demote = stat->nr_demote;
__entry->priority = priority;
__entry->reclaim_flags = trace_reclaim_flags(file);
),

- TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate_anon=%d nr_activate_file=%d nr_ref_keep=%ld nr_unmap_fail=%ld priority=%d flags=%s",
+ TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate_anon=%d"
+ " nr_activate_file=%d nr_ref_keep=%ld nr_unmap_fail=%ld nr_promote=%ld nr_demote=%ld priority=%d flags=%s",
__entry->nid,
__entry->nr_scanned, __entry->nr_reclaimed,
__entry->nr_dirty, __entry->nr_writeback,
__entry->nr_congested, __entry->nr_immediate,
__entry->nr_activate0, __entry->nr_activate1,
__entry->nr_ref_keep, __entry->nr_unmap_fail,
+ __entry->nr_promote, __entry->nr_demote,
__entry->priority,
show_reclaim_flags(__entry->reclaim_flags))
);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2cc0cb41fb32..21099b9f21e0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1063,8 +1063,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,

/* folio_update_gen() tried to promote this page? */
if (lru_gen_enabled() && !ignore_references &&
- folio_mapped(folio) && folio_test_referenced(folio))
+ folio_mapped(folio) && folio_test_referenced(folio)) {
+ stat->nr_promote += nr_pages;
goto keep_locked;
+ }

/*
* The number of dirty pages determines if a node is marked
@@ -1193,6 +1195,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
(thp_migration_supported() || !folio_test_large(folio))) {
list_add(&folio->lru, &demote_folios);
folio_unlock(folio);
+ stat->nr_demote += nr_pages;
continue;
}

@@ -4495,6 +4498,8 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
int type;
int scanned;
int reclaimed;
+ unsigned long nr_taken = sc->nr_scanned;
+ unsigned int total_reclaimed = 0;
LIST_HEAD(list);
LIST_HEAD(clean);
struct folio *folio;
@@ -4521,10 +4526,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
return scanned;
retry:
reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
- sc->nr_reclaimed += reclaimed;
- trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
- scanned, reclaimed, &stat, sc->priority,
- type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
+ total_reclaimed += reclaimed;

list_for_each_entry_safe_reverse(folio, next, &list, lru) {
if (!folio_evictable(folio)) {
@@ -4582,6 +4584,20 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
goto retry;
}

+ /**
+ * MGLRU scan_folios return nr_scanned no only contains
+ * isolated folios. To get actually touched folios in
+ * shrink_folios_list, we can record last nr_scanned which
+ * sc saved, and sc nr_scanned will update for each folios
+ * which we touched. New count sub last can get right nr_taken
+ */
+ nr_taken = sc->nr_scanned - nr_taken;
+
+ sc->nr_reclaimed += total_reclaimed;
+ trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, nr_taken,
+ total_reclaimed, &stat,
+ sc->priority, type);
+
return scanned;
}

--
2.34.1

2023-10-18 08:22:33

by Huan Yang

[permalink] [raw]
Subject: [PATCH 2/2] mm: multi-gen lru: fix stat count

For multi-gen lru reclaim in evict_folios, like shrink_inactive_list,
gather folios which isolate to reclaim, and invoke shirnk_folio_list.

But, when complete shrink, it not gather shrink reclaim stat into sc,
we can't get info like nr_dirty\congested in reclaim, and then
control writeback, dirty number and mark as LRUVEC_CONGESTED, or
just bpf trace shrink and get correct sc stat.

This patch fix this by simple copy code from shrink_inactive_list when
end of shrink list.

Signed-off-by: Huan Yang <[email protected]>
---
mm/vmscan.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 21099b9f21e0..88d1d586aea5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4593,6 +4593,41 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
*/
nr_taken = sc->nr_scanned - nr_taken;

+ /*
+ * If dirty folios are scanned that are not queued for IO, it
+ * implies that flushers are not doing their job. This can
+ * happen when memory pressure pushes dirty folios to the end of
+ * the LRU before the dirty limits are breached and the dirty
+ * data has expired. It can also happen when the proportion of
+ * dirty folios grows not through writes but through memory
+ * pressure reclaiming all the clean cache. And in some cases,
+ * the flushers simply cannot keep up with the allocation
+ * rate. Nudge the flusher threads in case they are asleep.
+ */
+ if (unlikely(stat.nr_unqueued_dirty == nr_taken)) {
+ wakeup_flusher_threads(WB_REASON_VMSCAN);
+ /*
+ * For cgroupv1 dirty throttling is achieved by waking up
+ * the kernel flusher here and later waiting on folios
+ * which are in writeback to finish (see shrink_folio_list()).
+ *
+ * Flusher may not be able to issue writeback quickly
+ * enough for cgroupv1 writeback throttling to work
+ * on a large system.
+ */
+ if (!writeback_throttling_sane(sc))
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+ }
+
+ sc->nr.dirty += stat.nr_dirty;
+ sc->nr.congested += stat.nr_congested;
+ sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
+ sc->nr.writeback += stat.nr_writeback;
+ sc->nr.immediate += stat.nr_immediate;
+ sc->nr.taken += nr_taken;
+ if (type)
+ sc->nr.file_taken += nr_taken;
+
sc->nr_reclaimed += total_reclaimed;
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, nr_taken,
total_reclaimed, &stat,
--
2.34.1

2023-10-18 16:22:37

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: multi-gen lru: fix stat count

On Wed, Oct 18, 2023 at 2:22 AM Huan Yang <[email protected]> wrote:
>
> For multi-gen lru reclaim in evict_folios, like shrink_inactive_list,
> gather folios which isolate to reclaim, and invoke shirnk_folio_list.
>
> But, when complete shrink, it not gather shrink reclaim stat into sc,
> we can't get info like nr_dirty\congested in reclaim, and then
> control writeback, dirty number and mark as LRUVEC_CONGESTED, or
> just bpf trace shrink and get correct sc stat.
>
> This patch fix this by simple copy code from shrink_inactive_list when
> end of shrink list.

MGLRU doesn't try to write back dirt file pages in the reclaim path --
it filters them out in sort_folio() and leaves them to the page
writeback. (The page writeback is a dedicated component for this
purpose). So there is nothing to fix.

2023-10-18 16:29:14

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 1/2] tracing: mm: multigen-lru: fix mglru trace

On Wed, Oct 18, 2023 at 2:22 AM Huan Yang <[email protected]> wrote:
>
> This patch add two reclaim stat:
> nr_promote: nr_pages shrink before promote by folio_update_gen.
> nr_demote: nr_pages NUMA demotion passed.

The above isn't specific to MLGRU, so they should be in a separate patchset.

> And then, use correct nr_scanned which evict_folios passed into
> trace_mm_vmscan_lru_shrink_inactive.
>
> Mistake info like this:
> ```
> kswapd0-89 [000] 64.887613: mm_vmscan_lru_shrink_inactive:
> nid=0 nr_scanned=64 nr_reclaimed=0 nr_dirty=0 nr_writeback=0
> nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0
> nr_ref_keep=0 nr_unmap_fail=0 priority=4
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
> ```
> Correct info like this:
> ```
> <...>-9041 [006] 43.738481: mm_vmscan_lru_shrink_inactive:
> nid=0 nr_scanned=13 nr_reclaimed=0 nr_dirty=0 nr_writeback=0
> nr_congested=0 nr_immediate=0 nr_activate_anon=9 nr_activate_file=0
> nr_ref_keep=0 nr_unmap_fail=0 nr_promote=4 nr_demote=0 priority=12
> flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
> ```

Adding Jaewon & Kalesh to take a look.

2023-10-19 02:18:33

by Huan Yang

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: multi-gen lru: fix stat count

Hi Yu Zhao,

Thanks for your reply.

在 2023/10/19 0:21, Yu Zhao 写道:
> On Wed, Oct 18, 2023 at 2:22 AM Huan Yang <[email protected]> wrote:
>> For multi-gen lru reclaim in evict_folios, like shrink_inactive_list,
>> gather folios which isolate to reclaim, and invoke shirnk_folio_list.
>>
>> But, when complete shrink, it not gather shrink reclaim stat into sc,
>> we can't get info like nr_dirty\congested in reclaim, and then
>> control writeback, dirty number and mark as LRUVEC_CONGESTED, or
>> just bpf trace shrink and get correct sc stat.
>>
>> This patch fix this by simple copy code from shrink_inactive_list when
>> end of shrink list.
> MGLRU doesn't try to write back dirt file pages in the reclaim path --
> it filters them out in sort_folio() and leaves them to the page
Nice to know this,  sort_folio() filters some folio indeed.
But, I want to know, if we touch some folio in shrink_folio_list(), may some
folio become dirty or writeback even if sort_folio() filter then?
> writeback. (The page writeback is a dedicated component for this
> purpose). So there is nothing to fix.

2023-10-19 02:25:23

by Huan Yang

[permalink] [raw]
Subject: Re: [PATCH 1/2] tracing: mm: multigen-lru: fix mglru trace


在 2023/10/19 0:28, Yu Zhao 写道:
> On Wed, Oct 18, 2023 at 2:22 AM Huan Yang <[email protected]> wrote:
>> This patch add two reclaim stat:
>> nr_promote: nr_pages shrink before promote by folio_update_gen.
>> nr_demote: nr_pages NUMA demotion passed.
> The above isn't specific to MLGRU, so they should be in a separate patchset.

OK, nr_demote isn't MGLRU, separate is good, but, if this, nr_demote
isn't need by
myself, I just add this when I see this code. :)
Please see nr_promote and nr_scanned fix  is MGLRU need?

>
>> And then, use correct nr_scanned which evict_folios passed into
>> trace_mm_vmscan_lru_shrink_inactive.
>>
>> Mistake info like this:
>> ```
>> kswapd0-89 [000] 64.887613: mm_vmscan_lru_shrink_inactive:
>> nid=0 nr_scanned=64 nr_reclaimed=0 nr_dirty=0 nr_writeback=0
>> nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0
>> nr_ref_keep=0 nr_unmap_fail=0 priority=4
>> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>> ```
>> Correct info like this:
>> ```
>> <...>-9041 [006] 43.738481: mm_vmscan_lru_shrink_inactive:
>> nid=0 nr_scanned=13 nr_reclaimed=0 nr_dirty=0 nr_writeback=0
>> nr_congested=0 nr_immediate=0 nr_activate_anon=9 nr_activate_file=0
>> nr_ref_keep=0 nr_unmap_fail=0 nr_promote=4 nr_demote=0 priority=12
>> flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
>> ```
> Adding Jaewon & Kalesh to take a look.
Thanks, Huan

2023-10-19 02:41:09

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: multi-gen lru: fix stat count

On Wed, Oct 18, 2023 at 8:17 PM Huan Yang <[email protected]> wrote:
>
> Hi Yu Zhao,
>
> Thanks for your reply.
>
> 在 2023/10/19 0:21, Yu Zhao 写道:
> > On Wed, Oct 18, 2023 at 2:22 AM Huan Yang <[email protected]> wrote:
> >> For multi-gen lru reclaim in evict_folios, like shrink_inactive_list,
> >> gather folios which isolate to reclaim, and invoke shirnk_folio_list.
> >>
> >> But, when complete shrink, it not gather shrink reclaim stat into sc,
> >> we can't get info like nr_dirty\congested in reclaim, and then
> >> control writeback, dirty number and mark as LRUVEC_CONGESTED, or
> >> just bpf trace shrink and get correct sc stat.
> >>
> >> This patch fix this by simple copy code from shrink_inactive_list when
> >> end of shrink list.
> > MGLRU doesn't try to write back dirt file pages in the reclaim path --
> > it filters them out in sort_folio() and leaves them to the page
> Nice to know this, sort_folio() filters some folio indeed.
> But, I want to know, if we touch some folio in shrink_folio_list(), may some
> folio become dirty or writeback even if sort_folio() filter then?

Good question: in that case MGLRU still doesn't try to write those
folios back because isolate_folio() cleared PG_reclaim and
shrink_folio_list() checks PG_reclaim:

if (folio_test_dirty(folio)) {
/*
* Only kswapd can writeback filesystem folios
* to avoid risk of stack overflow. But avoid
* injecting inefficient single-folio I/O into
* flusher writeback as much as possible: only
* write folios when we've encountered many
* dirty folios, and when we've already scanned
* the rest of the LRU for clean folios and see
* the same dirty folios again (with the reclaim
* flag set).
*/
if (folio_is_file_lru(folio) &&
(!current_is_kswapd() ||
!folio_test_reclaim(folio) ||
!test_bit(PGDAT_DIRTY, &pgdat->flags))) {

2023-10-19 06:30:44

by Huan Yang

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: multi-gen lru: fix stat count


在 2023/10/19 10:39, Yu Zhao 写道:
> On Wed, Oct 18, 2023 at 8:17 PM Huan Yang <[email protected]> wrote:
>> Hi Yu Zhao,
>>
>> Thanks for your reply.
>>
>> 在 2023/10/19 0:21, Yu Zhao 写道:
>>> On Wed, Oct 18, 2023 at 2:22 AM Huan Yang <[email protected]> wrote:
>>>> For multi-gen lru reclaim in evict_folios, like shrink_inactive_list,
>>>> gather folios which isolate to reclaim, and invoke shirnk_folio_list.
>>>>
>>>> But, when complete shrink, it not gather shrink reclaim stat into sc,
>>>> we can't get info like nr_dirty\congested in reclaim, and then
>>>> control writeback, dirty number and mark as LRUVEC_CONGESTED, or
>>>> just bpf trace shrink and get correct sc stat.
>>>>
>>>> This patch fix this by simple copy code from shrink_inactive_list when
>>>> end of shrink list.
>>> MGLRU doesn't try to write back dirt file pages in the reclaim path --
>>> it filters them out in sort_folio() and leaves them to the page
>> Nice to know this, sort_folio() filters some folio indeed.
>> But, I want to know, if we touch some folio in shrink_folio_list(), may some
>> folio become dirty or writeback even if sort_folio() filter then?
> Good question: in that case MGLRU still doesn't try to write those
> folios back because isolate_folio() cleared PG_reclaim and
> shrink_folio_list() checks PG_reclaim:

Thank you too much. So, MGLRU have many diff between typic LRU reclaim.
So, why don't offer MGLRU a own shrink path to avoid so many check of folio?
And more think, it's nice to assign a anon/file reclaim hook into
anon_vma/address_space?
(Each folio, have their own shrink path, don't try check path if it no
need.)

>
> if (folio_test_dirty(folio)) {
> /*
> * Only kswapd can writeback filesystem folios
> * to avoid risk of stack overflow. But avoid
> * injecting inefficient single-folio I/O into
> * flusher writeback as much as possible: only
> * write folios when we've encountered many
> * dirty folios, and when we've already scanned
> * the rest of the LRU for clean folios and see
> * the same dirty folios again (with the reclaim
> * flag set).
> */
> if (folio_is_file_lru(folio) &&
> (!current_is_kswapd() ||
> !folio_test_reclaim(folio) ||
> !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
Thanks