2016-12-28 15:30:41

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 0/7] vm, vmscan: enahance vmscan tracepoints

Hi,
while debugging [1] I've realized that there is some room for
improvements in the tracepoints set we offer currently. I had hard times
to make any conclusion from the existing ones. The resulting problem
turned out to be active list aging [2] and we are missing at least two
tracepoints to debug such a problem.

Some existing tracepoints could export more information to see _why_ the
reclaim progress cannot be made not only _how much_ we could reclaim.
The later could be seen quite reasonably from the vmstat counters
already. It can be argued that we are showing too many implementation
details in those tracepoints but I consider them way too lowlevel
already to be usable by any kernel independent userspace. I would be
_really_ surprised if anything but debugging tools have used them.

Any feedback is highly appreciated.

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]


2016-12-28 15:30:44

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/7] mm, vmscan: remove unused mm_vmscan_memcg_isolate

From: Michal Hocko <[email protected]>

the trace point is not used since 925b7673cce3 ("mm: make per-memcg LRU
lists exclusive") so it can be removed.

Signed-off-by: Michal Hocko <[email protected]>
---
include/trace/events/vmscan.h | 31 +------------------------------
1 file changed, 1 insertion(+), 30 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index c88fd0934e7e..39bad8921ca1 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -269,8 +269,7 @@ TRACE_EVENT(mm_shrink_slab_end,
__entry->retval)
);

-DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
-
+TRACE_EVENT(mm_vmscan_lru_isolate,
TP_PROTO(int classzone_idx,
int order,
unsigned long nr_requested,
@@ -311,34 +310,6 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
__entry->file)
);

-DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
-
- TP_PROTO(int classzone_idx,
- int order,
- unsigned long nr_requested,
- unsigned long nr_scanned,
- unsigned long nr_taken,
- isolate_mode_t isolate_mode,
- int file),
-
- TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
-
-);
-
-DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_memcg_isolate,
-
- TP_PROTO(int classzone_idx,
- int order,
- unsigned long nr_requested,
- unsigned long nr_scanned,
- unsigned long nr_taken,
- isolate_mode_t isolate_mode,
- int file),
-
- TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
-
-);
-
TRACE_EVENT(mm_vmscan_writepage,

TP_PROTO(struct page *page),
--
2.10.2

2016-12-28 15:30:49

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 6/7] mm, vmscan: enhance mm_vmscan_lru_shrink_inactive tracepoint

From: Michal Hocko <[email protected]>

mm_vmscan_lru_shrink_inactive will currently report the number of
scanned and reclaimed pages. This doesn't give us an idea how the
reclaim went except for the overall effectiveness though. Export
and show other counters which will tell us why we couldn't reclaim
some pages.
- nr_dirty, nr_writeback, nr_congested and nr_immediate tells
us how many pages are blocked due to IO
- nr_activate tells us how many pages were moved to the active
list
- nr_ref_keep reports how many pages are kept on the LRU due
to references (mostly for the file pages which are about to
go for another round through the inactive list)
- nr_unmap_fail - how many pages failed to unmap

All these are rather low level so they might change in future but the
tracepoint is already implementation specific so no tools should be
depending on its stability.

Signed-off-by: Michal Hocko <[email protected]>
---
include/trace/events/vmscan.h | 29 ++++++++++++++++++++++++++---
mm/vmscan.c | 14 ++++++++++++++
2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index cc0b4c456c78..d27606f27af7 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -348,14 +348,27 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,

TP_PROTO(int nid,
unsigned long nr_scanned, unsigned long nr_reclaimed,
+ unsigned long nr_dirty, unsigned long nr_writeback,
+ unsigned long nr_congested, unsigned long nr_immediate,
+ unsigned long nr_activate, unsigned long nr_ref_keep,
+ unsigned long nr_unmap_fail,
int priority, int file),

- TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
+ TP_ARGS(nid, nr_scanned, nr_reclaimed, nr_dirty, nr_writeback,
+ nr_congested, nr_immediate, nr_activate, nr_ref_keep,
+ nr_unmap_fail, priority, file),

TP_STRUCT__entry(
__field(int, nid)
__field(unsigned long, nr_scanned)
__field(unsigned long, nr_reclaimed)
+ __field(unsigned long, nr_dirty)
+ __field(unsigned long, nr_writeback)
+ __field(unsigned long, nr_congested)
+ __field(unsigned long, nr_immediate)
+ __field(unsigned long, nr_activate)
+ __field(unsigned long, nr_ref_keep)
+ __field(unsigned long, nr_unmap_fail)
__field(int, priority)
__field(int, reclaim_flags)
),
@@ -364,14 +377,24 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
__entry->nid = nid;
__entry->nr_scanned = nr_scanned;
__entry->nr_reclaimed = nr_reclaimed;
+ __entry->nr_dirty = nr_dirty;
+ __entry->nr_writeback = nr_writeback;
+ __entry->nr_congested = nr_congested;
+ __entry->nr_immediate = nr_immediate;
+ __entry->nr_activate = nr_activate;
+ __entry->nr_ref_keep = nr_ref_keep;
+ __entry->nr_unmap_fail = nr_unmap_fail;
__entry->priority = priority;
__entry->reclaim_flags = trace_shrink_flags(file);
),

- TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+ TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate=%ld nr_ref_keep=%ld nr_unmap_fail=%ld priority=%d flags=%s",
__entry->nid,
__entry->nr_scanned, __entry->nr_reclaimed,
- __entry->priority,
+ __entry->nr_dirty, __entry->nr_writeback,
+ __entry->nr_congested, __entry->nr_immediate,
+ __entry->nr_activate, __entry->nr_ref_keep,
+ __entry->nr_unmap_fail, __entry->priority,
show_reclaim_flags(__entry->reclaim_flags))
);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6f2d828968c..a701bdd6334a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -908,6 +908,9 @@ struct reclaim_stat {
unsigned nr_congested;
unsigned nr_writeback;
unsigned nr_immediate;
+ unsigned nr_activate;
+ unsigned nr_ref_keep;
+ unsigned nr_unmap_fail;
};

/*
@@ -929,6 +932,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
unsigned nr_reclaimed = 0;
unsigned nr_writeback = 0;
unsigned nr_immediate = 0;
+ unsigned nr_ref_keep = 0;
+ unsigned nr_unmap_fail = 0;

cond_resched();

@@ -1067,6 +1072,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
case PAGEREF_ACTIVATE:
goto activate_locked;
case PAGEREF_KEEP:
+ nr_ref_keep++;
goto keep_locked;
case PAGEREF_RECLAIM:
case PAGEREF_RECLAIM_CLEAN:
@@ -1104,6 +1110,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
(ttu_flags | TTU_BATCH_FLUSH))) {
case SWAP_FAIL:
+ nr_unmap_fail++;
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
@@ -1276,6 +1283,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
stat->nr_unqueued_dirty = nr_unqueued_dirty;
stat->nr_writeback = nr_writeback;
stat->nr_immediate = nr_immediate;
+ stat->nr_activate = pgactivate;
+ stat->nr_ref_keep = nr_ref_keep;
+ stat->nr_unmap_fail = nr_unmap_fail;
}
return nr_reclaimed;
}
@@ -1825,6 +1835,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed,
+ stat.nr_dirty, stat.nr_writeback,
+ stat.nr_congested, stat.nr_immediate,
+ stat.nr_activate, stat.nr_ref_keep,
+ stat.nr_unmap_fail,
sc->priority, file);
return nr_reclaimed;
}
--
2.10.2

2016-12-28 15:30:46

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 3/7] mm, vmscan: show the number of skipped pages in mm_vmscan_lru_isolate

From: Michal Hocko <[email protected]>

mm_vmscan_lru_isolate shows the number of requested, scanned and taken
pages. This is mostly OK but on 32b systems the number of scanned pages
is quite misleading because it includes both the scanned and skipped
pages. Moreover the skipped part is scaled based on the number of taken
pages. Let's report the exact numbers without any additional logic and
add the number of skipped pages. This should make the reported data much
more easier to interpret.

Signed-off-by: Michal Hocko <[email protected]>
---
include/trace/events/vmscan.h | 8 ++++++--
mm/vmscan.c | 10 +++++-----
2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index d34cc0ced2be..6af4dae46db2 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -274,17 +274,19 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
int order,
unsigned long nr_requested,
unsigned long nr_scanned,
+ unsigned long nr_skipped,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
int file),

- TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
+ TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, file),

TP_STRUCT__entry(
__field(int, classzone_idx)
__field(int, order)
__field(unsigned long, nr_requested)
__field(unsigned long, nr_scanned)
+ __field(unsigned long, nr_skipped)
__field(unsigned long, nr_taken)
__field(isolate_mode_t, isolate_mode)
__field(int, file)
@@ -295,17 +297,19 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
__entry->order = order;
__entry->nr_requested = nr_requested;
__entry->nr_scanned = nr_scanned;
+ __entry->nr_skipped = nr_skipped;
__entry->nr_taken = nr_taken;
__entry->isolate_mode = isolate_mode;
__entry->file = file;
),

- TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu file=%d",
+ TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_skipped=%lu nr_taken=%lu file=%d",
__entry->isolate_mode,
__entry->classzone_idx,
__entry->order,
__entry->nr_requested,
__entry->nr_scanned,
+ __entry->nr_skipped,
__entry->nr_taken,
__entry->file)
);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2302a1a58c6e..4f7c0d66d629 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1428,6 +1428,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
unsigned long nr_taken = 0;
unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
+ unsigned long skipped = 0, total_skipped = 0;
unsigned long scan, nr_pages;
LIST_HEAD(pages_skipped);

@@ -1479,14 +1480,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
*/
if (!list_empty(&pages_skipped)) {
int zid;
- unsigned long total_skipped = 0;

for (zid = 0; zid < MAX_NR_ZONES; zid++) {
if (!nr_skipped[zid])
continue;

__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
- total_skipped += nr_skipped[zid];
+ skipped += nr_skipped[zid];
}

/*
@@ -1494,13 +1494,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
* close to unreclaimable. If the LRU list is empty, account
* skipped pages as a full scan.
*/
- scan += list_empty(src) ? total_skipped : total_skipped >> 2;
+ total_skipped = list_empty(src) ? skipped : skipped >> 2;

list_splice(&pages_skipped, src);
}
- *nr_scanned = scan;
+ *nr_scanned = scan + total_skipped;
trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
- nr_taken, mode, is_file_lru(lru));
+ skipped, nr_taken, mode, is_file_lru(lru));
update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken);
return nr_taken;
}
--
2.10.2

2016-12-28 15:31:15

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 7/7] mm, vmscan: add mm_vmscan_inactive_list_is_low tracepoint

From: Michal Hocko <[email protected]>

Currently we have tracepoints for both active and inactive LRU lists
reclaim but we do not have any which would tell us why we we decided to
age the active list. Without that it is quite hard to diagnose
active/inactive lists balancing. Add mm_vmscan_inactive_list_is_low
tracepoint to tell us this information.

Signed-off-by: Michal Hocko <[email protected]>
---
include/trace/events/vmscan.h | 40 ++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 23 ++++++++++++++---------
2 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index d27606f27af7..02c038c570a9 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -15,6 +15,7 @@
#define RECLAIM_WB_MIXED 0x0010u
#define RECLAIM_WB_SYNC 0x0004u /* Unused, all reclaim async */
#define RECLAIM_WB_ASYNC 0x0008u
+#define RECLAIM_WB_LRU (RECLAIM_WB_ANON|RECLAIM_WB_FILE)

#define show_reclaim_flags(flags) \
(flags) ? __print_flags(flags, "|", \
@@ -436,6 +437,45 @@ TRACE_EVENT(mm_vmscan_lru_shrink_active,
show_reclaim_flags(__entry->reclaim_flags))
);

+TRACE_EVENT(mm_vmscan_inactive_list_is_low,
+
+ TP_PROTO(int nid, int reclaim_idx,
+ unsigned long total_inactive, unsigned long inactive,
+ unsigned long total_active, unsigned long active,
+ unsigned long ratio, int file),
+
+ TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, ratio, file),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(int, reclaim_idx)
+ __field(unsigned long, total_inactive)
+ __field(unsigned long, inactive)
+ __field(unsigned long, total_active)
+ __field(unsigned long, active)
+ __field(unsigned long, ratio)
+ __field(int, reclaim_flags)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->reclaim_idx = reclaim_idx;
+ __entry->total_inactive = total_inactive;
+ __entry->inactive = inactive;
+ __entry->total_active = total_active;
+ __entry->active = active;
+ __entry->ratio = ratio;
+ __entry->reclaim_flags = trace_shrink_flags(file) & RECLAIM_WB_LRU;
+ ),
+
+ TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld ratio=%ld flags=%s",
+ __entry->nid,
+ __entry->reclaim_idx,
+ __entry->total_inactive, __entry->inactive,
+ __entry->total_active, __entry->active,
+ __entry->ratio,
+ show_reclaim_flags(__entry->reclaim_flags))
+);
#endif /* _TRACE_VMSCAN_H */

/* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a701bdd6334a..8021401213e0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2041,11 +2041,11 @@ static void shrink_active_list(unsigned long nr_to_scan,
* 10TB 320 32GB
*/
static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
- struct scan_control *sc)
+ struct scan_control *sc, bool trace)
{
unsigned long inactive_ratio;
- unsigned long inactive;
- unsigned long active;
+ unsigned long total_inactive, inactive;
+ unsigned long total_active, active;
unsigned long gb;
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
int zid;
@@ -2057,8 +2057,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
if (!file && !total_swap_pages)
return false;

- inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
- active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
+ total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
+ total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);

/*
* For zone-constrained allocations, it is necessary to check if
@@ -2087,6 +2087,11 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
else
inactive_ratio = 1;

+ if (trace)
+ trace_mm_vmscan_inactive_list_is_low(pgdat->node_id,
+ sc->reclaim_idx,
+ total_inactive, inactive,
+ total_active, active, inactive_ratio, file);
return inactive * inactive_ratio < active;
}

@@ -2094,7 +2099,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
if (is_active_lru(lru)) {
- if (inactive_list_is_low(lruvec, is_file_lru(lru), sc))
+ if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
shrink_active_list(nr_to_scan, lruvec, sc, lru);
return 0;
}
@@ -2225,7 +2230,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* lruvec even if it has plenty of old anonymous pages unless the
* system is under heavy pressure.
*/
- if (!inactive_list_is_low(lruvec, true, sc) &&
+ if (!inactive_list_is_low(lruvec, true, sc, false) &&
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
@@ -2450,7 +2455,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
- if (inactive_list_is_low(lruvec, false, sc))
+ if (inactive_list_is_low(lruvec, false, sc, true))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
}
@@ -3100,7 +3105,7 @@ static void age_active_anon(struct pglist_data *pgdat,
do {
struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);

- if (inactive_list_is_low(lruvec, false, sc))
+ if (inactive_list_is_low(lruvec, false, sc, true))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);

--
2.10.2

2016-12-28 15:31:28

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 5/7] mm, vmscan: extract shrink_page_list reclaim counters into a struct

From: Michal Hocko <[email protected]>

shrink_page_list returns quite some counters back to its caller. Extract
the existing 5 into struct reclaim_stat because this makes the code
easier to follow and also allows further counters to be returned.

While we are at it, make all of them unsigned rather than unsigned long
as we do not really need full 64b for them (we never scan more than
SWAP_CLUSTER_MAX pages at once). This should reduce some stack space.

This patch shouldn't introduce any functional change.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmscan.c | 61 ++++++++++++++++++++++++++++++-------------------------------
1 file changed, 30 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f0774f30a42..f6f2d828968c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -902,6 +902,14 @@ static void page_check_dirty_writeback(struct page *page,
mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
}

+struct reclaim_stat {
+ unsigned nr_dirty;
+ unsigned nr_unqueued_dirty;
+ unsigned nr_congested;
+ unsigned nr_writeback;
+ unsigned nr_immediate;
+};
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -909,22 +917,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct pglist_data *pgdat,
struct scan_control *sc,
enum ttu_flags ttu_flags,
- unsigned long *ret_nr_dirty,
- unsigned long *ret_nr_unqueued_dirty,
- unsigned long *ret_nr_congested,
- unsigned long *ret_nr_writeback,
- unsigned long *ret_nr_immediate,
+ struct reclaim_stat *stat,
bool force_reclaim)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
int pgactivate = 0;
- unsigned long nr_unqueued_dirty = 0;
- unsigned long nr_dirty = 0;
- unsigned long nr_congested = 0;
- unsigned long nr_reclaimed = 0;
- unsigned long nr_writeback = 0;
- unsigned long nr_immediate = 0;
+ unsigned nr_unqueued_dirty = 0;
+ unsigned nr_dirty = 0;
+ unsigned nr_congested = 0;
+ unsigned nr_reclaimed = 0;
+ unsigned nr_writeback = 0;
+ unsigned nr_immediate = 0;

cond_resched();

@@ -1266,11 +1270,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);

- *ret_nr_dirty += nr_dirty;
- *ret_nr_congested += nr_congested;
- *ret_nr_unqueued_dirty += nr_unqueued_dirty;
- *ret_nr_writeback += nr_writeback;
- *ret_nr_immediate += nr_immediate;
+ if (stat) {
+ stat->nr_dirty = nr_dirty;
+ stat->nr_congested = nr_congested;
+ stat->nr_unqueued_dirty = nr_unqueued_dirty;
+ stat->nr_writeback = nr_writeback;
+ stat->nr_immediate = nr_immediate;
+ }
return nr_reclaimed;
}

@@ -1282,7 +1288,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
.priority = DEF_PRIORITY,
.may_unmap = 1,
};
- unsigned long ret, dummy1, dummy2, dummy3, dummy4, dummy5;
+ unsigned long ret;
struct page *page, *next;
LIST_HEAD(clean_pages);

@@ -1295,8 +1301,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
}

ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
- TTU_UNMAP|TTU_IGNORE_ACCESS,
- &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
+ TTU_UNMAP|TTU_IGNORE_ACCESS, NULL, true);
list_splice(&clean_pages, page_list);
mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
return ret;
@@ -1696,11 +1701,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_taken;
- unsigned long nr_dirty = 0;
- unsigned long nr_congested = 0;
- unsigned long nr_unqueued_dirty = 0;
- unsigned long nr_writeback = 0;
- unsigned long nr_immediate = 0;
+ struct reclaim_stat stat = {};
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -1745,9 +1746,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
return 0;

nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
- &nr_dirty, &nr_unqueued_dirty, &nr_congested,
- &nr_writeback, &nr_immediate,
- false);
+ &stat, false);

spin_lock_irq(&pgdat->lru_lock);

@@ -1781,7 +1780,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* of pages under pages flagged for immediate reclaim and stall if any
* are encountered in the nr_immediate check below.
*/
- if (nr_writeback && nr_writeback == nr_taken)
+ if (stat.nr_writeback && stat.nr_writeback == nr_taken)
set_bit(PGDAT_WRITEBACK, &pgdat->flags);

/*
@@ -1793,7 +1792,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* Tag a zone as congested if all the dirty pages scanned were
* backed by a congested BDI and wait_iff_congested will stall.
*/
- if (nr_dirty && nr_dirty == nr_congested)
+ if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
set_bit(PGDAT_CONGESTED, &pgdat->flags);

/*
@@ -1802,7 +1801,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* the pgdat PGDAT_DIRTY and kswapd will start writing pages from
* reclaim context.
*/
- if (nr_unqueued_dirty == nr_taken)
+ if (stat.nr_unqueued_dirty == nr_taken)
set_bit(PGDAT_DIRTY, &pgdat->flags);

/*
@@ -1811,7 +1810,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* that pages are cycling through the LRU faster than
* they are written so also forcibly stall.
*/
- if (nr_immediate && current_may_throttle())
+ if (stat.nr_immediate && current_may_throttle())
congestion_wait(BLK_RW_ASYNC, HZ/10);
}

--
2.10.2

2016-12-28 15:31:47

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

From: Michal Hocko <[email protected]>

mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
from is file or anonymous but we do not know which LRU this is. It is
useful to know whether the list is file or anonymous as well. Change
the tracepoint to show symbolic names of the lru rather.

Signed-off-by: Michal Hocko <[email protected]>
---
include/trace/events/vmscan.h | 20 ++++++++++++++------
mm/vmscan.c | 2 +-
2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 6af4dae46db2..cc0b4c456c78 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -36,6 +36,14 @@
(RECLAIM_WB_ASYNC) \
)

+#define show_lru_name(lru) \
+ __print_symbolic(lru, \
+ {LRU_INACTIVE_ANON, "LRU_INACTIVE_ANON"}, \
+ {LRU_ACTIVE_ANON, "LRU_ACTIVE_ANON"}, \
+ {LRU_INACTIVE_FILE, "LRU_INACTIVE_FILE"}, \
+ {LRU_ACTIVE_FILE, "LRU_ACTIVE_FILE"}, \
+ {LRU_UNEVICTABLE, "LRU_UNEVICTABLE"})
+
TRACE_EVENT(mm_vmscan_kswapd_sleep,

TP_PROTO(int nid),
@@ -277,9 +285,9 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
unsigned long nr_skipped,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
- int file),
+ int lru),

- TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, file),
+ TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, lru),

TP_STRUCT__entry(
__field(int, classzone_idx)
@@ -289,7 +297,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
__field(unsigned long, nr_skipped)
__field(unsigned long, nr_taken)
__field(isolate_mode_t, isolate_mode)
- __field(int, file)
+ __field(int, lru)
),

TP_fast_assign(
@@ -300,10 +308,10 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
__entry->nr_skipped = nr_skipped;
__entry->nr_taken = nr_taken;
__entry->isolate_mode = isolate_mode;
- __entry->file = file;
+ __entry->lru = lru;
),

- TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_skipped=%lu nr_taken=%lu file=%d",
+ TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_skipped=%lu nr_taken=%lu lru=%s",
__entry->isolate_mode,
__entry->classzone_idx,
__entry->order,
@@ -311,7 +319,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
__entry->nr_scanned,
__entry->nr_skipped,
__entry->nr_taken,
- __entry->file)
+ show_lru_name(__entry->lru))
);

TRACE_EVENT(mm_vmscan_writepage,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f7c0d66d629..3f0774f30a42 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1500,7 +1500,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
}
*nr_scanned = scan + total_skipped;
trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
- skipped, nr_taken, mode, is_file_lru(lru));
+ skipped, nr_taken, mode, lru);
update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken);
return nr_taken;
}
--
2.10.2

2016-12-28 15:32:02

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/7] mm, vmscan: add active list aging tracepoint

From: Michal Hocko <[email protected]>

Our reclaim process has several tracepoints to tell us more about how
things are progressing. We are, however, missing a tracepoint to track
active list aging. Introduce mm_vmscan_lru_shrink_active which reports
the number of scanned, rotated, deactivated and freed pages from the
particular node's active list.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/gfp.h | 2 +-
include/trace/events/vmscan.h | 38 ++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 6 +++++-
mm/vmscan.c | 22 +++++++++++++++++-----
4 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..61aa9b49e86d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -503,7 +503,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
extern void free_hot_cold_page(struct page *page, bool cold);
-extern void free_hot_cold_page_list(struct list_head *list, bool cold);
+extern int free_hot_cold_page_list(struct list_head *list, bool cold);

struct page_frag_cache;
extern void __page_frag_drain(struct page *page, unsigned int order,
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 39bad8921ca1..d34cc0ced2be 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -363,6 +363,44 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
show_reclaim_flags(__entry->reclaim_flags))
);

+TRACE_EVENT(mm_vmscan_lru_shrink_active,
+
+ TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_freed,
+ unsigned long nr_unevictable, unsigned long nr_deactivated,
+ unsigned long nr_rotated, int priority, int file),
+
+ TP_ARGS(nid, nr_scanned, nr_freed, nr_unevictable, nr_deactivated, nr_rotated, priority, file),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(unsigned long, nr_scanned)
+ __field(unsigned long, nr_freed)
+ __field(unsigned long, nr_unevictable)
+ __field(unsigned long, nr_deactivated)
+ __field(unsigned long, nr_rotated)
+ __field(int, priority)
+ __field(int, reclaim_flags)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->nr_scanned = nr_scanned;
+ __entry->nr_freed = nr_freed;
+ __entry->nr_unevictable = nr_unevictable;
+ __entry->nr_deactivated = nr_deactivated;
+ __entry->nr_rotated = nr_rotated;
+ __entry->priority = priority;
+ __entry->reclaim_flags = trace_shrink_flags(file);
+ ),
+
+ TP_printk("nid=%d nr_scanned=%ld nr_freed=%ld nr_unevictable=%ld nr_deactivated=%ld nr_rotated=%ld priority=%d flags=%s",
+ __entry->nid,
+ __entry->nr_scanned, __entry->nr_freed, __entry->nr_unevictable,
+ __entry->nr_deactivated, __entry->nr_rotated,
+ __entry->priority,
+ show_reclaim_flags(__entry->reclaim_flags))
+);
+
#endif /* _TRACE_VMSCAN_H */

/* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c24112308d6..77d204660857 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2487,14 +2487,18 @@ void free_hot_cold_page(struct page *page, bool cold)
/*
* Free a list of 0-order pages
*/
-void free_hot_cold_page_list(struct list_head *list, bool cold)
+int free_hot_cold_page_list(struct list_head *list, bool cold)
{
struct page *page, *next;
+ int ret = 0;

list_for_each_entry_safe(page, next, list, lru) {
trace_mm_page_free_batched(page, cold);
free_hot_cold_page(page, cold);
+ ret++;
}
+
+ return ret;
}

/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4abf08861d2..2302a1a58c6e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1846,9 +1846,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
*
* The downside is that we have to touch page->_refcount against each page.
* But we had to alter page->flags anyway.
+ *
+ * Returns the number of pages moved to the given lru.
*/

-static void move_active_pages_to_lru(struct lruvec *lruvec,
+static int move_active_pages_to_lru(struct lruvec *lruvec,
struct list_head *list,
struct list_head *pages_to_free,
enum lru_list lru)
@@ -1857,6 +1859,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
unsigned long pgmoved = 0;
struct page *page;
int nr_pages;
+ int nr_moved = 0;

while (!list_empty(list)) {
page = lru_to_page(list);
@@ -1882,11 +1885,15 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
spin_lock_irq(&pgdat->lru_lock);
} else
list_add(&page->lru, pages_to_free);
+ } else {
+ nr_moved++;
}
}

if (!is_active_lru(lru))
__count_vm_events(PGDEACTIVATE, pgmoved);
+
+ return nr_moved;
}

static void shrink_active_list(unsigned long nr_to_scan,
@@ -1902,7 +1909,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
LIST_HEAD(l_inactive);
struct page *page;
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
- unsigned long nr_rotated = 0;
+ unsigned long nr_rotated = 0, nr_unevictable = 0;
+ unsigned long nr_freed, nr_deactivate, nr_activate;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -1935,6 +1943,7 @@ static void shrink_active_list(unsigned long nr_to_scan,

if (unlikely(!page_evictable(page))) {
putback_lru_page(page);
+ nr_unevictable++;
continue;
}

@@ -1980,13 +1989,16 @@ static void shrink_active_list(unsigned long nr_to_scan,
*/
reclaim_stat->recent_rotated[file] += nr_rotated;

- move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
- move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
+ nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
+ nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(&pgdat->lru_lock);

mem_cgroup_uncharge_list(&l_hold);
- free_hot_cold_page_list(&l_hold, true);
+ nr_freed = free_hot_cold_page_list(&l_hold, true);
+ trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_scanned, nr_freed,
+ nr_unevictable, nr_deactivate, nr_rotated,
+ sc->priority, file);
}

/*
--
2.10.2

2016-12-28 15:50:36

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint



On 28.12.2016 17:30, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> from is file or anonymous but we do not know which LRU this is. It is
> useful to know whether the list is file or anonymous as well. Change
Maybe you wanted to say whether the list is ACTIVE/INACTIVE ?

> the tracepoint to show symbolic names of the lru rather.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/trace/events/vmscan.h | 20 ++++++++++++++------
> mm/vmscan.c | 2 +-
> 2 files changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 6af4dae46db2..cc0b4c456c78 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -36,6 +36,14 @@
> (RECLAIM_WB_ASYNC) \
> )
>
> +#define show_lru_name(lru) \
> + __print_symbolic(lru, \
> + {LRU_INACTIVE_ANON, "LRU_INACTIVE_ANON"}, \
> + {LRU_ACTIVE_ANON, "LRU_ACTIVE_ANON"}, \
> + {LRU_INACTIVE_FILE, "LRU_INACTIVE_FILE"}, \
> + {LRU_ACTIVE_FILE, "LRU_ACTIVE_FILE"}, \
> + {LRU_UNEVICTABLE, "LRU_UNEVICTABLE"})
> +
> TRACE_EVENT(mm_vmscan_kswapd_sleep,
>
> TP_PROTO(int nid),
> @@ -277,9 +285,9 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
> unsigned long nr_skipped,
> unsigned long nr_taken,
> isolate_mode_t isolate_mode,
> - int file),
> + int lru),
>
> - TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, file),
> + TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, lru),
>
> TP_STRUCT__entry(
> __field(int, classzone_idx)
> @@ -289,7 +297,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
> __field(unsigned long, nr_skipped)
> __field(unsigned long, nr_taken)
> __field(isolate_mode_t, isolate_mode)
> - __field(int, file)
> + __field(int, lru)
> ),
>
> TP_fast_assign(
> @@ -300,10 +308,10 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
> __entry->nr_skipped = nr_skipped;
> __entry->nr_taken = nr_taken;
> __entry->isolate_mode = isolate_mode;
> - __entry->file = file;
> + __entry->lru = lru;
> ),
>
> - TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_skipped=%lu nr_taken=%lu file=%d",
> + TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_skipped=%lu nr_taken=%lu lru=%s",
> __entry->isolate_mode,
> __entry->classzone_idx,
> __entry->order,
> @@ -311,7 +319,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
> __entry->nr_scanned,
> __entry->nr_skipped,
> __entry->nr_taken,
> - __entry->file)
> + show_lru_name(__entry->lru))
> );
>
> TRACE_EVENT(mm_vmscan_writepage,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f7c0d66d629..3f0774f30a42 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1500,7 +1500,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> }
> *nr_scanned = scan + total_skipped;
> trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
> - skipped, nr_taken, mode, is_file_lru(lru));
> + skipped, nr_taken, mode, lru);
> update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken);
> return nr_taken;
> }
>

2016-12-28 16:00:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

On Wed 28-12-16 17:50:31, Nikolay Borisov wrote:
>
>
> On 28.12.2016 17:30, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> > from is file or anonymous but we do not know which LRU this is. It is
> > useful to know whether the list is file or anonymous as well. Change
>
> Maybe you wanted to say whether the list is ACTIVE/INACTIVE ?

You are right. I will update the wording to:
"
mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
from is file or anonymous but we do not know which LRU this is. It is
useful to know whether the list is active or inactive as well as we
use the same function to isolate pages for both of them. Change
the tracepoint to show symbolic names of the lru rather.
"

Does it sound better?

Thanks!
--
Michal Hocko
SUSE Labs

2016-12-28 16:40:21

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint



On 28.12.2016 18:00, Michal Hocko wrote:
> On Wed 28-12-16 17:50:31, Nikolay Borisov wrote:
>>
>>
>> On 28.12.2016 17:30, Michal Hocko wrote:
>>> From: Michal Hocko <[email protected]>
>>>
>>> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
>>> from is file or anonymous but we do not know which LRU this is. It is
>>> useful to know whether the list is file or anonymous as well. Change
>>
>> Maybe you wanted to say whether the list is ACTIVE/INACTIVE ?
>
> You are right. I will update the wording to:
> "
> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> from is file or anonymous but we do not know which LRU this is. It is
> useful to know whether the list is active or inactive as well as we
> use the same function to isolate pages for both of them. Change
> the tracepoint to show symbolic names of the lru rather.
> "
>
> Does it sound better?

It's better. Just one more nit about the " as well as we
use the same function to isolate pages for both of them"

I think this can be reworded better. The way I understand is - it's
better to know whether it's active/inactive since we are using the same
function to do both, correct? If so then then perhaps the following is a
bit more clear:

"
It is useful to know whether the list is active or inactive, since we
are using the same function to isolate pages from both of them and it's
hard to distinguish otherwise.
"

But as I said - it's a minor nit.


>
> Thanks!
>

2016-12-28 16:49:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

On Wed 28-12-16 18:40:16, Nikolay Borisov wrote:
[...]
> "
> It is useful to know whether the list is active or inactive, since we
> are using the same function to isolate pages from both of them and it's
> hard to distinguish otherwise.
> "

OK, updated. Thanks!
--
Michal Hocko
SUSE Labs

2016-12-29 05:34:03

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm, vmscan: add active list aging tracepoint

On Wed, Dec 28, 2016 at 04:30:27PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Our reclaim process has several tracepoints to tell us more about how
> things are progressing. We are, however, missing a tracepoint to track
> active list aging. Introduce mm_vmscan_lru_shrink_active which reports
> the number of scanned, rotated, deactivated and freed pages from the
> particular node's active list.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/gfp.h | 2 +-
> include/trace/events/vmscan.h | 38 ++++++++++++++++++++++++++++++++++++++
> mm/page_alloc.c | 6 +++++-
> mm/vmscan.c | 22 +++++++++++++++++-----
> 4 files changed, 61 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4175dca4ac39..61aa9b49e86d 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -503,7 +503,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
> extern void __free_pages(struct page *page, unsigned int order);
> extern void free_pages(unsigned long addr, unsigned int order);
> extern void free_hot_cold_page(struct page *page, bool cold);
> -extern void free_hot_cold_page_list(struct list_head *list, bool cold);
> +extern int free_hot_cold_page_list(struct list_head *list, bool cold);
>
> struct page_frag_cache;
> extern void __page_frag_drain(struct page *page, unsigned int order,
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 39bad8921ca1..d34cc0ced2be 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -363,6 +363,44 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
> show_reclaim_flags(__entry->reclaim_flags))
> );
>
> +TRACE_EVENT(mm_vmscan_lru_shrink_active,
> +
> + TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_freed,
> + unsigned long nr_unevictable, unsigned long nr_deactivated,
> + unsigned long nr_rotated, int priority, int file),
> +
> + TP_ARGS(nid, nr_scanned, nr_freed, nr_unevictable, nr_deactivated, nr_rotated, priority, file),

I agree it is helpful. And it was when I investigated aging problem of 32bit
when node-lru was introduced. However, the question is we really need all those
kinds of information? just enough with nr_taken, nr_deactivated, priority, file?

Also, look at minor thing below.

Thanks.

> +
> + TP_STRUCT__entry(
> + __field(int, nid)
> + __field(unsigned long, nr_scanned)
> + __field(unsigned long, nr_freed)
> + __field(unsigned long, nr_unevictable)
> + __field(unsigned long, nr_deactivated)
> + __field(unsigned long, nr_rotated)
> + __field(int, priority)
> + __field(int, reclaim_flags)
> + ),
> +
> + TP_fast_assign(
> + __entry->nid = nid;
> + __entry->nr_scanned = nr_scanned;
> + __entry->nr_freed = nr_freed;
> + __entry->nr_unevictable = nr_unevictable;
> + __entry->nr_deactivated = nr_deactivated;
> + __entry->nr_rotated = nr_rotated;
> + __entry->priority = priority;
> + __entry->reclaim_flags = trace_shrink_flags(file);
> + ),
> +
> + TP_printk("nid=%d nr_scanned=%ld nr_freed=%ld nr_unevictable=%ld nr_deactivated=%ld nr_rotated=%ld priority=%d flags=%s",
> + __entry->nid,
> + __entry->nr_scanned, __entry->nr_freed, __entry->nr_unevictable,
> + __entry->nr_deactivated, __entry->nr_rotated,
> + __entry->priority,
> + show_reclaim_flags(__entry->reclaim_flags))
> +);
> +
> #endif /* _TRACE_VMSCAN_H */
>
> /* This part must be outside protection */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1c24112308d6..77d204660857 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2487,14 +2487,18 @@ void free_hot_cold_page(struct page *page, bool cold)
> /*
> * Free a list of 0-order pages
> */
> -void free_hot_cold_page_list(struct list_head *list, bool cold)
> +int free_hot_cold_page_list(struct list_head *list, bool cold)
> {
> struct page *page, *next;
> + int ret = 0;
>
> list_for_each_entry_safe(page, next, list, lru) {
> trace_mm_page_free_batched(page, cold);
> free_hot_cold_page(page, cold);
> + ret++;
> }
> +
> + return ret;
> }
>
> /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c4abf08861d2..2302a1a58c6e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1846,9 +1846,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> *
> * The downside is that we have to touch page->_refcount against each page.
> * But we had to alter page->flags anyway.
> + *
> + * Returns the number of pages moved to the given lru.
> */
>
> -static void move_active_pages_to_lru(struct lruvec *lruvec,
> +static int move_active_pages_to_lru(struct lruvec *lruvec,
> struct list_head *list,
> struct list_head *pages_to_free,
> enum lru_list lru)
> @@ -1857,6 +1859,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
> unsigned long pgmoved = 0;
> struct page *page;
> int nr_pages;
> + int nr_moved = 0;
>
> while (!list_empty(list)) {
> page = lru_to_page(list);
> @@ -1882,11 +1885,15 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
> spin_lock_irq(&pgdat->lru_lock);
> } else
> list_add(&page->lru, pages_to_free);
> + } else {
> + nr_moved++;
> }
> }
>
> if (!is_active_lru(lru))
> __count_vm_events(PGDEACTIVATE, pgmoved);
> +
> + return nr_moved;
> }
>
> static void shrink_active_list(unsigned long nr_to_scan,
> @@ -1902,7 +1909,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
> LIST_HEAD(l_inactive);
> struct page *page;
> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> - unsigned long nr_rotated = 0;
> + unsigned long nr_rotated = 0, nr_unevictable = 0;
> + unsigned long nr_freed, nr_deactivate, nr_activate;
> isolate_mode_t isolate_mode = 0;
> int file = is_file_lru(lru);
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> @@ -1935,6 +1943,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
> if (unlikely(!page_evictable(page))) {
> putback_lru_page(page);
> + nr_unevictable++;
> continue;
> }
>
> @@ -1980,13 +1989,16 @@ static void shrink_active_list(unsigned long nr_to_scan,
> */
> reclaim_stat->recent_rotated[file] += nr_rotated;
>
> - move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
> - move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> + nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);

Who use nr_active in here?

> + nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> spin_unlock_irq(&pgdat->lru_lock);
>
> mem_cgroup_uncharge_list(&l_hold);
> - free_hot_cold_page_list(&l_hold, true);
> + nr_freed = free_hot_cold_page_list(&l_hold, true);
> + trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_scanned, nr_freed,
> + nr_unevictable, nr_deactivate, nr_rotated,
> + sc->priority, file);
> }
>
> /*
> --
> 2.10.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-29 05:53:37

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/7] mm, vmscan: show the number of skipped pages in mm_vmscan_lru_isolate

On Wed, Dec 28, 2016 at 04:30:28PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> mm_vmscan_lru_isolate shows the number of requested, scanned and taken
> pages. This is mostly OK but on 32b systems the number of scanned pages
> is quite misleading because it includes both the scanned and skipped
> pages. Moreover the skipped part is scaled based on the number of taken
> pages. Let's report the exact numbers without any additional logic and
> add the number of skipped pages. This should make the reported data much
> more easier to interpret.
>
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Minchan Kim <[email protected]>

2016-12-29 06:02:08

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

On Wed, Dec 28, 2016 at 04:30:29PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> from is file or anonymous but we do not know which LRU this is. It is
> useful to know whether the list is file or anonymous as well. Change
> the tracepoint to show symbolic names of the lru rather.
>
> Signed-off-by: Michal Hocko <[email protected]>

Not exactly same with this but idea is almost same.
I used almost same tracepoint to investigate agging(i.e., deactivating) problem
in 32b kernel with node-lru.
It was enough. Namely, I didn't need tracepoint in shrink_active_list like your
first patch.
Your first patch is more straightforwad and information. But as you introduced
this patch, I want to ask in here.
Isn't it enough with this patch without your first one to find a such problem?

Thanks.

> ---
> include/trace/events/vmscan.h | 20 ++++++++++++++------
> mm/vmscan.c | 2 +-
> 2 files changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 6af4dae46db2..cc0b4c456c78 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -36,6 +36,14 @@
> (RECLAIM_WB_ASYNC) \
> )
>
> +#define show_lru_name(lru) \
> + __print_symbolic(lru, \
> + {LRU_INACTIVE_ANON, "LRU_INACTIVE_ANON"}, \
> + {LRU_ACTIVE_ANON, "LRU_ACTIVE_ANON"}, \
> + {LRU_INACTIVE_FILE, "LRU_INACTIVE_FILE"}, \
> + {LRU_ACTIVE_FILE, "LRU_ACTIVE_FILE"}, \
> + {LRU_UNEVICTABLE, "LRU_UNEVICTABLE"})
> +
> TRACE_EVENT(mm_vmscan_kswapd_sleep,
>
> TP_PROTO(int nid),
> @@ -277,9 +285,9 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
> unsigned long nr_skipped,
> unsigned long nr_taken,
> isolate_mode_t isolate_mode,
> - int file),
> + int lru),
>
> - TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, file),
> + TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, lru),
>
> TP_STRUCT__entry(
> __field(int, classzone_idx)
> @@ -289,7 +297,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
> __field(unsigned long, nr_skipped)
> __field(unsigned long, nr_taken)
> __field(isolate_mode_t, isolate_mode)
> - __field(int, file)
> + __field(int, lru)
> ),
>
> TP_fast_assign(
> @@ -300,10 +308,10 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
> __entry->nr_skipped = nr_skipped;
> __entry->nr_taken = nr_taken;
> __entry->isolate_mode = isolate_mode;
> - __entry->file = file;
> + __entry->lru = lru;
> ),
>
> - TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_skipped=%lu nr_taken=%lu file=%d",
> + TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_skipped=%lu nr_taken=%lu lru=%s",
> __entry->isolate_mode,
> __entry->classzone_idx,
> __entry->order,
> @@ -311,7 +319,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
> __entry->nr_scanned,
> __entry->nr_skipped,
> __entry->nr_taken,
> - __entry->file)
> + show_lru_name(__entry->lru))
> );
>
> TRACE_EVENT(mm_vmscan_writepage,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f7c0d66d629..3f0774f30a42 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1500,7 +1500,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> }
> *nr_scanned = scan + total_skipped;
> trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
> - skipped, nr_taken, mode, is_file_lru(lru));
> + skipped, nr_taken, mode, lru);
> update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken);
> return nr_taken;
> }
> --
> 2.10.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-29 07:40:48

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 1/7] mm, vmscan: remove unused mm_vmscan_memcg_isolate


On Wednesday, December 28, 2016 11:30 PM Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> the trace point is not used since 925b7673cce3 ("mm: make per-memcg LRU
> lists exclusive") so it can be removed.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>


2016-12-29 07:49:53

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm, vmscan: add active list aging tracepoint


On Wednesday, December 28, 2016 11:30 PM Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Our reclaim process has several tracepoints to tell us more about how
> things are progressing. We are, however, missing a tracepoint to track
> active list aging. Introduce mm_vmscan_lru_shrink_active which reports
> the number of scanned, rotated, deactivated and freed pages from the
> particular node's active list.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>


2016-12-29 07:52:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm, vmscan: add active list aging tracepoint

On Thu 29-12-16 14:33:59, Minchan Kim wrote:
> On Wed, Dec 28, 2016 at 04:30:27PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Our reclaim process has several tracepoints to tell us more about how
> > things are progressing. We are, however, missing a tracepoint to track
> > active list aging. Introduce mm_vmscan_lru_shrink_active which reports
> > the number of scanned, rotated, deactivated and freed pages from the
> > particular node's active list.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > include/linux/gfp.h | 2 +-
> > include/trace/events/vmscan.h | 38 ++++++++++++++++++++++++++++++++++++++
> > mm/page_alloc.c | 6 +++++-
> > mm/vmscan.c | 22 +++++++++++++++++-----
> > 4 files changed, 61 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 4175dca4ac39..61aa9b49e86d 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -503,7 +503,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
> > extern void __free_pages(struct page *page, unsigned int order);
> > extern void free_pages(unsigned long addr, unsigned int order);
> > extern void free_hot_cold_page(struct page *page, bool cold);
> > -extern void free_hot_cold_page_list(struct list_head *list, bool cold);
> > +extern int free_hot_cold_page_list(struct list_head *list, bool cold);
> >
> > struct page_frag_cache;
> > extern void __page_frag_drain(struct page *page, unsigned int order,
> > diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> > index 39bad8921ca1..d34cc0ced2be 100644
> > --- a/include/trace/events/vmscan.h
> > +++ b/include/trace/events/vmscan.h
> > @@ -363,6 +363,44 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
> > show_reclaim_flags(__entry->reclaim_flags))
> > );
> >
> > +TRACE_EVENT(mm_vmscan_lru_shrink_active,
> > +
> > + TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_freed,
> > + unsigned long nr_unevictable, unsigned long nr_deactivated,
> > + unsigned long nr_rotated, int priority, int file),
> > +
> > + TP_ARGS(nid, nr_scanned, nr_freed, nr_unevictable, nr_deactivated, nr_rotated, priority, file),
>
> I agree it is helpful. And it was when I investigated aging problem of 32bit
> when node-lru was introduced. However, the question is we really need all those
> kinds of information? just enough with nr_taken, nr_deactivated, priority, file?

Dunno. Is it harmful to add this information? I like it more when the
numbers just add up and you have a clear picture. You never know what
might be useful when debugging a weird behavior.

[...]
> > - move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
> > - move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> > + nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
>
> Who use nr_active in here?

this is an omission. I just forgot to add it... Thanks for noticing.

--
Michal Hocko
SUSE Labs

2016-12-29 07:57:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

On Thu 29-12-16 15:02:04, Minchan Kim wrote:
> On Wed, Dec 28, 2016 at 04:30:29PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> > from is file or anonymous but we do not know which LRU this is. It is
> > useful to know whether the list is file or anonymous as well. Change
> > the tracepoint to show symbolic names of the lru rather.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Not exactly same with this but idea is almost same.
> I used almost same tracepoint to investigate agging(i.e., deactivating) problem
> in 32b kernel with node-lru.
> It was enough. Namely, I didn't need tracepoint in shrink_active_list like your
> first patch.
> Your first patch is more straightforwad and information. But as you introduced
> this patch, I want to ask in here.
> Isn't it enough with this patch without your first one to find a such problem?

I assume this should be a reply to
http://lkml.kernel.org/r/[email protected], right?
And you are right that for the particular problem it was enough to have
a tracepoint inside inactive_list_is_low and shrink_active_list one
wasn't really needed. On the other hand aging issues are really hard to
debug as well and so I think that both are useful. The first one tell us
_why_ we do aging while the later _how_ we do that.
--
Michal Hocko
SUSE Labs

2016-12-29 07:57:45

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 3/7] mm, vmscan: show the number of skipped pages in mm_vmscan_lru_isolate

On Wednesday, December 28, 2016 11:30 PM Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> mm_vmscan_lru_isolate shows the number of requested, scanned and taken
> pages. This is mostly OK but on 32b systems the number of scanned pages
> is quite misleading because it includes both the scanned and skipped
> pages. Moreover the skipped part is scaled based on the number of taken
> pages. Let's report the exact numbers without any additional logic and
> add the number of skipped pages. This should make the reported data much
> more easier to interpret.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>

2016-12-29 07:59:47

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint


On Wednesday, December 28, 2016 11:30 PM Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> from is file or anonymous but we do not know which LRU this is. It is
> useful to know whether the list is file or anonymous as well. Change
> the tracepoint to show symbolic names of the lru rather.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>

2016-12-29 08:06:16

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 5/7] mm, vmscan: extract shrink_page_list reclaim counters into a struct



On Wednesday, December 28, 2016 11:31 PM Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> shrink_page_list returns quite some counters back to its caller. Extract
> the existing 5 into struct reclaim_stat because this makes the code
> easier to follow and also allows further counters to be returned.
>
> While we are at it, make all of them unsigned rather than unsigned long
> as we do not really need full 64b for them (we never scan more than
> SWAP_CLUSTER_MAX pages at once). This should reduce some stack space.
>
> This patch shouldn't introduce any functional change.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>

2016-12-29 08:11:34

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 6/7] mm, vmscan: enhance mm_vmscan_lru_shrink_inactive tracepoint



On Wednesday, December 28, 2016 11:31 PM Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> mm_vmscan_lru_shrink_inactive will currently report the number of
> scanned and reclaimed pages. This doesn't give us an idea how the
> reclaim went except for the overall effectiveness though. Export
> and show other counters which will tell us why we couldn't reclaim
> some pages.
> - nr_dirty, nr_writeback, nr_congested and nr_immediate tells
> us how many pages are blocked due to IO
> - nr_activate tells us how many pages were moved to the active
> list
> - nr_ref_keep reports how many pages are kept on the LRU due
> to references (mostly for the file pages which are about to
> go for another round through the inactive list)
> - nr_unmap_fail - how many pages failed to unmap
>
> All these are rather low level so they might change in future but the
> tracepoint is already implementation specific so no tools should be
> depending on its stability.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>


2016-12-29 08:25:12

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 7/7] mm, vmscan: add mm_vmscan_inactive_list_is_low tracepoint

On Wednesday, December 28, 2016 11:31 PM Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Currently we have tracepoints for both active and inactive LRU lists
> reclaim but we do not have any which would tell us why we we decided to
> age the active list. Without that it is quite hard to diagnose
> active/inactive lists balancing. Add mm_vmscan_inactive_list_is_low
> tracepoint to tell us this information.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>


2016-12-30 01:48:57

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm, vmscan: add active list aging tracepoint

On Thu, Dec 29, 2016 at 08:52:46AM +0100, Michal Hocko wrote:
> On Thu 29-12-16 14:33:59, Minchan Kim wrote:
> > On Wed, Dec 28, 2016 at 04:30:27PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > Our reclaim process has several tracepoints to tell us more about how
> > > things are progressing. We are, however, missing a tracepoint to track
> > > active list aging. Introduce mm_vmscan_lru_shrink_active which reports
> > > the number of scanned, rotated, deactivated and freed pages from the
> > > particular node's active list.
> > >
> > > Signed-off-by: Michal Hocko <[email protected]>
> > > ---
> > > include/linux/gfp.h | 2 +-
> > > include/trace/events/vmscan.h | 38 ++++++++++++++++++++++++++++++++++++++
> > > mm/page_alloc.c | 6 +++++-
> > > mm/vmscan.c | 22 +++++++++++++++++-----
> > > 4 files changed, 61 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > index 4175dca4ac39..61aa9b49e86d 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -503,7 +503,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
> > > extern void __free_pages(struct page *page, unsigned int order);
> > > extern void free_pages(unsigned long addr, unsigned int order);
> > > extern void free_hot_cold_page(struct page *page, bool cold);
> > > -extern void free_hot_cold_page_list(struct list_head *list, bool cold);
> > > +extern int free_hot_cold_page_list(struct list_head *list, bool cold);
> > >
> > > struct page_frag_cache;
> > > extern void __page_frag_drain(struct page *page, unsigned int order,
> > > diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> > > index 39bad8921ca1..d34cc0ced2be 100644
> > > --- a/include/trace/events/vmscan.h
> > > +++ b/include/trace/events/vmscan.h
> > > @@ -363,6 +363,44 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
> > > show_reclaim_flags(__entry->reclaim_flags))
> > > );
> > >
> > > +TRACE_EVENT(mm_vmscan_lru_shrink_active,
> > > +
> > > + TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_freed,
> > > + unsigned long nr_unevictable, unsigned long nr_deactivated,
> > > + unsigned long nr_rotated, int priority, int file),
> > > +
> > > + TP_ARGS(nid, nr_scanned, nr_freed, nr_unevictable, nr_deactivated, nr_rotated, priority, file),
> >
> > I agree it is helpful. And it was when I investigated aging problem of 32bit
> > when node-lru was introduced. However, the question is we really need all those
> > kinds of information? just enough with nr_taken, nr_deactivated, priority, file?
>
> Dunno. Is it harmful to add this information? I like it more when the
> numbers just add up and you have a clear picture. You never know what
> might be useful when debugging a weird behavior.

Michal, I'm not huge fan of "might be useful" although it's a small piece of code.
It adds just all of kinds overheads (memory footprint, runtime performance,
maintainance) without any proved benefit.

If we allow such things, people would start adding more things with just "why not,
it might be useful. you never know the future" and it ends up making linux fiction
novel mess.

If it's necessary, someday, someone will catch up and will send or ask patch with
detailed description "why the stat is important and how it is good for us to solve
some problem". From that, we can learn workload, way to solve the problem and git
history has the valuable description so new comers can keep the community up easily.
So, finally, overheads are justified and get merged.

Please add must-have for your goal described.

2016-12-30 01:56:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

On Thu, Dec 29, 2016 at 08:56:49AM +0100, Michal Hocko wrote:
> On Thu 29-12-16 15:02:04, Minchan Kim wrote:
> > On Wed, Dec 28, 2016 at 04:30:29PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> > > from is file or anonymous but we do not know which LRU this is. It is
> > > useful to know whether the list is file or anonymous as well. Change
> > > the tracepoint to show symbolic names of the lru rather.
> > >
> > > Signed-off-by: Michal Hocko <[email protected]>
> >
> > Not exactly same with this but idea is almost same.
> > I used almost same tracepoint to investigate agging(i.e., deactivating) problem
> > in 32b kernel with node-lru.
> > It was enough. Namely, I didn't need tracepoint in shrink_active_list like your
> > first patch.
> > Your first patch is more straightforwad and information. But as you introduced
> > this patch, I want to ask in here.
> > Isn't it enough with this patch without your first one to find a such problem?
>
> I assume this should be a reply to
> http://lkml.kernel.org/r/[email protected], right?

I don't know my browser says "No such Message-ID known"

> And you are right that for the particular problem it was enough to have
> a tracepoint inside inactive_list_is_low and shrink_active_list one
> wasn't really needed. On the other hand aging issues are really hard to

What kinds of aging issue? What's the problem? How such tracepoint can help?
Please describe.

> debug as well and so I think that both are useful. The first one tell us
> _why_ we do aging while the later _how_ we do that.

Solve reported problem first you already knew. It would be no doubt
to merge and then send other patches about "it might be useful" with
useful scenario.

2016-12-30 09:11:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/7] vm, vmscan: enahance vmscan tracepoints

On Wed, Dec 28, 2016 at 04:30:25PM +0100, Michal Hocko wrote:
> Hi,
> while debugging [1] I've realized that there is some room for
> improvements in the tracepoints set we offer currently. I had hard times
> to make any conclusion from the existing ones. The resulting problem
> turned out to be active list aging [2] and we are missing at least two
> tracepoints to debug such a problem.
>
> Some existing tracepoints could export more information to see _why_ the
> reclaim progress cannot be made not only _how much_ we could reclaim.
> The later could be seen quite reasonably from the vmstat counters
> already. It can be argued that we are showing too many implementation
> details in those tracepoints but I consider them way too lowlevel
> already to be usable by any kernel independent userspace. I would be
> _really_ surprised if anything but debugging tools have used them.
>
> Any feedback is highly appreciated.
>

There is some minor overhead introduced in some paths regardless of
whether the tracepoints are active or not but I suspect it's negligible
in the context of the overhead of reclaim in general so;

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2016-12-30 09:26:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm, vmscan: add active list aging tracepoint

On Fri 30-12-16 10:48:53, Minchan Kim wrote:
> On Thu, Dec 29, 2016 at 08:52:46AM +0100, Michal Hocko wrote:
> > On Thu 29-12-16 14:33:59, Minchan Kim wrote:
> > > On Wed, Dec 28, 2016 at 04:30:27PM +0100, Michal Hocko wrote:
[...]
> > > > +TRACE_EVENT(mm_vmscan_lru_shrink_active,
> > > > +
> > > > + TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_freed,
> > > > + unsigned long nr_unevictable, unsigned long nr_deactivated,
> > > > + unsigned long nr_rotated, int priority, int file),
> > > > +
> > > > + TP_ARGS(nid, nr_scanned, nr_freed, nr_unevictable, nr_deactivated, nr_rotated, priority, file),
> > >
> > > I agree it is helpful. And it was when I investigated aging problem of 32bit
> > > when node-lru was introduced. However, the question is we really need all those
> > > kinds of information? just enough with nr_taken, nr_deactivated, priority, file?
> >
> > Dunno. Is it harmful to add this information? I like it more when the
> > numbers just add up and you have a clear picture. You never know what
> > might be useful when debugging a weird behavior.
>
> Michal, I'm not huge fan of "might be useful" although it's a small piece of code.

But these are tracepoints. One of their primary reasons to exist is
to help debug things. And it is really hard to predict what might be
useful in advance. It is not like the patch would export numbers which
would be irrelevant to the reclaim.

> It adds just all of kinds overheads (memory footprint, runtime performance,
> maintainance) without any proved benefit.

Does it really add any measurable overhead or the maintenance burden? I
think the only place we could argue about is free_hot_cold_page_list
which is used in hot paths.

I think we can sacrifice it. The same for culled unevictable
pages. We wouldn't know what is the missing part
nr_taken-(nr_activate+nr_deactivate) because it could be either freed or
moved to the unevictable list but that could be handled in a separate
tracepoint in putback_lru_page which sounds like a useful thing I guess.

> If we allow such things, people would start adding more things with just "why not,
> it might be useful. you never know the future" and it ends up making linux fiction
> novel mess.

I agree with this concern in general, but is this the case in this
particular case?

> If it's necessary, someday, someone will catch up and will send or ask patch with
> detailed description "why the stat is important and how it is good for us to solve
> some problem".

I can certainly enhance the changelog. See below.

> From that, we can learn workload, way to solve the problem and git
> history has the valuable description so new comers can keep the community up easily.
> So, finally, overheads are justified and get merged.
>
> Please add must-have for your goal described.

My primary point is that tracepoints which do not give us a good picture
are quite useless and force us to add trace_printk or other means to
give us further information. Then I wonder why to have an incomplete
tracepoint at all.

Anyway, what do you think about this updated patch? I have kept Hillf's
A-b so please let me know if it is no longer valid.
---
>From 5f1bc22ad1e54050b4da3228d68945e70342ebb6 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 27 Dec 2016 13:18:20 +0100
Subject: [PATCH] mm, vmscan: add active list aging tracepoint

Our reclaim process has several tracepoints to tell us more about how
things are progressing. We are, however, missing a tracepoint to track
active list aging. Introduce mm_vmscan_lru_shrink_active which reports
the number of
- nr_scanned, nr_taken pages to tell us the LRU isolation
effectiveness.
- nr_rotated pages which tells us that we are hitting referenced
pages which are deactivated. If this is a large part of the
reported nr_deactivated pages then the active list is too small
- nr_activated pages which tells us how many pages are keept on the
active list - mostly exec pages. A high number can indicate
that we might be trashing on executables.

Changes since v1
- report nr_taken pages as per Minchan
- report nr_activated as per Minchan
- do not report nr_freed pages because that would add a tiny overhead to
free_hot_cold_page_list which is a hot path
- do not report nr_unevictable because we can report this number via a
different and more generic tracepoint in putback_lru_page
- fix move_active_pages_to_lru to report proper page count when we hit
into large pages

Acked-by: Hillf Danton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/trace/events/vmscan.h | 38 ++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 18 ++++++++++++++----
2 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 39bad8921ca1..f9ef242ece1b 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -363,6 +363,44 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
show_reclaim_flags(__entry->reclaim_flags))
);

+TRACE_EVENT(mm_vmscan_lru_shrink_active,
+
+ TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_taken,
+ unsigned long nr_activate, unsigned long nr_deactivated,
+ unsigned long nr_rotated, int priority, int file),
+
+ TP_ARGS(nid, nr_scanned, nr_taken, nr_activate, nr_deactivated, nr_rotated, priority, file),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(unsigned long, nr_scanned)
+ __field(unsigned long, nr_taken)
+ __field(unsigned long, nr_activate)
+ __field(unsigned long, nr_deactivated)
+ __field(unsigned long, nr_rotated)
+ __field(int, priority)
+ __field(int, reclaim_flags)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->nr_scanned = nr_scanned;
+ __entry->nr_taken = nr_taken;
+ __entry->nr_activate = nr_activate;
+ __entry->nr_deactivated = nr_deactivated;
+ __entry->nr_rotated = nr_rotated;
+ __entry->priority = priority;
+ __entry->reclaim_flags = trace_shrink_flags(file);
+ ),
+
+ TP_printk("nid=%d nr_scanned=%ld nr_taken=%ld nr_activated=%ld nr_deactivated=%ld nr_rotated=%ld priority=%d flags=%s",
+ __entry->nid,
+ __entry->nr_scanned, __entry->nr_taken,
+ __entry->nr_activate, __entry->nr_deactivated, __entry->nr_rotated,
+ __entry->priority,
+ show_reclaim_flags(__entry->reclaim_flags))
+);
+
#endif /* _TRACE_VMSCAN_H */

/* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4abf08861d2..4da4d8d0496c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1846,9 +1846,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
*
* The downside is that we have to touch page->_refcount against each page.
* But we had to alter page->flags anyway.
+ *
+ * Returns the number of pages moved to the given lru.
*/

-static void move_active_pages_to_lru(struct lruvec *lruvec,
+static unsigned move_active_pages_to_lru(struct lruvec *lruvec,
struct list_head *list,
struct list_head *pages_to_free,
enum lru_list lru)
@@ -1857,6 +1859,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
unsigned long pgmoved = 0;
struct page *page;
int nr_pages;
+ int nr_moved = 0;

while (!list_empty(list)) {
page = lru_to_page(list);
@@ -1882,11 +1885,15 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
spin_lock_irq(&pgdat->lru_lock);
} else
list_add(&page->lru, pages_to_free);
+ } else {
+ nr_moved += nr_pages;
}
}

if (!is_active_lru(lru))
__count_vm_events(PGDEACTIVATE, pgmoved);
+
+ return nr_moved;
}

static void shrink_active_list(unsigned long nr_to_scan,
@@ -1902,7 +1909,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
LIST_HEAD(l_inactive);
struct page *page;
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
- unsigned long nr_rotated = 0;
+ unsigned nr_deactivate, nr_activate;
+ unsigned nr_rotated = 0;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -1980,13 +1988,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
*/
reclaim_stat->recent_rotated[file] += nr_rotated;

- move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
- move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
+ nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
+ nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(&pgdat->lru_lock);

mem_cgroup_uncharge_list(&l_hold);
free_hot_cold_page_list(&l_hold, true);
+ trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_scanned, nr_taken,
+ nr_activate, nr_deactivate, nr_rotated, sc->priority, file);
}

/*
--
2.10.2


--
Michal Hocko
SUSE Labs

2016-12-30 09:33:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

On Fri 30-12-16 10:56:25, Minchan Kim wrote:
> On Thu, Dec 29, 2016 at 08:56:49AM +0100, Michal Hocko wrote:
> > On Thu 29-12-16 15:02:04, Minchan Kim wrote:
> > > On Wed, Dec 28, 2016 at 04:30:29PM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko <[email protected]>
> > > >
> > > > mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> > > > from is file or anonymous but we do not know which LRU this is. It is
> > > > useful to know whether the list is file or anonymous as well. Change
> > > > the tracepoint to show symbolic names of the lru rather.
> > > >
> > > > Signed-off-by: Michal Hocko <[email protected]>
> > >
> > > Not exactly same with this but idea is almost same.
> > > I used almost same tracepoint to investigate agging(i.e., deactivating) problem
> > > in 32b kernel with node-lru.
> > > It was enough. Namely, I didn't need tracepoint in shrink_active_list like your
> > > first patch.
> > > Your first patch is more straightforwad and information. But as you introduced
> > > this patch, I want to ask in here.
> > > Isn't it enough with this patch without your first one to find a such problem?
> >
> > I assume this should be a reply to
> > http://lkml.kernel.org/r/[email protected], right?
>
> I don't know my browser says "No such Message-ID known"

Hmm, not sure why it didn't get archived at lkml.kernel.org.
I meant https://lkml.org/lkml/2016/12/28/167

> > And you are right that for the particular problem it was enough to have
> > a tracepoint inside inactive_list_is_low and shrink_active_list one
> > wasn't really needed. On the other hand aging issues are really hard to
>
> What kinds of aging issue? What's the problem? How such tracepoint can help?
> Please describe.

If you do not see that active list is shrunk then you do not know why it
is not shrunk. It might be a active/inactive ratio or just a plan bug
like the 32b issue me and you were debugging.

> > debug as well and so I think that both are useful. The first one tell us
> > _why_ we do aging while the later _how_ we do that.
>
> Solve reported problem first you already knew. It would be no doubt
> to merge and then send other patches about "it might be useful" with
> useful scenario.

I am not sure I understand. The point of tracepoints is to be
pro-actively helpful not only to add something that has been useful in
one-off cases. A particular debugging session might be really helpful to
tell us what we are missing and this was the case here to a large part.
Once I was looking there I just wanted to save the pain of adding more
debugging information in future and allow people to debug their issue
without forcing them to recompile the kernel. I believe this is one of
the strong usecases for tracepoints in the first place.

--
Michal Hocko
SUSE Labs

2016-12-30 09:37:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/7] vm, vmscan: enahance vmscan tracepoints

On Fri 30-12-16 09:11:17, Mel Gorman wrote:
> On Wed, Dec 28, 2016 at 04:30:25PM +0100, Michal Hocko wrote:
> > Hi,
> > while debugging [1] I've realized that there is some room for
> > improvements in the tracepoints set we offer currently. I had hard times
> > to make any conclusion from the existing ones. The resulting problem
> > turned out to be active list aging [2] and we are missing at least two
> > tracepoints to debug such a problem.
> >
> > Some existing tracepoints could export more information to see _why_ the
> > reclaim progress cannot be made not only _how much_ we could reclaim.
> > The later could be seen quite reasonably from the vmstat counters
> > already. It can be argued that we are showing too many implementation
> > details in those tracepoints but I consider them way too lowlevel
> > already to be usable by any kernel independent userspace. I would be
> > _really_ surprised if anything but debugging tools have used them.
> >
> > Any feedback is highly appreciated.
> >
>
> There is some minor overhead introduced in some paths regardless of
> whether the tracepoints are active or not but I suspect it's negligible
> in the context of the overhead of reclaim in general so;

I will work on improving some of them. E.g. I've dropped the change to
free_hot_cold_page_list because that is indeed a hot path but other than
that there shouldn't be any even medium hot path which should see any
overhead I can see. If you are aware of any, please let me know and I
will think about how to improve it.

> Acked-by: Mel Gorman <[email protected]>

Thanks!
--
Michal Hocko
SUSE Labs

2016-12-30 09:44:14

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm, vmscan: add active list aging tracepoint


On Friday, December 30, 2016 5:27 PM Michal Hocko wrote:
> Anyway, what do you think about this updated patch? I have kept Hillf's
> A-b so please let me know if it is no longer valid.
>
My mind is not changed:)

Happy new year folks!

Hillf

2016-12-30 10:20:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/7] vm, vmscan: enahance vmscan tracepoints

On Fri, Dec 30, 2016 at 10:36:55AM +0100, Michal Hocko wrote:
> On Fri 30-12-16 09:11:17, Mel Gorman wrote:
> > On Wed, Dec 28, 2016 at 04:30:25PM +0100, Michal Hocko wrote:
> > > Hi,
> > > while debugging [1] I've realized that there is some room for
> > > improvements in the tracepoints set we offer currently. I had hard times
> > > to make any conclusion from the existing ones. The resulting problem
> > > turned out to be active list aging [2] and we are missing at least two
> > > tracepoints to debug such a problem.
> > >
> > > Some existing tracepoints could export more information to see _why_ the
> > > reclaim progress cannot be made not only _how much_ we could reclaim.
> > > The later could be seen quite reasonably from the vmstat counters
> > > already. It can be argued that we are showing too many implementation
> > > details in those tracepoints but I consider them way too lowlevel
> > > already to be usable by any kernel independent userspace. I would be
> > > _really_ surprised if anything but debugging tools have used them.
> > >
> > > Any feedback is highly appreciated.
> > >
> >
> > There is some minor overhead introduced in some paths regardless of
> > whether the tracepoints are active or not but I suspect it's negligible
> > in the context of the overhead of reclaim in general so;
>
> I will work on improving some of them. E.g. I've dropped the change to
> free_hot_cold_page_list because that is indeed a hot path but other than
> that there shouldn't be any even medium hot path which should see any
> overhead I can see. If you are aware of any, please let me know and I
> will think about how to improve it.
>

I didn't spot one. The path where I saw the most overhead is already
quite heavy.

--
Mel Gorman
SUSE Labs

2016-12-30 16:05:04

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm, vmscan: add active list aging tracepoint

On Fri, Dec 30, 2016 at 10:26:37AM +0100, Michal Hocko wrote:
> On Fri 30-12-16 10:48:53, Minchan Kim wrote:
> > On Thu, Dec 29, 2016 at 08:52:46AM +0100, Michal Hocko wrote:
> > > On Thu 29-12-16 14:33:59, Minchan Kim wrote:
> > > > On Wed, Dec 28, 2016 at 04:30:27PM +0100, Michal Hocko wrote:
> [...]
> > > > > +TRACE_EVENT(mm_vmscan_lru_shrink_active,
> > > > > +
> > > > > + TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_freed,
> > > > > + unsigned long nr_unevictable, unsigned long nr_deactivated,
> > > > > + unsigned long nr_rotated, int priority, int file),
> > > > > +
> > > > > + TP_ARGS(nid, nr_scanned, nr_freed, nr_unevictable, nr_deactivated, nr_rotated, priority, file),
> > > >
> > > > I agree it is helpful. And it was when I investigated aging problem of 32bit
> > > > when node-lru was introduced. However, the question is we really need all those
> > > > kinds of information? just enough with nr_taken, nr_deactivated, priority, file?
> > >
> > > Dunno. Is it harmful to add this information? I like it more when the
> > > numbers just add up and you have a clear picture. You never know what
> > > might be useful when debugging a weird behavior.
> >
> > Michal, I'm not huge fan of "might be useful" although it's a small piece of code.
>
> But these are tracepoints. One of their primary reasons to exist is
> to help debug things. And it is really hard to predict what might be
> useful in advance. It is not like the patch would export numbers which
> would be irrelevant to the reclaim.

What's different?

Please think over if everyone says like that they want to add something
with the reason "it's tracepoint which helps dubug and we cannot assume
what might be useful in the future."

>
> > It adds just all of kinds overheads (memory footprint, runtime performance,
> > maintainance) without any proved benefit.
>
> Does it really add any measurable overhead or the maintenance burden? I

Don't limit your thought in this particular case and expand the idea to
others who want to see random value via tracepoint with just "might-be-
good". We will lose the reason to prevent that trend if we merge any
tracepoint expansion patch with just "might-be-useful" reason.
Finally, that would bite us.

> think the only place we could argue about is free_hot_cold_page_list
> which is used in hot paths.

The point of view about shrinking active list, what we want to know
is just (nr_taken|nr_deactivated|priority|file) and it's enough,
I think. So, if you want to add nr_freed, nr_unevictable, nr_rotated
please, describe "what problem we can solve with those each numbers".

>
> I think we can sacrifice it. The same for culled unevictable
> pages. We wouldn't know what is the missing part
> nr_taken-(nr_activate+nr_deactivate) because it could be either freed or
> moved to the unevictable list but that could be handled in a separate
> tracepoint in putback_lru_page which sounds like a useful thing I guess.
>
> > If we allow such things, people would start adding more things with just "why not,
> > it might be useful. you never know the future" and it ends up making linux fiction
> > novel mess.
>
> I agree with this concern in general, but is this the case in this
> particular case?

I believe it's not different.

>
> > If it's necessary, someday, someone will catch up and will send or ask patch with
> > detailed description "why the stat is important and how it is good for us to solve
> > some problem".
>
> I can certainly enhance the changelog. See below.
>
> > From that, we can learn workload, way to solve the problem and git
> > history has the valuable description so new comers can keep the community up easily.
> > So, finally, overheads are justified and get merged.
> >
> > Please add must-have for your goal described.
>
> My primary point is that tracepoints which do not give us a good picture
> are quite useless and force us to add trace_printk or other means to
> give us further information. Then I wonder why to have an incomplete
> tracepoint at all.
>
> Anyway, what do you think about this updated patch? I have kept Hillf's
> A-b so please let me know if it is no longer valid.
> ---
> From 5f1bc22ad1e54050b4da3228d68945e70342ebb6 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Tue, 27 Dec 2016 13:18:20 +0100
> Subject: [PATCH] mm, vmscan: add active list aging tracepoint
>
> Our reclaim process has several tracepoints to tell us more about how
> things are progressing. We are, however, missing a tracepoint to track
> active list aging. Introduce mm_vmscan_lru_shrink_active which reports

I agree this part.

> the number of
> - nr_scanned, nr_taken pages to tell us the LRU isolation
> effectiveness.

I agree nr_taken for knowing shrinking effectiveness but don't
agree nr_scanned. If we want to know LRU isolation effectiveness
with nr_scanned and nr_taken, isolate_lru_pages will do.

> - nr_rotated pages which tells us that we are hitting referenced
> pages which are deactivated. If this is a large part of the
> reported nr_deactivated pages then the active list is too small

It might be but not exactly. If your goal is to know LRU size, it can be
done in get_scan_count. I tend to agree LRU size is helpful for
performance analysis because decreased LRU size signals memory shortage
then performance drop.

> - nr_activated pages which tells us how many pages are keept on the
kept

> active list - mostly exec pages. A high number can indicate

file-based exec pages

> that we might be trashing on executables.

And welcome to drop nr_unevictable, nr_freed.

I will be off until next week monday so please understand if my response
is slow.

Thanks.

>
> Changes since v1
> - report nr_taken pages as per Minchan
> - report nr_activated as per Minchan
> - do not report nr_freed pages because that would add a tiny overhead to
> free_hot_cold_page_list which is a hot path
> - do not report nr_unevictable because we can report this number via a
> different and more generic tracepoint in putback_lru_page
> - fix move_active_pages_to_lru to report proper page count when we hit
> into large pages
>
> Acked-by: Hillf Danton <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/trace/events/vmscan.h | 38 ++++++++++++++++++++++++++++++++++++++
> mm/vmscan.c | 18 ++++++++++++++----
> 2 files changed, 52 insertions(+), 4 deletions(-)
>
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 39bad8921ca1..f9ef242ece1b 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -363,6 +363,44 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
> show_reclaim_flags(__entry->reclaim_flags))
> );
>
> +TRACE_EVENT(mm_vmscan_lru_shrink_active,
> +
> + TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_taken,
> + unsigned long nr_activate, unsigned long nr_deactivated,
> + unsigned long nr_rotated, int priority, int file),
> +
> + TP_ARGS(nid, nr_scanned, nr_taken, nr_activate, nr_deactivated, nr_rotated, priority, file),
> +
> + TP_STRUCT__entry(
> + __field(int, nid)
> + __field(unsigned long, nr_scanned)
> + __field(unsigned long, nr_taken)
> + __field(unsigned long, nr_activate)
> + __field(unsigned long, nr_deactivated)
> + __field(unsigned long, nr_rotated)
> + __field(int, priority)
> + __field(int, reclaim_flags)
> + ),
> +
> + TP_fast_assign(
> + __entry->nid = nid;
> + __entry->nr_scanned = nr_scanned;
> + __entry->nr_taken = nr_taken;
> + __entry->nr_activate = nr_activate;
> + __entry->nr_deactivated = nr_deactivated;
> + __entry->nr_rotated = nr_rotated;
> + __entry->priority = priority;
> + __entry->reclaim_flags = trace_shrink_flags(file);
> + ),
> +
> + TP_printk("nid=%d nr_scanned=%ld nr_taken=%ld nr_activated=%ld nr_deactivated=%ld nr_rotated=%ld priority=%d flags=%s",
> + __entry->nid,
> + __entry->nr_scanned, __entry->nr_taken,
> + __entry->nr_activate, __entry->nr_deactivated, __entry->nr_rotated,
> + __entry->priority,
> + show_reclaim_flags(__entry->reclaim_flags))
> +);
> +
> #endif /* _TRACE_VMSCAN_H */
>
> /* This part must be outside protection */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c4abf08861d2..4da4d8d0496c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1846,9 +1846,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> *
> * The downside is that we have to touch page->_refcount against each page.
> * But we had to alter page->flags anyway.
> + *
> + * Returns the number of pages moved to the given lru.
> */
>
> -static void move_active_pages_to_lru(struct lruvec *lruvec,
> +static unsigned move_active_pages_to_lru(struct lruvec *lruvec,
> struct list_head *list,
> struct list_head *pages_to_free,
> enum lru_list lru)
> @@ -1857,6 +1859,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
> unsigned long pgmoved = 0;
> struct page *page;
> int nr_pages;
> + int nr_moved = 0;
>
> while (!list_empty(list)) {
> page = lru_to_page(list);
> @@ -1882,11 +1885,15 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
> spin_lock_irq(&pgdat->lru_lock);
> } else
> list_add(&page->lru, pages_to_free);
> + } else {
> + nr_moved += nr_pages;
> }
> }
>
> if (!is_active_lru(lru))
> __count_vm_events(PGDEACTIVATE, pgmoved);
> +
> + return nr_moved;
> }
>
> static void shrink_active_list(unsigned long nr_to_scan,
> @@ -1902,7 +1909,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
> LIST_HEAD(l_inactive);
> struct page *page;
> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> - unsigned long nr_rotated = 0;
> + unsigned nr_deactivate, nr_activate;
> + unsigned nr_rotated = 0;
> isolate_mode_t isolate_mode = 0;
> int file = is_file_lru(lru);
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> @@ -1980,13 +1988,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
> */
> reclaim_stat->recent_rotated[file] += nr_rotated;
>
> - move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
> - move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> + nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
> + nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> spin_unlock_irq(&pgdat->lru_lock);
>
> mem_cgroup_uncharge_list(&l_hold);
> free_hot_cold_page_list(&l_hold, true);
> + trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_scanned, nr_taken,
> + nr_activate, nr_deactivate, nr_rotated, sc->priority, file);
> }
>
> /*
> --
> 2.10.2
>
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-30 16:37:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm, vmscan: add active list aging tracepoint

On Sat 31-12-16 01:04:56, Minchan Kim wrote:
[...]
> > From 5f1bc22ad1e54050b4da3228d68945e70342ebb6 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Tue, 27 Dec 2016 13:18:20 +0100
> > Subject: [PATCH] mm, vmscan: add active list aging tracepoint
> >
> > Our reclaim process has several tracepoints to tell us more about how
> > things are progressing. We are, however, missing a tracepoint to track
> > active list aging. Introduce mm_vmscan_lru_shrink_active which reports
>
> I agree this part.
>
> > the number of
> > - nr_scanned, nr_taken pages to tell us the LRU isolation
> > effectiveness.
>
> I agree nr_taken for knowing shrinking effectiveness but don't
> agree nr_scanned. If we want to know LRU isolation effectiveness
> with nr_scanned and nr_taken, isolate_lru_pages will do.

Yes it will. On the other hand the number is there and there is no
additional overhead, maintenance or otherwise, to provide that number.
The inactive counterpart does that for quite some time already. So why
exactly does that matter? Don't take me wrong but isn't this more on a
nit picking side than necessary? Or do I just misunderstand your
concenrs? It is not like we are providing a stable user API as the
tracepoint is clearly implementation specific and not something to be
used for anything other than debugging.

> > - nr_rotated pages which tells us that we are hitting referenced
> > pages which are deactivated. If this is a large part of the
> > reported nr_deactivated pages then the active list is too small
>
> It might be but not exactly. If your goal is to know LRU size, it can be
> done in get_scan_count. I tend to agree LRU size is helpful for
> performance analysis because decreased LRU size signals memory shortage
> then performance drop.

No, I am not really interested in the exact size but rather to allow to
find whether we are aging the active list too early...

>
> > - nr_activated pages which tells us how many pages are keept on the
> kept

fixed

>
> > active list - mostly exec pages. A high number can indicate
>
> file-based exec pages

OK, fixed

>
> > that we might be trashing on executables.
>
> And welcome to drop nr_unevictable, nr_freed.
>
> I will be off until next week monday so please understand if my response
> is slow.

There is no reason to hurry...
--
Michal Hocko
SUSE Labs

2016-12-30 17:30:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/7] mm, vmscan: add active list aging tracepoint

On Fri 30-12-16 17:37:42, Michal Hocko wrote:
> On Sat 31-12-16 01:04:56, Minchan Kim wrote:
[...]
> > > - nr_rotated pages which tells us that we are hitting referenced
> > > pages which are deactivated. If this is a large part of the
> > > reported nr_deactivated pages then the active list is too small
> >
> > It might be but not exactly. If your goal is to know LRU size, it can be
> > done in get_scan_count. I tend to agree LRU size is helpful for
> > performance analysis because decreased LRU size signals memory shortage
> > then performance drop.
>
> No, I am not really interested in the exact size but rather to allow to
> find whether we are aging the active list too early...

But thinking about that some more, maybe sticking with the nr_rotated
terminology is rather confusing and displaying the value as nr_referenced
would be more clear.
--
Michal Hocko
SUSE Labs