2019-06-03 22:00:18

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 00/11] mm: fix page aging across multiple cgroups

When applications are put into unconfigured cgroups for memory
accounting purposes, the cgrouping itself should not change the
behavior of the page reclaim code. We expect the VM to reclaim the
coldest pages in the system. But right now the VM can reclaim hot
pages in one cgroup while there is eligible cold cache in others.

This is because one part of the reclaim algorithm isn't truly cgroup
hierarchy aware: the inactive/active list balancing. That is the part
that is supposed to protect hot cache data from one-off streaming IO.

The recursive cgroup reclaim scheme will scan and rotate the physical
LRU lists of each eligible cgroup at the same rate in a round-robin
fashion, thereby establishing a relative order among the pages of all
those cgroups. However, the inactive/active balancing decisions are
made locally within each cgroup, so when a cgroup is running low on
cold pages, its hot pages will get reclaimed - even when sibling
cgroups have plenty of cold cache eligible in the same reclaim run.

For example:

[root@ham ~]# head -n1 /proc/meminfo
MemTotal: 1016336 kB

[root@ham ~]# ./reclaimtest2.sh
Establishing 50M active files in cgroup A...
Hot pages cached: 12800/12800 workingset-a
Linearly scanning through 18G of file data in cgroup B:
real 0m4.269s
user 0m0.051s
sys 0m4.182s
Hot pages cached: 134/12800 workingset-a

The streaming IO in B, which doesn't benefit from caching at all,
pushes out most of the workingset in A.

Solution

This series fixes the problem by elevating inactive/active balancing
decisions to the toplevel of the reclaim run. This is either a cgroup
that hit its limit, or straight-up global reclaim if there is physical
memory pressure. From there, it takes a recursive view of the cgroup
subtree to decide whether page deactivation is necessary.

In the test above, the VM will then recognize that cgroup B has plenty
of eligible cold cache, and that thet hot pages in A can be spared:

[root@ham ~]# ./reclaimtest2.sh
Establishing 50M active files in cgroup A...
Hot pages cached: 12800/12800 workingset-a
Linearly scanning through 18G of file data in cgroup B:
real 0m4.244s
user 0m0.064s
sys 0m4.177s
Hot pages cached: 12800/12800 workingset-a

Implementation

Whether active pages can be deactivated or not is influenced by two
factors: the inactive list dropping below a minimum size relative to
the active list, and the occurence of refaults.

After some cleanups and preparations, this patch series first moves
refault detection to the reclaim root, then enforces the minimum
inactive size based on a recursive view of the cgroup tree's LRUs.

History

Note that this actually never worked correctly in Linux cgroups. In
the past it worked for global reclaim and leaf limit reclaim only (we
used to have two physical LRU linkages per page), but it never worked
for intermediate limit reclaim over multiple leaf cgroups.

We're noticing this now because 1) we're putting everything into
cgroups for accounting, not just the things we want to control and 2)
we're moving away from leaf limits that invoke reclaim on individual
cgroups, toward large tree reclaim, triggered by high-level limits or
physical memory pressure, that is influenced by local protections such
as memory.low and memory.min instead.

Requirements

These changes are based on the fast recursive memcg stats merged in
5.2-rc1. The patches are against v5.2-rc2-mmots-2019-05-29-20-56-12
plus the page cache fix in https://lkml.org/lkml/2019/5/24/813.

include/linux/memcontrol.h | 37 +--
include/linux/mmzone.h | 30 +-
include/linux/swap.h | 2 +-
mm/memcontrol.c | 6 +-
mm/page_alloc.c | 2 +-
mm/vmscan.c | 667 ++++++++++++++++++++++---------------------
mm/workingset.c | 74 +++--
7 files changed, 437 insertions(+), 381 deletions(-)



2019-06-03 22:02:41

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 03/11] mm: vmscan: simplify lruvec_lru_size()

This function currently takes the node or lruvec size and subtracts
the zones that are excluded by the classzone index of the
allocation. It uses four different types of counters to do this.

Just add up the eligible zones.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 19 +++++--------------
1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 853be16ee5e2..69c4c82a9b5a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -342,30 +342,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
*/
unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
{
- unsigned long lru_size;
+ unsigned long size = 0;
int zid;

- if (!mem_cgroup_disabled())
- lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
- else
- lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
-
- for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
+ for (zid = 0; zid <= zone_idx; zid++) {
struct zone *zone = &lruvec_pgdat(lruvec)->node_zones[zid];
- unsigned long size;

if (!managed_zone(zone))
continue;

if (!mem_cgroup_disabled())
- size = mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
+ size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
else
- size = zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid],
- NR_ZONE_LRU_BASE + lru);
- lru_size -= min(size, lru_size);
+ size += zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
}
-
- return lru_size;
+ return size;

}

--
2.21.0

2019-06-03 22:03:02

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 07/11] mm: vmscan: split shrink_node() into node part and memcgs part

This function is getting long and unwieldy. The new shrink_node()
handles the generic (node) reclaim aspects:
- global vmpressure notifications
- writeback and congestion throttling
- reclaim/compaction management
- kswapd giving up on unreclaimable nodes

It then calls shrink_node_memcgs() which handles cgroup specifics:
- the cgroup tree traversal
- memory.low considerations
- per-cgroup slab shrinking callbacks
- per-cgroup vmpressure notifications

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b85111474ee2..ee79b39d0538 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2665,24 +2665,15 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
(memcg && memcg_congested(pgdat, memcg));
}

-static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
+static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
{
- struct reclaim_state *reclaim_state = current->reclaim_state;
struct mem_cgroup *root = sc->target_mem_cgroup;
struct mem_cgroup_reclaim_cookie reclaim = {
.pgdat = pgdat,
.priority = sc->priority,
};
- unsigned long nr_reclaimed, nr_scanned;
- bool reclaimable = false;
struct mem_cgroup *memcg;

-again:
- memset(&sc->nr, 0, sizeof(sc->nr));
-
- nr_reclaimed = sc->nr_reclaimed;
- nr_scanned = sc->nr_scanned;
-
memcg = mem_cgroup_iter(root, NULL, &reclaim);
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -2750,6 +2741,22 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
break;
}
} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
+}
+
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
+{
+ struct reclaim_state *reclaim_state = current->reclaim_state;
+ struct mem_cgroup *root = sc->target_mem_cgroup;
+ unsigned long nr_reclaimed, nr_scanned;
+ bool reclaimable = false;
+
+again:
+ memset(&sc->nr, 0, sizeof(sc->nr));
+
+ nr_reclaimed = sc->nr_reclaimed;
+ nr_scanned = sc->nr_scanned;
+
+ shrink_node_memcgs(pgdat, sc);

if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2757,7 +2764,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
}

/* Record the subtree's reclaim efficiency */
- vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+ vmpressure(sc->gfp_mask, root, true,
sc->nr_scanned - nr_scanned,
sc->nr_reclaimed - nr_reclaimed);

--
2.21.0

2019-06-03 22:04:15

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 04/11] mm: vmscan: naming fixes: cgroup_reclaim() and writeback_working()

Seven years after introducing the global_reclaim() function, I still
have to double take when reading a callsite. I don't know how others
do it. This is a terrible name.

Invert the meaning and rename it to cgroup_reclaim().

[ After all, "global reclaim" is just regular reclaim invoked from the
page allocator. It's reclaim on behalf of a cgroup limit that is a
special case of reclaim, and should be explicit - not the reverse. ]

sane_reclaim() isn't very descriptive either: it tests whether we can
use the regular writeback throttling - available during regular page
reclaim or cgroup2 limit reclaim - or need to use the broken
wait_on_page_writeback() method. Rename it to writeback_working().

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 38 ++++++++++++++++++--------------------
1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 69c4c82a9b5a..afd5e2432a8e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -239,13 +239,13 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
#endif /* CONFIG_MEMCG_KMEM */

#ifdef CONFIG_MEMCG
-static bool global_reclaim(struct scan_control *sc)
+static bool cgroup_reclaim(struct scan_control *sc)
{
- return !sc->target_mem_cgroup;
+ return sc->target_mem_cgroup;
}

/**
- * sane_reclaim - is the usual dirty throttling mechanism operational?
+ * writeback_working - is the usual dirty throttling mechanism unavailable?
* @sc: scan_control in question
*
* The normal page dirty throttling mechanism in balance_dirty_pages() is
@@ -257,11 +257,9 @@ static bool global_reclaim(struct scan_control *sc)
* This function tests whether the vmscan currently in progress can assume
* that the normal dirty throttling mechanism is operational.
*/
-static bool sane_reclaim(struct scan_control *sc)
+static bool writeback_working(struct scan_control *sc)
{
- struct mem_cgroup *memcg = sc->target_mem_cgroup;
-
- if (!memcg)
+ if (!cgroup_reclaim(sc))
return true;
#ifdef CONFIG_CGROUP_WRITEBACK
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
@@ -293,12 +291,12 @@ static bool memcg_congested(pg_data_t *pgdat,

}
#else
-static bool global_reclaim(struct scan_control *sc)
+static bool cgroup_reclaim(struct scan_control *sc)
{
- return true;
+ return false;
}

-static bool sane_reclaim(struct scan_control *sc)
+static bool writeback_working(struct scan_control *sc)
{
return true;
}
@@ -1211,7 +1209,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto activate_locked;

/* Case 2 above */
- } else if (sane_reclaim(sc) ||
+ } else if (writeback_working(sc) ||
!PageReclaim(page) || !may_enter_fs) {
/*
* This is slightly racy - end_page_writeback()
@@ -1806,7 +1804,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
if (current_is_kswapd())
return 0;

- if (!sane_reclaim(sc))
+ if (!writeback_working(sc))
return 0;

if (file) {
@@ -1957,7 +1955,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
reclaim_stat->recent_scanned[file] += nr_taken;

item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
- if (global_reclaim(sc))
+ if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
spin_unlock_irq(&pgdat->lru_lock);
@@ -1971,7 +1969,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
spin_lock_irq(&pgdat->lru_lock);

item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
- if (global_reclaim(sc))
+ if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_reclaimed);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
@@ -2239,7 +2237,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* using the memory controller's swap limit feature would be
* too expensive.
*/
- if (!global_reclaim(sc) && !swappiness) {
+ if (cgroup_reclaim(sc) && !swappiness) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2263,7 +2261,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* thrashing file LRU becomes infinitely more attractive than
* anon pages. Try to detect this based on file LRU size.
*/
- if (global_reclaim(sc)) {
+ if (!cgroup_reclaim(sc)) {
unsigned long pgdatfile;
unsigned long pgdatfree;
int z;
@@ -2494,7 +2492,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
* abort proportional reclaim if either the file or anon lru has already
* dropped to zero at the first pass.
*/
- scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
+ scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
sc->priority == DEF_PRIORITY);

blk_start_plug(&plug);
@@ -2816,7 +2814,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* Legacy memcg will stall in page writeback so avoid forcibly
* stalling in wait_iff_congested().
*/
- if (!global_reclaim(sc) && sane_reclaim(sc) &&
+ if (cgroup_reclaim(sc) && writeback_working(sc) &&
sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
set_memcg_congestion(pgdat, root, true);

@@ -2911,7 +2909,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
* Take care memory controller reclaiming has small influence
* to global LRU.
*/
- if (global_reclaim(sc)) {
+ if (!cgroup_reclaim(sc)) {
if (!cpuset_zone_allowed(zone,
GFP_KERNEL | __GFP_HARDWALL))
continue;
@@ -3011,7 +3009,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
retry:
delayacct_freepages_start();

- if (global_reclaim(sc))
+ if (!cgroup_reclaim(sc))
__count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);

do {
--
2.21.0

2019-06-03 22:08:29

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 01/11] mm: vmscan: move inactive_list_is_low() swap check to the caller

inactive_list_is_low() should be about one thing: checking the ratio
between inactive and active list. Kitchensink checks like the one for
swap space makes the function hard to use and modify its
callsites. Luckly, most callers already have an understanding of the
swap situation, so it's easy to clean up.

get_scan_count() has its own, memcg-aware swap check, and doesn't even
get to the inactive_list_is_low() check on the anon list when there is
no swap space available.

shrink_list() is called on the results of get_scan_count(), so that
check is redundant too.

age_active_anon() has its own totalswap_pages check right before it
checks the list proportions.

The shrink_node_memcg() site is the only one that doesn't do its own
swap check. Add it there.

Then delete the swap check from inactive_list_is_low().

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 84dcb651d05c..f396424850aa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2165,13 +2165,6 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
unsigned long refaults;
unsigned long gb;

- /*
- * If we don't have swap space, anonymous page deactivation
- * is pointless.
- */
- if (!file && !total_swap_pages)
- return false;
-
inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);

@@ -2592,7 +2585,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
- if (inactive_list_is_low(lruvec, false, sc, true))
+ if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
}
--
2.21.0

2019-06-03 22:08:50

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 06/11] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec()

A lruvec holds LRU pages owned by a certain NUMA node and cgroup.
Instead of awkwardly passing around a combination of a pgdat and a
memcg pointer, pass down the lruvec as soon as we can look it up.

Nested callers that need to access node or cgroup properties can look
them them up if necessary, but there are only a few cases.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 304974481146..b85111474ee2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2210,9 +2210,10 @@ enum scan_balance {
* nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
* nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
*/
-static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
- struct scan_control *sc, unsigned long *nr)
+static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
+ unsigned long *nr)
{
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
int swappiness = mem_cgroup_swappiness(memcg);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 fraction[2];
@@ -2460,13 +2461,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
}
}

-/*
- * This is a basic per-node page freer. Used by both kswapd and direct reclaim.
- */
-static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
- struct scan_control *sc)
+static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
- struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
unsigned long nr[NR_LRU_LISTS];
unsigned long targets[NR_LRU_LISTS];
unsigned long nr_to_scan;
@@ -2476,7 +2472,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
struct blk_plug plug;
bool scan_adjusted;

- get_scan_count(lruvec, memcg, sc, nr);
+ get_scan_count(lruvec, sc, nr);

/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
@@ -2689,6 +2685,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)

memcg = mem_cgroup_iter(root, NULL, &reclaim);
do {
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
unsigned long reclaimed;
unsigned long scanned;

@@ -2725,7 +2722,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)

reclaimed = sc->nr_reclaimed;
scanned = sc->nr_scanned;
- shrink_node_memcg(pgdat, memcg, sc);
+
+ shrink_lruvec(lruvec, sc);

if (sc->may_shrinkslab) {
shrink_slab(sc->gfp_mask, pgdat->node_id,
@@ -3243,6 +3241,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
pg_data_t *pgdat,
unsigned long *nr_scanned)
{
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
struct scan_control sc = {
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.target_mem_cgroup = memcg,
@@ -3268,7 +3267,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_node_memcg(pgdat, memcg, &sc);
+ shrink_lruvec(lruvec, &sc);

trace_mm_vmscan_memcg_softlimit_reclaim_end(
cgroup_ino(memcg->css.cgroup),
--
2.21.0

2019-06-03 22:09:10

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 10/11] mm: vmscan: detect file thrashing at the reclaim root

We use refault information to determine whether the cache workingset
is stable or transitioning, and dynamically adjust the inactive:active
file LRU ratio so as to maximize protection from one-off cache during
stable periods, and minimize IO during transitions.

With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, refaults only affect
the local LRU order in the cgroup in which they are occuring. As a
result, cache transitions can take longer in a cgrouped system as the
active pages of sibling cgroups aren't challenged when they should be.

[ Right now, this is somewhat theoretical, because the siblings, under
continued regular reclaim pressure, should eventually run out of
inactive pages - and since inactive:active *size* balancing is also
done on a cgroup-local level, we will challenge the active pages
eventually in most cases. But the next patch will move that relative
size enforcement to the reclaim root as well, and then this patch
here will be necessary to propagate refault pressure to siblings. ]

This patch moves refault detection to the root of reclaim. Instead of
remembering the cgroup owner of an evicted page, remember the cgroup
that caused the reclaim to happen. When refaults later occur, they'll
correctly influence the cross-cgroup LRU order that reclaim follows.

I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
refault of those pages will challenge the global LRU order, and not
just the local order down inside C.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 5 +++
include/linux/swap.h | 2 +-
mm/vmscan.c | 32 ++++++++---------
mm/workingset.c | 72 +++++++++++++++++++++++++++++---------
4 files changed, 77 insertions(+), 34 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d33e09c51acc..70dcb6ab4e68 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -899,6 +899,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &pgdat->__lruvec;
}

+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+ return NULL;
+}
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..c65c11158a6c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,7 @@ struct vma_swap_readahead {
};

/* linux/mm/workingset.c */
-void *workingset_eviction(struct page *page);
+void *workingset_eviction(struct mem_cgroup *target_memcg, struct page *page);
void workingset_refault(struct page *page, void *shadow);
void workingset_activation(struct page *page);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cabf94dfa92d..6b7bd5708c73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -834,7 +834,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
* gets returned with a refcount of 0.
*/
static int __remove_mapping(struct address_space *mapping, struct page *page,
- bool reclaimed)
+ bool reclaimed, struct mem_cgroup *target_memcg)
{
unsigned long flags;
int refcount;
@@ -909,7 +909,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
*/
if (reclaimed && page_is_file_cache(page) &&
!mapping_exiting(mapping) && !dax_mapping(mapping))
- shadow = workingset_eviction(page);
+ shadow = workingset_eviction(target_memcg, page);
__delete_from_page_cache(page, shadow);
xa_unlock_irqrestore(&mapping->i_pages, flags);

@@ -932,7 +932,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
*/
int remove_mapping(struct address_space *mapping, struct page *page)
{
- if (__remove_mapping(mapping, page, false)) {
+ if (__remove_mapping(mapping, page, false, NULL)) {
/*
* Unfreezing the refcount with 1 rather than 2 effectively
* drops the pagecache ref for us without requiring another
@@ -1410,7 +1410,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,

count_vm_event(PGLAZYFREED);
count_memcg_page_event(page, PGLAZYFREED);
- } else if (!mapping || !__remove_mapping(mapping, page, true))
+ } else if (!mapping || !__remove_mapping(mapping, page, true,
+ sc->target_mem_cgroup))
goto keep_locked;

unlock_page(page);
@@ -2119,6 +2120,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
enum lru_list inactive_lru = file * LRU_FILE;
unsigned long inactive, active;
unsigned long inactive_ratio;
+ struct lruvec *target_lruvec;
unsigned long refaults;
unsigned long gb;

@@ -2130,8 +2132,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
* is being established. Disable active list protection to get
* rid of the stale workingset quickly.
*/
- refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
- if (file && actual_reclaim && lruvec->refaults != refaults) {
+ target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+ refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
+ if (file && actual_reclaim && target_lruvec->refaults != refaults) {
inactive_ratio = 0;
} else {
gb = (inactive + active) >> (30 - PAGE_SHIFT);
@@ -2937,19 +2940,14 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
sc->gfp_mask = orig_mask;
}

-static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat)
+static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
{
- struct mem_cgroup *memcg;
-
- memcg = mem_cgroup_iter(root_memcg, NULL, NULL);
- do {
- unsigned long refaults;
- struct lruvec *lruvec;
+ struct lruvec *target_lruvec;
+ unsigned long refaults;

- lruvec = mem_cgroup_lruvec(memcg, pgdat);
- refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
- lruvec->refaults = refaults;
- } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
+ target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+ refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
+ target_lruvec->refaults = refaults;
}

/*
diff --git a/mm/workingset.c b/mm/workingset.c
index 2aaa70bea99c..2ae67a6d770c 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -213,28 +213,53 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
*workingsetp = workingset;
}

+static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+{
+ /*
+ * Reclaiming a cgroup means reclaiming all its children in a
+ * round-robin fashion. That means that each cgroup has an LRU
+ * order that is composed of the LRU orders of its child
+ * cgroups; and every page has an LRU position not just in the
+ * cgroup that owns it, but in all of that group's ancestors.
+ *
+ * So when the physical inactive list of a leaf cgroup ages,
+ * the virtual inactive lists of all its parents, including
+ * the root cgroup's, age as well.
+ */
+ do {
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ atomic_long_inc(&lruvec->inactive_age);
+ } while (memcg && (memcg = parent_mem_cgroup(memcg)));
+}
+
/**
* workingset_eviction - note the eviction of a page from memory
+ * @target_memcg: the cgroup that is causing the reclaim
* @page: the page being evicted
*
* Returns a shadow entry to be stored in @page->mapping->i_pages in place
* of the evicted @page so that a later refault can be detected.
*/
-void *workingset_eviction(struct page *page)
+void *workingset_eviction(struct mem_cgroup *target_memcg, struct page *page)
{
struct pglist_data *pgdat = page_pgdat(page);
- struct mem_cgroup *memcg = page_memcg(page);
- int memcgid = mem_cgroup_id(memcg);
unsigned long eviction;
struct lruvec *lruvec;
+ int memcgid;

/* Page is fully exclusive and pins page->mem_cgroup */
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);

- lruvec = mem_cgroup_lruvec(memcg, pgdat);
- eviction = atomic_long_inc_return(&lruvec->inactive_age);
+ advance_inactive_age(page_memcg(page), pgdat);
+
+ /* XXX: target_memcg can be NULL, go through lruvec */
+ lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+ memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
+ eviction = atomic_long_read(&lruvec->inactive_age);
return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
}

@@ -244,10 +269,13 @@ void *workingset_eviction(struct page *page)
* @shadow: shadow entry of the evicted page
*
* Calculates and evaluates the refault distance of the previously
- * evicted page in the context of the node it was allocated in.
+ * evicted page in the context of the node and the memcg whose memory
+ * pressure caused the eviction.
*/
void workingset_refault(struct page *page, void *shadow)
{
+ struct mem_cgroup *eviction_memcg;
+ struct lruvec *eviction_lruvec;
unsigned long refault_distance;
struct pglist_data *pgdat;
unsigned long active_file;
@@ -277,12 +305,12 @@ void workingset_refault(struct page *page, void *shadow)
* would be better if the root_mem_cgroup existed in all
* configurations instead.
*/
- memcg = mem_cgroup_from_id(memcgid);
- if (!mem_cgroup_disabled() && !memcg)
+ eviction_memcg = mem_cgroup_from_id(memcgid);
+ if (!mem_cgroup_disabled() && !eviction_memcg)
goto out;
- lruvec = mem_cgroup_lruvec(memcg, pgdat);
- refault = atomic_long_read(&lruvec->inactive_age);
- active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
+ eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
+ refault = atomic_long_read(&eviction_lruvec->inactive_age);
+ active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);

/*
* Calculate the refault distance
@@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow)
*/
refault_distance = (refault - eviction) & EVICTION_MASK;

+ /*
+ * The activation decision for this page is made at the level
+ * where the eviction occurred, as that is where the LRU order
+ * during page reclaim is being determined.
+ *
+ * However, the cgroup that will own the page is the one that
+ * is actually experiencing the refault event.
+ */
+ memcg = get_mem_cgroup_from_mm(current->mm);
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
inc_lruvec_state(lruvec, WORKINGSET_REFAULT);

/*
@@ -310,10 +349,10 @@ void workingset_refault(struct page *page, void *shadow)
* the memory was available to the page cache.
*/
if (refault_distance > active_file)
- goto out;
+ goto out_memcg;

SetPageActive(page);
- atomic_long_inc(&lruvec->inactive_age);
+ advance_inactive_age(memcg, pgdat);
inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);

/* Page was active prior to eviction */
@@ -321,6 +360,9 @@ void workingset_refault(struct page *page, void *shadow)
SetPageWorkingset(page);
inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
}
+
+out_memcg:
+ mem_cgroup_put(memcg);
out:
rcu_read_unlock();
}
@@ -332,7 +374,6 @@ void workingset_refault(struct page *page, void *shadow)
void workingset_activation(struct page *page)
{
struct mem_cgroup *memcg;
- struct lruvec *lruvec;

rcu_read_lock();
/*
@@ -345,8 +386,7 @@ void workingset_activation(struct page *page)
memcg = page_memcg_rcu(page);
if (!mem_cgroup_disabled() && !memcg)
goto out;
- lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
- atomic_long_inc(&lruvec->inactive_age);
+ advance_inactive_age(memcg, page_pgdat(page));
out:
rcu_read_unlock();
}
--
2.21.0

2019-06-03 22:10:15

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 05/11] mm: vmscan: replace shrink_node() loop with a retry jump

Most of the function body is inside a loop, which imposes an
additional indentation and scoping level that makes the code a bit
hard to follow and modify.

The looping only happens in case of reclaim-compaction, which isn't
the common case. So rather than adding yet another function level to
the reclaim path and have every reclaim invocation go through a level
that only exists for one specific cornercase, use a retry goto.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 266 ++++++++++++++++++++++++++--------------------------
1 file changed, 133 insertions(+), 133 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index afd5e2432a8e..304974481146 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2672,164 +2672,164 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
{
struct reclaim_state *reclaim_state = current->reclaim_state;
+ struct mem_cgroup *root = sc->target_mem_cgroup;
+ struct mem_cgroup_reclaim_cookie reclaim = {
+ .pgdat = pgdat,
+ .priority = sc->priority,
+ };
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
+ struct mem_cgroup *memcg;

- do {
- struct mem_cgroup *root = sc->target_mem_cgroup;
- struct mem_cgroup_reclaim_cookie reclaim = {
- .pgdat = pgdat,
- .priority = sc->priority,
- };
- struct mem_cgroup *memcg;
-
- memset(&sc->nr, 0, sizeof(sc->nr));
+again:
+ memset(&sc->nr, 0, sizeof(sc->nr));

- nr_reclaimed = sc->nr_reclaimed;
- nr_scanned = sc->nr_scanned;
+ nr_reclaimed = sc->nr_reclaimed;
+ nr_scanned = sc->nr_scanned;

- memcg = mem_cgroup_iter(root, NULL, &reclaim);
- do {
- unsigned long reclaimed;
- unsigned long scanned;
+ memcg = mem_cgroup_iter(root, NULL, &reclaim);
+ do {
+ unsigned long reclaimed;
+ unsigned long scanned;

- switch (mem_cgroup_protected(root, memcg)) {
- case MEMCG_PROT_MIN:
- /*
- * Hard protection.
- * If there is no reclaimable memory, OOM.
- */
+ switch (mem_cgroup_protected(root, memcg)) {
+ case MEMCG_PROT_MIN:
+ /*
+ * Hard protection.
+ * If there is no reclaimable memory, OOM.
+ */
+ continue;
+ case MEMCG_PROT_LOW:
+ /*
+ * Soft protection.
+ * Respect the protection only as long as
+ * there is an unprotected supply
+ * of reclaimable memory from other cgroups.
+ */
+ if (!sc->memcg_low_reclaim) {
+ sc->memcg_low_skipped = 1;
continue;
- case MEMCG_PROT_LOW:
- /*
- * Soft protection.
- * Respect the protection only as long as
- * there is an unprotected supply
- * of reclaimable memory from other cgroups.
- */
- if (!sc->memcg_low_reclaim) {
- sc->memcg_low_skipped = 1;
- continue;
- }
- memcg_memory_event(memcg, MEMCG_LOW);
- break;
- case MEMCG_PROT_NONE:
- /*
- * All protection thresholds breached. We may
- * still choose to vary the scan pressure
- * applied based on by how much the cgroup in
- * question has exceeded its protection
- * thresholds (see get_scan_count).
- */
- break;
}
-
- reclaimed = sc->nr_reclaimed;
- scanned = sc->nr_scanned;
- shrink_node_memcg(pgdat, memcg, sc);
-
- if (sc->may_shrinkslab) {
- shrink_slab(sc->gfp_mask, pgdat->node_id,
- memcg, sc->priority);
- }
-
- /* Record the group's reclaim efficiency */
- vmpressure(sc->gfp_mask, memcg, false,
- sc->nr_scanned - scanned,
- sc->nr_reclaimed - reclaimed);
-
+ memcg_memory_event(memcg, MEMCG_LOW);
+ break;
+ case MEMCG_PROT_NONE:
/*
- * Kswapd have to scan all memory cgroups to fulfill
- * the overall scan target for the node.
- *
- * Limit reclaim, on the other hand, only cares about
- * nr_to_reclaim pages to be reclaimed and it will
- * retry with decreasing priority if one round over the
- * whole hierarchy is not sufficient.
+ * All protection thresholds breached. We may
+ * still choose to vary the scan pressure
+ * applied based on by how much the cgroup in
+ * question has exceeded its protection
+ * thresholds (see get_scan_count).
*/
- if (!current_is_kswapd() &&
- sc->nr_reclaimed >= sc->nr_to_reclaim) {
- mem_cgroup_iter_break(root, memcg);
- break;
- }
- } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
+ break;
+ }
+
+ reclaimed = sc->nr_reclaimed;
+ scanned = sc->nr_scanned;
+ shrink_node_memcg(pgdat, memcg, sc);

- if (reclaim_state) {
- sc->nr_reclaimed += reclaim_state->reclaimed_slab;
- reclaim_state->reclaimed_slab = 0;
+ if (sc->may_shrinkslab) {
+ shrink_slab(sc->gfp_mask, pgdat->node_id,
+ memcg, sc->priority);
}

- /* Record the subtree's reclaim efficiency */
- vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
- sc->nr_scanned - nr_scanned,
- sc->nr_reclaimed - nr_reclaimed);
+ /* Record the group's reclaim efficiency */
+ vmpressure(sc->gfp_mask, memcg, false,
+ sc->nr_scanned - scanned,
+ sc->nr_reclaimed - reclaimed);

- if (sc->nr_reclaimed - nr_reclaimed)
- reclaimable = true;
+ /*
+ * Kswapd have to scan all memory cgroups to fulfill
+ * the overall scan target for the node.
+ *
+ * Limit reclaim, on the other hand, only cares about
+ * nr_to_reclaim pages to be reclaimed and it will
+ * retry with decreasing priority if one round over the
+ * whole hierarchy is not sufficient.
+ */
+ if (!current_is_kswapd() &&
+ sc->nr_reclaimed >= sc->nr_to_reclaim) {
+ mem_cgroup_iter_break(root, memcg);
+ break;
+ }
+ } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));

- if (current_is_kswapd()) {
- /*
- * If reclaim is isolating dirty pages under writeback,
- * it implies that the long-lived page allocation rate
- * is exceeding the page laundering rate. Either the
- * global limits are not being effective at throttling
- * processes due to the page distribution throughout
- * zones or there is heavy usage of a slow backing
- * device. The only option is to throttle from reclaim
- * context which is not ideal as there is no guarantee
- * the dirtying process is throttled in the same way
- * balance_dirty_pages() manages.
- *
- * Once a node is flagged PGDAT_WRITEBACK, kswapd will
- * count the number of pages under pages flagged for
- * immediate reclaim and stall if any are encountered
- * in the nr_immediate check below.
- */
- if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
- set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+ if (reclaim_state) {
+ sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+ reclaim_state->reclaimed_slab = 0;
+ }

- /*
- * Tag a node as congested if all the dirty pages
- * scanned were backed by a congested BDI and
- * wait_iff_congested will stall.
- */
- if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
- set_bit(PGDAT_CONGESTED, &pgdat->flags);
+ /* Record the subtree's reclaim efficiency */
+ vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+ sc->nr_scanned - nr_scanned,
+ sc->nr_reclaimed - nr_reclaimed);

- /* Allow kswapd to start writing pages during reclaim.*/
- if (sc->nr.unqueued_dirty == sc->nr.file_taken)
- set_bit(PGDAT_DIRTY, &pgdat->flags);
+ if (sc->nr_reclaimed - nr_reclaimed)
+ reclaimable = true;

- /*
- * If kswapd scans pages marked marked for immediate
- * reclaim and under writeback (nr_immediate), it
- * implies that pages are cycling through the LRU
- * faster than they are written so also forcibly stall.
- */
- if (sc->nr.immediate)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
- }
+ if (current_is_kswapd()) {
+ /*
+ * If reclaim is isolating dirty pages under writeback,
+ * it implies that the long-lived page allocation rate
+ * is exceeding the page laundering rate. Either the
+ * global limits are not being effective at throttling
+ * processes due to the page distribution throughout
+ * zones or there is heavy usage of a slow backing
+ * device. The only option is to throttle from reclaim
+ * context which is not ideal as there is no guarantee
+ * the dirtying process is throttled in the same way
+ * balance_dirty_pages() manages.
+ *
+ * Once a node is flagged PGDAT_WRITEBACK, kswapd will
+ * count the number of pages under pages flagged for
+ * immediate reclaim and stall if any are encountered
+ * in the nr_immediate check below.
+ */
+ if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
+ set_bit(PGDAT_WRITEBACK, &pgdat->flags);

/*
- * Legacy memcg will stall in page writeback so avoid forcibly
- * stalling in wait_iff_congested().
+ * Tag a node as congested if all the dirty pages
+ * scanned were backed by a congested BDI and
+ * wait_iff_congested will stall.
*/
- if (cgroup_reclaim(sc) && writeback_working(sc) &&
- sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
- set_memcg_congestion(pgdat, root, true);
+ if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
+ set_bit(PGDAT_CONGESTED, &pgdat->flags);
+
+ /* Allow kswapd to start writing pages during reclaim.*/
+ if (sc->nr.unqueued_dirty == sc->nr.file_taken)
+ set_bit(PGDAT_DIRTY, &pgdat->flags);

/*
- * Stall direct reclaim for IO completions if underlying BDIs
- * and node is congested. Allow kswapd to continue until it
- * starts encountering unqueued dirty pages or cycling through
- * the LRU too quickly.
+ * If kswapd scans pages marked marked for immediate
+ * reclaim and under writeback (nr_immediate), it
+ * implies that pages are cycling through the LRU
+ * faster than they are written so also forcibly stall.
*/
- if (!sc->hibernation_mode && !current_is_kswapd() &&
- current_may_throttle() && pgdat_memcg_congested(pgdat, root))
- wait_iff_congested(BLK_RW_ASYNC, HZ/10);
+ if (sc->nr.immediate)
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ }
+
+ /*
+ * Legacy memcg will stall in page writeback so avoid forcibly
+ * stalling in wait_iff_congested().
+ */
+ if (cgroup_reclaim(sc) && writeback_working(sc) &&
+ sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
+ set_memcg_congestion(pgdat, root, true);
+
+ /*
+ * Stall direct reclaim for IO completions if underlying BDIs
+ * and node is congested. Allow kswapd to continue until it
+ * starts encountering unqueued dirty pages or cycling through
+ * the LRU too quickly.
+ */
+ if (!sc->hibernation_mode && !current_is_kswapd() &&
+ current_may_throttle() && pgdat_memcg_congested(pgdat, root))
+ wait_iff_congested(BLK_RW_ASYNC, HZ/10);

- } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
- sc->nr_scanned - nr_scanned, sc));
+ if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
+ sc->nr_scanned - nr_scanned, sc))
+ goto again;

/*
* Kswapd gives up on balancing particular nodes after too
--
2.21.0

2019-06-03 22:10:23

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 09/11] mm: vmscan: move file exhaustion detection to the node level

When file pages are lower than the watermark on a node, we try to
force scan anonymous pages to counter-act the balancing algorithms
preference for new file pages when they are likely thrashing. This is
node-level decision, but it's currently made each time we look at an
lruvec. This is unnecessarily expensive and also a layering violation
that makes the code harder to understand.

Clean this up by making the check once per node and setting a flag in
the scan_control.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 80 ++++++++++++++++++++++++++++-------------------------
1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb535c572733..cabf94dfa92d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -104,6 +104,9 @@ struct scan_control {
/* One of the zones is ready for compaction */
unsigned int compaction_ready:1;

+ /* The file pages on the current node are dangerously low */
+ unsigned int file_is_tiny:1;
+
/* Allocation order */
s8 order;

@@ -2219,45 +2222,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
}

/*
- * Prevent the reclaimer from falling into the cache trap: as
- * cache pages start out inactive, every cache fault will tip
- * the scan balance towards the file LRU. And as the file LRU
- * shrinks, so does the window for rotation from references.
- * This means we have a runaway feedback loop where a tiny
- * thrashing file LRU becomes infinitely more attractive than
- * anon pages. Try to detect this based on file LRU size.
+ * If the system is almost out of file pages, force-scan anon.
+ * But only if there are enough inactive anonymous pages on
+ * the LRU. Otherwise, the small LRU gets thrashed.
*/
- if (!cgroup_reclaim(sc)) {
- unsigned long pgdatfile;
- unsigned long pgdatfree;
- int z;
- unsigned long total_high_wmark = 0;
-
- pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
- pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
- node_page_state(pgdat, NR_INACTIVE_FILE);
-
- for (z = 0; z < MAX_NR_ZONES; z++) {
- struct zone *zone = &pgdat->node_zones[z];
- if (!managed_zone(zone))
- continue;
-
- total_high_wmark += high_wmark_pages(zone);
- }
-
- if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
- /*
- * Force SCAN_ANON if there are enough inactive
- * anonymous pages on the LRU in eligible zones.
- * Otherwise, the small LRU gets thrashed.
- */
- if (!inactive_list_is_low(lruvec, false, sc, false) &&
- lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, sc->reclaim_idx)
- >> sc->priority) {
- scan_balance = SCAN_ANON;
- goto out;
- }
- }
+ if (sc->file_is_tiny &&
+ !inactive_list_is_low(lruvec, false, sc, false) &&
+ lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
+ sc->reclaim_idx) >> sc->priority) {
+ scan_balance = SCAN_ANON;
+ goto out;
}

/*
@@ -2718,6 +2692,36 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
nr_reclaimed = sc->nr_reclaimed;
nr_scanned = sc->nr_scanned;

+ /*
+ * Prevent the reclaimer from falling into the cache trap: as
+ * cache pages start out inactive, every cache fault will tip
+ * the scan balance towards the file LRU. And as the file LRU
+ * shrinks, so does the window for rotation from references.
+ * This means we have a runaway feedback loop where a tiny
+ * thrashing file LRU becomes infinitely more attractive than
+ * anon pages. Try to detect this based on file LRU size.
+ */
+ if (!cgroup_reclaim(sc)) {
+ unsigned long file;
+ unsigned long free;
+ int z;
+ unsigned long total_high_wmark = 0;
+
+ free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+ file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+ node_page_state(pgdat, NR_INACTIVE_FILE);
+
+ for (z = 0; z < MAX_NR_ZONES; z++) {
+ struct zone *zone = &pgdat->node_zones[z];
+ if (!managed_zone(zone))
+ continue;
+
+ total_high_wmark += high_wmark_pages(zone);
+ }
+
+ sc->file_is_tiny = file + free <= total_high_wmark;
+ }
+
shrink_node_memcgs(pgdat, sc);

if (reclaim_state) {
--
2.21.0

2019-06-03 22:12:15

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 08/11] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs

The current writeback congestion tracking has separate flags for
kswapd reclaim (node level) and cgroup limit reclaim (memcg-node
level). This is unnecessarily complicated: the lruvec is an existing
abstraction layer for that node-memcg intersection.

Introduce lruvec->flags and LRUVEC_CONGESTED. Then track that at the
reclaim root level, which is either the NUMA node for global reclaim,
or the cgroup-node intersection for cgroup reclaim.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 6 +--
include/linux/mmzone.h | 11 ++++--
mm/vmscan.c | 80 ++++++++++++--------------------------
3 files changed, 36 insertions(+), 61 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fc32cfaebf32..d33e09c51acc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -144,9 +144,6 @@ struct mem_cgroup_per_node {
unsigned long usage_in_excess;/* Set to the value by which */
/* the soft limit is exceeded*/
bool on_tree;
- bool congested; /* memcg has many dirty pages */
- /* backed by a congested BDI */
-
struct mem_cgroup *memcg; /* Back pointer, we cannot */
/* use container_of */
};
@@ -401,6 +398,9 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
goto out;
}

+ if (!memcg)
+ memcg = root_mem_cgroup;
+
mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
lruvec = &mz->lruvec;
out:
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 95d63a395f40..b3ab64cf5619 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -293,6 +293,12 @@ struct zone_reclaim_stat {
unsigned long recent_scanned[2];
};

+enum lruvec_flags {
+ LRUVEC_CONGESTED, /* lruvec has many dirty pages
+ * backed by a congested BDI
+ */
+};
+
struct lruvec {
struct list_head lists[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
@@ -300,6 +306,8 @@ struct lruvec {
atomic_long_t inactive_age;
/* Refaults at the time of last reclaim cycle */
unsigned long refaults;
+ /* Various lruvec state flags (enum lruvec_flags) */
+ unsigned long flags;
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
#endif
@@ -562,9 +570,6 @@ struct zone {
} ____cacheline_internodealigned_in_smp;

enum pgdat_flags {
- PGDAT_CONGESTED, /* pgdat has many dirty pages backed by
- * a congested BDI
- */
PGDAT_DIRTY, /* reclaim scanning has recently found
* many dirty file pages at the tail
* of the LRU.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee79b39d0538..eb535c572733 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -267,29 +267,6 @@ static bool writeback_working(struct scan_control *sc)
#endif
return false;
}
-
-static void set_memcg_congestion(pg_data_t *pgdat,
- struct mem_cgroup *memcg,
- bool congested)
-{
- struct mem_cgroup_per_node *mn;
-
- if (!memcg)
- return;
-
- mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
- WRITE_ONCE(mn->congested, congested);
-}
-
-static bool memcg_congested(pg_data_t *pgdat,
- struct mem_cgroup *memcg)
-{
- struct mem_cgroup_per_node *mn;
-
- mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
- return READ_ONCE(mn->congested);
-
-}
#else
static bool cgroup_reclaim(struct scan_control *sc)
{
@@ -300,18 +277,6 @@ static bool writeback_working(struct scan_control *sc)
{
return true;
}
-
-static inline void set_memcg_congestion(struct pglist_data *pgdat,
- struct mem_cgroup *memcg, bool congested)
-{
-}
-
-static inline bool memcg_congested(struct pglist_data *pgdat,
- struct mem_cgroup *memcg)
-{
- return false;
-
-}
#endif

/*
@@ -2659,12 +2624,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
return true;
}

-static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
-{
- return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
- (memcg && memcg_congested(pgdat, memcg));
-}
-
static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
{
struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2748,8 +2707,11 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
struct reclaim_state *reclaim_state = current->reclaim_state;
struct mem_cgroup *root = sc->target_mem_cgroup;
unsigned long nr_reclaimed, nr_scanned;
+ struct lruvec *target_lruvec;
bool reclaimable = false;

+ target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
again:
memset(&sc->nr, 0, sizeof(sc->nr));

@@ -2792,14 +2754,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
set_bit(PGDAT_WRITEBACK, &pgdat->flags);

- /*
- * Tag a node as congested if all the dirty pages
- * scanned were backed by a congested BDI and
- * wait_iff_congested will stall.
- */
- if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
- set_bit(PGDAT_CONGESTED, &pgdat->flags);
-
/* Allow kswapd to start writing pages during reclaim.*/
if (sc->nr.unqueued_dirty == sc->nr.file_taken)
set_bit(PGDAT_DIRTY, &pgdat->flags);
@@ -2815,12 +2769,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
}

/*
+ * Tag a node/memcg as congested if all the dirty pages
+ * scanned were backed by a congested BDI and
+ * wait_iff_congested will stall.
+ *
* Legacy memcg will stall in page writeback so avoid forcibly
* stalling in wait_iff_congested().
*/
- if (cgroup_reclaim(sc) && writeback_working(sc) &&
+ if ((current_is_kswapd() ||
+ (cgroup_reclaim(sc) && writeback_working(sc))) &&
sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
- set_memcg_congestion(pgdat, root, true);
+ set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);

/*
* Stall direct reclaim for IO completions if underlying BDIs
@@ -2828,8 +2787,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* starts encountering unqueued dirty pages or cycling through
* the LRU too quickly.
*/
- if (!sc->hibernation_mode && !current_is_kswapd() &&
- current_may_throttle() && pgdat_memcg_congested(pgdat, root))
+ if (!current_is_kswapd() && current_may_throttle() &&
+ !sc->hibernation_mode &&
+ test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
wait_iff_congested(BLK_RW_ASYNC, HZ/10);

if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
@@ -3043,8 +3003,16 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
if (zone->zone_pgdat == last_pgdat)
continue;
last_pgdat = zone->zone_pgdat;
+
snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
- set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false);
+
+ if (cgroup_reclaim(sc)) {
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
+ zone->zone_pgdat);
+ clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
+ }
}

delayacct_freepages_end();
@@ -3419,7 +3387,9 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
/* Clear pgdat state for congested, dirty or under writeback. */
static void clear_pgdat_congested(pg_data_t *pgdat)
{
- clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+ struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat);
+
+ clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
clear_bit(PGDAT_DIRTY, &pgdat->flags);
clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
}
--
2.21.0

2019-06-03 22:18:15

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 11/11] mm: vmscan: enforce inactive:active ratio at the reclaim root

We split the LRU lists into inactive and an active parts to maximize
workingset protection while allowing just enough inactive cache space
to faciltate readahead and writeback for one-off file accesses (e.g. a
linear scan through a file, or logging); or just enough inactive anon
to maintain recent reference information when reclaim needs to swap.

With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, inactive:active size
decisions are done on a per-cgroup level. As a result, we'll reclaim a
cgroup's workingset when it doesn't have cold pages, even when one of
its siblings has plenty of it that should be reclaimed first.

For example: workload A has 50M worth of hot cache but doesn't do any
one-off file accesses; meanwhile, parallel workload B scans files and
rarely accesses the same page twice.

If these workloads were to run in an uncgrouped system, A would be
protected from the high rate of cache faults from B. But if they were
put in parallel cgroups for memory accounting purposes, B's fast cache
fault rate would push out the hot cache pages of A. This is unexpected
and undesirable - the "scan resistance" of the page cache is broken.

This patch moves inactive:active size balancing decisions to the root
of reclaim - the same level where the LRU order is established.

It does this by looking at the recursize size of the inactive and the
active file sets of the cgroup subtree at the beginning of the reclaim
cycle, and then making a decision - scan or skip active pages - that
applies throughout the entire run and to every cgroup involved.

With that in place, in the test above, the VM will recognize that
there are plenty of inactive pages in the combined cache set of
workloads A and B and prefer the one-off cache in B over the hot pages
in A. The scan resistance of the cache is restored.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 4 +-
mm/vmscan.c | 183 +++++++++++++++++++++++++----------------
2 files changed, 116 insertions(+), 71 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b3ab64cf5619..8d100905c2ec 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -270,12 +270,12 @@ enum lru_list {

#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)

-static inline int is_file_lru(enum lru_list lru)
+static inline bool is_file_lru(enum lru_list lru)
{
return (lru == LRU_INACTIVE_FILE || lru == LRU_ACTIVE_FILE);
}

-static inline int is_active_lru(enum lru_list lru)
+static inline bool is_active_lru(enum lru_list lru)
{
return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6b7bd5708c73..6af35bb02da0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -79,6 +79,11 @@ struct scan_control {
*/
struct mem_cgroup *target_mem_cgroup;

+ /* Can active pages be deactivated as part of reclaim? anon=0 file=1 */
+ unsigned int may_deactivate:2;
+ unsigned int force_deactivate:1;
+ unsigned int skipped_deactivate:1;
+
/* Writepage batching in laptop mode; RECLAIM_WRITE */
unsigned int may_writepage:1;

@@ -104,6 +109,9 @@ struct scan_control {
/* One of the zones is ready for compaction */
unsigned int compaction_ready:1;

+ /* There is easily reclaimable cold cache in the current node */
+ unsigned int cache_trim_mode:1;
+
/* The file pages on the current node are dangerously low */
unsigned int file_is_tiny:1;

@@ -2084,6 +2092,20 @@ static void shrink_active_list(unsigned long nr_to_scan,
nr_deactivate, nr_rotated, sc->priority, file);
}

+static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
+ struct lruvec *lruvec, struct scan_control *sc)
+{
+ if (is_active_lru(lru)) {
+ if (sc->may_deactivate & (1 << is_file_lru(lru)))
+ shrink_active_list(nr_to_scan, lruvec, sc, lru);
+ else
+ sc->skipped_deactivate = 1;
+ return 0;
+ }
+
+ return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
+}
+
/*
* The inactive anon list should be small enough that the VM never has
* to do too much work.
@@ -2112,59 +2134,25 @@ static void shrink_active_list(unsigned long nr_to_scan,
* 1TB 101 10GB
* 10TB 320 32GB
*/
-static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
- struct scan_control *sc, bool actual_reclaim)
+static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
{
- enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
- struct pglist_data *pgdat = lruvec_pgdat(lruvec);
- enum lru_list inactive_lru = file * LRU_FILE;
+ enum lru_list active_lru = inactive_lru + LRU_ACTIVE;
unsigned long inactive, active;
unsigned long inactive_ratio;
- struct lruvec *target_lruvec;
- unsigned long refaults;
unsigned long gb;

- inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
- active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
+ inactive = lruvec_page_state(lruvec, inactive_lru);
+ active = lruvec_page_state(lruvec, active_lru);

- /*
- * When refaults are being observed, it means a new workingset
- * is being established. Disable active list protection to get
- * rid of the stale workingset quickly.
- */
- target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
- refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
- if (file && actual_reclaim && target_lruvec->refaults != refaults) {
- inactive_ratio = 0;
- } else {
- gb = (inactive + active) >> (30 - PAGE_SHIFT);
- if (gb)
- inactive_ratio = int_sqrt(10 * gb);
- else
- inactive_ratio = 1;
- }
-
- if (actual_reclaim)
- trace_mm_vmscan_inactive_list_is_low(pgdat->node_id, sc->reclaim_idx,
- lruvec_lru_size(lruvec, inactive_lru, MAX_NR_ZONES), inactive,
- lruvec_lru_size(lruvec, active_lru, MAX_NR_ZONES), active,
- inactive_ratio, file);
+ gb = (inactive + active) >> (30 - PAGE_SHIFT);
+ if (gb)
+ inactive_ratio = int_sqrt(10 * gb);
+ else
+ inactive_ratio = 1;

return inactive * inactive_ratio < active;
}

-static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
- struct lruvec *lruvec, struct scan_control *sc)
-{
- if (is_active_lru(lru)) {
- if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
- shrink_active_list(nr_to_scan, lruvec, sc, lru);
- return 0;
- }
-
- return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
-}
-
enum scan_balance {
SCAN_EQUAL,
SCAN_FRACT,
@@ -2226,28 +2214,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,

/*
* If the system is almost out of file pages, force-scan anon.
- * But only if there are enough inactive anonymous pages on
- * the LRU. Otherwise, the small LRU gets thrashed.
*/
- if (sc->file_is_tiny &&
- !inactive_list_is_low(lruvec, false, sc, false) &&
- lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
- sc->reclaim_idx) >> sc->priority) {
+ if (sc->file_is_tiny) {
scan_balance = SCAN_ANON;
goto out;
}

/*
- * If there is enough inactive page cache, i.e. if the size of the
- * inactive list is greater than that of the active list *and* the
- * inactive list actually has some pages to scan on this priority, we
- * do not reclaim anything from the anonymous working set right now.
- * Without the second condition we could end up never scanning an
- * lruvec even if it has plenty of old anonymous pages unless the
- * system is under heavy pressure.
+ * If there is enough inactive page cache, we do not reclaim
+ * anything from the anonymous working right now.
*/
- if (!inactive_list_is_low(lruvec, true, sc, false) &&
- lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
+ if (sc->cache_trim_mode) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2512,7 +2489,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
- if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
+ if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
}
@@ -2686,6 +2663,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
unsigned long nr_reclaimed, nr_scanned;
struct lruvec *target_lruvec;
bool reclaimable = false;
+ unsigned long file;

target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);

@@ -2695,6 +2673,44 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
nr_reclaimed = sc->nr_reclaimed;
nr_scanned = sc->nr_scanned;

+ /*
+ * Target desirable inactive:active list ratios for the anon
+ * and file LRU lists.
+ */
+ if (!sc->force_deactivate) {
+ unsigned long refaults;
+
+ if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+ sc->may_deactivate |= 1;
+ else
+ sc->may_deactivate &= ~1;
+
+ /*
+ * When refaults are being observed, it means a new
+ * workingset is being established. Deactivate to get
+ * rid of any stale active pages quickly.
+ */
+ refaults = lruvec_page_state(target_lruvec,
+ WORKINGSET_ACTIVATE);
+ if (refaults != target_lruvec->refaults ||
+ inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+ sc->may_deactivate |= 2;
+ else
+ sc->may_deactivate &= ~2;
+ } else
+ sc->may_deactivate = 3;
+
+ /*
+ * If we have plenty of inactive file pages that aren't
+ * thrashing, try to reclaim those first before touching
+ * anonymous pages.
+ */
+ file = lruvec_page_state(target_lruvec, LRU_INACTIVE_FILE);
+ if (file >> sc->priority && !(sc->may_deactivate & 2))
+ sc->cache_trim_mode = 1;
+ else
+ sc->cache_trim_mode = 0;
+
/*
* Prevent the reclaimer from falling into the cache trap: as
* cache pages start out inactive, every cache fault will tip
@@ -2705,10 +2721,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* anon pages. Try to detect this based on file LRU size.
*/
if (!cgroup_reclaim(sc)) {
- unsigned long file;
- unsigned long free;
- int z;
unsigned long total_high_wmark = 0;
+ unsigned long free, anon;
+ int z;

free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
file = node_page_state(pgdat, NR_ACTIVE_FILE) +
@@ -2722,7 +2737,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
total_high_wmark += high_wmark_pages(zone);
}

- sc->file_is_tiny = file + free <= total_high_wmark;
+ /*
+ * If anon is low too, this isn't a runaway file
+ * reclaim problem, but rather just extreme memory
+ * pressure. Reclaim as per usual in that case.
+ */
+ anon = node_page_state(pgdat, NR_INACTIVE_ANON);
+
+ sc->file_is_tiny =
+ file + free <= total_high_wmark &&
+ !(sc->may_deactivate & 1) &&
+ anon >> sc->priority;
}

shrink_node_memcgs(pgdat, sc);
@@ -3026,9 +3051,27 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
if (sc->compaction_ready)
return 1;

+ /*
+ * We make inactive:active ratio decisions based on the node's
+ * composition of memory, but a restrictive reclaim_idx or a
+ * memory.low cgroup setting can exempt large amounts of
+ * memory from reclaim. Neither of which are very common, so
+ * instead of doing costly eligibility calculations of the
+ * entire cgroup subtree up front, we assume the estimates are
+ * good, and retry with forcible deactivation if that fails.
+ */
+ if (sc->skipped_deactivate) {
+ sc->priority = initial_priority;
+ sc->force_deactivate = 1;
+ sc->skipped_deactivate = 0;
+ goto retry;
+ }
+
/* Untapped cgroup reserves? Don't OOM, retry. */
if (sc->memcg_low_skipped) {
sc->priority = initial_priority;
+ sc->force_deactivate = 0;
+ sc->skipped_deactivate = 0;
sc->memcg_low_reclaim = 1;
sc->memcg_low_skipped = 0;
goto retry;
@@ -3310,18 +3353,20 @@ static void age_active_anon(struct pglist_data *pgdat,
struct scan_control *sc)
{
struct mem_cgroup *memcg;
+ struct lruvec *lruvec;

if (!total_swap_pages)
return;

+ lruvec = mem_cgroup_lruvec(NULL, pgdat);
+ if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
+ return;
+
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
- struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
- if (inactive_list_is_low(lruvec, false, sc, true))
- shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
- sc, LRU_ACTIVE_ANON);
-
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
+ sc, LRU_ACTIVE_ANON);
memcg = mem_cgroup_iter(NULL, memcg, NULL);
} while (memcg);
}
--
2.21.0

2019-06-03 22:18:43

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 02/11] mm: clean up and clarify lruvec lookup procedure

There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
used is somewhat confusing right now, and it's easy to make mistakes -
especially when it comes to global reclaim.

How it works: when memory cgroups are enabled, we always use the
root_mem_cgroup's per-node lruvecs. When memory cgroups are not
compiled in or disabled at runtime, we use pgdat->lruvec.

Document that in a comment.

Due to the way the reclaim code is generalized, all lookups use the
mem_cgroup_lruvec() helper function, and nobody should have to find
the right lruvec manually right now. But to avoid future mistakes,
rename the pgdat->lruvec member to pgdat->__lruvec and delete the
convenience wrapper that suggests it's a commonly accessed member.

While in this area, swap the mem_cgroup_lruvec() argument order. The
name suggests a memcg operation, yet it takes a pgdat first and a
memcg second. I have to double take every time I call this. Fix that.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 26 +++++++++++++-------------
include/linux/mmzone.h | 15 ++++++++-------
mm/memcontrol.c | 6 +++---
mm/page_alloc.c | 2 +-
mm/vmscan.c | 6 +++---
mm/workingset.c | 8 ++++----
6 files changed, 32 insertions(+), 31 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fa1e8cb1b3e2..fc32cfaebf32 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -382,22 +382,22 @@ mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
}

/**
- * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
- * @node: node of the wanted lruvec
+ * mem_cgroup_lruvec - get the lru list vector for a memcg & node
* @memcg: memcg of the wanted lruvec
+ * @node: node of the wanted lruvec
*
- * Returns the lru list vector holding pages for a given @node or a given
- * @memcg and @zone. This can be the node lruvec, if the memory controller
- * is disabled.
+ * Returns the lru list vector holding pages for a given @memcg &
+ * @node combination. This can be the node lruvec, if the memory
+ * controller is disabled.
*/
-static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
- struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
+ struct pglist_data *pgdat)
{
struct mem_cgroup_per_node *mz;
struct lruvec *lruvec;

if (mem_cgroup_disabled()) {
- lruvec = node_lruvec(pgdat);
+ lruvec = &pgdat->__lruvec;
goto out;
}

@@ -716,7 +716,7 @@ static inline void __mod_lruvec_page_state(struct page *page,
return;
}

- lruvec = mem_cgroup_lruvec(pgdat, page->mem_cgroup);
+ lruvec = mem_cgroup_lruvec(page->mem_cgroup, pgdat);
__mod_lruvec_state(lruvec, idx, val);
}

@@ -887,16 +887,16 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
{
}

-static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
- struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
+ struct pglist_data *pgdat)
{
- return node_lruvec(pgdat);
+ return &pgdat->__lruvec;
}

static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
struct pglist_data *pgdat)
{
- return &pgdat->lruvec;
+ return &pgdat->__lruvec;
}

static inline bool mm_match_cgroup(struct mm_struct *mm,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 427b79c39b3c..95d63a395f40 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -761,7 +761,13 @@ typedef struct pglist_data {
#endif

/* Fields commonly accessed by the page reclaim scanner */
- struct lruvec lruvec;
+
+ /*
+ * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
+ *
+ * Use mem_cgroup_lruvec() to look up lruvecs.
+ */
+ struct lruvec __lruvec;

unsigned long flags;

@@ -784,11 +790,6 @@ typedef struct pglist_data {
#define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn)
#define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))

-static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
-{
- return &pgdat->lruvec;
-}
-
static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
{
return pgdat->node_start_pfn + pgdat->node_spanned_pages;
@@ -826,7 +827,7 @@ static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
#ifdef CONFIG_MEMCG
return lruvec->pgdat;
#else
- return container_of(lruvec, struct pglist_data, lruvec);
+ return container_of(lruvec, struct pglist_data, __lruvec);
#endif
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c193aef3ba9e..6de8ca735ee2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1200,7 +1200,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
struct lruvec *lruvec;

if (mem_cgroup_disabled()) {
- lruvec = &pgdat->lruvec;
+ lruvec = &pgdat->__lruvec;
goto out;
}

@@ -1518,7 +1518,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
int nid, bool noswap)
{
- struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));

if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) ||
lruvec_page_state(lruvec, NR_ACTIVE_FILE))
@@ -3406,7 +3406,7 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
int nid, unsigned int lru_mask)
{
- struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
unsigned long nr = 0;
enum lru_list lru;

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a345418b548e..cd8e64e536f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6619,7 +6619,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)

pgdat_page_ext_init(pgdat);
spin_lock_init(&pgdat->lru_lock);
- lruvec_init(node_lruvec(pgdat));
+ lruvec_init(&pgdat->__lruvec);
}

static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f396424850aa..853be16ee5e2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2477,7 +2477,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
struct scan_control *sc)
{
- struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
unsigned long nr[NR_LRU_LISTS];
unsigned long targets[NR_LRU_LISTS];
unsigned long nr_to_scan;
@@ -2988,7 +2988,7 @@ static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat)
unsigned long refaults;
struct lruvec *lruvec;

- lruvec = mem_cgroup_lruvec(pgdat, memcg);
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
lruvec->refaults = refaults;
} while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
@@ -3351,7 +3351,7 @@ static void age_active_anon(struct pglist_data *pgdat,

memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
- struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);

if (inactive_list_is_low(lruvec, false, sc, true))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
diff --git a/mm/workingset.c b/mm/workingset.c
index e0b4edcb88c8..2aaa70bea99c 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -233,7 +233,7 @@ void *workingset_eviction(struct page *page)
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);

- lruvec = mem_cgroup_lruvec(pgdat, memcg);
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
eviction = atomic_long_inc_return(&lruvec->inactive_age);
return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
}
@@ -280,7 +280,7 @@ void workingset_refault(struct page *page, void *shadow)
memcg = mem_cgroup_from_id(memcgid);
if (!mem_cgroup_disabled() && !memcg)
goto out;
- lruvec = mem_cgroup_lruvec(pgdat, memcg);
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
refault = atomic_long_read(&lruvec->inactive_age);
active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);

@@ -345,7 +345,7 @@ void workingset_activation(struct page *page)
memcg = page_memcg_rcu(page);
if (!mem_cgroup_disabled() && !memcg)
goto out;
- lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+ lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
atomic_long_inc(&lruvec->inactive_age);
out:
rcu_read_unlock();
@@ -428,7 +428,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
struct lruvec *lruvec;
int i;

- lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg);
+ lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
pages += lruvec_page_state_local(lruvec,
NR_LRU_BASE + i);
--
2.21.0

2019-11-07 02:51:53

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 00/11] mm: fix page aging across multiple cgroups

On Mon, Jun 3, 2019 at 2:59 PM Johannes Weiner <[email protected]> wrote:
>
> When applications are put into unconfigured cgroups for memory
> accounting purposes, the cgrouping itself should not change the
> behavior of the page reclaim code. We expect the VM to reclaim the
> coldest pages in the system. But right now the VM can reclaim hot
> pages in one cgroup while there is eligible cold cache in others.
>
> This is because one part of the reclaim algorithm isn't truly cgroup
> hierarchy aware: the inactive/active list balancing. That is the part
> that is supposed to protect hot cache data from one-off streaming IO.
>
> The recursive cgroup reclaim scheme will scan and rotate the physical
> LRU lists of each eligible cgroup at the same rate in a round-robin
> fashion, thereby establishing a relative order among the pages of all
> those cgroups. However, the inactive/active balancing decisions are
> made locally within each cgroup, so when a cgroup is running low on
> cold pages, its hot pages will get reclaimed - even when sibling
> cgroups have plenty of cold cache eligible in the same reclaim run.
>
> For example:
>
> [root@ham ~]# head -n1 /proc/meminfo
> MemTotal: 1016336 kB
>
> [root@ham ~]# ./reclaimtest2.sh
> Establishing 50M active files in cgroup A...
> Hot pages cached: 12800/12800 workingset-a
> Linearly scanning through 18G of file data in cgroup B:
> real 0m4.269s
> user 0m0.051s
> sys 0m4.182s
> Hot pages cached: 134/12800 workingset-a
>

Can you share reclaimtest2.sh as well? Maybe a selftest to
monitor/test future changes.


> The streaming IO in B, which doesn't benefit from caching at all,
> pushes out most of the workingset in A.
>
> Solution
>
> This series fixes the problem by elevating inactive/active balancing
> decisions to the toplevel of the reclaim run. This is either a cgroup
> that hit its limit, or straight-up global reclaim if there is physical
> memory pressure. From there, it takes a recursive view of the cgroup
> subtree to decide whether page deactivation is necessary.
>
> In the test above, the VM will then recognize that cgroup B has plenty
> of eligible cold cache, and that thet hot pages in A can be spared:
>
> [root@ham ~]# ./reclaimtest2.sh
> Establishing 50M active files in cgroup A...
> Hot pages cached: 12800/12800 workingset-a
> Linearly scanning through 18G of file data in cgroup B:
> real 0m4.244s
> user 0m0.064s
> sys 0m4.177s
> Hot pages cached: 12800/12800 workingset-a
>
> Implementation
>
> Whether active pages can be deactivated or not is influenced by two
> factors: the inactive list dropping below a minimum size relative to
> the active list, and the occurence of refaults.
>
> After some cleanups and preparations, this patch series first moves
> refault detection to the reclaim root, then enforces the minimum
> inactive size based on a recursive view of the cgroup tree's LRUs.
>
> History
>
> Note that this actually never worked correctly in Linux cgroups. In
> the past it worked for global reclaim and leaf limit reclaim only (we
> used to have two physical LRU linkages per page), but it never worked
> for intermediate limit reclaim over multiple leaf cgroups.
>
> We're noticing this now because 1) we're putting everything into
> cgroups for accounting, not just the things we want to control and 2)
> we're moving away from leaf limits that invoke reclaim on individual
> cgroups, toward large tree reclaim, triggered by high-level limits or
> physical memory pressure, that is influenced by local protections such
> as memory.low and memory.min instead.
>
> Requirements
>
> These changes are based on the fast recursive memcg stats merged in
> 5.2-rc1. The patches are against v5.2-rc2-mmots-2019-05-29-20-56-12
> plus the page cache fix in https://lkml.org/lkml/2019/5/24/813.
>
> include/linux/memcontrol.h | 37 +--
> include/linux/mmzone.h | 30 +-
> include/linux/swap.h | 2 +-
> mm/memcontrol.c | 6 +-
> mm/page_alloc.c | 2 +-
> mm/vmscan.c | 667 ++++++++++++++++++++++---------------------
> mm/workingset.c | 74 +++--
> 7 files changed, 437 insertions(+), 381 deletions(-)
>
>

2019-11-07 02:52:19

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 01/11] mm: vmscan: move inactive_list_is_low() swap check to the caller

On Mon, Jun 3, 2019 at 3:05 PM Johannes Weiner <[email protected]> wrote:
>
> inactive_list_is_low() should be about one thing: checking the ratio
> between inactive and active list. Kitchensink checks like the one for
> swap space makes the function hard to use and modify its
> callsites. Luckly, most callers already have an understanding of the
> swap situation, so it's easy to clean up.
>
> get_scan_count() has its own, memcg-aware swap check, and doesn't even
> get to the inactive_list_is_low() check on the anon list when there is
> no swap space available.
>
> shrink_list() is called on the results of get_scan_count(), so that
> check is redundant too.
>
> age_active_anon() has its own totalswap_pages check right before it
> checks the list proportions.
>
> The shrink_node_memcg() site is the only one that doesn't do its own
> swap check. Add it there.
>
> Then delete the swap check from inactive_list_is_low().
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>


> ---
> mm/vmscan.c | 9 +--------
> 1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 84dcb651d05c..f396424850aa 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2165,13 +2165,6 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> unsigned long refaults;
> unsigned long gb;
>
> - /*
> - * If we don't have swap space, anonymous page deactivation
> - * is pointless.
> - */
> - if (!file && !total_swap_pages)
> - return false;
> -
> inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
> active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
>
> @@ -2592,7 +2585,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
> * Even if we did not try to evict anon pages at all, we want to
> * rebalance the anon lru active/inactive ratio.
> */
> - if (inactive_list_is_low(lruvec, false, sc, true))
> + if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
> shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> sc, LRU_ACTIVE_ANON);
> }
> --
> 2.21.0
>

2019-11-07 02:52:27

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 02/11] mm: clean up and clarify lruvec lookup procedure

On Mon, Jun 3, 2019 at 3:04 PM Johannes Weiner <[email protected]> wrote:
>
> There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
> used is somewhat confusing right now, and it's easy to make mistakes -
> especially when it comes to global reclaim.
>
> How it works: when memory cgroups are enabled, we always use the
> root_mem_cgroup's per-node lruvecs. When memory cgroups are not
> compiled in or disabled at runtime, we use pgdat->lruvec.
>
> Document that in a comment.
>
> Due to the way the reclaim code is generalized, all lookups use the
> mem_cgroup_lruvec() helper function, and nobody should have to find
> the right lruvec manually right now. But to avoid future mistakes,
> rename the pgdat->lruvec member to pgdat->__lruvec and delete the
> convenience wrapper that suggests it's a commonly accessed member.
>
> While in this area, swap the mem_cgroup_lruvec() argument order. The
> name suggests a memcg operation, yet it takes a pgdat first and a
> memcg second. I have to double

check

> take every time I call this. Fix that.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>


> ---
> include/linux/memcontrol.h | 26 +++++++++++++-------------
> include/linux/mmzone.h | 15 ++++++++-------
> mm/memcontrol.c | 6 +++---
> mm/page_alloc.c | 2 +-
> mm/vmscan.c | 6 +++---
> mm/workingset.c | 8 ++++----
> 6 files changed, 32 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index fa1e8cb1b3e2..fc32cfaebf32 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -382,22 +382,22 @@ mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
> }
>
> /**
> - * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
> - * @node: node of the wanted lruvec
> + * mem_cgroup_lruvec - get the lru list vector for a memcg & node
> * @memcg: memcg of the wanted lruvec
> + * @node: node of the wanted lruvec
> *
> - * Returns the lru list vector holding pages for a given @node or a given
> - * @memcg and @zone. This can be the node lruvec, if the memory controller
> - * is disabled.
> + * Returns the lru list vector holding pages for a given @memcg &
> + * @node combination. This can be the node lruvec, if the memory
> + * controller is disabled.
> */
> -static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
> - struct mem_cgroup *memcg)
> +static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
> + struct pglist_data *pgdat)
> {
> struct mem_cgroup_per_node *mz;
> struct lruvec *lruvec;
>
> if (mem_cgroup_disabled()) {
> - lruvec = node_lruvec(pgdat);
> + lruvec = &pgdat->__lruvec;
> goto out;
> }
>
> @@ -716,7 +716,7 @@ static inline void __mod_lruvec_page_state(struct page *page,
> return;
> }
>
> - lruvec = mem_cgroup_lruvec(pgdat, page->mem_cgroup);
> + lruvec = mem_cgroup_lruvec(page->mem_cgroup, pgdat);
> __mod_lruvec_state(lruvec, idx, val);
> }
>
> @@ -887,16 +887,16 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
> {
> }
>
> -static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
> - struct mem_cgroup *memcg)
> +static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
> + struct pglist_data *pgdat)
> {
> - return node_lruvec(pgdat);
> + return &pgdat->__lruvec;
> }
>
> static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> struct pglist_data *pgdat)
> {
> - return &pgdat->lruvec;
> + return &pgdat->__lruvec;
> }
>
> static inline bool mm_match_cgroup(struct mm_struct *mm,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 427b79c39b3c..95d63a395f40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -761,7 +761,13 @@ typedef struct pglist_data {
> #endif
>
> /* Fields commonly accessed by the page reclaim scanner */
> - struct lruvec lruvec;
> +
> + /*
> + * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
> + *
> + * Use mem_cgroup_lruvec() to look up lruvecs.
> + */
> + struct lruvec __lruvec;
>
> unsigned long flags;
>
> @@ -784,11 +790,6 @@ typedef struct pglist_data {
> #define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn)
> #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
>
> -static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
> -{
> - return &pgdat->lruvec;
> -}
> -
> static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
> {
> return pgdat->node_start_pfn + pgdat->node_spanned_pages;
> @@ -826,7 +827,7 @@ static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
> #ifdef CONFIG_MEMCG
> return lruvec->pgdat;
> #else
> - return container_of(lruvec, struct pglist_data, lruvec);
> + return container_of(lruvec, struct pglist_data, __lruvec);
> #endif
> }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c193aef3ba9e..6de8ca735ee2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1200,7 +1200,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
> struct lruvec *lruvec;
>
> if (mem_cgroup_disabled()) {
> - lruvec = &pgdat->lruvec;
> + lruvec = &pgdat->__lruvec;
> goto out;
> }
>
> @@ -1518,7 +1518,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
> int nid, bool noswap)
> {
> - struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
>
> if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) ||
> lruvec_page_state(lruvec, NR_ACTIVE_FILE))
> @@ -3406,7 +3406,7 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
> static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
> int nid, unsigned int lru_mask)
> {
> - struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> unsigned long nr = 0;
> enum lru_list lru;
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a345418b548e..cd8e64e536f7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6619,7 +6619,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>
> pgdat_page_ext_init(pgdat);
> spin_lock_init(&pgdat->lru_lock);
> - lruvec_init(node_lruvec(pgdat));
> + lruvec_init(&pgdat->__lruvec);
> }
>
> static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f396424850aa..853be16ee5e2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2477,7 +2477,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
> struct scan_control *sc)
> {
> - struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> unsigned long nr[NR_LRU_LISTS];
> unsigned long targets[NR_LRU_LISTS];
> unsigned long nr_to_scan;
> @@ -2988,7 +2988,7 @@ static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat)
> unsigned long refaults;
> struct lruvec *lruvec;
>
> - lruvec = mem_cgroup_lruvec(pgdat, memcg);
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
> lruvec->refaults = refaults;
> } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
> @@ -3351,7 +3351,7 @@ static void age_active_anon(struct pglist_data *pgdat,
>
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
> - struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
> if (inactive_list_is_low(lruvec, false, sc, true))
> shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> diff --git a/mm/workingset.c b/mm/workingset.c
> index e0b4edcb88c8..2aaa70bea99c 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -233,7 +233,7 @@ void *workingset_eviction(struct page *page)
> VM_BUG_ON_PAGE(page_count(page), page);
> VM_BUG_ON_PAGE(!PageLocked(page), page);
>
> - lruvec = mem_cgroup_lruvec(pgdat, memcg);
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> eviction = atomic_long_inc_return(&lruvec->inactive_age);
> return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
> }
> @@ -280,7 +280,7 @@ void workingset_refault(struct page *page, void *shadow)
> memcg = mem_cgroup_from_id(memcgid);
> if (!mem_cgroup_disabled() && !memcg)
> goto out;
> - lruvec = mem_cgroup_lruvec(pgdat, memcg);
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> refault = atomic_long_read(&lruvec->inactive_age);
> active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
>
> @@ -345,7 +345,7 @@ void workingset_activation(struct page *page)
> memcg = page_memcg_rcu(page);
> if (!mem_cgroup_disabled() && !memcg)
> goto out;
> - lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> + lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
> atomic_long_inc(&lruvec->inactive_age);
> out:
> rcu_read_unlock();
> @@ -428,7 +428,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> struct lruvec *lruvec;
> int i;
>
> - lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg);
> + lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
> pages += lruvec_page_state_local(lruvec,
> NR_LRU_BASE + i);
> --
> 2.21.0
>

2019-11-07 02:53:02

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 03/11] mm: vmscan: simplify lruvec_lru_size()

On Mon, Jun 3, 2019 at 2:59 PM Johannes Weiner <[email protected]> wrote:
>
> This function currently takes the node or lruvec size and subtracts
> the zones that are excluded by the classzone index of the
> allocation. It uses four different types of counters to do this.
>
> Just add up the eligible zones.
>
> Signed-off-by: Johannes Weiner <[email protected]>

I think this became part of other series. Anyways:

Reviewed-by: Shakeel Butt <[email protected]>


> ---
> mm/vmscan.c | 19 +++++--------------
> 1 file changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 853be16ee5e2..69c4c82a9b5a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -342,30 +342,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
> */
> unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
> {
> - unsigned long lru_size;
> + unsigned long size = 0;
> int zid;
>
> - if (!mem_cgroup_disabled())
> - lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
> - else
> - lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
> -
> - for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> + for (zid = 0; zid <= zone_idx; zid++) {
> struct zone *zone = &lruvec_pgdat(lruvec)->node_zones[zid];
> - unsigned long size;
>
> if (!managed_zone(zone))
> continue;
>
> if (!mem_cgroup_disabled())
> - size = mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> + size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> else
> - size = zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid],
> - NR_ZONE_LRU_BASE + lru);
> - lru_size -= min(size, lru_size);
> + size += zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
> }
> -
> - return lru_size;
> + return size;
>
> }
>
> --
> 2.21.0
>

2019-11-07 02:53:09

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 04/11] mm: vmscan: naming fixes: cgroup_reclaim() and writeback_working()

On Mon, Jun 3, 2019 at 3:02 PM Johannes Weiner <[email protected]> wrote:
>
> Seven years after introducing the global_reclaim() function, I still
> have to double take when reading a callsite. I don't know how others
> do it. This is a terrible name.
>
> Invert the meaning and rename it to cgroup_reclaim().
>
> [ After all, "global reclaim" is just regular reclaim invoked from the
> page allocator. It's reclaim on behalf of a cgroup limit that is a
> special case of reclaim, and should be explicit - not the reverse. ]
>

Not really confusing for me at least but no objection on cgroup_reclaim().

> sane_reclaim() isn't very descriptive either: it tests whether we can
> use the regular writeback throttling - available during regular page
> reclaim or cgroup2 limit reclaim - or need to use the broken
> wait_on_page_writeback() method. Rename it to writeback_working().

Totally agree here.

>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>


> ---
> mm/vmscan.c | 38 ++++++++++++++++++--------------------
> 1 file changed, 18 insertions(+), 20 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 69c4c82a9b5a..afd5e2432a8e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -239,13 +239,13 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
> #endif /* CONFIG_MEMCG_KMEM */
>
> #ifdef CONFIG_MEMCG
> -static bool global_reclaim(struct scan_control *sc)
> +static bool cgroup_reclaim(struct scan_control *sc)
> {
> - return !sc->target_mem_cgroup;
> + return sc->target_mem_cgroup;
> }
>
> /**
> - * sane_reclaim - is the usual dirty throttling mechanism operational?
> + * writeback_working - is the usual dirty throttling mechanism unavailable?
> * @sc: scan_control in question
> *
> * The normal page dirty throttling mechanism in balance_dirty_pages() is
> @@ -257,11 +257,9 @@ static bool global_reclaim(struct scan_control *sc)
> * This function tests whether the vmscan currently in progress can assume
> * that the normal dirty throttling mechanism is operational.
> */
> -static bool sane_reclaim(struct scan_control *sc)
> +static bool writeback_working(struct scan_control *sc)
> {
> - struct mem_cgroup *memcg = sc->target_mem_cgroup;
> -
> - if (!memcg)
> + if (!cgroup_reclaim(sc))
> return true;
> #ifdef CONFIG_CGROUP_WRITEBACK
> if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> @@ -293,12 +291,12 @@ static bool memcg_congested(pg_data_t *pgdat,
>
> }
> #else
> -static bool global_reclaim(struct scan_control *sc)
> +static bool cgroup_reclaim(struct scan_control *sc)
> {
> - return true;
> + return false;
> }
>
> -static bool sane_reclaim(struct scan_control *sc)
> +static bool writeback_working(struct scan_control *sc)
> {
> return true;
> }
> @@ -1211,7 +1209,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> goto activate_locked;
>
> /* Case 2 above */
> - } else if (sane_reclaim(sc) ||
> + } else if (writeback_working(sc) ||
> !PageReclaim(page) || !may_enter_fs) {
> /*
> * This is slightly racy - end_page_writeback()
> @@ -1806,7 +1804,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
> if (current_is_kswapd())
> return 0;
>
> - if (!sane_reclaim(sc))
> + if (!writeback_working(sc))
> return 0;
>
> if (file) {
> @@ -1957,7 +1955,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> reclaim_stat->recent_scanned[file] += nr_taken;
>
> item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> - if (global_reclaim(sc))
> + if (!cgroup_reclaim(sc))
> __count_vm_events(item, nr_scanned);
> __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> spin_unlock_irq(&pgdat->lru_lock);
> @@ -1971,7 +1969,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> spin_lock_irq(&pgdat->lru_lock);
>
> item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> - if (global_reclaim(sc))
> + if (!cgroup_reclaim(sc))
> __count_vm_events(item, nr_reclaimed);
> __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
> @@ -2239,7 +2237,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> * using the memory controller's swap limit feature would be
> * too expensive.
> */
> - if (!global_reclaim(sc) && !swappiness) {
> + if (cgroup_reclaim(sc) && !swappiness) {
> scan_balance = SCAN_FILE;
> goto out;
> }
> @@ -2263,7 +2261,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> * thrashing file LRU becomes infinitely more attractive than
> * anon pages. Try to detect this based on file LRU size.
> */
> - if (global_reclaim(sc)) {
> + if (!cgroup_reclaim(sc)) {
> unsigned long pgdatfile;
> unsigned long pgdatfree;
> int z;
> @@ -2494,7 +2492,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
> * abort proportional reclaim if either the file or anon lru has already
> * dropped to zero at the first pass.
> */
> - scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
> + scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
> sc->priority == DEF_PRIORITY);
>
> blk_start_plug(&plug);
> @@ -2816,7 +2814,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> * Legacy memcg will stall in page writeback so avoid forcibly
> * stalling in wait_iff_congested().
> */
> - if (!global_reclaim(sc) && sane_reclaim(sc) &&
> + if (cgroup_reclaim(sc) && writeback_working(sc) &&
> sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> set_memcg_congestion(pgdat, root, true);
>
> @@ -2911,7 +2909,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> * Take care memory controller reclaiming has small influence
> * to global LRU.
> */
> - if (global_reclaim(sc)) {
> + if (!cgroup_reclaim(sc)) {
> if (!cpuset_zone_allowed(zone,
> GFP_KERNEL | __GFP_HARDWALL))
> continue;
> @@ -3011,7 +3009,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> retry:
> delayacct_freepages_start();
>
> - if (global_reclaim(sc))
> + if (!cgroup_reclaim(sc))
> __count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);
>
> do {
> --
> 2.21.0
>

2019-11-07 02:53:32

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 05/11] mm: vmscan: replace shrink_node() loop with a retry jump

On Mon, Jun 3, 2019 at 3:05 PM Johannes Weiner <[email protected]> wrote:
>
> Most of the function body is inside a loop, which imposes an
> additional indentation and scoping level that makes the code a bit
> hard to follow and modify.
>
> The looping only happens in case of reclaim-compaction, which isn't
> the common case. So rather than adding yet another function level to
> the reclaim path and have every reclaim invocation go through a level
> that only exists for one specific cornercase, use a retry goto.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>


> ---
> mm/vmscan.c | 266 ++++++++++++++++++++++++++--------------------------
> 1 file changed, 133 insertions(+), 133 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index afd5e2432a8e..304974481146 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2672,164 +2672,164 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> {
> struct reclaim_state *reclaim_state = current->reclaim_state;
> + struct mem_cgroup *root = sc->target_mem_cgroup;
> + struct mem_cgroup_reclaim_cookie reclaim = {
> + .pgdat = pgdat,
> + .priority = sc->priority,
> + };
> unsigned long nr_reclaimed, nr_scanned;
> bool reclaimable = false;
> + struct mem_cgroup *memcg;
>
> - do {
> - struct mem_cgroup *root = sc->target_mem_cgroup;
> - struct mem_cgroup_reclaim_cookie reclaim = {
> - .pgdat = pgdat,
> - .priority = sc->priority,
> - };
> - struct mem_cgroup *memcg;
> -
> - memset(&sc->nr, 0, sizeof(sc->nr));
> +again:
> + memset(&sc->nr, 0, sizeof(sc->nr));
>
> - nr_reclaimed = sc->nr_reclaimed;
> - nr_scanned = sc->nr_scanned;
> + nr_reclaimed = sc->nr_reclaimed;
> + nr_scanned = sc->nr_scanned;
>
> - memcg = mem_cgroup_iter(root, NULL, &reclaim);
> - do {
> - unsigned long reclaimed;
> - unsigned long scanned;
> + memcg = mem_cgroup_iter(root, NULL, &reclaim);
> + do {
> + unsigned long reclaimed;
> + unsigned long scanned;
>
> - switch (mem_cgroup_protected(root, memcg)) {
> - case MEMCG_PROT_MIN:
> - /*
> - * Hard protection.
> - * If there is no reclaimable memory, OOM.
> - */
> + switch (mem_cgroup_protected(root, memcg)) {
> + case MEMCG_PROT_MIN:
> + /*
> + * Hard protection.
> + * If there is no reclaimable memory, OOM.
> + */
> + continue;
> + case MEMCG_PROT_LOW:
> + /*
> + * Soft protection.
> + * Respect the protection only as long as
> + * there is an unprotected supply
> + * of reclaimable memory from other cgroups.
> + */
> + if (!sc->memcg_low_reclaim) {
> + sc->memcg_low_skipped = 1;
> continue;
> - case MEMCG_PROT_LOW:
> - /*
> - * Soft protection.
> - * Respect the protection only as long as
> - * there is an unprotected supply
> - * of reclaimable memory from other cgroups.
> - */
> - if (!sc->memcg_low_reclaim) {
> - sc->memcg_low_skipped = 1;
> - continue;
> - }
> - memcg_memory_event(memcg, MEMCG_LOW);
> - break;
> - case MEMCG_PROT_NONE:
> - /*
> - * All protection thresholds breached. We may
> - * still choose to vary the scan pressure
> - * applied based on by how much the cgroup in
> - * question has exceeded its protection
> - * thresholds (see get_scan_count).
> - */
> - break;
> }
> -
> - reclaimed = sc->nr_reclaimed;
> - scanned = sc->nr_scanned;
> - shrink_node_memcg(pgdat, memcg, sc);
> -
> - if (sc->may_shrinkslab) {
> - shrink_slab(sc->gfp_mask, pgdat->node_id,
> - memcg, sc->priority);
> - }
> -
> - /* Record the group's reclaim efficiency */
> - vmpressure(sc->gfp_mask, memcg, false,
> - sc->nr_scanned - scanned,
> - sc->nr_reclaimed - reclaimed);
> -
> + memcg_memory_event(memcg, MEMCG_LOW);
> + break;
> + case MEMCG_PROT_NONE:
> /*
> - * Kswapd have to scan all memory cgroups to fulfill
> - * the overall scan target for the node.
> - *
> - * Limit reclaim, on the other hand, only cares about
> - * nr_to_reclaim pages to be reclaimed and it will
> - * retry with decreasing priority if one round over the
> - * whole hierarchy is not sufficient.
> + * All protection thresholds breached. We may
> + * still choose to vary the scan pressure
> + * applied based on by how much the cgroup in
> + * question has exceeded its protection
> + * thresholds (see get_scan_count).
> */
> - if (!current_is_kswapd() &&
> - sc->nr_reclaimed >= sc->nr_to_reclaim) {
> - mem_cgroup_iter_break(root, memcg);
> - break;
> - }
> - } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> + break;
> + }
> +
> + reclaimed = sc->nr_reclaimed;
> + scanned = sc->nr_scanned;
> + shrink_node_memcg(pgdat, memcg, sc);
>
> - if (reclaim_state) {
> - sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> - reclaim_state->reclaimed_slab = 0;
> + if (sc->may_shrinkslab) {
> + shrink_slab(sc->gfp_mask, pgdat->node_id,
> + memcg, sc->priority);
> }
>
> - /* Record the subtree's reclaim efficiency */
> - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> - sc->nr_scanned - nr_scanned,
> - sc->nr_reclaimed - nr_reclaimed);
> + /* Record the group's reclaim efficiency */
> + vmpressure(sc->gfp_mask, memcg, false,
> + sc->nr_scanned - scanned,
> + sc->nr_reclaimed - reclaimed);
>
> - if (sc->nr_reclaimed - nr_reclaimed)
> - reclaimable = true;
> + /*
> + * Kswapd have to scan all memory cgroups to fulfill
> + * the overall scan target for the node.
> + *
> + * Limit reclaim, on the other hand, only cares about
> + * nr_to_reclaim pages to be reclaimed and it will
> + * retry with decreasing priority if one round over the
> + * whole hierarchy is not sufficient.
> + */
> + if (!current_is_kswapd() &&
> + sc->nr_reclaimed >= sc->nr_to_reclaim) {
> + mem_cgroup_iter_break(root, memcg);
> + break;
> + }
> + } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>
> - if (current_is_kswapd()) {
> - /*
> - * If reclaim is isolating dirty pages under writeback,
> - * it implies that the long-lived page allocation rate
> - * is exceeding the page laundering rate. Either the
> - * global limits are not being effective at throttling
> - * processes due to the page distribution throughout
> - * zones or there is heavy usage of a slow backing
> - * device. The only option is to throttle from reclaim
> - * context which is not ideal as there is no guarantee
> - * the dirtying process is throttled in the same way
> - * balance_dirty_pages() manages.
> - *
> - * Once a node is flagged PGDAT_WRITEBACK, kswapd will
> - * count the number of pages under pages flagged for
> - * immediate reclaim and stall if any are encountered
> - * in the nr_immediate check below.
> - */
> - if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
> - set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> + if (reclaim_state) {
> + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> + reclaim_state->reclaimed_slab = 0;
> + }
>
> - /*
> - * Tag a node as congested if all the dirty pages
> - * scanned were backed by a congested BDI and
> - * wait_iff_congested will stall.
> - */
> - if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> - set_bit(PGDAT_CONGESTED, &pgdat->flags);
> + /* Record the subtree's reclaim efficiency */
> + vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> + sc->nr_scanned - nr_scanned,
> + sc->nr_reclaimed - nr_reclaimed);
>
> - /* Allow kswapd to start writing pages during reclaim.*/
> - if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> - set_bit(PGDAT_DIRTY, &pgdat->flags);
> + if (sc->nr_reclaimed - nr_reclaimed)
> + reclaimable = true;
>
> - /*
> - * If kswapd scans pages marked marked for immediate
> - * reclaim and under writeback (nr_immediate), it
> - * implies that pages are cycling through the LRU
> - * faster than they are written so also forcibly stall.
> - */
> - if (sc->nr.immediate)
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> - }
> + if (current_is_kswapd()) {
> + /*
> + * If reclaim is isolating dirty pages under writeback,
> + * it implies that the long-lived page allocation rate
> + * is exceeding the page laundering rate. Either the
> + * global limits are not being effective at throttling
> + * processes due to the page distribution throughout
> + * zones or there is heavy usage of a slow backing
> + * device. The only option is to throttle from reclaim
> + * context which is not ideal as there is no guarantee
> + * the dirtying process is throttled in the same way
> + * balance_dirty_pages() manages.
> + *
> + * Once a node is flagged PGDAT_WRITEBACK, kswapd will
> + * count the number of pages under pages flagged for
> + * immediate reclaim and stall if any are encountered
> + * in the nr_immediate check below.
> + */
> + if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
> + set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
> /*
> - * Legacy memcg will stall in page writeback so avoid forcibly
> - * stalling in wait_iff_congested().
> + * Tag a node as congested if all the dirty pages
> + * scanned were backed by a congested BDI and
> + * wait_iff_congested will stall.
> */
> - if (cgroup_reclaim(sc) && writeback_working(sc) &&
> - sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> - set_memcg_congestion(pgdat, root, true);
> + if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> + set_bit(PGDAT_CONGESTED, &pgdat->flags);
> +
> + /* Allow kswapd to start writing pages during reclaim.*/
> + if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> + set_bit(PGDAT_DIRTY, &pgdat->flags);
>
> /*
> - * Stall direct reclaim for IO completions if underlying BDIs
> - * and node is congested. Allow kswapd to continue until it
> - * starts encountering unqueued dirty pages or cycling through
> - * the LRU too quickly.
> + * If kswapd scans pages marked marked for immediate
> + * reclaim and under writeback (nr_immediate), it
> + * implies that pages are cycling through the LRU
> + * faster than they are written so also forcibly stall.
> */
> - if (!sc->hibernation_mode && !current_is_kswapd() &&
> - current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> - wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> + if (sc->nr.immediate)
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> + }
> +
> + /*
> + * Legacy memcg will stall in page writeback so avoid forcibly
> + * stalling in wait_iff_congested().
> + */
> + if (cgroup_reclaim(sc) && writeback_working(sc) &&
> + sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> + set_memcg_congestion(pgdat, root, true);
> +
> + /*
> + * Stall direct reclaim for IO completions if underlying BDIs
> + * and node is congested. Allow kswapd to continue until it
> + * starts encountering unqueued dirty pages or cycling through
> + * the LRU too quickly.
> + */
> + if (!sc->hibernation_mode && !current_is_kswapd() &&
> + current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> + wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>
> - } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> - sc->nr_scanned - nr_scanned, sc));
> + if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> + sc->nr_scanned - nr_scanned, sc))
> + goto again;
>
> /*
> * Kswapd gives up on balancing particular nodes after too
> --
> 2.21.0
>

2019-11-07 02:53:40

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 06/11] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec()

On Mon, Jun 3, 2019 at 3:07 PM Johannes Weiner <[email protected]> wrote:
>
> A lruvec holds LRU pages owned by a certain NUMA node and cgroup.
> Instead of awkwardly passing around a combination of a pgdat and a
> memcg pointer, pass down the lruvec as soon as we can look it up.
>
> Nested callers that need to access node or cgroup properties can look
> them them up if necessary, but there are only a few cases.

*them

>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>


> ---
> mm/vmscan.c | 21 ++++++++++-----------
> 1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 304974481146..b85111474ee2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2210,9 +2210,10 @@ enum scan_balance {
> * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
> * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
> */
> -static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> - struct scan_control *sc, unsigned long *nr)
> +static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> + unsigned long *nr)
> {
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> int swappiness = mem_cgroup_swappiness(memcg);
> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> u64 fraction[2];
> @@ -2460,13 +2461,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> }
> }
>
> -/*
> - * This is a basic per-node page freer. Used by both kswapd and direct reclaim.
> - */
> -static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
> - struct scan_control *sc)
> +static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> {
> - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> unsigned long nr[NR_LRU_LISTS];
> unsigned long targets[NR_LRU_LISTS];
> unsigned long nr_to_scan;
> @@ -2476,7 +2472,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
> struct blk_plug plug;
> bool scan_adjusted;
>
> - get_scan_count(lruvec, memcg, sc, nr);
> + get_scan_count(lruvec, sc, nr);
>
> /* Record the original scan target for proportional adjustments later */
> memcpy(targets, nr, sizeof(nr));
> @@ -2689,6 +2685,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>
> memcg = mem_cgroup_iter(root, NULL, &reclaim);
> do {
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> unsigned long reclaimed;
> unsigned long scanned;
>
> @@ -2725,7 +2722,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>
> reclaimed = sc->nr_reclaimed;
> scanned = sc->nr_scanned;
> - shrink_node_memcg(pgdat, memcg, sc);
> +
> + shrink_lruvec(lruvec, sc);
>
> if (sc->may_shrinkslab) {
> shrink_slab(sc->gfp_mask, pgdat->node_id,
> @@ -3243,6 +3241,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> pg_data_t *pgdat,
> unsigned long *nr_scanned)
> {
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> struct scan_control sc = {
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .target_mem_cgroup = memcg,
> @@ -3268,7 +3267,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> * will pick up pages from other mem cgroup's as well. We hack
> * the priority and make it zero.
> */
> - shrink_node_memcg(pgdat, memcg, &sc);
> + shrink_lruvec(lruvec, &sc);
>
> trace_mm_vmscan_memcg_softlimit_reclaim_end(
> cgroup_ino(memcg->css.cgroup),
> --
> 2.21.0
>

2019-11-07 02:54:06

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 08/11] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs

On Mon, Jun 3, 2019 at 3:04 PM Johannes Weiner <[email protected]> wrote:
>
> The current writeback congestion tracking has separate flags for
> kswapd reclaim (node level) and cgroup limit reclaim (memcg-node
> level). This is unnecessarily complicated: the lruvec is an existing
> abstraction layer for that node-memcg intersection.
>
> Introduce lruvec->flags and LRUVEC_CONGESTED. Then track that at the
> reclaim root level, which is either the NUMA node for global reclaim,
> or the cgroup-node intersection for cgroup reclaim.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>


> ---
> include/linux/memcontrol.h | 6 +--
> include/linux/mmzone.h | 11 ++++--
> mm/vmscan.c | 80 ++++++++++++--------------------------
> 3 files changed, 36 insertions(+), 61 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index fc32cfaebf32..d33e09c51acc 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -144,9 +144,6 @@ struct mem_cgroup_per_node {
> unsigned long usage_in_excess;/* Set to the value by which */
> /* the soft limit is exceeded*/
> bool on_tree;
> - bool congested; /* memcg has many dirty pages */
> - /* backed by a congested BDI */
> -
> struct mem_cgroup *memcg; /* Back pointer, we cannot */
> /* use container_of */
> };
> @@ -401,6 +398,9 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
> goto out;
> }
>
> + if (!memcg)
> + memcg = root_mem_cgroup;
> +
> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
> lruvec = &mz->lruvec;
> out:
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 95d63a395f40..b3ab64cf5619 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -293,6 +293,12 @@ struct zone_reclaim_stat {
> unsigned long recent_scanned[2];
> };
>
> +enum lruvec_flags {
> + LRUVEC_CONGESTED, /* lruvec has many dirty pages
> + * backed by a congested BDI
> + */
> +};
> +
> struct lruvec {
> struct list_head lists[NR_LRU_LISTS];
> struct zone_reclaim_stat reclaim_stat;
> @@ -300,6 +306,8 @@ struct lruvec {
> atomic_long_t inactive_age;
> /* Refaults at the time of last reclaim cycle */
> unsigned long refaults;
> + /* Various lruvec state flags (enum lruvec_flags) */
> + unsigned long flags;
> #ifdef CONFIG_MEMCG
> struct pglist_data *pgdat;
> #endif
> @@ -562,9 +570,6 @@ struct zone {
> } ____cacheline_internodealigned_in_smp;
>
> enum pgdat_flags {
> - PGDAT_CONGESTED, /* pgdat has many dirty pages backed by
> - * a congested BDI
> - */
> PGDAT_DIRTY, /* reclaim scanning has recently found
> * many dirty file pages at the tail
> * of the LRU.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ee79b39d0538..eb535c572733 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -267,29 +267,6 @@ static bool writeback_working(struct scan_control *sc)
> #endif
> return false;
> }
> -
> -static void set_memcg_congestion(pg_data_t *pgdat,
> - struct mem_cgroup *memcg,
> - bool congested)
> -{
> - struct mem_cgroup_per_node *mn;
> -
> - if (!memcg)
> - return;
> -
> - mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
> - WRITE_ONCE(mn->congested, congested);
> -}
> -
> -static bool memcg_congested(pg_data_t *pgdat,
> - struct mem_cgroup *memcg)
> -{
> - struct mem_cgroup_per_node *mn;
> -
> - mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
> - return READ_ONCE(mn->congested);
> -
> -}
> #else
> static bool cgroup_reclaim(struct scan_control *sc)
> {
> @@ -300,18 +277,6 @@ static bool writeback_working(struct scan_control *sc)
> {
> return true;
> }
> -
> -static inline void set_memcg_congestion(struct pglist_data *pgdat,
> - struct mem_cgroup *memcg, bool congested)
> -{
> -}
> -
> -static inline bool memcg_congested(struct pglist_data *pgdat,
> - struct mem_cgroup *memcg)
> -{
> - return false;
> -
> -}
> #endif
>
> /*
> @@ -2659,12 +2624,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
> return true;
> }
>
> -static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> -{
> - return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
> - (memcg && memcg_congested(pgdat, memcg));
> -}
> -
> static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> {
> struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -2748,8 +2707,11 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> struct reclaim_state *reclaim_state = current->reclaim_state;
> struct mem_cgroup *root = sc->target_mem_cgroup;
> unsigned long nr_reclaimed, nr_scanned;
> + struct lruvec *target_lruvec;
> bool reclaimable = false;
>
> + target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> +
> again:
> memset(&sc->nr, 0, sizeof(sc->nr));
>
> @@ -2792,14 +2754,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
> set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
> - /*
> - * Tag a node as congested if all the dirty pages
> - * scanned were backed by a congested BDI and
> - * wait_iff_congested will stall.
> - */
> - if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> - set_bit(PGDAT_CONGESTED, &pgdat->flags);
> -
> /* Allow kswapd to start writing pages during reclaim.*/
> if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> set_bit(PGDAT_DIRTY, &pgdat->flags);
> @@ -2815,12 +2769,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> }
>
> /*
> + * Tag a node/memcg as congested if all the dirty pages
> + * scanned were backed by a congested BDI and
> + * wait_iff_congested will stall.
> + *
> * Legacy memcg will stall in page writeback so avoid forcibly
> * stalling in wait_iff_congested().
> */
> - if (cgroup_reclaim(sc) && writeback_working(sc) &&
> + if ((current_is_kswapd() ||
> + (cgroup_reclaim(sc) && writeback_working(sc))) &&
> sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> - set_memcg_congestion(pgdat, root, true);
> + set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
>
> /*
> * Stall direct reclaim for IO completions if underlying BDIs
> @@ -2828,8 +2787,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> * starts encountering unqueued dirty pages or cycling through
> * the LRU too quickly.
> */
> - if (!sc->hibernation_mode && !current_is_kswapd() &&
> - current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> + if (!current_is_kswapd() && current_may_throttle() &&
> + !sc->hibernation_mode &&
> + test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
> wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>
> if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> @@ -3043,8 +3003,16 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> if (zone->zone_pgdat == last_pgdat)
> continue;
> last_pgdat = zone->zone_pgdat;
> +
> snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
> - set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false);
> +
> + if (cgroup_reclaim(sc)) {
> + struct lruvec *lruvec;
> +
> + lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
> + zone->zone_pgdat);
> + clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
> + }
> }
>
> delayacct_freepages_end();
> @@ -3419,7 +3387,9 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
> /* Clear pgdat state for congested, dirty or under writeback. */
> static void clear_pgdat_congested(pg_data_t *pgdat)
> {
> - clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> + struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat);
> +
> + clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
> clear_bit(PGDAT_DIRTY, &pgdat->flags);
> clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
> }
> --
> 2.21.0
>

2019-11-07 02:54:19

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm: vmscan: move file exhaustion detection to the node level

On Mon, Jun 3, 2019 at 3:08 PM Johannes Weiner <[email protected]> wrote:
>
> When file pages are lower than the watermark on a node, we try to
> force scan anonymous pages to counter-act the balancing algorithms
> preference for new file pages when they are likely thrashing. This is
> node-level decision, but it's currently made each time we look at an
> lruvec. This is unnecessarily expensive and also a layering violation
> that makes the code harder to understand.
>
> Clean this up by making the check once per node and setting a flag in
> the scan_control.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>


> ---
> mm/vmscan.c | 80 ++++++++++++++++++++++++++++-------------------------
> 1 file changed, 42 insertions(+), 38 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eb535c572733..cabf94dfa92d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -104,6 +104,9 @@ struct scan_control {
> /* One of the zones is ready for compaction */
> unsigned int compaction_ready:1;
>
> + /* The file pages on the current node are dangerously low */
> + unsigned int file_is_tiny:1;
> +
> /* Allocation order */
> s8 order;
>
> @@ -2219,45 +2222,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> }
>

Unrelated to the patch. I think we need to revisit all the heuristics
that were added here over the years. get_scan_count() has become
really complicated and weird.

> /*
> - * Prevent the reclaimer from falling into the cache trap: as
> - * cache pages start out inactive, every cache fault will tip
> - * the scan balance towards the file LRU. And as the file LRU
> - * shrinks, so does the window for rotation from references.
> - * This means we have a runaway feedback loop where a tiny
> - * thrashing file LRU becomes infinitely more attractive than
> - * anon pages. Try to detect this based on file LRU size.
> + * If the system is almost out of file pages, force-scan anon.
> + * But only if there are enough inactive anonymous pages on
> + * the LRU. Otherwise, the small LRU gets thrashed.
> */
> - if (!cgroup_reclaim(sc)) {
> - unsigned long pgdatfile;
> - unsigned long pgdatfree;
> - int z;
> - unsigned long total_high_wmark = 0;
> -
> - pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> - pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> - node_page_state(pgdat, NR_INACTIVE_FILE);
> -
> - for (z = 0; z < MAX_NR_ZONES; z++) {
> - struct zone *zone = &pgdat->node_zones[z];
> - if (!managed_zone(zone))
> - continue;
> -
> - total_high_wmark += high_wmark_pages(zone);
> - }
> -
> - if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
> - /*
> - * Force SCAN_ANON if there are enough inactive
> - * anonymous pages on the LRU in eligible zones.
> - * Otherwise, the small LRU gets thrashed.
> - */
> - if (!inactive_list_is_low(lruvec, false, sc, false) &&
> - lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, sc->reclaim_idx)
> - >> sc->priority) {
> - scan_balance = SCAN_ANON;
> - goto out;
> - }
> - }
> + if (sc->file_is_tiny &&
> + !inactive_list_is_low(lruvec, false, sc, false) &&
> + lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
> + sc->reclaim_idx) >> sc->priority) {
> + scan_balance = SCAN_ANON;
> + goto out;
> }
>
> /*
> @@ -2718,6 +2692,36 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> nr_reclaimed = sc->nr_reclaimed;
> nr_scanned = sc->nr_scanned;
>
> + /*
> + * Prevent the reclaimer from falling into the cache trap: as
> + * cache pages start out inactive, every cache fault will tip
> + * the scan balance towards the file LRU. And as the file LRU
> + * shrinks, so does the window for rotation from references.
> + * This means we have a runaway feedback loop where a tiny
> + * thrashing file LRU becomes infinitely more attractive than
> + * anon pages. Try to detect this based on file LRU size.
> + */
> + if (!cgroup_reclaim(sc)) {
> + unsigned long file;
> + unsigned long free;
> + int z;
> + unsigned long total_high_wmark = 0;
> +
> + free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> + file = node_page_state(pgdat, NR_ACTIVE_FILE) +
> + node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> + for (z = 0; z < MAX_NR_ZONES; z++) {
> + struct zone *zone = &pgdat->node_zones[z];
> + if (!managed_zone(zone))
> + continue;
> +
> + total_high_wmark += high_wmark_pages(zone);
> + }
> +
> + sc->file_is_tiny = file + free <= total_high_wmark;
> + }
> +
> shrink_node_memcgs(pgdat, sc);
>
> if (reclaim_state) {
> --
> 2.21.0
>

2019-11-07 02:55:55

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 07/11] mm: vmscan: split shrink_node() into node part and memcgs part

On Mon, Jun 3, 2019 at 3:00 PM Johannes Weiner <[email protected]> wrote:
>
> This function is getting long and unwieldy. The new shrink_node()
> handles the generic (node) reclaim aspects:
> - global vmpressure notifications
> - writeback and congestion throttling
> - reclaim/compaction management
> - kswapd giving up on unreclaimable nodes
>
> It then calls shrink_node_memcgs() which handles cgroup specifics:
> - the cgroup tree traversal
> - memory.low considerations
> - per-cgroup slab shrinking callbacks
> - per-cgroup vmpressure notifications
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>


> ---
> mm/vmscan.c | 29 ++++++++++++++++++-----------
> 1 file changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b85111474ee2..ee79b39d0538 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2665,24 +2665,15 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> (memcg && memcg_congested(pgdat, memcg));
> }
>
> -static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> +static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> {
> - struct reclaim_state *reclaim_state = current->reclaim_state;
> struct mem_cgroup *root = sc->target_mem_cgroup;
> struct mem_cgroup_reclaim_cookie reclaim = {
> .pgdat = pgdat,
> .priority = sc->priority,
> };
> - unsigned long nr_reclaimed, nr_scanned;
> - bool reclaimable = false;
> struct mem_cgroup *memcg;
>
> -again:
> - memset(&sc->nr, 0, sizeof(sc->nr));
> -
> - nr_reclaimed = sc->nr_reclaimed;
> - nr_scanned = sc->nr_scanned;
> -
> memcg = mem_cgroup_iter(root, NULL, &reclaim);
> do {
> struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> @@ -2750,6 +2741,22 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> break;
> }
> } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> +}
> +
> +static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> +{
> + struct reclaim_state *reclaim_state = current->reclaim_state;
> + struct mem_cgroup *root = sc->target_mem_cgroup;
> + unsigned long nr_reclaimed, nr_scanned;
> + bool reclaimable = false;
> +
> +again:
> + memset(&sc->nr, 0, sizeof(sc->nr));
> +
> + nr_reclaimed = sc->nr_reclaimed;
> + nr_scanned = sc->nr_scanned;
> +
> + shrink_node_memcgs(pgdat, sc);
>
> if (reclaim_state) {
> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> @@ -2757,7 +2764,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> }
>
> /* Record the subtree's reclaim efficiency */
> - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> + vmpressure(sc->gfp_mask, root, true,
> sc->nr_scanned - nr_scanned,
> sc->nr_reclaimed - nr_reclaimed);
>
> --
> 2.21.0
>

2019-11-07 17:51:23

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 00/11] mm: fix page aging across multiple cgroups

On Wed, Nov 06, 2019 at 06:50:25PM -0800, Shakeel Butt wrote:
> On Mon, Jun 3, 2019 at 2:59 PM Johannes Weiner <[email protected]> wrote:
> >
> > When applications are put into unconfigured cgroups for memory
> > accounting purposes, the cgrouping itself should not change the
> > behavior of the page reclaim code. We expect the VM to reclaim the
> > coldest pages in the system. But right now the VM can reclaim hot
> > pages in one cgroup while there is eligible cold cache in others.
> >
> > This is because one part of the reclaim algorithm isn't truly cgroup
> > hierarchy aware: the inactive/active list balancing. That is the part
> > that is supposed to protect hot cache data from one-off streaming IO.
> >
> > The recursive cgroup reclaim scheme will scan and rotate the physical
> > LRU lists of each eligible cgroup at the same rate in a round-robin
> > fashion, thereby establishing a relative order among the pages of all
> > those cgroups. However, the inactive/active balancing decisions are
> > made locally within each cgroup, so when a cgroup is running low on
> > cold pages, its hot pages will get reclaimed - even when sibling
> > cgroups have plenty of cold cache eligible in the same reclaim run.
> >
> > For example:
> >
> > [root@ham ~]# head -n1 /proc/meminfo
> > MemTotal: 1016336 kB
> >
> > [root@ham ~]# ./reclaimtest2.sh
> > Establishing 50M active files in cgroup A...
> > Hot pages cached: 12800/12800 workingset-a
> > Linearly scanning through 18G of file data in cgroup B:
> > real 0m4.269s
> > user 0m0.051s
> > sys 0m4.182s
> > Hot pages cached: 134/12800 workingset-a
> >
>
> Can you share reclaimtest2.sh as well? Maybe a selftest to
> monitor/test future changes.

I wish it were more portable, but it really only does what it says in
the log output, in a pretty hacky way, with all parameters hard-coded
to my test environment:

---

#!/bin/bash

# this should protect workingset-a from workingset-b

set -e
#set -x

echo Establishing 50M active files in cgroup A...
rmdir /cgroup/workingset-a 2>/dev/null || true
mkdir /cgroup/workingset-a
echo $$ > /cgroup/workingset-a/cgroup.procs
rm -f workingset-a
dd of=workingset-a bs=1M count=0 seek=50 2>/dev/null >/dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
echo -n "Hot pages cached: "
./mincore workingset-a

echo -n Linearly scanning through 2G of file data cgroup B:
rmdir /cgroup/workingset-b >/dev/null || true
mkdir /cgroup/workingset-b
echo $$ > /cgroup/workingset-b/cgroup.procs
rm -f workingset-b
dd of=workingset-b bs=1M count=0 seek=2048 2>/dev/null >/dev/null
time (
cat workingset-b > /dev/null
cat workingset-b > /dev/null
cat workingset-b > /dev/null
cat workingset-b > /dev/null
cat workingset-b > /dev/null
cat workingset-b > /dev/null
cat workingset-b > /dev/null
cat workingset-b > /dev/null )
echo -n "Hot pages cached: "
./mincore workingset-a

2019-11-08 03:45:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 01/11] mm: vmscan: move inactive_list_is_low() swap check to the caller

On Wed, 6 Nov 2019 18:50:37 -0800 Shakeel Butt <[email protected]> wrote:

> On Mon, Jun 3, 2019 at 3:05 PM Johannes Weiner <[email protected]> wrote:
> >
> ...
>
> Reviewed-by: Shakeel Butt <[email protected]>

hm, you appear to have reviewed a five month old version. Hoping that
nothing changed too much, I added your review tags to the one week old
version ;)