2014-11-25 18:24:04

by Johannes Weiner

[permalink] [raw]
Subject: [patch] mm: vmscan: invoke slab shrinkers from shrink_zone()

The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge
the eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask. This is
redundant code, extra runtime work, and fairly inaccurate when it
comes to the estimation of actually scannable LRU pages. The code
duplication will only get worse when making the shrinkers cgroup-aware
and requiring them to have out-of-band cgroup hierarchy walks as well.

Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.

Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space,
like zone_reclaimable_pages() currently does. Accumulate the number
over all visited lruvecs to get the per-zone value.

Some nodes have multiple zones due to memory addressing restrictions.
To avoid putting too much pressure on the shrinkers, only invoke them
once for each such node, using the class zone of the allocation as the
pivot zone.

For now, this integrates the slab shrinking better into the reclaim
logic and gets rid of duplicative invocations from kswapd, direct
reclaim, and zone reclaim. It also prepares for cgroup-awareness,
allowing memcg-capable shrinkers to be added at the lruvec level
without much duplication of both code and runtime work.

This changes kswapd behavior, which used to invoke the shrinkers for
each zone, but with scan ratios gathered from the entire node,
resulting in meaningless pressure quantities on multi-zone nodes.

Zone reclaim behavior also changes. It used to shrink slabs until the
same amount of pages were shrunk as were reclaimed from the LRUs. Now
it merely invokes the shrinkers once with the zone's scan ratio, which
makes the shrinkers go easier on caches that implement aging and would
prefer feeding back pressure from recently used slab objects to unused
LRU pages.

Signed-off-by: Johannes Weiner <[email protected]>
---
drivers/staging/android/ashmem.c | 3 +-
fs/drop_caches.c | 11 ++-
include/linux/mm.h | 6 +-
include/linux/shrinker.h | 2 -
mm/memory-failure.c | 11 +--
mm/page_alloc.c | 6 +-
mm/vmscan.c | 208 +++++++++++++++------------------------
7 files changed, 98 insertions(+), 149 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 27eecfe1c410..8c7852742f4b 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -418,7 +418,7 @@ out:
}

/*
- * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
+ * ashmem_shrink - our cache shrinker, called from mm/vmscan.c
*
* 'nr_to_scan' is the number of objects to scan for freeing.
*
@@ -785,7 +785,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
.nr_to_scan = LONG_MAX,
};
ret = ashmem_shrink_count(&ashmem_shrinker, &sc);
- nodes_setall(sc.nodes_to_scan);
ashmem_shrink_scan(&ashmem_shrinker, &sc);
}
break;
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 1de7294aad20..2bc2c87f35e7 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -40,13 +40,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
static void drop_slab(void)
{
int nr_objects;
- struct shrink_control shrink = {
- .gfp_mask = GFP_KERNEL,
- };

- nodes_setall(shrink.nodes_to_scan);
do {
- nr_objects = shrink_slab(&shrink, 1000, 1000);
+ int nid;
+
+ nr_objects = 0;
+ for_each_online_node(nid)
+ nr_objects += shrink_node_slabs(GFP_KERNEL, nid,
+ 1000, 1000);
} while (nr_objects > 10);
}

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b922a16c9b5b..f652931aa4bd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2088,9 +2088,9 @@ int drop_caches_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
#endif

-unsigned long shrink_slab(struct shrink_control *shrink,
- unsigned long nr_pages_scanned,
- unsigned long lru_pages);
+unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid,
+ unsigned long nr_scanned,
+ unsigned long nr_eligible);

#ifndef CONFIG_MMU
#define randomize_va_space 0
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c097077ef0..f4aee75f00b1 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -18,8 +18,6 @@ struct shrink_control {
*/
unsigned long nr_to_scan;

- /* shrink from these nodes */
- nodemask_t nodes_to_scan;
/* current node being shrunk (for NUMA aware shrinkers) */
int nid;
};
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6b94969d91c5..feb803bf3443 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -239,19 +239,14 @@ void shake_page(struct page *p, int access)
}

/*
- * Only call shrink_slab here (which would also shrink other caches) if
- * access is not potentially fatal.
+ * Only call shrink_node_slabs here (which would also shrink
+ * other caches) if access is not potentially fatal.
*/
if (access) {
int nr;
int nid = page_to_nid(p);
do {
- struct shrink_control shrink = {
- .gfp_mask = GFP_KERNEL,
- };
- node_set(nid, shrink.nodes_to_scan);
-
- nr = shrink_slab(&shrink, 1000, 1000);
+ nr = shrink_node_slabs(GFP_KERNEL, nid, 1000, 1000);
if (page_count(p) == 1)
break;
} while (nr > 10);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b0e6eaba454c..efa4e6fe6a4d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6267,9 +6267,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
if (!PageLRU(page))
found++;
/*
- * If there are RECLAIMABLE pages, we need to check it.
- * But now, memory offline itself doesn't call shrink_slab()
- * and it still to be fixed.
+ * If there are RECLAIMABLE pages, we need to check
+ * it. But now, memory offline itself doesn't call
+ * shrink_node_slabs() and it still to be fixed.
*/
/*
* If the page is not RAM, page_count()should be 0.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a384339bf718..8c2b45bfe610 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -229,9 +229,10 @@ EXPORT_SYMBOL(unregister_shrinker);

#define SHRINK_BATCH 128

-static unsigned long
-shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
- unsigned long nr_pages_scanned, unsigned long lru_pages)
+static unsigned long shrink_slabs(struct shrink_control *shrinkctl,
+ struct shrinker *shrinker,
+ unsigned long nr_scanned,
+ unsigned long nr_eligible)
{
unsigned long freed = 0;
unsigned long long delta;
@@ -255,9 +256,9 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);

total_scan = nr;
- delta = (4 * nr_pages_scanned) / shrinker->seeks;
+ delta = (4 * nr_scanned) / shrinker->seeks;
delta *= freeable;
- do_div(delta, lru_pages + 1);
+ do_div(delta, nr_eligible + 1);
total_scan += delta;
if (total_scan < 0) {
pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
@@ -289,8 +290,8 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
total_scan = freeable * 2;

trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
- nr_pages_scanned, lru_pages,
- freeable, delta, total_scan);
+ nr_scanned, nr_eligible,
+ freeable, delta, total_scan);

/*
* Normally, we should not scan less than batch_size objects in one
@@ -339,34 +340,37 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
return freed;
}

-/*
- * Call the shrink functions to age shrinkable caches
- *
- * Here we assume it costs one seek to replace a lru page and that it also
- * takes a seek to recreate a cache object. With this in mind we age equal
- * percentages of the lru and ageable caches. This should balance the seeks
- * generated by these structures.
+/**
+ * shrink_node_slabs - shrink slab caches of a given node
+ * @gfp_mask: allocation context
+ * @nid: node whose slab caches to target
+ * @nr_scanned: pressure numerator
+ * @nr_eligible: pressure denominator
*
- * If the vm encountered mapped pages on the LRU it increase the pressure on
- * slab to avoid swapping.
+ * Call the shrink functions to age shrinkable caches.
*
- * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits.
+ * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
+ * unaware shrinkers will receive a node id of 0 instead.
*
- * `lru_pages' represents the number of on-LRU pages in all the zones which
- * are eligible for the caller's allocation attempt. It is used for balancing
- * slab reclaim versus page reclaim.
+ * @nr_scanned and @nr_eligible form a ratio that indicate how much of
+ * the available objects should be scanned. Page reclaim for example
+ * passes the number of pages scanned and the number of pages on the
+ * LRU lists that it considered on @nid, plus a bias in @nr_scanned
+ * when it encountered mapped pages. The ratio is further biased by
+ * the ->seeks setting of the shrink function, which indicates the
+ * cost to recreate an object relative to that of an LRU page.
*
- * Returns the number of slab objects which we shrunk.
+ * Returns the number of reclaimed slab objects.
*/
-unsigned long shrink_slab(struct shrink_control *shrinkctl,
- unsigned long nr_pages_scanned,
- unsigned long lru_pages)
+unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid,
+ unsigned long nr_scanned,
+ unsigned long nr_eligible)
{
struct shrinker *shrinker;
unsigned long freed = 0;

- if (nr_pages_scanned == 0)
- nr_pages_scanned = SWAP_CLUSTER_MAX;
+ if (nr_scanned == 0)
+ nr_scanned = SWAP_CLUSTER_MAX;

if (!down_read_trylock(&shrinker_rwsem)) {
/*
@@ -380,20 +384,17 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
}

list_for_each_entry(shrinker, &shrinker_list, list) {
- if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
- shrinkctl->nid = 0;
- freed += shrink_slab_node(shrinkctl, shrinker,
- nr_pages_scanned, lru_pages);
- continue;
- }
+ struct shrink_control sc = {
+ .gfp_mask = gfp_mask,
+ .nid = nid,
+ };

- for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
- if (node_online(shrinkctl->nid))
- freed += shrink_slab_node(shrinkctl, shrinker,
- nr_pages_scanned, lru_pages);
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+ sc.nid = 0;

- }
+ freed += shrink_slabs(&sc, shrinker, nr_scanned, nr_eligible);
}
+
up_read(&shrinker_rwsem);
out:
cond_resched();
@@ -1876,7 +1877,8 @@ enum scan_balance {
* nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
*/
static void get_scan_count(struct lruvec *lruvec, int swappiness,
- struct scan_control *sc, unsigned long *nr)
+ struct scan_control *sc, unsigned long *nr,
+ unsigned long *lru_pages)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 fraction[2];
@@ -2022,6 +2024,7 @@ out:
some_scanned = false;
/* Only use force_scan on second pass. */
for (pass = 0; !some_scanned && pass < 2; pass++) {
+ *lru_pages = 0;
for_each_evictable_lru(lru) {
int file = is_file_lru(lru);
unsigned long size;
@@ -2048,14 +2051,19 @@ out:
case SCAN_FILE:
case SCAN_ANON:
/* Scan one type exclusively */
- if ((scan_balance == SCAN_FILE) != file)
+ if ((scan_balance == SCAN_FILE) != file) {
+ size = 0;
scan = 0;
+ }
break;
default:
/* Look ma, no brain */
BUG();
}
+
+ *lru_pages += size;
nr[lru] = scan;
+
/*
* Skip the second pass and don't force_scan,
* if we found something to scan.
@@ -2069,7 +2077,7 @@ out:
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
- struct scan_control *sc)
+ struct scan_control *sc, unsigned long *lru_pages)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long targets[NR_LRU_LISTS];
@@ -2080,7 +2088,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
struct blk_plug plug;
bool scan_adjusted;

- get_scan_count(lruvec, swappiness, sc, nr);
+ get_scan_count(lruvec, swappiness, sc, nr, lru_pages);

/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
@@ -2258,7 +2266,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
}
}

-static bool shrink_zone(struct zone *zone, struct scan_control *sc)
+static bool shrink_zone(struct zone *zone, struct scan_control *sc,
+ bool is_classzone)
{
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
@@ -2269,6 +2278,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc)
.zone = zone,
.priority = sc->priority,
};
+ unsigned long zone_lru_pages = 0;
struct mem_cgroup *memcg;

nr_reclaimed = sc->nr_reclaimed;
@@ -2276,13 +2286,15 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc)

memcg = mem_cgroup_iter(root, NULL, &reclaim);
do {
+ unsigned long lru_pages;
struct lruvec *lruvec;
int swappiness;

lruvec = mem_cgroup_zone_lruvec(zone, memcg);
swappiness = mem_cgroup_swappiness(memcg);

- shrink_lruvec(lruvec, swappiness, sc);
+ shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
+ zone_lru_pages += lru_pages;

/*
* Direct reclaim and kswapd have to scan all memory
@@ -2302,6 +2314,25 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc)
memcg = mem_cgroup_iter(root, memcg, &reclaim);
} while (memcg);

+ /*
+ * Shrink the slab caches in the same proportion that
+ * the eligible LRU pages were scanned.
+ */
+ if (global_reclaim(sc) && is_classzone) {
+ struct reclaim_state *reclaim_state;
+
+ shrink_node_slabs(sc->gfp_mask, zone_to_nid(zone),
+ sc->nr_scanned - nr_scanned,
+ zone_lru_pages);
+
+ reclaim_state = current->reclaim_state;
+ if (reclaim_state) {
+ sc->nr_reclaimed +=
+ reclaim_state->reclaimed_slab;
+ reclaim_state->reclaimed_slab = 0;
+ }
+ }
+
vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
sc->nr_scanned - nr_scanned,
sc->nr_reclaimed - nr_reclaimed);
@@ -2376,12 +2407,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
struct zone *zone;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
- unsigned long lru_pages = 0;
- struct reclaim_state *reclaim_state = current->reclaim_state;
gfp_t orig_mask;
- struct shrink_control shrink = {
- .gfp_mask = sc->gfp_mask,
- };
enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
bool reclaimable = false;

@@ -2394,10 +2420,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
if (buffer_heads_over_limit)
sc->gfp_mask |= __GFP_HIGHMEM;

- nodes_clear(shrink.nodes_to_scan);
-
for_each_zone_zonelist_nodemask(zone, z, zonelist,
- gfp_zone(sc->gfp_mask), sc->nodemask) {
+ requested_highidx, sc->nodemask) {
if (!populated_zone(zone))
continue;
/*
@@ -2409,9 +2433,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
GFP_KERNEL | __GFP_HARDWALL))
continue;

- lru_pages += zone_reclaimable_pages(zone);
- node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
if (sc->priority != DEF_PRIORITY &&
!zone_reclaimable(zone))
continue; /* Let kswapd poll it */
@@ -2450,7 +2471,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
/* need some check for avoid more shrink_zone() */
}

- if (shrink_zone(zone, sc))
+ if (shrink_zone(zone, sc, zone_idx(zone) == requested_highidx))
reclaimable = true;

if (global_reclaim(sc) &&
@@ -2459,20 +2480,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
}

/*
- * Don't shrink slabs when reclaiming memory from over limit cgroups
- * but do shrink slab at least once when aborting reclaim for
- * compaction to avoid unevenly scanning file/anon LRU pages over slab
- * pages.
- */
- if (global_reclaim(sc)) {
- shrink_slab(&shrink, sc->nr_scanned, lru_pages);
- if (reclaim_state) {
- sc->nr_reclaimed += reclaim_state->reclaimed_slab;
- reclaim_state->reclaimed_slab = 0;
- }
- }
-
- /*
* Restore to original mask to avoid the impact on the caller if we
* promoted it to __GFP_HIGHMEM.
*/
@@ -2736,6 +2743,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
};
struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
int swappiness = mem_cgroup_swappiness(memcg);
+ unsigned long lru_pages;

sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2751,7 +2759,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_lruvec(lruvec, swappiness, &sc);
+ shrink_lruvec(lruvec, swappiness, &sc, &lru_pages);

trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);

@@ -2932,15 +2940,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
static bool kswapd_shrink_zone(struct zone *zone,
int classzone_idx,
struct scan_control *sc,
- unsigned long lru_pages,
unsigned long *nr_attempted)
{
int testorder = sc->order;
unsigned long balance_gap;
- struct reclaim_state *reclaim_state = current->reclaim_state;
- struct shrink_control shrink = {
- .gfp_mask = sc->gfp_mask,
- };
bool lowmem_pressure;

/* Reclaim above the high watermark. */
@@ -2975,13 +2978,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
balance_gap, classzone_idx))
return true;

- shrink_zone(zone, sc);
- nodes_clear(shrink.nodes_to_scan);
- node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
- reclaim_state->reclaimed_slab = 0;
- shrink_slab(&shrink, sc->nr_scanned, lru_pages);
- sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+ shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);

/* Account for the number of pages attempted to reclaim */
*nr_attempted += sc->nr_to_reclaim;
@@ -3042,7 +3039,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
count_vm_event(PAGEOUTRUN);

do {
- unsigned long lru_pages = 0;
unsigned long nr_attempted = 0;
bool raise_priority = true;
bool pgdat_needs_compaction = (order > 0);
@@ -3102,8 +3098,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
if (!populated_zone(zone))
continue;

- lru_pages += zone_reclaimable_pages(zone);
-
/*
* If any zone is currently balanced then kswapd will
* not call compaction as it is expected that the
@@ -3159,8 +3153,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
* that that high watermark would be met at 100%
* efficiency.
*/
- if (kswapd_shrink_zone(zone, end_zone, &sc,
- lru_pages, &nr_attempted))
+ if (kswapd_shrink_zone(zone, end_zone,
+ &sc, &nr_attempted))
raise_priority = false;
}

@@ -3612,10 +3606,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
};
- struct shrink_control shrink = {
- .gfp_mask = sc.gfp_mask,
- };
- unsigned long nr_slab_pages0, nr_slab_pages1;

cond_resched();
/*
@@ -3634,44 +3624,10 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* priorities until we have enough memory freed.
*/
do {
- shrink_zone(zone, &sc);
+ shrink_zone(zone, &sc, true);
} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
}

- nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
- if (nr_slab_pages0 > zone->min_slab_pages) {
- /*
- * shrink_slab() does not currently allow us to determine how
- * many pages were freed in this zone. So we take the current
- * number of slab pages and shake the slab until it is reduced
- * by the same nr_pages that we used for reclaiming unmapped
- * pages.
- */
- nodes_clear(shrink.nodes_to_scan);
- node_set(zone_to_nid(zone), shrink.nodes_to_scan);
- for (;;) {
- unsigned long lru_pages = zone_reclaimable_pages(zone);
-
- /* No reclaimable slab or very low memory pressure */
- if (!shrink_slab(&shrink, sc.nr_scanned, lru_pages))
- break;
-
- /* Freed enough memory */
- nr_slab_pages1 = zone_page_state(zone,
- NR_SLAB_RECLAIMABLE);
- if (nr_slab_pages1 + nr_pages <= nr_slab_pages0)
- break;
- }
-
- /*
- * Update nr_reclaimed by the number of slab pages we
- * reclaimed from this zone.
- */
- nr_slab_pages1 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
- if (nr_slab_pages1 < nr_slab_pages0)
- sc.nr_reclaimed += nr_slab_pages0 - nr_slab_pages1;
- }
-
p->reclaim_state = NULL;
current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
lockdep_clear_current_reclaim_state();
--
2.1.3


2014-11-26 21:41:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] mm: vmscan: invoke slab shrinkers from shrink_zone()

On Tue, 25 Nov 2014 13:23:50 -0500 Johannes Weiner <[email protected]> wrote:

> The slab shrinkers are currently invoked from the zonelist walkers in
> kswapd, direct reclaim, and zone reclaim, all of which roughly gauge
> the eligible LRU pages and assemble a nodemask to pass to NUMA-aware
> shrinkers, which then again have to walk over the nodemask. This is
> redundant code, extra runtime work, and fairly inaccurate when it
> comes to the estimation of actually scannable LRU pages. The code
> duplication will only get worse when making the shrinkers cgroup-aware
> and requiring them to have out-of-band cgroup hierarchy walks as well.
>
> Instead, invoke the shrinkers from shrink_zone(), which is where all
> reclaimers end up, to avoid this duplication.
>
> Take the count for eligible LRU pages out of get_scan_count(), which
> considers many more factors than just the availability of swap space,
> like zone_reclaimable_pages() currently does. Accumulate the number
> over all visited lruvecs to get the per-zone value.
>
> Some nodes have multiple zones due to memory addressing restrictions.
> To avoid putting too much pressure on the shrinkers, only invoke them
> once for each such node, using the class zone of the allocation as the
> pivot zone.
>
> For now, this integrates the slab shrinking better into the reclaim
> logic and gets rid of duplicative invocations from kswapd, direct
> reclaim, and zone reclaim. It also prepares for cgroup-awareness,
> allowing memcg-capable shrinkers to be added at the lruvec level
> without much duplication of both code and runtime work.
>
> This changes kswapd behavior, which used to invoke the shrinkers for
> each zone, but with scan ratios gathered from the entire node,
> resulting in meaningless pressure quantities on multi-zone nodes.

It's a troublesome patch - we've been poking at this code for years and
now it gets significantly upended. It all *seems* sensible, but any
warts will take time to identify.

> Zone reclaim behavior also changes. It used to shrink slabs until the
> same amount of pages were shrunk as were reclaimed from the LRUs. Now
> it merely invokes the shrinkers once with the zone's scan ratio, which
> makes the shrinkers go easier on caches that implement aging and would
> prefer feeding back pressure from recently used slab objects to unused
> LRU pages.

hm, "go easier on caches" means it changes reclaim balancing. Is the
result better or worse?

2014-11-28 16:06:47

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [patch] mm: vmscan: invoke slab shrinkers from shrink_zone()

Hi Johannes,

The patch generally looks good to me, because it simplifies the code
flow significantly and makes it easier for me to introduce per memcg
slab reclaim (thanks!). However, it has one serious flaw. Please see the
comment inline.

On Tue, Nov 25, 2014 at 01:23:50PM -0500, Johannes Weiner wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a384339bf718..8c2b45bfe610 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
[...]
> @@ -2376,12 +2407,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> struct zone *zone;
> unsigned long nr_soft_reclaimed;
> unsigned long nr_soft_scanned;
> - unsigned long lru_pages = 0;
> - struct reclaim_state *reclaim_state = current->reclaim_state;
> gfp_t orig_mask;
> - struct shrink_control shrink = {
> - .gfp_mask = sc->gfp_mask,
> - };
> enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
> bool reclaimable = false;
>
> @@ -2394,10 +2420,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> if (buffer_heads_over_limit)
> sc->gfp_mask |= __GFP_HIGHMEM;
>
> - nodes_clear(shrink.nodes_to_scan);
> -
> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> - gfp_zone(sc->gfp_mask), sc->nodemask) {
> + requested_highidx, sc->nodemask) {
> if (!populated_zone(zone))
> continue;
> /*
> @@ -2409,9 +2433,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> GFP_KERNEL | __GFP_HARDWALL))
> continue;
>
> - lru_pages += zone_reclaimable_pages(zone);
> - node_set(zone_to_nid(zone), shrink.nodes_to_scan);
> -
> if (sc->priority != DEF_PRIORITY &&
> !zone_reclaimable(zone))
> continue; /* Let kswapd poll it */
> @@ -2450,7 +2471,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> /* need some check for avoid more shrink_zone() */
> }
>
> - if (shrink_zone(zone, sc))
> + if (shrink_zone(zone, sc, zone_idx(zone) == requested_highidx))
> reclaimable = true;
>
> if (global_reclaim(sc) &&

If the highest zone (zone_idx=requested_highidx) is not populated, we
won't scan slab caches on direct reclaim, which may result in OOM kill
even if there are plenty of freeable dentries available.

It's especially relevant for VMs, which often have less than 4G of RAM,
in which case we will only have ZONE_DMA and ZONE_DMA32 populated and
empty ZONE_NORMAL on x86_64.

What about distributing the pressure proportionally to the number of
present pages on the zone? Something like this:

>From 5face0e29300950575bf9ccbd995828e2f183da1 Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <[email protected]>
Date: Fri, 28 Nov 2014 17:58:43 +0300
Subject: [PATCH] vmscan: fix slab vs page cache reclaim balance

Though page cache reclaim is done per-zone, the finest granularity for
slab cache reclaim is per node. To achieve proportional pressure being
put on them, we therefore shrink slab caches only when scanning the
class zone, which is the highest zone suitable for allocations. However,
the class zone may be empty, e.g. ZONE_NORMAL/ZONE_HIGH, which are class
zones for most allocations, are empty on x86_64 with < 4G of RAM. This
will result in slab cache being scanned only by kswapd, which in turn
may lead to a premature OOM kill.

This patch attempts to fix this by calling shrink_node_slabs per each
zone eligible while distributing the pressure between zones
proportionally to the number of present pages.

Signed-off-by: Vladimir Davydov <[email protected]>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9130cf67bac1..dd80625a1be5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2298,8 +2298,7 @@ static inline bool should_continue_reclaim(struct zone *zone,
}
}

-static bool shrink_zone(struct zone *zone, struct scan_control *sc,
- bool is_classzone)
+static bool shrink_zone(struct zone *zone, struct scan_control *sc)
{
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
@@ -2310,7 +2309,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
.zone = zone,
.priority = sc->priority,
};
- unsigned long zone_lru_pages = 0;
+ unsigned long nr_eligible = 0;
struct mem_cgroup *memcg;

nr_reclaimed = sc->nr_reclaimed;
@@ -2319,6 +2318,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
memcg = mem_cgroup_iter(root, NULL, &reclaim);
do {
unsigned long lru_pages;
+ unsigned long long tmp;
struct lruvec *lruvec;
int swappiness;

@@ -2326,7 +2326,17 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
swappiness = mem_cgroup_swappiness(memcg);

shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
- zone_lru_pages += lru_pages;
+
+ /*
+ * Scale lru_pages inversely proportionally to the zone
+ * size in order to not over-reclaim slab caches, which
+ * are zone unaware.
+ */
+ tmp = lru_pages;
+ tmp *= zone->zone_pgdat->node_present_pages;
+ do_div(tmp, zone->present_pages);
+
+ nr_eligible += tmp;

/*
* Direct reclaim and kswapd have to scan all memory
@@ -2350,12 +2360,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
* Shrink the slab caches in the same proportion that
* the eligible LRU pages were scanned.
*/
- if (global_reclaim(sc) && is_classzone) {
+ if (global_reclaim(sc)) {
struct reclaim_state *reclaim_state;

shrink_node_slabs(sc->gfp_mask, zone_to_nid(zone),
sc->nr_scanned - nr_scanned,
- zone_lru_pages);
+ nr_eligible);

reclaim_state = current->reclaim_state;
if (reclaim_state) {
@@ -2503,7 +2513,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
/* need some check for avoid more shrink_zone() */
}

- if (shrink_zone(zone, sc, zone_idx(zone) == requested_highidx))
+ if (shrink_zone(zone, sc))
reclaimable = true;

if (global_reclaim(sc) &&
@@ -3010,7 +3020,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
balance_gap, classzone_idx))
return true;

- shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
+ shrink_zone(zone, sc);

/* Account for the number of pages attempted to reclaim */
*nr_attempted += sc->nr_to_reclaim;
@@ -3656,7 +3666,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* priorities until we have enough memory freed.
*/
do {
- shrink_zone(zone, &sc, true);
+ shrink_zone(zone, &sc);
} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
}

2015-04-16 03:56:20

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [patch] mm: vmscan: invoke slab shrinkers from shrink_zone()

Hello, Johannes.

Ccing Vlastimil, because this patch causes some regression on
stress-highalloc test in mmtests and he is a expert on compaction
and would have interest on it. :)

On Fri, Nov 28, 2014 at 07:06:37PM +0300, Vladimir Davydov wrote:
> Hi Johannes,
>
> The patch generally looks good to me, because it simplifies the code
> flow significantly and makes it easier for me to introduce per memcg
> slab reclaim (thanks!). However, it has one serious flaw. Please see the
> comment inline.
>
> On Tue, Nov 25, 2014 at 01:23:50PM -0500, Johannes Weiner wrote:
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a384339bf718..8c2b45bfe610 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> [...]
> > @@ -2376,12 +2407,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > struct zone *zone;
> > unsigned long nr_soft_reclaimed;
> > unsigned long nr_soft_scanned;
> > - unsigned long lru_pages = 0;
> > - struct reclaim_state *reclaim_state = current->reclaim_state;
> > gfp_t orig_mask;
> > - struct shrink_control shrink = {
> > - .gfp_mask = sc->gfp_mask,
> > - };
> > enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
> > bool reclaimable = false;
> >
> > @@ -2394,10 +2420,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > if (buffer_heads_over_limit)
> > sc->gfp_mask |= __GFP_HIGHMEM;
> >
> > - nodes_clear(shrink.nodes_to_scan);
> > -
> > for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > - gfp_zone(sc->gfp_mask), sc->nodemask) {
> > + requested_highidx, sc->nodemask) {
> > if (!populated_zone(zone))
> > continue;
> > /*
> > @@ -2409,9 +2433,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > GFP_KERNEL | __GFP_HARDWALL))
> > continue;
> >
> > - lru_pages += zone_reclaimable_pages(zone);
> > - node_set(zone_to_nid(zone), shrink.nodes_to_scan);
> > -
> > if (sc->priority != DEF_PRIORITY &&
> > !zone_reclaimable(zone))
> > continue; /* Let kswapd poll it */
> > @@ -2450,7 +2471,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > /* need some check for avoid more shrink_zone() */
> > }
> >
> > - if (shrink_zone(zone, sc))
> > + if (shrink_zone(zone, sc, zone_idx(zone) == requested_highidx))
> > reclaimable = true;
> >
> > if (global_reclaim(sc) &&
>
> If the highest zone (zone_idx=requested_highidx) is not populated, we
> won't scan slab caches on direct reclaim, which may result in OOM kill
> even if there are plenty of freeable dentries available.
>
> It's especially relevant for VMs, which often have less than 4G of RAM,
> in which case we will only have ZONE_DMA and ZONE_DMA32 populated and
> empty ZONE_NORMAL on x86_64.

I got similar problem mentioned above by Vladimir when I test stress-highest
benchmark. My test system has ZONE_DMA and ZONE_DMA32 and ZONE_NORMAL zones
like as following.

Node 0, zone DMA
spanned 4095
present 3998
managed 3977
Node 0, zone DMA32
spanned 1044480
present 782333
managed 762561
Node 0, zone Normal
spanned 262144
present 262144
managed 245318

Perhaps, requested_highidx would be ZONE_NORMAL for almost normal
allocation request.

When I test stress-highalloc benchmark, shrink_zone() on requested_highidx
zone in kswapd_shrink_zone() is frequently skipped because this zone is
already balanced. But, another zone, for example, DMA32, which has more memory,
isn't balanced so kswapd try to reclaim on that zone. But,
zone_idx(zone) == classzone_idx isn't true for that zone so
shrink_slab() is skipped and we can't age slab objects with same ratio
of lru pages. This could be also possible on direct reclaim path as Vladimir
mentioned.

This causes following success rate regression of phase 1,2 on stress-highalloc
benchmark. The situation of phase 1,2 is that many high order allocations are
requested while many threads do kernel build in parallel.

Base: Run 1
Ops 1 33.00 ( 0.00%)
Ops 2 43.00 ( 0.00%)
Ops 3 80.00 ( 0.00%)
Base: Run 2
Ops 1 33.00 ( 0.00%)
Ops 2 44.00 ( 0.00%)
Ops 3 80.00 ( 0.00%)
Base: Run 3
Ops 1 30.00 ( 0.00%)
Ops 2 44.00 ( 0.00%)
Ops 3 80.00 ( 0.00%)

Revert offending commit: Run 1
Ops 1 46.00 ( 0.00%)
Ops 2 53.00 ( 0.00%)
Ops 3 80.00 ( 0.00%)
Revert offending commit: Run 2
Ops 1 48.00 ( 0.00%)
Ops 2 55.00 ( 0.00%)
Ops 3 80.00 ( 0.00%)
Revert offending commit: Run 3
Ops 1 48.00 ( 0.00%)
Ops 2 55.00 ( 0.00%)
Ops 3 81.00 ( 0.00%)

I'm not sure whether we should consider this benchmark's regression very much,
because real life's compaction behavious would be different with this
benchmark. Anyway, I have some questions related to this patch. I don't know
this code very well so please correct me if I'm wrong.

I read the patch carefully and there is two main differences between before
and after. One is the way of aging ratio calculation. Before, we use number of
lru pages in node, but, this patch uses number of lru pages in zone. As I
understand correctly, shrink_slab() works for a node range rather than
zone one. And, I guess that calculated ratio with zone's number of lru pages
could be more fluctuate than node's one. Is it reasonable to use zone's one?

And, should we guarantee one time invocation of shrink_slab() in above cases?
When I tested it, benchmark result is restored a little.

Guarantee one time invocation: Run 1
Ops 1 30.00 ( 0.00%)
Ops 2 47.00 ( 0.00%)
Ops 3 80.00 ( 0.00%)
Guarantee one time invocation: Run 2
Ops 1 43.00 ( 0.00%)
Ops 2 45.00 ( 0.00%)
Ops 3 78.00 ( 0.00%)
Guarantee one time invocation: Run 3
Ops 1 39.00 ( 0.00%)
Ops 2 45.00 ( 0.00%)
Ops 3 80.00 ( 0.00%)

Thanks.

>
> What about distributing the pressure proportionally to the number of
> present pages on the zone? Something like this:
>
> >From 5face0e29300950575bf9ccbd995828e2f183da1 Mon Sep 17 00:00:00 2001
> From: Vladimir Davydov <[email protected]>
> Date: Fri, 28 Nov 2014 17:58:43 +0300
> Subject: [PATCH] vmscan: fix slab vs page cache reclaim balance
>
> Though page cache reclaim is done per-zone, the finest granularity for
> slab cache reclaim is per node. To achieve proportional pressure being
> put on them, we therefore shrink slab caches only when scanning the
> class zone, which is the highest zone suitable for allocations. However,
> the class zone may be empty, e.g. ZONE_NORMAL/ZONE_HIGH, which are class
> zones for most allocations, are empty on x86_64 with < 4G of RAM. This
> will result in slab cache being scanned only by kswapd, which in turn
> may lead to a premature OOM kill.
>
> This patch attempts to fix this by calling shrink_node_slabs per each
> zone eligible while distributing the pressure between zones
> proportionally to the number of present pages.
>
> Signed-off-by: Vladimir Davydov <[email protected]>
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9130cf67bac1..dd80625a1be5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2298,8 +2298,7 @@ static inline bool should_continue_reclaim(struct zone *zone,
> }
> }
>
> -static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> - bool is_classzone)
> +static bool shrink_zone(struct zone *zone, struct scan_control *sc)
> {
> unsigned long nr_reclaimed, nr_scanned;
> bool reclaimable = false;
> @@ -2310,7 +2309,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> .zone = zone,
> .priority = sc->priority,
> };
> - unsigned long zone_lru_pages = 0;
> + unsigned long nr_eligible = 0;
> struct mem_cgroup *memcg;
>
> nr_reclaimed = sc->nr_reclaimed;
> @@ -2319,6 +2318,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> memcg = mem_cgroup_iter(root, NULL, &reclaim);
> do {
> unsigned long lru_pages;
> + unsigned long long tmp;
> struct lruvec *lruvec;
> int swappiness;
>
> @@ -2326,7 +2326,17 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> swappiness = mem_cgroup_swappiness(memcg);
>
> shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
> - zone_lru_pages += lru_pages;
> +
> + /*
> + * Scale lru_pages inversely proportionally to the zone
> + * size in order to not over-reclaim slab caches, which
> + * are zone unaware.
> + */
> + tmp = lru_pages;
> + tmp *= zone->zone_pgdat->node_present_pages;
> + do_div(tmp, zone->present_pages);
> +
> + nr_eligible += tmp;
>
> /*
> * Direct reclaim and kswapd have to scan all memory
> @@ -2350,12 +2360,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> * Shrink the slab caches in the same proportion that
> * the eligible LRU pages were scanned.
> */
> - if (global_reclaim(sc) && is_classzone) {
> + if (global_reclaim(sc)) {
> struct reclaim_state *reclaim_state;
>
> shrink_node_slabs(sc->gfp_mask, zone_to_nid(zone),
> sc->nr_scanned - nr_scanned,
> - zone_lru_pages);
> + nr_eligible);
>
> reclaim_state = current->reclaim_state;
> if (reclaim_state) {
> @@ -2503,7 +2513,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> /* need some check for avoid more shrink_zone() */
> }
>
> - if (shrink_zone(zone, sc, zone_idx(zone) == requested_highidx))
> + if (shrink_zone(zone, sc))
> reclaimable = true;
>
> if (global_reclaim(sc) &&
> @@ -3010,7 +3020,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
> balance_gap, classzone_idx))
> return true;
>
> - shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
> + shrink_zone(zone, sc);
>
> /* Account for the number of pages attempted to reclaim */
> *nr_attempted += sc->nr_to_reclaim;
> @@ -3656,7 +3666,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> * priorities until we have enough memory freed.
> */
> do {
> - shrink_zone(zone, &sc, true);
> + shrink_zone(zone, &sc);
> } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
> }
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-04-16 14:34:36

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch] mm: vmscan: invoke slab shrinkers from shrink_zone()

Hi Joonsoo,

On Thu, Apr 16, 2015 at 12:57:36PM +0900, Joonsoo Kim wrote:
> Hello, Johannes.
>
> Ccing Vlastimil, because this patch causes some regression on
> stress-highalloc test in mmtests and he is a expert on compaction
> and would have interest on it. :)
>
> On Fri, Nov 28, 2014 at 07:06:37PM +0300, Vladimir Davydov wrote:
> > If the highest zone (zone_idx=requested_highidx) is not populated, we
> > won't scan slab caches on direct reclaim, which may result in OOM kill
> > even if there are plenty of freeable dentries available.
> >
> > It's especially relevant for VMs, which often have less than 4G of RAM,
> > in which case we will only have ZONE_DMA and ZONE_DMA32 populated and
> > empty ZONE_NORMAL on x86_64.
>
> I got similar problem mentioned above by Vladimir when I test stress-highest
> benchmark. My test system has ZONE_DMA and ZONE_DMA32 and ZONE_NORMAL zones
> like as following.
>
> Node 0, zone DMA
> spanned 4095
> present 3998
> managed 3977
> Node 0, zone DMA32
> spanned 1044480
> present 782333
> managed 762561
> Node 0, zone Normal
> spanned 262144
> present 262144
> managed 245318
>
> Perhaps, requested_highidx would be ZONE_NORMAL for almost normal
> allocation request.
>
> When I test stress-highalloc benchmark, shrink_zone() on requested_highidx
> zone in kswapd_shrink_zone() is frequently skipped because this zone is
> already balanced. But, another zone, for example, DMA32, which has more memory,
> isn't balanced so kswapd try to reclaim on that zone. But,
> zone_idx(zone) == classzone_idx isn't true for that zone so
> shrink_slab() is skipped and we can't age slab objects with same ratio
> of lru pages.

No, kswapd_shrink_zone() has the highest *unbalanced* zone as the
classzone. When Normal is balanced but DMA32 is not, then kswapd
scans DMA and DMA32 and invokes the shrinkers for DMA32.

> This could be also possible on direct reclaim path as Vladimir
> mentioned.

Direct reclaim ignores watermarks and always scans a zone. The
problem is only with completely unpopulated zones, but Vladimir
addressed that.

> This causes following success rate regression of phase 1,2 on stress-highalloc
> benchmark. The situation of phase 1,2 is that many high order allocations are
> requested while many threads do kernel build in parallel.

Yes, the patch made the shrinkers on multi-zone nodes less aggressive.
>From the changelog:

This changes kswapd behavior, which used to invoke the shrinkers for each
zone, but with scan ratios gathered from the entire node, resulting in
meaningless pressure quantities on multi-zone nodes.

So the previous code *did* apply more pressure on the shrinkers, but
it didn't make any sense. The number of slab objects to scan for each
scanned LRU page depended on how many zones there were in a node, and
their relative sizes. So a node with a large DMA32 and a small Normal
would receive vastly different relative slab pressure than a node with
only one big zone Normal. That's not something we should revert to.

If we are too weak on objects compared to LRU pages then we should
adjust DEFAULT_SEEKS or individual shrinker settings.

If we think our pressure ratio is accurate but we don't reclaim enough
compared to our compaction efforts, then any adjustments to improve
huge page successrate should come from the allocator/compaction side.

> Base: Run 1
> Ops 1 33.00 ( 0.00%)
> Ops 2 43.00 ( 0.00%)
> Ops 3 80.00 ( 0.00%)
> Base: Run 2
> Ops 1 33.00 ( 0.00%)
> Ops 2 44.00 ( 0.00%)
> Ops 3 80.00 ( 0.00%)
> Base: Run 3
> Ops 1 30.00 ( 0.00%)
> Ops 2 44.00 ( 0.00%)
> Ops 3 80.00 ( 0.00%)
>
> Revert offending commit: Run 1
> Ops 1 46.00 ( 0.00%)
> Ops 2 53.00 ( 0.00%)
> Ops 3 80.00 ( 0.00%)
> Revert offending commit: Run 2
> Ops 1 48.00 ( 0.00%)
> Ops 2 55.00 ( 0.00%)
> Ops 3 80.00 ( 0.00%)
> Revert offending commit: Run 3
> Ops 1 48.00 ( 0.00%)
> Ops 2 55.00 ( 0.00%)
> Ops 3 81.00 ( 0.00%)
>
> I'm not sure whether we should consider this benchmark's regression very much,
> because real life's compaction behavious would be different with this
> benchmark. Anyway, I have some questions related to this patch. I don't know
> this code very well so please correct me if I'm wrong.
>
> I read the patch carefully and there is two main differences between before
> and after. One is the way of aging ratio calculation. Before, we use number of
> lru pages in node, but, this patch uses number of lru pages in zone. As I
> understand correctly, shrink_slab() works for a node range rather than
> zone one. And, I guess that calculated ratio with zone's number of lru pages
> could be more fluctuate than node's one. Is it reasonable to use zone's one?

The page allocator distributes allocations evenly among the zones in a
node, so the fluctuation should be fairly low.

And we scan the LRUs in chunks of 32 pages, which gives us good enough
ratio granularity on even tiny zones (1/8th on a hypothetical 1M zone).

> And, should we guarantee one time invocation of shrink_slab() in above cases?
> When I tested it, benchmark result is restored a little.
>
> Guarantee one time invocation: Run 1
> Ops 1 30.00 ( 0.00%)
> Ops 2 47.00 ( 0.00%)
> Ops 3 80.00 ( 0.00%)
> Guarantee one time invocation: Run 2
> Ops 1 43.00 ( 0.00%)
> Ops 2 45.00 ( 0.00%)
> Ops 3 78.00 ( 0.00%)
> Guarantee one time invocation: Run 3
> Ops 1 39.00 ( 0.00%)
> Ops 2 45.00 ( 0.00%)
> Ops 3 80.00 ( 0.00%)

It should already invoke the shrinkers at least once per node. Could
you tell me how you changed the code for this test?

Thanks,
Johannes

2015-04-16 23:18:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [patch] mm: vmscan: invoke slab shrinkers from shrink_zone()

On Thu, Apr 16, 2015 at 10:34:13AM -0400, Johannes Weiner wrote:
> On Thu, Apr 16, 2015 at 12:57:36PM +0900, Joonsoo Kim wrote:
> > This causes following success rate regression of phase 1,2 on stress-highalloc
> > benchmark. The situation of phase 1,2 is that many high order allocations are
> > requested while many threads do kernel build in parallel.
>
> Yes, the patch made the shrinkers on multi-zone nodes less aggressive.
> From the changelog:
>
> This changes kswapd behavior, which used to invoke the shrinkers for each
> zone, but with scan ratios gathered from the entire node, resulting in
> meaningless pressure quantities on multi-zone nodes.
>
> So the previous code *did* apply more pressure on the shrinkers, but
> it didn't make any sense. The number of slab objects to scan for each
> scanned LRU page depended on how many zones there were in a node, and
> their relative sizes. So a node with a large DMA32 and a small Normal
> would receive vastly different relative slab pressure than a node with
> only one big zone Normal. That's not something we should revert to.
>
> If we are too weak on objects compared to LRU pages then we should
> adjust DEFAULT_SEEKS or individual shrinker settings.

Now this thread has my attention. Changing shrinker defaults will
seriously upset the memory balance under load (in unpredictable
ways) so I really don't think we should even consider changing
DEFAULT_SEEKS.

If there's a shrinker imbalance, we need to understand which
shrinker needs rebalancing, then modify that shrinker's
configuration and then observe the impact this has on the rest of
the system. This means looking at variance of the memory footprint
in steady state, reclaim overshoot and damping rates before steady
state is acheived, etc. Balancing multiple shrinkers (especially
those with dependencies on other caches) under memory
load is a non-trivial undertaking.

I don't see any evidence that we have a shrinker imbalance, so I
really suspect the problem is "shrinkers aren't doing enough work".
In that case, we need to increase the pressure being generated, not
start fiddling around with shrinker configurations.

> If we think our pressure ratio is accurate but we don't reclaim enough
> compared to our compaction efforts, then any adjustments to improve
> huge page successrate should come from the allocator/compaction side.

Right - if compaction is failing, then the problem is more likely
that it isn't generating enough pressure, and so the shrinkers
aren't doing the work we are expecting them to do. That's a problem
with compaction, not the shrinkers...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-04-17 05:05:36

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [patch] mm: vmscan: invoke slab shrinkers from shrink_zone()

On Thu, Apr 16, 2015 at 10:34:13AM -0400, Johannes Weiner wrote:
> Hi Joonsoo,
>
> On Thu, Apr 16, 2015 at 12:57:36PM +0900, Joonsoo Kim wrote:
> > Hello, Johannes.
> >
> > Ccing Vlastimil, because this patch causes some regression on
> > stress-highalloc test in mmtests and he is a expert on compaction
> > and would have interest on it. :)
> >
> > On Fri, Nov 28, 2014 at 07:06:37PM +0300, Vladimir Davydov wrote:
> > > If the highest zone (zone_idx=requested_highidx) is not populated, we
> > > won't scan slab caches on direct reclaim, which may result in OOM kill
> > > even if there are plenty of freeable dentries available.
> > >
> > > It's especially relevant for VMs, which often have less than 4G of RAM,
> > > in which case we will only have ZONE_DMA and ZONE_DMA32 populated and
> > > empty ZONE_NORMAL on x86_64.
> >
> > I got similar problem mentioned above by Vladimir when I test stress-highest
> > benchmark. My test system has ZONE_DMA and ZONE_DMA32 and ZONE_NORMAL zones
> > like as following.
> >
> > Node 0, zone DMA
> > spanned 4095
> > present 3998
> > managed 3977
> > Node 0, zone DMA32
> > spanned 1044480
> > present 782333
> > managed 762561
> > Node 0, zone Normal
> > spanned 262144
> > present 262144
> > managed 245318
> >
> > Perhaps, requested_highidx would be ZONE_NORMAL for almost normal
> > allocation request.
> >
> > When I test stress-highalloc benchmark, shrink_zone() on requested_highidx
> > zone in kswapd_shrink_zone() is frequently skipped because this zone is
> > already balanced. But, another zone, for example, DMA32, which has more memory,
> > isn't balanced so kswapd try to reclaim on that zone. But,
> > zone_idx(zone) == classzone_idx isn't true for that zone so
> > shrink_slab() is skipped and we can't age slab objects with same ratio
> > of lru pages.
>
> No, kswapd_shrink_zone() has the highest *unbalanced* zone as the
> classzone. When Normal is balanced but DMA32 is not, then kswapd
> scans DMA and DMA32 and invokes the shrinkers for DMA32.

Hmm... there is some corner cases related to compaction.
kswapd checks highest *unbalanced* zone with sc->order, but,
in kswapd_shrink_zone(), test_order would be changed to 0 if running
compaction is possible. In this case, following code could be true and
kswapd skip to shrink that highest unbalanced zone.

if (!lowmem_pressure &&
zone_balanced(zone, testorder, balance_gap, classzone_idx))

If this happens and lower zone is unbalanced, shrink_zone() would be
called but shrink_slab() could be skipped.

Anyway, I suspected that this is cause of regression, but, I found that
it isn't. When highest unbalanced zone is skipped to shrink due to
compaction possibility, lower zones are also balanced and all shrinks
are skipped in my test. See below.

>
> > This could be also possible on direct reclaim path as Vladimir
> > mentioned.
>
> Direct reclaim ignores watermarks and always scans a zone. The
> problem is only with completely unpopulated zones, but Vladimir
> addressed that.

There is also similar corner case related to compaction in direct
reclaim path.

> > This causes following success rate regression of phase 1,2 on stress-highalloc
> > benchmark. The situation of phase 1,2 is that many high order allocations are
> > requested while many threads do kernel build in parallel.
>
> Yes, the patch made the shrinkers on multi-zone nodes less aggressive.
> >From the changelog:
>
> This changes kswapd behavior, which used to invoke the shrinkers for each
> zone, but with scan ratios gathered from the entire node, resulting in
> meaningless pressure quantities on multi-zone nodes.
>
> So the previous code *did* apply more pressure on the shrinkers, but
> it didn't make any sense. The number of slab objects to scan for each
> scanned LRU page depended on how many zones there were in a node, and
> their relative sizes. So a node with a large DMA32 and a small Normal
> would receive vastly different relative slab pressure than a node with
> only one big zone Normal. That's not something we should revert to.

Yes, I agree that previous code didn't make any sense.

> If we are too weak on objects compared to LRU pages then we should
> adjust DEFAULT_SEEKS or individual shrinker settings.
>
> If we think our pressure ratio is accurate but we don't reclaim enough
> compared to our compaction efforts, then any adjustments to improve
> huge page successrate should come from the allocator/compaction side.

Yes, I agree. Before tackling down to the compaction side, I'd like to
confirm how shrinker works and it has no problem.

> > Base: Run 1
> > Ops 1 33.00 ( 0.00%)
> > Ops 2 43.00 ( 0.00%)
> > Ops 3 80.00 ( 0.00%)
> > Base: Run 2
> > Ops 1 33.00 ( 0.00%)
> > Ops 2 44.00 ( 0.00%)
> > Ops 3 80.00 ( 0.00%)
> > Base: Run 3
> > Ops 1 30.00 ( 0.00%)
> > Ops 2 44.00 ( 0.00%)
> > Ops 3 80.00 ( 0.00%)
> >
> > Revert offending commit: Run 1
> > Ops 1 46.00 ( 0.00%)
> > Ops 2 53.00 ( 0.00%)
> > Ops 3 80.00 ( 0.00%)
> > Revert offending commit: Run 2
> > Ops 1 48.00 ( 0.00%)
> > Ops 2 55.00 ( 0.00%)
> > Ops 3 80.00 ( 0.00%)
> > Revert offending commit: Run 3
> > Ops 1 48.00 ( 0.00%)
> > Ops 2 55.00 ( 0.00%)
> > Ops 3 81.00 ( 0.00%)
> >
> > I'm not sure whether we should consider this benchmark's regression very much,
> > because real life's compaction behavious would be different with this
> > benchmark. Anyway, I have some questions related to this patch. I don't know
> > this code very well so please correct me if I'm wrong.
> >
> > I read the patch carefully and there is two main differences between before
> > and after. One is the way of aging ratio calculation. Before, we use number of
> > lru pages in node, but, this patch uses number of lru pages in zone. As I
> > understand correctly, shrink_slab() works for a node range rather than
> > zone one. And, I guess that calculated ratio with zone's number of lru pages
> > could be more fluctuate than node's one. Is it reasonable to use zone's one?
>
> The page allocator distributes allocations evenly among the zones in a
> node, so the fluctuation should be fairly low.
>
> And we scan the LRUs in chunks of 32 pages, which gives us good enough
> ratio granularity on even tiny zones (1/8th on a hypothetical 1M zone).

Okay.

> > And, should we guarantee one time invocation of shrink_slab() in above cases?
> > When I tested it, benchmark result is restored a little.
> >
> > Guarantee one time invocation: Run 1
> > Ops 1 30.00 ( 0.00%)
> > Ops 2 47.00 ( 0.00%)
> > Ops 3 80.00 ( 0.00%)
> > Guarantee one time invocation: Run 2
> > Ops 1 43.00 ( 0.00%)
> > Ops 2 45.00 ( 0.00%)
> > Ops 3 78.00 ( 0.00%)
> > Guarantee one time invocation: Run 3
> > Ops 1 39.00 ( 0.00%)
> > Ops 2 45.00 ( 0.00%)
> > Ops 3 80.00 ( 0.00%)
>
> It should already invoke the shrinkers at least once per node. Could
> you tell me how you changed the code for this test?

Sorry about that. There is my mistake and above data is just
experimental error. Anyway, I put the code like as below.

@@ -2950,7 +2957,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
static bool kswapd_shrink_zone(struct zone *zone,
int classzone_idx,
struct scan_control *sc,
- unsigned long *nr_attempted)
+ unsigned long *nr_attempted,
+ bool *should_shrink_slab)
{
int testorder = sc->order;
unsigned long balance_gap;
@@ -2985,10 +2993,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
*/
lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
if (!lowmem_pressure && zone_balanced(zone, testorder,
- balance_gap, classzone_idx))
+ balance_gap, classzone_idx)) {
+ if (zone_idx(zone) == classzone_idx)
+ *should_shrink_slab = true;
return true;
+ }

- shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
+ shrink_zone(zone, sc, (zone_idx(zone) == classzone_idx) ||
+ *should_shrink_slab);
+ *should_shrink_slab = false;

/* Account for the number of pages attempted to reclaim */
*nr_attempted += sc->nr_to_reclaim;
@@ -3052,6 +3065,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
unsigned long nr_attempted = 0;
bool raise_priority = true;
bool pgdat_needs_compaction = (order > 0);
+ bool should_shrink_slab = false;

sc.nr_reclaimed = 0;

@@ -3164,7 +3178,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
* efficiency.
*/
if (kswapd_shrink_zone(zone, end_zone,
- &sc, &nr_attempted))
+ &sc, &nr_attempted,
+ &should_shrink_slab))
raise_priority = false;
}


Reason I did this change is that I see lots of '*should_shrink_slab = true'
cases happen. But, I didn't check shrink_zone() is called due to should_shrink_slab
before. Now I checked it again and found that calling shrink_zone with
'should_shrink_slab = true' doesn't happend. Maybe, in this case, lower zone
is also balanced.

I will inverstigate more about what causes compaction benchmark regression.
If it is related to less aggressive shrink on multi-zone, it is fine and
I will try to look at compaction code.

Anyway, As mentioned above, if, theoretically, skip is possible, we should
invoke the shrinkers at least once per node like as above change? I guess
this possibility is very low.

Thanks.

2015-04-17 05:08:22

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [patch] mm: vmscan: invoke slab shrinkers from shrink_zone()

On Fri, Apr 17, 2015 at 09:17:53AM +1000, Dave Chinner wrote:
> On Thu, Apr 16, 2015 at 10:34:13AM -0400, Johannes Weiner wrote:
> > On Thu, Apr 16, 2015 at 12:57:36PM +0900, Joonsoo Kim wrote:
> > > This causes following success rate regression of phase 1,2 on stress-highalloc
> > > benchmark. The situation of phase 1,2 is that many high order allocations are
> > > requested while many threads do kernel build in parallel.
> >
> > Yes, the patch made the shrinkers on multi-zone nodes less aggressive.
> > From the changelog:
> >
> > This changes kswapd behavior, which used to invoke the shrinkers for each
> > zone, but with scan ratios gathered from the entire node, resulting in
> > meaningless pressure quantities on multi-zone nodes.
> >
> > So the previous code *did* apply more pressure on the shrinkers, but
> > it didn't make any sense. The number of slab objects to scan for each
> > scanned LRU page depended on how many zones there were in a node, and
> > their relative sizes. So a node with a large DMA32 and a small Normal
> > would receive vastly different relative slab pressure than a node with
> > only one big zone Normal. That's not something we should revert to.
> >
> > If we are too weak on objects compared to LRU pages then we should
> > adjust DEFAULT_SEEKS or individual shrinker settings.
>
> Now this thread has my attention. Changing shrinker defaults will
> seriously upset the memory balance under load (in unpredictable
> ways) so I really don't think we should even consider changing
> DEFAULT_SEEKS.
>
> If there's a shrinker imbalance, we need to understand which
> shrinker needs rebalancing, then modify that shrinker's
> configuration and then observe the impact this has on the rest of
> the system. This means looking at variance of the memory footprint
> in steady state, reclaim overshoot and damping rates before steady
> state is acheived, etc. Balancing multiple shrinkers (especially
> those with dependencies on other caches) under memory
> load is a non-trivial undertaking.
>
> I don't see any evidence that we have a shrinker imbalance, so I
> really suspect the problem is "shrinkers aren't doing enough work".
> In that case, we need to increase the pressure being generated, not
> start fiddling around with shrinker configurations.

Okay. I agree.

>
> > If we think our pressure ratio is accurate but we don't reclaim enough
> > compared to our compaction efforts, then any adjustments to improve
> > huge page successrate should come from the allocator/compaction side.
>
> Right - if compaction is failing, then the problem is more likely
> that it isn't generating enough pressure, and so the shrinkers
> aren't doing the work we are expecting them to do. That's a problem
> with compaction, not the shrinkers...

Yes, I agree that. I will investigate more on compaction.

Thanks.

2015-04-24 11:47:11

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [patch] mm: vmscan: invoke slab shrinkers from shrink_zone()

On 04/17/2015 07:06 AM, Joonsoo Kim wrote:
> On Thu, Apr 16, 2015 at 10:34:13AM -0400, Johannes Weiner wrote:
>> Hi Joonsoo,
>>
>> On Thu, Apr 16, 2015 at 12:57:36PM +0900, Joonsoo Kim wrote:
>>> Hello, Johannes.
>>>
>>> Ccing Vlastimil, because this patch causes some regression on
>>> stress-highalloc test in mmtests and he is a expert on compaction
>>> and would have interest on it. :)

Thanks. It's not the first case when stress-highalloc indicated a
regression due to changes in reclaim, and some (not really related)
change to compaction code undid the apparent regression. But one has to
keep in mind that the benchmark is far from perfect. Thanks for the
investigation though.