2014-11-06 23:50:47

by Johannes Weiner

[permalink] [raw]
Subject: [rfc patch] mm: vmscan: invoke slab shrinkers for each lruvec

The slab shrinkers currently rely on the reclaim code providing an
ad-hoc concept of NUMA nodes that doesn't really exist: for all
scanned zones and lruvecs, the nodes and available LRU pages are
summed up, only to have the shrinkers then again walk that nodemask
when scanning slab caches. This duplication will only get worse and
more expensive once the shrinkers become aware of cgroups.

Instead, invoke the shrinkers for each lruvec scanned - which is
either the zone level, or in case of cgroups, the subset of a zone's
pages that is charged to the scanned memcg. The number of eligible
LRU pages is naturally available at that level - it is even more
accurate than simply looking at the global state and the number of
available swap pages, as get_scan_count() considers many other factors
when deciding which LRU pages to consider.

This invokes the shrinkers more often and with smaller page and scan
counts, but the ratios remain the same, and the shrinkers themselves
do not add significantly to the existing per-lruvec cost.

This integrates the slab shrinking nicely into the reclaim logic. Not
just conceptually, but it also allows kswapd, the direct reclaim code,
and zone reclaim to get rid of their ad-hoc custom slab shrinking.

Lastly, this facilitates making the shrinkers cgroup-aware without a
fantastic amount code and runtime work duplication, and consolidation
will make hierarchy walk optimizations easier later on.

Signed-off-by: Johannes Weiner <[email protected]>
---
drivers/staging/android/ashmem.c | 1 -
fs/drop_caches.c | 15 ++--
include/linux/shrinker.h | 2 -
mm/memory-failure.c | 3 +-
mm/vmscan.c | 164 +++++++++++++--------------------------
5 files changed, 63 insertions(+), 122 deletions(-)

I put this together as a result of the discussion with Vladimir about
memcg-aware slab shrinkers this morning.

This might need some tuning, but it definitely looks like the right
thing to do conceptually. I'm currently playing with various slab-
and memcg-heavy workloads (many numa nodes + many cgroups = many
shrinker invocations) but so far the numbers look okay.

It would be great if other people could weigh in, and possibly also
expose it to their favorite filesystem and reclaim stress tests.

Thanks

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index ad4f5790a76f..776fba626278 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -785,7 +785,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
.nr_to_scan = LONG_MAX,
};
ret = ashmem_shrink_count(&ashmem_shrinker, &sc);
- nodes_setall(sc.nodes_to_scan);
ashmem_shrink_scan(&ashmem_shrinker, &sc);
}
break;
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 1de7294aad20..ca602bf7d97a 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -40,13 +40,18 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
static void drop_slab(void)
{
int nr_objects;
- struct shrink_control shrink = {
- .gfp_mask = GFP_KERNEL,
- };

- nodes_setall(shrink.nodes_to_scan);
do {
- nr_objects = shrink_slab(&shrink, 1000, 1000);
+ int nid;
+
+ nr_objects = 0;
+ for_each_online_node(nid) {
+ struct shrink_control shrink = {
+ .gfp_mask = GFP_KERNEL,
+ .nid = nid,
+ };
+ nr_objects += shrink_slab(&shrink, 1000, 1000);
+ }
} while (nr_objects > 10);
}

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c097077ef0..f4aee75f00b1 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -18,8 +18,6 @@ struct shrink_control {
*/
unsigned long nr_to_scan;

- /* shrink from these nodes */
- nodemask_t nodes_to_scan;
/* current node being shrunk (for NUMA aware shrinkers) */
int nid;
};
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e619625489c2..93ee9739f7fd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -248,9 +248,8 @@ void shake_page(struct page *p, int access)
do {
struct shrink_control shrink = {
.gfp_mask = GFP_KERNEL,
+ .nid = nid,
};
- node_set(nid, shrink.nodes_to_scan);
-
nr = shrink_slab(&shrink, 1000, 1000);
if (page_count(p) == 1)
break;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a384339bf718..6a9ab5adf118 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -229,9 +229,10 @@ EXPORT_SYMBOL(unregister_shrinker);

#define SHRINK_BATCH 128

-static unsigned long
-shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
- unsigned long nr_pages_scanned, unsigned long lru_pages)
+static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
+ struct shrinker *shrinker,
+ unsigned long nr_pages_scanned,
+ unsigned long lru_pages)
{
unsigned long freed = 0;
unsigned long long delta;
@@ -239,10 +240,15 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
long freeable;
long nr;
long new_nr;
- int nid = shrinkctl->nid;
+ int nid;
long batch_size = shrinker->batch ? shrinker->batch
: SHRINK_BATCH;

+ if (shrinker->flags & SHRINKER_NUMA_AWARE)
+ nid = shrinkctl->nid;
+ else
+ nid = 0;
+
freeable = shrinker->count_objects(shrinker, shrinkctl);
if (freeable == 0)
return 0;
@@ -380,19 +386,9 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
}

list_for_each_entry(shrinker, &shrinker_list, list) {
- if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
- shrinkctl->nid = 0;
- freed += shrink_slab_node(shrinkctl, shrinker,
+ freed += do_shrink_slab(shrinkctl, shrinker,
nr_pages_scanned, lru_pages);
- continue;
- }
-
- for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
- if (node_online(shrinkctl->nid))
- freed += shrink_slab_node(shrinkctl, shrinker,
- nr_pages_scanned, lru_pages);

- }
}
up_read(&shrinker_rwsem);
out:
@@ -1876,7 +1872,8 @@ enum scan_balance {
* nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
*/
static void get_scan_count(struct lruvec *lruvec, int swappiness,
- struct scan_control *sc, unsigned long *nr)
+ struct scan_control *sc, unsigned long *nr,
+ unsigned long *lru_pages)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 fraction[2];
@@ -2022,39 +2019,34 @@ out:
some_scanned = false;
/* Only use force_scan on second pass. */
for (pass = 0; !some_scanned && pass < 2; pass++) {
+ *lru_pages = 0;
for_each_evictable_lru(lru) {
int file = is_file_lru(lru);
unsigned long size;
unsigned long scan;

+ /* Scan one type exclusively */
+ if ((scan_balance == SCAN_FILE) != file) {
+ nr[lru] = 0;
+ continue;
+ }
+
size = get_lru_size(lruvec, lru);
- scan = size >> sc->priority;
+ *lru_pages += size;

+ scan = size >> sc->priority;
if (!scan && pass && force_scan)
scan = min(size, SWAP_CLUSTER_MAX);

- switch (scan_balance) {
- case SCAN_EQUAL:
- /* Scan lists relative to size */
- break;
- case SCAN_FRACT:
+ if (scan_balance == SCAN_FRACT) {
/*
* Scan types proportional to swappiness and
* their relative recent reclaim efficiency.
*/
scan = div64_u64(scan * fraction[file],
- denominator);
- break;
- case SCAN_FILE:
- case SCAN_ANON:
- /* Scan one type exclusively */
- if ((scan_balance == SCAN_FILE) != file)
- scan = 0;
- break;
- default:
- /* Look ma, no brain */
- BUG();
+ denominator);
}
+
nr[lru] = scan;
/*
* Skip the second pass and don't force_scan,
@@ -2077,10 +2069,17 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
enum lru_list lru;
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+ unsigned long nr_scanned = sc->nr_scanned;
+ unsigned long lru_pages;
struct blk_plug plug;
bool scan_adjusted;
+ struct shrink_control shrink = {
+ .gfp_mask = sc->gfp_mask,
+ .nid = zone_to_nid(lruvec_zone(lruvec)),
+ };
+ struct reclaim_state *reclaim_state = current->reclaim_state;

- get_scan_count(lruvec, swappiness, sc, nr);
+ get_scan_count(lruvec, swappiness, sc, nr, &lru_pages);

/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
@@ -2173,6 +2172,23 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
sc->nr_reclaimed += nr_reclaimed;

/*
+ * Shrink slab caches in the same proportion that the eligible
+ * LRU pages were scanned.
+ *
+ * XXX: Skip memcg limit reclaim, as the slab shrinkers are
+ * not cgroup-aware yet and we can't know if the objects in
+ * the global lists contribute to the memcg limit.
+ */
+ if (global_reclaim(sc) && lru_pages) {
+ nr_scanned = sc->nr_scanned - nr_scanned;
+ shrink_slab(&shrink, nr_scanned, lru_pages);
+ if (reclaim_state) {
+ sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+ reclaim_state->reclaimed_slab = 0;
+ }
+ }
+
+ /*
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
@@ -2376,12 +2392,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
struct zone *zone;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
- unsigned long lru_pages = 0;
- struct reclaim_state *reclaim_state = current->reclaim_state;
gfp_t orig_mask;
- struct shrink_control shrink = {
- .gfp_mask = sc->gfp_mask,
- };
enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
bool reclaimable = false;

@@ -2394,8 +2405,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
if (buffer_heads_over_limit)
sc->gfp_mask |= __GFP_HIGHMEM;

- nodes_clear(shrink.nodes_to_scan);
-
for_each_zone_zonelist_nodemask(zone, z, zonelist,
gfp_zone(sc->gfp_mask), sc->nodemask) {
if (!populated_zone(zone))
@@ -2409,9 +2418,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
GFP_KERNEL | __GFP_HARDWALL))
continue;

- lru_pages += zone_reclaimable_pages(zone);
- node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
if (sc->priority != DEF_PRIORITY &&
!zone_reclaimable(zone))
continue; /* Let kswapd poll it */
@@ -2459,20 +2465,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
}

/*
- * Don't shrink slabs when reclaiming memory from over limit cgroups
- * but do shrink slab at least once when aborting reclaim for
- * compaction to avoid unevenly scanning file/anon LRU pages over slab
- * pages.
- */
- if (global_reclaim(sc)) {
- shrink_slab(&shrink, sc->nr_scanned, lru_pages);
- if (reclaim_state) {
- sc->nr_reclaimed += reclaim_state->reclaimed_slab;
- reclaim_state->reclaimed_slab = 0;
- }
- }
-
- /*
* Restore to original mask to avoid the impact on the caller if we
* promoted it to __GFP_HIGHMEM.
*/
@@ -2932,15 +2924,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
static bool kswapd_shrink_zone(struct zone *zone,
int classzone_idx,
struct scan_control *sc,
- unsigned long lru_pages,
unsigned long *nr_attempted)
{
int testorder = sc->order;
unsigned long balance_gap;
- struct reclaim_state *reclaim_state = current->reclaim_state;
- struct shrink_control shrink = {
- .gfp_mask = sc->gfp_mask,
- };
bool lowmem_pressure;

/* Reclaim above the high watermark. */
@@ -2976,12 +2963,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
return true;

shrink_zone(zone, sc);
- nodes_clear(shrink.nodes_to_scan);
- node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
- reclaim_state->reclaimed_slab = 0;
- shrink_slab(&shrink, sc->nr_scanned, lru_pages);
- sc->nr_reclaimed += reclaim_state->reclaimed_slab;

/* Account for the number of pages attempted to reclaim */
*nr_attempted += sc->nr_to_reclaim;
@@ -3042,7 +3023,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
count_vm_event(PAGEOUTRUN);

do {
- unsigned long lru_pages = 0;
unsigned long nr_attempted = 0;
bool raise_priority = true;
bool pgdat_needs_compaction = (order > 0);
@@ -3102,8 +3082,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
if (!populated_zone(zone))
continue;

- lru_pages += zone_reclaimable_pages(zone);
-
/*
* If any zone is currently balanced then kswapd will
* not call compaction as it is expected that the
@@ -3159,8 +3137,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
* that that high watermark would be met at 100%
* efficiency.
*/
- if (kswapd_shrink_zone(zone, end_zone, &sc,
- lru_pages, &nr_attempted))
+ if (kswapd_shrink_zone(zone, end_zone,
+ &sc, &nr_attempted))
raise_priority = false;
}

@@ -3612,10 +3590,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
};
- struct shrink_control shrink = {
- .gfp_mask = sc.gfp_mask,
- };
- unsigned long nr_slab_pages0, nr_slab_pages1;

cond_resched();
/*
@@ -3638,40 +3612,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
}

- nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
- if (nr_slab_pages0 > zone->min_slab_pages) {
- /*
- * shrink_slab() does not currently allow us to determine how
- * many pages were freed in this zone. So we take the current
- * number of slab pages and shake the slab until it is reduced
- * by the same nr_pages that we used for reclaiming unmapped
- * pages.
- */
- nodes_clear(shrink.nodes_to_scan);
- node_set(zone_to_nid(zone), shrink.nodes_to_scan);
- for (;;) {
- unsigned long lru_pages = zone_reclaimable_pages(zone);
-
- /* No reclaimable slab or very low memory pressure */
- if (!shrink_slab(&shrink, sc.nr_scanned, lru_pages))
- break;
-
- /* Freed enough memory */
- nr_slab_pages1 = zone_page_state(zone,
- NR_SLAB_RECLAIMABLE);
- if (nr_slab_pages1 + nr_pages <= nr_slab_pages0)
- break;
- }
-
- /*
- * Update nr_reclaimed by the number of slab pages we
- * reclaimed from this zone.
- */
- nr_slab_pages1 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
- if (nr_slab_pages1 < nr_slab_pages0)
- sc.nr_reclaimed += nr_slab_pages0 - nr_slab_pages1;
- }
-
p->reclaim_state = NULL;
current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
lockdep_clear_current_reclaim_state();
--
2.1.3


2014-11-07 09:18:28

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [rfc patch] mm: vmscan: invoke slab shrinkers for each lruvec

Hi Johannes,

The general idea sounds sane to me. A few comments inline.

On Thu, Nov 06, 2014 at 06:50:28PM -0500, Johannes Weiner wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a384339bf718..6a9ab5adf118 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
[...]
> @@ -1876,7 +1872,8 @@ enum scan_balance {
> * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
> */
> static void get_scan_count(struct lruvec *lruvec, int swappiness,
> - struct scan_control *sc, unsigned long *nr)
> + struct scan_control *sc, unsigned long *nr,
> + unsigned long *lru_pages)
> {
> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> u64 fraction[2];
> @@ -2022,39 +2019,34 @@ out:
> some_scanned = false;
> /* Only use force_scan on second pass. */
> for (pass = 0; !some_scanned && pass < 2; pass++) {
> + *lru_pages = 0;
> for_each_evictable_lru(lru) {
> int file = is_file_lru(lru);
> unsigned long size;
> unsigned long scan;
>
> + /* Scan one type exclusively */
> + if ((scan_balance == SCAN_FILE) != file) {
> + nr[lru] = 0;
> + continue;
> + }
> +

Why do you move this piece of code? AFAIU, we only want to accumulate
the total number of evictable pages on the lruvec, so the patch for
shrink_lruvec should look much simpler. Is it a kind of cleanup? If so,
I guess it'd be better to submit it separately.

Anyways, this hunk doesn't look right to me. With it applied, if
scan_balance equals SCAN_EQUAL or SCAN_FRACT we won't scan file lists at
all.

> size = get_lru_size(lruvec, lru);
> - scan = size >> sc->priority;
> + *lru_pages += size;
>
> + scan = size >> sc->priority;
> if (!scan && pass && force_scan)
> scan = min(size, SWAP_CLUSTER_MAX);
>
> - switch (scan_balance) {
> - case SCAN_EQUAL:
> - /* Scan lists relative to size */
> - break;
> - case SCAN_FRACT:
> + if (scan_balance == SCAN_FRACT) {
> /*
> * Scan types proportional to swappiness and
> * their relative recent reclaim efficiency.
> */
> scan = div64_u64(scan * fraction[file],
> - denominator);
> - break;
> - case SCAN_FILE:
> - case SCAN_ANON:
> - /* Scan one type exclusively */
> - if ((scan_balance == SCAN_FILE) != file)
> - scan = 0;
> - break;
> - default:
> - /* Look ma, no brain */
> - BUG();
> + denominator);
> }
> +
> nr[lru] = scan;
> /*
> * Skip the second pass and don't force_scan,
> @@ -2077,10 +2069,17 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
> enum lru_list lru;
> unsigned long nr_reclaimed = 0;
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> + unsigned long nr_scanned = sc->nr_scanned;
> + unsigned long lru_pages;
> struct blk_plug plug;
> bool scan_adjusted;
> + struct shrink_control shrink = {
> + .gfp_mask = sc->gfp_mask,
> + .nid = zone_to_nid(lruvec_zone(lruvec)),
> + };
> + struct reclaim_state *reclaim_state = current->reclaim_state;
>
> - get_scan_count(lruvec, swappiness, sc, nr);
> + get_scan_count(lruvec, swappiness, sc, nr, &lru_pages);
>
> /* Record the original scan target for proportional adjustments later */
> memcpy(targets, nr, sizeof(nr));
> @@ -2173,6 +2172,23 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
> sc->nr_reclaimed += nr_reclaimed;
>
> /*
> + * Shrink slab caches in the same proportion that the eligible
> + * LRU pages were scanned.
> + *
> + * XXX: Skip memcg limit reclaim, as the slab shrinkers are
> + * not cgroup-aware yet and we can't know if the objects in
> + * the global lists contribute to the memcg limit.
> + */
> + if (global_reclaim(sc) && lru_pages) {
> + nr_scanned = sc->nr_scanned - nr_scanned;
> + shrink_slab(&shrink, nr_scanned, lru_pages);

I've a few concerns about slab-vs-pagecache reclaim proportion:

If a node has > 1 zones, then we will scan slabs more aggressively than
lru pages. Not sure, if it really matters, because on x86_64 most nodes
have 1 zone.

If there are > 1 nodes, NUMA-unaware shrinkers will get more pressure
than NUMA-aware ones. However, we have the same behavior in kswapd at
present. This might be an issue if there are many nodes.

If there are > 1 memory cgroups, slab shrinkers will get significantly
more pressure on global reclaim than they should. The introduction of
memcg-aware shrinkers will help, but only for memcg-aware shrinkers.
Other shrinkers (if there are any) will be treated unfairly. I think for
memcg-unaware shrinkers (i.e. for all shrinkers right now) we should
pass lru_pages=zone_reclaimable_pages.

BTW, may be we'd better pass the scan priority for shrink_slab to
calculate the pressure instead of messing with nr_scanned/lru_pages?

> + if (reclaim_state) {
> + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> + reclaim_state->reclaimed_slab = 0;
> + }

OFF TOPIC: I wonder why we need the reclaim_state. The main shrink
candidates, dentries and inodes, are mostly freed by RCU, so they won't
count there.

> + }
> +
> + /*
> * Even if we did not try to evict anon pages at all, we want to
> * rebalance the anon lru active/inactive ratio.
> */
[...]

Thanks,
Vladimir

2014-11-07 16:28:00

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch] mm: vmscan: invoke slab shrinkers for each lruvec

On Fri, Nov 07, 2014 at 12:18:11PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 06, 2014 at 06:50:28PM -0500, Johannes Weiner wrote:
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a384339bf718..6a9ab5adf118 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> [...]
> > @@ -1876,7 +1872,8 @@ enum scan_balance {
> > * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
> > */
> > static void get_scan_count(struct lruvec *lruvec, int swappiness,
> > - struct scan_control *sc, unsigned long *nr)
> > + struct scan_control *sc, unsigned long *nr,
> > + unsigned long *lru_pages)
> > {
> > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > u64 fraction[2];
> > @@ -2022,39 +2019,34 @@ out:
> > some_scanned = false;
> > /* Only use force_scan on second pass. */
> > for (pass = 0; !some_scanned && pass < 2; pass++) {
> > + *lru_pages = 0;
> > for_each_evictable_lru(lru) {
> > int file = is_file_lru(lru);
> > unsigned long size;
> > unsigned long scan;
> >
> > + /* Scan one type exclusively */
> > + if ((scan_balance == SCAN_FILE) != file) {
> > + nr[lru] = 0;
> > + continue;
> > + }
> > +
>
> Why do you move this piece of code? AFAIU, we only want to accumulate
> the total number of evictable pages on the lruvec, so the patch for
> shrink_lruvec should look much simpler. Is it a kind of cleanup? If so,
> I guess it'd be better to submit it separately.

Yes, it started out as a separate patch to make the setting of
*lru_pages more readable.

> Anyways, this hunk doesn't look right to me. With it applied, if
> scan_balance equals SCAN_EQUAL or SCAN_FRACT we won't scan file lists at
> all.

Urgh, brain fart. I reverted back to the original switch, it should
be readable enough.

> > @@ -2173,6 +2172,23 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
> > sc->nr_reclaimed += nr_reclaimed;
> >
> > /*
> > + * Shrink slab caches in the same proportion that the eligible
> > + * LRU pages were scanned.
> > + *
> > + * XXX: Skip memcg limit reclaim, as the slab shrinkers are
> > + * not cgroup-aware yet and we can't know if the objects in
> > + * the global lists contribute to the memcg limit.
> > + */
> > + if (global_reclaim(sc) && lru_pages) {
> > + nr_scanned = sc->nr_scanned - nr_scanned;
> > + shrink_slab(&shrink, nr_scanned, lru_pages);
>
> I've a few concerns about slab-vs-pagecache reclaim proportion:

Well, hopefully! :-)

> If a node has > 1 zones, then we will scan slabs more aggressively than
> lru pages. Not sure, if it really matters, because on x86_64 most nodes
> have 1 zone.

It shouldn't be a big problem, but it is also fairly easy to fix. We
can safely assume that most slab objects are allocated without any
zone restrictions (except that they have to be lowmem), so the aging
pressure of the Normal zone should be the most suitable to translate
to the slab objects as well. Thus, we should shrink slabs only when
scanning the LRU pages of the Normal zone.

> If there are > 1 nodes, NUMA-unaware shrinkers will get more pressure
> than NUMA-aware ones. However, we have the same behavior in kswapd at
> present. This might be an issue if there are many nodes.

There is nothing we can do about this. Kswapd does the majority of
reclaim, and by its nature has a per-node view. We can't provide a
bigger granularity than that.

> If there are > 1 memory cgroups, slab shrinkers will get significantly
> more pressure on global reclaim than they should. The introduction of
> memcg-aware shrinkers will help, but only for memcg-aware shrinkers.
> Other shrinkers (if there are any) will be treated unfairly. I think for
> memcg-unaware shrinkers (i.e. for all shrinkers right now) we should
> pass lru_pages=zone_reclaimable_pages.

Agreed, we need separate entry points for cgroup-aware and
cgroup-unaware shrinkers. I'm putting the node shrinker into
shrink_zone() for now, which will later remain memcg-unaware.

The memcg one you can later add in shrink_lruvec(), along with a
filter that skips over memcg-aware shrinkers on the node-level.

I'm propagating the pages that get_scan_count() considers up to
shrink_zone(), so they are available to you in shrink_lruvec().

> BTW, may be we'd better pass the scan priority for shrink_slab to
> calculate the pressure instead of messing with nr_scanned/lru_pages?

The scan goal itself is lru_pages >> priority, so the priority level
is already encoded in the nr_scanned / lru_pages ratio.

> > + if (reclaim_state) {
> > + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > + reclaim_state->reclaimed_slab = 0;
> > + }
>
> OFF TOPIC: I wonder why we need the reclaim_state. The main shrink
> candidates, dentries and inodes, are mostly freed by RCU, so they won't
> count there.

Good point. I'll make a note of it, but might defer it in this
series.

Thanks!

2014-11-10 06:46:49

by Dave Chinner

[permalink] [raw]
Subject: Re: [rfc patch] mm: vmscan: invoke slab shrinkers for each lruvec

On Thu, Nov 06, 2014 at 06:50:28PM -0500, Johannes Weiner wrote:
> The slab shrinkers currently rely on the reclaim code providing an
> ad-hoc concept of NUMA nodes that doesn't really exist: for all
> scanned zones and lruvecs, the nodes and available LRU pages are
> summed up, only to have the shrinkers then again walk that nodemask
> when scanning slab caches. This duplication will only get worse and
> more expensive once the shrinkers become aware of cgroups.

As i said previously, it's not an "ad-hoc concept". It's exactly the
same NUMA abstraction that the VM presents to anyone who wants to
control memory allocation locality. i.e. node IDs and node masks.

> Instead, invoke the shrinkers for each lruvec scanned - which is
> either the zone level, or in case of cgroups, the subset of a zone's
> pages that is charged to the scanned memcg. The number of eligible
> LRU pages is naturally available at that level - it is even more
> accurate than simply looking at the global state and the number of
> available swap pages, as get_scan_count() considers many other factors
> when deciding which LRU pages to consider.
>
> This invokes the shrinkers more often and with smaller page and scan
> counts, but the ratios remain the same, and the shrinkers themselves
> do not add significantly to the existing per-lruvec cost.

That goes in the opposite direction is which we've found filesystem
shrinkers operate most effectively. i.e. larger batches with less
frequent reclaim callouts tend to result in more consistent
performance because shrinkers take locks and do IO that other
application operations get stuck behind (shrink_batch exists
for this reason ;).

>
> This integrates the slab shrinking nicely into the reclaim logic. Not
> just conceptually, but it also allows kswapd, the direct reclaim code,
> and zone reclaim to get rid of their ad-hoc custom slab shrinking.
>
> Lastly, this facilitates making the shrinkers cgroup-aware without a
> fantastic amount code and runtime work duplication, and consolidation
> will make hierarchy walk optimizations easier later on.

It still makes callers have to care about internal VM metrics
to calculate how much work they should do. Callers should be able to
pass in a measure of how much work the shrinker should do (e.g. as
a percentage of cache size). Even the VM can use this - it can take
it's scanned/pages variables and use them to calculate the
percentage of caches to free, and the shrinker loop can then be
completely free of any relationship to the LRU page reclaim
implementation.....

e.g. drop_caches() should just be able to call "shrink_slab_all()"
and not have to care about nodes, batch sizes, detect when caches
are empty, etc. Similarly shake_page() only wants
"shrink_slab_node_nr()" to free a small amount of objects from the
node it cares about each time.

i.e. we should be writing helpers to remove shrinker implementation
quirks from callers, not driving more of the quirks into external
callers...

> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> drivers/staging/android/ashmem.c | 1 -
> fs/drop_caches.c | 15 ++--
> include/linux/shrinker.h | 2 -
> mm/memory-failure.c | 3 +-
> mm/vmscan.c | 164 +++++++++++++--------------------------
> 5 files changed, 63 insertions(+), 122 deletions(-)
>
> I put this together as a result of the discussion with Vladimir about
> memcg-aware slab shrinkers this morning.
>
> This might need some tuning, but it definitely looks like the right
> thing to do conceptually. I'm currently playing with various slab-
> and memcg-heavy workloads (many numa nodes + many cgroups = many
> shrinker invocations) but so far the numbers look okay.
>
> It would be great if other people could weigh in, and possibly also
> expose it to their favorite filesystem and reclaim stress tests.

Causes substantial increase in performance variability on inode
intensive benchmarks. On my standard fsmark benchmark, I see file
creates sit at around 250,000/sec, but there's several second long
dips to about 50,000/sec co-inciding with the inode cache being
trimmed by several million inodes. Looking more deeply, this is due
to memory pressure driving inode writeback - we're reclaiming inodes
that haven't yet been written to disk, and so by reclaiming the
inode cache slab more frequently it's driving larger peaks of IO
and blocking ongoing filesystem operations more frequently.

My initial thoughts are that this correlates with the above comments
I made about frequency of shrinker calls and batch sizes, so I
suspect that the aggregation of shrinker-based reclaim work is
necessary to minimise the interference that recalim causes at the
filesystem level...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-11-10 14:44:28

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch] mm: vmscan: invoke slab shrinkers for each lruvec

On Mon, Nov 10, 2014 at 05:46:40PM +1100, Dave Chinner wrote:
> On Thu, Nov 06, 2014 at 06:50:28PM -0500, Johannes Weiner wrote:
> > The slab shrinkers currently rely on the reclaim code providing an
> > ad-hoc concept of NUMA nodes that doesn't really exist: for all
> > scanned zones and lruvecs, the nodes and available LRU pages are
> > summed up, only to have the shrinkers then again walk that nodemask
> > when scanning slab caches. This duplication will only get worse and
> > more expensive once the shrinkers become aware of cgroups.
>
> As i said previously, it's not an "ad-hoc concept". It's exactly the
> same NUMA abstraction that the VM presents to anyone who wants to
> control memory allocation locality. i.e. node IDs and node masks.

That's what the page allocator takes as input, but it's translated to
zones fairly quickly, and it definitely doesn't exist in the reclaim
code anymore. The way we reconstruct the node view on that level
really is ad-hoc. And that is a problem. It's not just duplicative,
it's also that reclaim really needs the ability to target types of
zones in order to meet allocation constraints, and right now we just
scatter-shoot slab objects and hope we free some desired zone memory.

I don't see any intrinsic value in symmetry here, especially since we
can easily encapsulate how the lists are organized for the shrinkers.
Vladimir is already pushing the interface in that direction: add
object, remove object, walk objects according to this shrink_control.

> > Instead, invoke the shrinkers for each lruvec scanned - which is
> > either the zone level, or in case of cgroups, the subset of a zone's
> > pages that is charged to the scanned memcg. The number of eligible
> > LRU pages is naturally available at that level - it is even more
> > accurate than simply looking at the global state and the number of
> > available swap pages, as get_scan_count() considers many other factors
> > when deciding which LRU pages to consider.
> >
> > This invokes the shrinkers more often and with smaller page and scan
> > counts, but the ratios remain the same, and the shrinkers themselves
> > do not add significantly to the existing per-lruvec cost.
>
> That goes in the opposite direction is which we've found filesystem
> shrinkers operate most effectively. i.e. larger batches with less
> frequent reclaim callouts tend to result in more consistent
> performance because shrinkers take locks and do IO that other
> application operations get stuck behind (shrink_batch exists
> for this reason ;).

Kswapd, which does the majority of reclaim - unless pressure mounts
too high - already invokes the shrinkers per-zone, so this seems like
a granularity we should generally get away with.

> > This integrates the slab shrinking nicely into the reclaim logic. Not
> > just conceptually, but it also allows kswapd, the direct reclaim code,
> > and zone reclaim to get rid of their ad-hoc custom slab shrinking.
> >
> > Lastly, this facilitates making the shrinkers cgroup-aware without a
> > fantastic amount code and runtime work duplication, and consolidation
> > will make hierarchy walk optimizations easier later on.
>
> It still makes callers have to care about internal VM metrics
> to calculate how much work they should do. Callers should be able to
> pass in a measure of how much work the shrinker should do (e.g. as
> a percentage of cache size). Even the VM can use this - it can take
> it's scanned/pages variables and use them to calculate the
> percentage of caches to free, and the shrinker loop can then be
> completely free of any relationship to the LRU page reclaim
> implementation.....
>
> e.g. drop_caches() should just be able to call "shrink_slab_all()"
> and not have to care about nodes, batch sizes, detect when caches
> are empty, etc. Similarly shake_page() only wants
> "shrink_slab_node_nr()" to free a small amount of objects from the
> node it cares about each time.
>
> i.e. we should be writing helpers to remove shrinker implementation
> quirks from callers, not driving more of the quirks into external
> callers...

Yes, that is something we should do, but it seems orthogonal for now.
We can always wrap up the interface once we agree on how to organize
the objects.

> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > drivers/staging/android/ashmem.c | 1 -
> > fs/drop_caches.c | 15 ++--
> > include/linux/shrinker.h | 2 -
> > mm/memory-failure.c | 3 +-
> > mm/vmscan.c | 164 +++++++++++++--------------------------
> > 5 files changed, 63 insertions(+), 122 deletions(-)
> >
> > I put this together as a result of the discussion with Vladimir about
> > memcg-aware slab shrinkers this morning.
> >
> > This might need some tuning, but it definitely looks like the right
> > thing to do conceptually. I'm currently playing with various slab-
> > and memcg-heavy workloads (many numa nodes + many cgroups = many
> > shrinker invocations) but so far the numbers look okay.
> >
> > It would be great if other people could weigh in, and possibly also
> > expose it to their favorite filesystem and reclaim stress tests.
>
> Causes substantial increase in performance variability on inode
> intensive benchmarks. On my standard fsmark benchmark, I see file
> creates sit at around 250,000/sec, but there's several second long
> dips to about 50,000/sec co-inciding with the inode cache being
> trimmed by several million inodes. Looking more deeply, this is due
> to memory pressure driving inode writeback - we're reclaiming inodes
> that haven't yet been written to disk, and so by reclaiming the
> inode cache slab more frequently it's driving larger peaks of IO
> and blocking ongoing filesystem operations more frequently.
>
> My initial thoughts are that this correlates with the above comments
> I made about frequency of shrinker calls and batch sizes, so I
> suspect that the aggregation of shrinker-based reclaim work is
> necessary to minimise the interference that recalim causes at the
> filesystem level...

Could you share the benchmark you are running? Also, how many nodes
does this machine have?

My suspicion is that invoking the shrinkers per-zone against the same
old per-node lists incorrectly triples the slab pressure relative to
the lru pressure. Kswapd already does this, but I'm guessing it
matters less under lower pressure, and once pressure mounts direct
reclaim takes over, which my patch made more aggressive.

Organizing the slab objects on per-zone lists would solve this, and
would also allow proper zone reclaim to meet allocation restraints.

But for now, how about the following patch that invokes the shrinkers
once per node, but uses the allocator's preferred zone as the pivot
for nr_scanned / nr_eligible instead of summing up the node? That
doesn't increase the current number of slab invocations, actually
reduces them for kswapd, but would put the shrinker calls at a more
natural level for reclaim, and would also allow us to go ahead with
the cgroup-aware slab shrinkers.

---

>From a5b19616822564ae880f0ae22b8a8cb96b2c71a5 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Thu, 6 Nov 2014 11:30:27 -0500
Subject: [patch] mm: vmscan: invoke slab shrinkers from shrink_zone()

The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge
the eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask. This is
redundant code, extra runtime work, and fairly inaccurate when it
comes to the estimation of actually scannable LRU pages. The code
duplication will only get worse when making the shrinkers cgroup-aware
and requiring them to have out-of-band cgroup hierarchy walks as well.

Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.

Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space,
like zone_reclaimable_pages() currently does. Accumulate the number
over all visited lruvecs to get the per-zone value.

Some nodes have multiple zones due to memory addressing restrictions.
To avoid putting too much pressure on the shrinkers, only invoke them
once for each such node, using the class zone of the allocation as the
pivot zone. At some point, we may want to think about organizing slab
objects directly on per-zone lists instead, which would also allow
targetting them more directly under zone-constrained allocations.

For now, this integrates the slab shrinking better into the reclaim
logic and gets rid of duplicative invocations from kswapd, direct
reclaim, and zone reclaim. It also prepares for cgroup-awareness,
allowing memcg-capable shrinkers to be added at the lruvec level
without much duplication of both code and runtime work.

Not-yet-signed-off-by: Johannes Weiner <[email protected]>
---
drivers/staging/android/ashmem.c | 3 +-
fs/drop_caches.c | 11 ++-
include/linux/mm.h | 6 +-
include/linux/shrinker.h | 2 -
mm/memory-failure.c | 11 +--
mm/page_alloc.c | 6 +-
mm/vmscan.c | 208 +++++++++++++++------------------------
7 files changed, 98 insertions(+), 149 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index ad4f5790a76f..46f8ef42559e 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -418,7 +418,7 @@ out:
}

/*
- * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
+ * ashmem_shrink - our cache shrinker, called from mm/vmscan.c
*
* 'nr_to_scan' is the number of objects to scan for freeing.
*
@@ -785,7 +785,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
.nr_to_scan = LONG_MAX,
};
ret = ashmem_shrink_count(&ashmem_shrinker, &sc);
- nodes_setall(sc.nodes_to_scan);
ashmem_shrink_scan(&ashmem_shrinker, &sc);
}
break;
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 1de7294aad20..2bc2c87f35e7 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -40,13 +40,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
static void drop_slab(void)
{
int nr_objects;
- struct shrink_control shrink = {
- .gfp_mask = GFP_KERNEL,
- };

- nodes_setall(shrink.nodes_to_scan);
do {
- nr_objects = shrink_slab(&shrink, 1000, 1000);
+ int nid;
+
+ nr_objects = 0;
+ for_each_online_node(nid)
+ nr_objects += shrink_node_slabs(GFP_KERNEL, nid,
+ 1000, 1000);
} while (nr_objects > 10);
}

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c3bb024f8e8..6ce99da4d7a4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2089,9 +2089,9 @@ int drop_caches_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
#endif

-unsigned long shrink_slab(struct shrink_control *shrink,
- unsigned long nr_pages_scanned,
- unsigned long lru_pages);
+unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid,
+ unsigned long nr_scanned,
+ unsigned long nr_eligible);

#ifndef CONFIG_MMU
#define randomize_va_space 0
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c097077ef0..f4aee75f00b1 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -18,8 +18,6 @@ struct shrink_control {
*/
unsigned long nr_to_scan;

- /* shrink from these nodes */
- nodemask_t nodes_to_scan;
/* current node being shrunk (for NUMA aware shrinkers) */
int nid;
};
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e619625489c2..5174f6f27385 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -239,19 +239,14 @@ void shake_page(struct page *p, int access)
}

/*
- * Only call shrink_slab here (which would also shrink other caches) if
- * access is not potentially fatal.
+ * Only call shrink_node_slabs here (which would also shrink
+ * other caches) if access is not potentially fatal.
*/
if (access) {
int nr;
int nid = page_to_nid(p);
do {
- struct shrink_control shrink = {
- .gfp_mask = GFP_KERNEL,
- };
- node_set(nid, shrink.nodes_to_scan);
-
- nr = shrink_slab(&shrink, 1000, 1000);
+ nr = shrink_node_slabs(GFP_KERNEL, nid, 1000, 1000);
if (page_count(p) == 1)
break;
} while (nr > 10);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7536b50768ed..7d425b2a1a35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6263,9 +6263,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
if (!PageLRU(page))
found++;
/*
- * If there are RECLAIMABLE pages, we need to check it.
- * But now, memory offline itself doesn't call shrink_slab()
- * and it still to be fixed.
+ * If there are RECLAIMABLE pages, we need to check
+ * it. But now, memory offline itself doesn't call
+ * shrink_node_slabs() and it still to be fixed.
*/
/*
* If the page is not RAM, page_count()should be 0.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a384339bf718..8c2b45bfe610 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -229,9 +229,10 @@ EXPORT_SYMBOL(unregister_shrinker);

#define SHRINK_BATCH 128

-static unsigned long
-shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
- unsigned long nr_pages_scanned, unsigned long lru_pages)
+static unsigned long shrink_slabs(struct shrink_control *shrinkctl,
+ struct shrinker *shrinker,
+ unsigned long nr_scanned,
+ unsigned long nr_eligible)
{
unsigned long freed = 0;
unsigned long long delta;
@@ -255,9 +256,9 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);

total_scan = nr;
- delta = (4 * nr_pages_scanned) / shrinker->seeks;
+ delta = (4 * nr_scanned) / shrinker->seeks;
delta *= freeable;
- do_div(delta, lru_pages + 1);
+ do_div(delta, nr_eligible + 1);
total_scan += delta;
if (total_scan < 0) {
pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
@@ -289,8 +290,8 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
total_scan = freeable * 2;

trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
- nr_pages_scanned, lru_pages,
- freeable, delta, total_scan);
+ nr_scanned, nr_eligible,
+ freeable, delta, total_scan);

/*
* Normally, we should not scan less than batch_size objects in one
@@ -339,34 +340,37 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
return freed;
}

-/*
- * Call the shrink functions to age shrinkable caches
- *
- * Here we assume it costs one seek to replace a lru page and that it also
- * takes a seek to recreate a cache object. With this in mind we age equal
- * percentages of the lru and ageable caches. This should balance the seeks
- * generated by these structures.
+/**
+ * shrink_node_slabs - shrink slab caches of a given node
+ * @gfp_mask: allocation context
+ * @nid: node whose slab caches to target
+ * @nr_scanned: pressure numerator
+ * @nr_eligible: pressure denominator
*
- * If the vm encountered mapped pages on the LRU it increase the pressure on
- * slab to avoid swapping.
+ * Call the shrink functions to age shrinkable caches.
*
- * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits.
+ * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
+ * unaware shrinkers will receive a node id of 0 instead.
*
- * `lru_pages' represents the number of on-LRU pages in all the zones which
- * are eligible for the caller's allocation attempt. It is used for balancing
- * slab reclaim versus page reclaim.
+ * @nr_scanned and @nr_eligible form a ratio that indicate how much of
+ * the available objects should be scanned. Page reclaim for example
+ * passes the number of pages scanned and the number of pages on the
+ * LRU lists that it considered on @nid, plus a bias in @nr_scanned
+ * when it encountered mapped pages. The ratio is further biased by
+ * the ->seeks setting of the shrink function, which indicates the
+ * cost to recreate an object relative to that of an LRU page.
*
- * Returns the number of slab objects which we shrunk.
+ * Returns the number of reclaimed slab objects.
*/
-unsigned long shrink_slab(struct shrink_control *shrinkctl,
- unsigned long nr_pages_scanned,
- unsigned long lru_pages)
+unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid,
+ unsigned long nr_scanned,
+ unsigned long nr_eligible)
{
struct shrinker *shrinker;
unsigned long freed = 0;

- if (nr_pages_scanned == 0)
- nr_pages_scanned = SWAP_CLUSTER_MAX;
+ if (nr_scanned == 0)
+ nr_scanned = SWAP_CLUSTER_MAX;

if (!down_read_trylock(&shrinker_rwsem)) {
/*
@@ -380,20 +384,17 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
}

list_for_each_entry(shrinker, &shrinker_list, list) {
- if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
- shrinkctl->nid = 0;
- freed += shrink_slab_node(shrinkctl, shrinker,
- nr_pages_scanned, lru_pages);
- continue;
- }
+ struct shrink_control sc = {
+ .gfp_mask = gfp_mask,
+ .nid = nid,
+ };

- for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
- if (node_online(shrinkctl->nid))
- freed += shrink_slab_node(shrinkctl, shrinker,
- nr_pages_scanned, lru_pages);
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+ sc.nid = 0;

- }
+ freed += shrink_slabs(&sc, shrinker, nr_scanned, nr_eligible);
}
+
up_read(&shrinker_rwsem);
out:
cond_resched();
@@ -1876,7 +1877,8 @@ enum scan_balance {
* nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
*/
static void get_scan_count(struct lruvec *lruvec, int swappiness,
- struct scan_control *sc, unsigned long *nr)
+ struct scan_control *sc, unsigned long *nr,
+ unsigned long *lru_pages)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 fraction[2];
@@ -2022,6 +2024,7 @@ out:
some_scanned = false;
/* Only use force_scan on second pass. */
for (pass = 0; !some_scanned && pass < 2; pass++) {
+ *lru_pages = 0;
for_each_evictable_lru(lru) {
int file = is_file_lru(lru);
unsigned long size;
@@ -2048,14 +2051,19 @@ out:
case SCAN_FILE:
case SCAN_ANON:
/* Scan one type exclusively */
- if ((scan_balance == SCAN_FILE) != file)
+ if ((scan_balance == SCAN_FILE) != file) {
+ size = 0;
scan = 0;
+ }
break;
default:
/* Look ma, no brain */
BUG();
}
+
+ *lru_pages += size;
nr[lru] = scan;
+
/*
* Skip the second pass and don't force_scan,
* if we found something to scan.
@@ -2069,7 +2077,7 @@ out:
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
- struct scan_control *sc)
+ struct scan_control *sc, unsigned long *lru_pages)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long targets[NR_LRU_LISTS];
@@ -2080,7 +2088,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
struct blk_plug plug;
bool scan_adjusted;

- get_scan_count(lruvec, swappiness, sc, nr);
+ get_scan_count(lruvec, swappiness, sc, nr, lru_pages);

/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
@@ -2258,7 +2266,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
}
}

-static bool shrink_zone(struct zone *zone, struct scan_control *sc)
+static bool shrink_zone(struct zone *zone, struct scan_control *sc,
+ bool is_classzone)
{
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
@@ -2269,6 +2278,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc)
.zone = zone,
.priority = sc->priority,
};
+ unsigned long zone_lru_pages = 0;
struct mem_cgroup *memcg;

nr_reclaimed = sc->nr_reclaimed;
@@ -2276,13 +2286,15 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc)

memcg = mem_cgroup_iter(root, NULL, &reclaim);
do {
+ unsigned long lru_pages;
struct lruvec *lruvec;
int swappiness;

lruvec = mem_cgroup_zone_lruvec(zone, memcg);
swappiness = mem_cgroup_swappiness(memcg);

- shrink_lruvec(lruvec, swappiness, sc);
+ shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
+ zone_lru_pages += lru_pages;

/*
* Direct reclaim and kswapd have to scan all memory
@@ -2302,6 +2314,25 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc)
memcg = mem_cgroup_iter(root, memcg, &reclaim);
} while (memcg);

+ /*
+ * Shrink the slab caches in the same proportion that
+ * the eligible LRU pages were scanned.
+ */
+ if (global_reclaim(sc) && is_classzone) {
+ struct reclaim_state *reclaim_state;
+
+ shrink_node_slabs(sc->gfp_mask, zone_to_nid(zone),
+ sc->nr_scanned - nr_scanned,
+ zone_lru_pages);
+
+ reclaim_state = current->reclaim_state;
+ if (reclaim_state) {
+ sc->nr_reclaimed +=
+ reclaim_state->reclaimed_slab;
+ reclaim_state->reclaimed_slab = 0;
+ }
+ }
+
vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
sc->nr_scanned - nr_scanned,
sc->nr_reclaimed - nr_reclaimed);
@@ -2376,12 +2407,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
struct zone *zone;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
- unsigned long lru_pages = 0;
- struct reclaim_state *reclaim_state = current->reclaim_state;
gfp_t orig_mask;
- struct shrink_control shrink = {
- .gfp_mask = sc->gfp_mask,
- };
enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
bool reclaimable = false;

@@ -2394,10 +2420,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
if (buffer_heads_over_limit)
sc->gfp_mask |= __GFP_HIGHMEM;

- nodes_clear(shrink.nodes_to_scan);
-
for_each_zone_zonelist_nodemask(zone, z, zonelist,
- gfp_zone(sc->gfp_mask), sc->nodemask) {
+ requested_highidx, sc->nodemask) {
if (!populated_zone(zone))
continue;
/*
@@ -2409,9 +2433,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
GFP_KERNEL | __GFP_HARDWALL))
continue;

- lru_pages += zone_reclaimable_pages(zone);
- node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
if (sc->priority != DEF_PRIORITY &&
!zone_reclaimable(zone))
continue; /* Let kswapd poll it */
@@ -2450,7 +2471,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
/* need some check for avoid more shrink_zone() */
}

- if (shrink_zone(zone, sc))
+ if (shrink_zone(zone, sc, zone_idx(zone) == requested_highidx))
reclaimable = true;

if (global_reclaim(sc) &&
@@ -2459,20 +2480,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
}

/*
- * Don't shrink slabs when reclaiming memory from over limit cgroups
- * but do shrink slab at least once when aborting reclaim for
- * compaction to avoid unevenly scanning file/anon LRU pages over slab
- * pages.
- */
- if (global_reclaim(sc)) {
- shrink_slab(&shrink, sc->nr_scanned, lru_pages);
- if (reclaim_state) {
- sc->nr_reclaimed += reclaim_state->reclaimed_slab;
- reclaim_state->reclaimed_slab = 0;
- }
- }
-
- /*
* Restore to original mask to avoid the impact on the caller if we
* promoted it to __GFP_HIGHMEM.
*/
@@ -2736,6 +2743,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
};
struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
int swappiness = mem_cgroup_swappiness(memcg);
+ unsigned long lru_pages;

sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2751,7 +2759,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_lruvec(lruvec, swappiness, &sc);
+ shrink_lruvec(lruvec, swappiness, &sc, &lru_pages);

trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);

@@ -2932,15 +2940,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
static bool kswapd_shrink_zone(struct zone *zone,
int classzone_idx,
struct scan_control *sc,
- unsigned long lru_pages,
unsigned long *nr_attempted)
{
int testorder = sc->order;
unsigned long balance_gap;
- struct reclaim_state *reclaim_state = current->reclaim_state;
- struct shrink_control shrink = {
- .gfp_mask = sc->gfp_mask,
- };
bool lowmem_pressure;

/* Reclaim above the high watermark. */
@@ -2975,13 +2978,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
balance_gap, classzone_idx))
return true;

- shrink_zone(zone, sc);
- nodes_clear(shrink.nodes_to_scan);
- node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
- reclaim_state->reclaimed_slab = 0;
- shrink_slab(&shrink, sc->nr_scanned, lru_pages);
- sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+ shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);

/* Account for the number of pages attempted to reclaim */
*nr_attempted += sc->nr_to_reclaim;
@@ -3042,7 +3039,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
count_vm_event(PAGEOUTRUN);

do {
- unsigned long lru_pages = 0;
unsigned long nr_attempted = 0;
bool raise_priority = true;
bool pgdat_needs_compaction = (order > 0);
@@ -3102,8 +3098,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
if (!populated_zone(zone))
continue;

- lru_pages += zone_reclaimable_pages(zone);
-
/*
* If any zone is currently balanced then kswapd will
* not call compaction as it is expected that the
@@ -3159,8 +3153,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
* that that high watermark would be met at 100%
* efficiency.
*/
- if (kswapd_shrink_zone(zone, end_zone, &sc,
- lru_pages, &nr_attempted))
+ if (kswapd_shrink_zone(zone, end_zone,
+ &sc, &nr_attempted))
raise_priority = false;
}

@@ -3612,10 +3606,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
};
- struct shrink_control shrink = {
- .gfp_mask = sc.gfp_mask,
- };
- unsigned long nr_slab_pages0, nr_slab_pages1;

cond_resched();
/*
@@ -3634,44 +3624,10 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* priorities until we have enough memory freed.
*/
do {
- shrink_zone(zone, &sc);
+ shrink_zone(zone, &sc, true);
} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
}

- nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
- if (nr_slab_pages0 > zone->min_slab_pages) {
- /*
- * shrink_slab() does not currently allow us to determine how
- * many pages were freed in this zone. So we take the current
- * number of slab pages and shake the slab until it is reduced
- * by the same nr_pages that we used for reclaiming unmapped
- * pages.
- */
- nodes_clear(shrink.nodes_to_scan);
- node_set(zone_to_nid(zone), shrink.nodes_to_scan);
- for (;;) {
- unsigned long lru_pages = zone_reclaimable_pages(zone);
-
- /* No reclaimable slab or very low memory pressure */
- if (!shrink_slab(&shrink, sc.nr_scanned, lru_pages))
- break;
-
- /* Freed enough memory */
- nr_slab_pages1 = zone_page_state(zone,
- NR_SLAB_RECLAIMABLE);
- if (nr_slab_pages1 + nr_pages <= nr_slab_pages0)
- break;
- }
-
- /*
- * Update nr_reclaimed by the number of slab pages we
- * reclaimed from this zone.
- */
- nr_slab_pages1 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
- if (nr_slab_pages1 < nr_slab_pages0)
- sc.nr_reclaimed += nr_slab_pages0 - nr_slab_pages1;
- }
-
p->reclaim_state = NULL;
current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
lockdep_clear_current_reclaim_state();
--
2.1.3

2014-11-11 06:34:26

by Dave Chinner

[permalink] [raw]
Subject: Re: [rfc patch] mm: vmscan: invoke slab shrinkers for each lruvec

On Mon, Nov 10, 2014 at 09:44:05AM -0500, Johannes Weiner wrote:
> On Mon, Nov 10, 2014 at 05:46:40PM +1100, Dave Chinner wrote:
> > On Thu, Nov 06, 2014 at 06:50:28PM -0500, Johannes Weiner wrote:
> > > The slab shrinkers currently rely on the reclaim code providing an
> > > ad-hoc concept of NUMA nodes that doesn't really exist: for all
> > > scanned zones and lruvecs, the nodes and available LRU pages are
> > > summed up, only to have the shrinkers then again walk that nodemask
> > > when scanning slab caches. This duplication will only get worse and
> > > more expensive once the shrinkers become aware of cgroups.
> >
> > As i said previously, it's not an "ad-hoc concept". It's exactly the
> > same NUMA abstraction that the VM presents to anyone who wants to
> > control memory allocation locality. i.e. node IDs and node masks.
>
> That's what the page allocator takes as input, but it's translated to
> zones fairly quickly, and it definitely doesn't exist in the reclaim
> code anymore. The way we reconstruct the node view on that level
> really is ad-hoc. And that is a problem. It's not just duplicative,
> it's also that reclaim really needs the ability to target types of
> zones in order to meet allocation constraints, and right now we just
> scatter-shoot slab objects and hope we free some desired zone memory.
>
> I don't see any intrinsic value in symmetry here, especially since we
> can easily encapsulate how the lists are organized for the shrinkers.
> Vladimir is already pushing the interface in that direction: add
> object, remove object, walk objects according to this shrink_control.

You probably don't see the value in symmetry because you look at it
from the inside. I lok at shrinkers from outside the VM, and so I
see them from a very different view point. That is, I do not want
the internal architecture of the VM dictating how I must do things.
The shrinkers are extremely flexible and used in quite varied
subsystems for very different purposes. The shrinker API and
behaviour needs to be independent of the VM so we don't lose that
flexibility....
>
> > > Instead, invoke the shrinkers for each lruvec scanned - which is
> > > either the zone level, or in case of cgroups, the subset of a zone's
> > > pages that is charged to the scanned memcg. The number of eligible
> > > LRU pages is naturally available at that level - it is even more
> > > accurate than simply looking at the global state and the number of
> > > available swap pages, as get_scan_count() considers many other factors
> > > when deciding which LRU pages to consider.
> > >
> > > This invokes the shrinkers more often and with smaller page and scan
> > > counts, but the ratios remain the same, and the shrinkers themselves
> > > do not add significantly to the existing per-lruvec cost.
> >
> > That goes in the opposite direction is which we've found filesystem
> > shrinkers operate most effectively. i.e. larger batches with less
> > frequent reclaim callouts tend to result in more consistent
> > performance because shrinkers take locks and do IO that other
> > application operations get stuck behind (shrink_batch exists
> > for this reason ;).
>
> Kswapd, which does the majority of reclaim - unless pressure mounts
> too high - already invokes the shrinkers per-zone, so this seems like
> a granularity we should generally get away with.

Pressure almost always mounts too high on filesystem intensive
workloads. :/

i.e. Almost all siginifcant memory allocation is done under GFP_NOFS
and almost all the kernel caches that need reclaim are filesystem
caches and hence will not do reclaim under GFP_NOFS allocation
contexts. Hence all reclaim gets defered to kswapd very quickly
and so kswapd has to deal with the reclaim backlog of multiple CPUs
on each node generating significant memory pressure....

> > > It would be great if other people could weigh in, and possibly also
> > > expose it to their favorite filesystem and reclaim stress tests.
> >
> > Causes substantial increase in performance variability on inode
> > intensive benchmarks. On my standard fsmark benchmark, I see file
> > creates sit at around 250,000/sec, but there's several second long
> > dips to about 50,000/sec co-inciding with the inode cache being
> > trimmed by several million inodes. Looking more deeply, this is due
> > to memory pressure driving inode writeback - we're reclaiming inodes
> > that haven't yet been written to disk, and so by reclaiming the
> > inode cache slab more frequently it's driving larger peaks of IO
> > and blocking ongoing filesystem operations more frequently.
> >
> > My initial thoughts are that this correlates with the above comments
> > I made about frequency of shrinker calls and batch sizes, so I
> > suspect that the aggregation of shrinker-based reclaim work is
> > necessary to minimise the interference that recalim causes at the
> > filesystem level...
>
> Could you share the benchmark you are running?

I need to put this in the XFS FAQ. It's the same damn benchmark I've
been running for almost 5 years now....

https://lkml.org/lkml/2013/12/4/813

> Also, how many nodes
> does this machine have?

VM w/ 16p, 16Gb RAM, fakenuma=4, and a sparse 500TB filesystem image
on a pair of SSDs in RAID0.

> My suspicion is that invoking the shrinkers per-zone against the same
> old per-node lists incorrectly triples the slab pressure relative to
> the lru pressure. Kswapd already does this, but I'm guessing it
> matters less under lower pressure, and once pressure mounts direct
> reclaim takes over, which my patch made more aggressive.
>
> Organizing the slab objects on per-zone lists would solve this, and
> would also allow proper zone reclaim to meet allocation restraints.

Which is exactly the sort of increase in per-object cache + LRUs we
want to avoid. Nothing outside the VM has the concept of zones - all
our object caches are either global or per-node pools or LRUs. We
need to keep some semblence of global LRU in filesystem
caches because objects are heirarchical in importance. Hence the
more finely grained we chop up the LRUs the worse our working set
maintenance gets.

Indeed, we've moved away from VM based page caches for our complex
subsystems because the VM simply cannot express the reclaim
relationships necessary for efficient maintenance of the working
set. The working set has nothing to do with the locality of the
objects maintained by the cache - it's all about the relationships
between the objects. Evicting objects based on memory locality does
not work, and that's why we do not use the VM based page caches.

This is precisely why I do not want the internal architecture of the
VM to dictate how we cache and reclaim objects - the VM just has no
concept of what the caches are used for and so therefore should not
be dictating structure of the cache or it's reclaim algorithms.
Shrinkers are simply a method of applying a measure of pressure to a
cache, they do not and should not define how caches are structured
or how they scale.

> But for now, how about the following patch that invokes the shrinkers
> once per node, but uses the allocator's preferred zone as the pivot
> for nr_scanned / nr_eligible instead of summing up the node? That
> doesn't increase the current number of slab invocations, actually
> reduces them for kswapd, but would put the shrinker calls at a more
> natural level for reclaim, and would also allow us to go ahead with
> the cgroup-aware slab shrinkers.

I'll try to test it tomorrow.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-11-11 19:53:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [rfc patch] mm: vmscan: invoke slab shrinkers for each lruvec

On Tue, Nov 11, 2014 at 05:34:19PM +1100, Dave Chinner wrote:
> On Mon, Nov 10, 2014 at 09:44:05AM -0500, Johannes Weiner wrote:
> > On Mon, Nov 10, 2014 at 05:46:40PM +1100, Dave Chinner wrote:
> > > On Thu, Nov 06, 2014 at 06:50:28PM -0500, Johannes Weiner wrote:
> > > > The slab shrinkers currently rely on the reclaim code providing an
> > > > ad-hoc concept of NUMA nodes that doesn't really exist: for all
> > > > scanned zones and lruvecs, the nodes and available LRU pages are
> > > > summed up, only to have the shrinkers then again walk that nodemask
> > > > when scanning slab caches. This duplication will only get worse and
> > > > more expensive once the shrinkers become aware of cgroups.
> > >
> > > As i said previously, it's not an "ad-hoc concept". It's exactly the
> > > same NUMA abstraction that the VM presents to anyone who wants to
> > > control memory allocation locality. i.e. node IDs and node masks.
> >
> > That's what the page allocator takes as input, but it's translated to
> > zones fairly quickly, and it definitely doesn't exist in the reclaim
> > code anymore. The way we reconstruct the node view on that level
> > really is ad-hoc. And that is a problem. It's not just duplicative,
> > it's also that reclaim really needs the ability to target types of
> > zones in order to meet allocation constraints, and right now we just
> > scatter-shoot slab objects and hope we free some desired zone memory.
> >
> > I don't see any intrinsic value in symmetry here, especially since we
> > can easily encapsulate how the lists are organized for the shrinkers.
> > Vladimir is already pushing the interface in that direction: add
> > object, remove object, walk objects according to this shrink_control.
>
> You probably don't see the value in symmetry because you look at it
> from the inside. I lok at shrinkers from outside the VM, and so I
> see them from a very different view point. That is, I do not want
> the internal architecture of the VM dictating how I must do things.
> The shrinkers are extremely flexible and used in quite varied
> subsystems for very different purposes. The shrinker API and
> behaviour needs to be independent of the VM so we don't lose that
> flexibility....
>
> > My suspicion is that invoking the shrinkers per-zone against the same
> > old per-node lists incorrectly triples the slab pressure relative to
> > the lru pressure. Kswapd already does this, but I'm guessing it
> > matters less under lower pressure, and once pressure mounts direct
> > reclaim takes over, which my patch made more aggressive.
> >
> > Organizing the slab objects on per-zone lists would solve this, and
> > would also allow proper zone reclaim to meet allocation restraints.
>
> Which is exactly the sort of increase in per-object cache + LRUs we
> want to avoid. Nothing outside the VM has the concept of zones - all
> our object caches are either global or per-node pools or LRUs. We
> need to keep some semblence of global LRU in filesystem
> caches because objects are heirarchical in importance. Hence the
> more finely grained we chop up the LRUs the worse our working set
> maintenance gets.
>
> Indeed, we've moved away from VM based page caches for our complex
> subsystems because the VM simply cannot express the reclaim
> relationships necessary for efficient maintenance of the working
> set. The working set has nothing to do with the locality of the
> objects maintained by the cache - it's all about the relationships
> between the objects. Evicting objects based on memory locality does
> not work, and that's why we do not use the VM based page caches.
>
> This is precisely why I do not want the internal architecture of the
> VM to dictate how we cache and reclaim objects - the VM just has no
> concept of what the caches are used for and so therefore should not
> be dictating structure of the cache or it's reclaim algorithms.
> Shrinkers are simply a method of applying a measure of pressure to a
> cache, they do not and should not define how caches are structured
> or how they scale.

But that measure of pressure is defined by local activity. How can we
reasonably derive a global measure of pressure from independent local
situations?

The scanned/eligible ratio can't be meaningful without a unit attached
to it, and that unit is currently implementation defined. Just try to
figure out what it means for kswapd, direct reclaim, zone reclaim etc.

"One NUMA node" is a reasonable unit, and with NUMA-awareness in the
shrinkers we can finally have a meaningful quantity of pressure that
we can translate between the VM and auxiliary object caches. My 2nd
patch should at least make all reclaim paths agree on this unit and
invoke the shrinkers with well-defined arguments.

This is still not optimal for the VM and zone reclaim will age slab
objects in an unrelated zone on the same node disproportionately, but
we can probably live with that for the most part. It's an acceptable
trade-off when filesystems can group related objects on the same node.

Global object lists on the other hand are simply broken when we drive
them by node-local pressure - which is the best the VM can do. They
can work by accident, but they'll do the wrong thing depending on the
workload, and they'll be tuned to implementation details in the VM.
You might as well drive them by a timer or some other random event.

Using the shrinkers like that is not flexibility, it's just undefined
behavior. I wish we could stop this abuse.

> > But for now, how about the following patch that invokes the shrinkers
> > once per node, but uses the allocator's preferred zone as the pivot
> > for nr_scanned / nr_eligible instead of summing up the node? That
> > doesn't increase the current number of slab invocations, actually
> > reduces them for kswapd, but would put the shrinker calls at a more
> > natural level for reclaim, and would also allow us to go ahead with
> > the cgroup-aware slab shrinkers.
>
> I'll try to test it tomorrow.

Thanks!

2014-11-13 03:06:56

by Dave Chinner

[permalink] [raw]
Subject: Re: [rfc patch] mm: vmscan: invoke slab shrinkers for each lruvec

On Tue, Nov 11, 2014 at 02:53:10PM -0500, Johannes Weiner wrote:
> On Tue, Nov 11, 2014 at 05:34:19PM +1100, Dave Chinner wrote:
> > On Mon, Nov 10, 2014 at 09:44:05AM -0500, Johannes Weiner wrote:
> > > On Mon, Nov 10, 2014 at 05:46:40PM +1100, Dave Chinner wrote:
> > > > On Thu, Nov 06, 2014 at 06:50:28PM -0500, Johannes Weiner wrote:
> > > > > The slab shrinkers currently rely on the reclaim code providing an
> > > > > ad-hoc concept of NUMA nodes that doesn't really exist: for all
> > > > > scanned zones and lruvecs, the nodes and available LRU pages are
> > > > > summed up, only to have the shrinkers then again walk that nodemask
> > > > > when scanning slab caches. This duplication will only get worse and
> > > > > more expensive once the shrinkers become aware of cgroups.
> > > >
> > > > As i said previously, it's not an "ad-hoc concept". It's exactly the
> > > > same NUMA abstraction that the VM presents to anyone who wants to
> > > > control memory allocation locality. i.e. node IDs and node masks.
> > >
> > > That's what the page allocator takes as input, but it's translated to
> > > zones fairly quickly, and it definitely doesn't exist in the reclaim
> > > code anymore. The way we reconstruct the node view on that level
> > > really is ad-hoc. And that is a problem. It's not just duplicative,
> > > it's also that reclaim really needs the ability to target types of
> > > zones in order to meet allocation constraints, and right now we just
> > > scatter-shoot slab objects and hope we free some desired zone memory.
> > >
> > > I don't see any intrinsic value in symmetry here, especially since we
> > > can easily encapsulate how the lists are organized for the shrinkers.
> > > Vladimir is already pushing the interface in that direction: add
> > > object, remove object, walk objects according to this shrink_control.
> >
> > You probably don't see the value in symmetry because you look at it
> > from the inside. I lok at shrinkers from outside the VM, and so I
> > see them from a very different view point. That is, I do not want
> > the internal architecture of the VM dictating how I must do things.
> > The shrinkers are extremely flexible and used in quite varied
> > subsystems for very different purposes. The shrinker API and
> > behaviour needs to be independent of the VM so we don't lose that
> > flexibility....
> >
> > > My suspicion is that invoking the shrinkers per-zone against the same
> > > old per-node lists incorrectly triples the slab pressure relative to
> > > the lru pressure. Kswapd already does this, but I'm guessing it
> > > matters less under lower pressure, and once pressure mounts direct
> > > reclaim takes over, which my patch made more aggressive.
> > >
> > > Organizing the slab objects on per-zone lists would solve this, and
> > > would also allow proper zone reclaim to meet allocation restraints.
> >
> > Which is exactly the sort of increase in per-object cache + LRUs we
> > want to avoid. Nothing outside the VM has the concept of zones - all
> > our object caches are either global or per-node pools or LRUs. We
> > need to keep some semblence of global LRU in filesystem
> > caches because objects are heirarchical in importance. Hence the
> > more finely grained we chop up the LRUs the worse our working set
> > maintenance gets.
> >
> > Indeed, we've moved away from VM based page caches for our complex
> > subsystems because the VM simply cannot express the reclaim
> > relationships necessary for efficient maintenance of the working
> > set. The working set has nothing to do with the locality of the
> > objects maintained by the cache - it's all about the relationships
> > between the objects. Evicting objects based on memory locality does
> > not work, and that's why we do not use the VM based page caches.
> >
> > This is precisely why I do not want the internal architecture of the
> > VM to dictate how we cache and reclaim objects - the VM just has no
> > concept of what the caches are used for and so therefore should not
> > be dictating structure of the cache or it's reclaim algorithms.
> > Shrinkers are simply a method of applying a measure of pressure to a
> > cache, they do not and should not define how caches are structured
> > or how they scale.
>
> But that measure of pressure is defined by local activity. How can we
> reasonably derive a global measure of pressure from independent local
> situations?
>
> The scanned/eligible ratio can't be meaningful without a unit attached
> to it, and that unit is currently implementation defined. Just try to
> figure out what it means for kswapd, direct reclaim, zone reclaim etc.
>
> "One NUMA node" is a reasonable unit, and with NUMA-awareness in the
> shrinkers we can finally have a meaningful quantity of pressure that
> we can translate between the VM and auxiliary object caches. My 2nd
> patch should at least make all reclaim paths agree on this unit and
> invoke the shrinkers with well-defined arguments.
>
> This is still not optimal for the VM and zone reclaim will age slab
> objects in an unrelated zone on the same node disproportionately, but
> we can probably live with that for the most part. It's an acceptable
> trade-off when filesystems can group related objects on the same node.
>

Just becaus a subsystem can group objects by memory locality doesn't
mean it's a good idea. In fact, it's actively harmful to do so at
times.

An example from XFS - the metadata buffer cache. A metadata buffer
in memory is accessed instantly. In contrast, a metadata buffer that
has to be read from disk can take seconds to be read under heavy IO
load and memory pressure. We have a global filesystem metadata
buffer LRU that the shrinker walks and so we have global working set
management.

Cleaning dirty pages on a node under memory pressure often requires
allocation (i.e. delayed allocation) during ->writepage. Every
single allocation or freeing operation needs an allocation group
header buffer to be in memory as it's the root of the btree that
indexes free space.

So, let's make that LRU per-node and the shrinker NUMA aware and
follow VM reclaim locality node masks. The AG header buffer is on
the node with memory pressure, so we turf it out of cache because it
got to the end of the LRU.

Now memory reclaim goes back to scanning the page LRU because we
still need more pages. We find a dirty page, go to write it back,
and then we find we need the AG header for allocation, We now need to
allocate pages for the buffer, then read it in from disk before we
can allocate the blocks needed to clean the page. because this is
kswapd that is running, allocation is node local and occurs on the
node we have memory pressure on, increasing the amount of memory
pressure on the node. This causes more reclaim from the local buffer
cache LRU, leaving only actively referenced buffers on the LRU.

Eventually we finish the page LRU scanning pass and then we run the
shrinkers again. They trigger background cleaning of dirty buffers,
then the buffer cache shrinker sees the AG header buffer as one of
the only unreferenced buffers on the LRU for that node, so reclaims
it.

Memory pressure continues, page LRU scanning goes to write another
dirty page, and the AG header read/reclaim cycle starts again.
The result of this behaviour is that page writeback is several
orders of magnitude slower than it should because we are not
maintaining the working set of metadata buffers in memory.

Caches are funny like this: if you ignore the dependencies between
them then the system performs like crap. The page cache does not
exist in isolation - it is dependent on several other caches (and
therefore shrinkers) behaving appropriately under memory pressure
for it's algorithms to work effectively.

> Global object lists on the other hand are simply broken when we drive
> them by node-local pressure - which is the best the VM can do. They
> can work by accident, but they'll do the wrong thing depending on the
> workload, and they'll be tuned to implementation details in the VM.
> You might as well drive them by a timer or some other random event.

Seriously? The above metadata buffer LRU is a global object list
because the buffers are owned by the filesystem, not the VM. Access
patterns to the filesystem metadata determine what gets reclaimed,
not memory locality, and as the above example shows memory reclaim
performance is dependent on the filesystem ignoring memory reclaim
locality hints for critical filesystem metadata.

Perhaps another view:

The VM as a nice, neat a bunch of LRUs with well defined
locality dervied from the hardware it directly manages.

Filesystems have nice, neat caches with well defined locality
and reclaimation algorithms that are derived directly from the
hardware they manage.

Graphics memory pools are nice, neat caches with well defined
locality and reclaimation algorithms that are derived directly from
the hardware they manage.

What's the difference between them? They all have different
definitions of "locality".

In the case of filesystems, locality of IO and using caches to
avoiding repeated IO by maintaining the working set in memory is
*always* going to be more important than memory locality. Memory
locality of cached objects has almost no impact on filesystem
performance, and so when it comes to chosing between "working set
cached in memory" and "node low on memory" we are *always* going to
chose to keep the working set cached in memory.

IOWs, what fits the VM does not fit filesystems which does not fit
what graphics drivers need. The *best* we can do is supply some
abstract measure of *how much work to do* and hints about *where to
apply the work*. We have to leave it up to the subsystems to best
manage the impedence mismatch between what the VM wants and what the
subsystem can do - the VM cannot do this and because of that it
can't dictate how reclaim should be performed by shrinkers...

> Using the shrinkers like that is not flexibility, it's just undefined
> behavior. I wish we could stop this abuse.

It's not undefined, nor is it abuse - it's just not how you wish it
worked. There is no "one size fits all" answer here, and trying to
force shrinkers to behave according to the VM's idea of "reclaim
locality" instead of their own is counter-productive.

If you want to make things better, start by trying to make the
zonelist page reclaim implemented by shrinker callouts so all memory
reclaim is controlled by the shrinker infrastructure. Just try it as
a mental exercise - it might make you think about this whole "caches
are all different" from a new perspective....

Cheers,

Dave.
--
Dave Chinner
[email protected]