From: Mel Gorman <[email protected]>
This series started with the idea to move LRU lists to pgdat but this
part was more important to start with. It was written against 4.2-rc1 but
applies to 4.2-rc3.
The zonelist cache has been around for a long time but it is of dubious merit
with a lot of complexity. There are a few reasons why it needs help that
are explained in the first patch but the most important is that a failed
THP allocation can cause a zone to be treated as "full". This potentially
causes unnecessary stalls, reclaim activity or remote fallbacks. Maybe the
issues could be fixed but it's not worth it. The series places a small
number of other micro-optimisations on top before examining watermarks.
High-order watermarks are something that can cause high-order allocations to
fail even though pages are free. This was originally to protect high-order
atomic allocations but there is a much better way that can be handled using
migrate types. This series uses page grouping by mobility to preserve some
pageblocks for high-order allocations with the size of the reservation
depending on demand. kswapd awareness is maintained by examining the free
lists. By patch 10 in this series, there are no high-order watermark checks
while preserving the properties that motivated the introduction of the
watermark checks.
An interesting side-effect of this series is that high-order atomic
allocations should be a lot more reliable as long as they start before heavy
fragmentation or memory pressure is encountered. This is due to the reserves
being dynamically sized instead of just depending on MIGRATE_RESERVE. The
traditional expected case here is atomic allocations for network buffers
using jumbo frames whose devices cannot handle scatter/gather. In aggressive
tests the failure rate of atomic order-3 allocations is reduced by 98%. I
would be very interested in hearing from someone who uses jumbo frames
with hardware that requires high-order atomic allocations to succeed that
can test this series.
A potential side-effect of this series may be of interest to developers
of embedded platforms. There have been a number of patches recently that
were aimed at making high-order allocations fast or reliable in various
different ways. Usually they came under the headings of compaction but
they are likely to have hit limited success without modifying how grouping
works. One patch attempting to introduce an interface that allowed userspace
to dump all of memory in an attempt to make high-order allocations faster
which is definitely a bad idea. Using this series they get two other
options as out-of-tree patches
1. Alter patch 9 of this series to only call unreserve_highatomic_pageblock
if the system is about to go OOM. This should drop the allocation
failure for high-order atomic failures to 0 or near 0 in a lot of cases.
Your milage will depend on the workload. Such a change would not suit
mainline because it'll push a lot of workloads into reclaim in cases
where the HighAtomic reserves are too large.
2. Alter patch 9 of this series to reserve space for all high-order kernel
allocations, not just atomic ones. This will make the high-order
allocations more reliable and in many cases faster. However, the caveat
may be excessive reclaim if those reserves become a large percentage of
memory. I would recommend that you still try and avoid ever depending
on high-order allocations for functional correctness. Alternative keep
them as short-lived as possible so they fit in a small reserve.
With or without the out-of-tree modifications, this series should work well
with compaction series that aim to make more pages migratable so high-order
allocations are more successful.
include/linux/cpuset.h | 6 +
include/linux/gfp.h | 47 +++-
include/linux/mmzone.h | 94 +-------
init/main.c | 2 +-
mm/huge_memory.c | 2 +-
mm/internal.h | 1 +
mm/page_alloc.c | 565 ++++++++++++++-----------------------------------
mm/slab.c | 4 +-
mm/slob.c | 4 +-
mm/slub.c | 6 +-
mm/vmscan.c | 4 +-
mm/vmstat.c | 2 +-
12 files changed, 228 insertions(+), 509 deletions(-)
--
2.4.3
From: Mel Gorman <[email protected]>
The zonelist cache (zlc) was introduced to skip over zones that were
recently known to be full. At the time the paths it bypassed were the
cpuset checks, the watermark calculations and zone_reclaim. The situation
today is different and the complexity of zlc is harder to justify.
1) The cpuset checks are no-ops unless a cpuset is active and in general are
a lot cheaper.
2) zone_reclaim is now disabled by default and I suspect that was a large
source of the cost that zlc wanted to avoid. When it is enabled, it's
known to be a major source of stalling when nodes fill up and it's
unwise to hit every other user with the overhead.
3) Watermark checks are expensive to calculate for high-order
allocation requests. Later patches in this series will reduce the cost of
the watermark checking.
4) The most important issue is that in the current implementation it
is possible for a failed THP allocation to mark a zone full for order-0
allocations and cause a fallback to remote nodes.
The last issue could be addressed with additional complexity but it's
not clear that we need zlc at all so this patch deletes it. If stalls
due to repeated zone_reclaim are ever reported as an issue then we should
introduce deferring logic based on a timeout inside zone_reclaim itself
and leave the page allocator fast paths alone.
Impact on page-allocator microbenchmarks is negligible as they don't hit
the paths where the zlc comes into play. The impact was noticable in
a workload called "stutter". One part uses a lot of anonymous memory,
a second measures mmap latency and a third copies a large file. In an
ideal world the latency application would not notice the mmap latency.
On a 4-node machine the results of this patch are
4-node machine stutter
4.2.0-rc1 4.2.0-rc1
vanilla nozlc-v1r20
Min mmap 53.9902 ( 0.00%) 49.3629 ( 8.57%)
1st-qrtle mmap 54.6776 ( 0.00%) 54.1201 ( 1.02%)
2nd-qrtle mmap 54.9242 ( 0.00%) 54.5961 ( 0.60%)
3rd-qrtle mmap 55.1817 ( 0.00%) 54.9338 ( 0.45%)
Max-90% mmap 55.3952 ( 0.00%) 55.3929 ( 0.00%)
Max-93% mmap 55.4766 ( 0.00%) 57.5712 ( -3.78%)
Max-95% mmap 55.5522 ( 0.00%) 57.8376 ( -4.11%)
Max-99% mmap 55.7938 ( 0.00%) 63.6180 (-14.02%)
Max mmap 6344.0292 ( 0.00%) 67.2477 ( 98.94%)
Mean mmap 57.3732 ( 0.00%) 54.5680 ( 4.89%)
Note the maximum stall latency which was 6 seconds and becomes 67ms with
this patch applied. However, also note that it is not guaranteed this
benchmark always hits pathelogical cases and the milage varies. There is
a secondary impact with more direct reclaim because zones are now being
considered instead of being skipped by zlc.
4.1.0 4.1.0
vanilla nozlc-v1r4
Swap Ins 838 502
Swap Outs 1149395 2622895
DMA32 allocs 17839113 15863747
Normal allocs 129045707 137847920
Direct pages scanned 4070089 29046893
Kswapd pages scanned 17147837 17140694
Kswapd pages reclaimed 17146691 17139601
Direct pages reclaimed 1888879 4886630
Kswapd efficiency 99% 99%
Kswapd velocity 17523.721 17518.928
Direct efficiency 46% 16%
Direct velocity 4159.306 29687.854
Percentage direct scans 19% 62%
Page writes by reclaim 1149395.000 2622895.000
Page writes file 0 0
Page writes anon 1149395 2622895
The direct page scan and reclaim rates are noticable. It is possible
this will not be a universal win on all workloads but cycling through
zonelists waiting for zlc->last_full_zap to expire is not the right
decision.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 73 -----------------
mm/page_alloc.c | 217 +------------------------------------------------
2 files changed, 2 insertions(+), 288 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 754c25966a0a..754289f371fa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -585,75 +585,6 @@ static inline bool zone_is_empty(struct zone *zone)
* [1] : No fallback (__GFP_THISNODE)
*/
#define MAX_ZONELISTS 2
-
-
-/*
- * We cache key information from each zonelist for smaller cache
- * footprint when scanning for free pages in get_page_from_freelist().
- *
- * 1) The BITMAP fullzones tracks which zones in a zonelist have come
- * up short of free memory since the last time (last_fullzone_zap)
- * we zero'd fullzones.
- * 2) The array z_to_n[] maps each zone in the zonelist to its node
- * id, so that we can efficiently evaluate whether that node is
- * set in the current tasks mems_allowed.
- *
- * Both fullzones and z_to_n[] are one-to-one with the zonelist,
- * indexed by a zones offset in the zonelist zones[] array.
- *
- * The get_page_from_freelist() routine does two scans. During the
- * first scan, we skip zones whose corresponding bit in 'fullzones'
- * is set or whose corresponding node in current->mems_allowed (which
- * comes from cpusets) is not set. During the second scan, we bypass
- * this zonelist_cache, to ensure we look methodically at each zone.
- *
- * Once per second, we zero out (zap) fullzones, forcing us to
- * reconsider nodes that might have regained more free memory.
- * The field last_full_zap is the time we last zapped fullzones.
- *
- * This mechanism reduces the amount of time we waste repeatedly
- * reexaming zones for free memory when they just came up low on
- * memory momentarilly ago.
- *
- * The zonelist_cache struct members logically belong in struct
- * zonelist. However, the mempolicy zonelists constructed for
- * MPOL_BIND are intentionally variable length (and usually much
- * shorter). A general purpose mechanism for handling structs with
- * multiple variable length members is more mechanism than we want
- * here. We resort to some special case hackery instead.
- *
- * The MPOL_BIND zonelists don't need this zonelist_cache (in good
- * part because they are shorter), so we put the fixed length stuff
- * at the front of the zonelist struct, ending in a variable length
- * zones[], as is needed by MPOL_BIND.
- *
- * Then we put the optional zonelist cache on the end of the zonelist
- * struct. This optional stuff is found by a 'zlcache_ptr' pointer in
- * the fixed length portion at the front of the struct. This pointer
- * both enables us to find the zonelist cache, and in the case of
- * MPOL_BIND zonelists, (which will just set the zlcache_ptr to NULL)
- * to know that the zonelist cache is not there.
- *
- * The end result is that struct zonelists come in two flavors:
- * 1) The full, fixed length version, shown below, and
- * 2) The custom zonelists for MPOL_BIND.
- * The custom MPOL_BIND zonelists have a NULL zlcache_ptr and no zlcache.
- *
- * Even though there may be multiple CPU cores on a node modifying
- * fullzones or last_full_zap in the same zonelist_cache at the same
- * time, we don't lock it. This is just hint data - if it is wrong now
- * and then, the allocator will still function, perhaps a bit slower.
- */
-
-
-struct zonelist_cache {
- unsigned short z_to_n[MAX_ZONES_PER_ZONELIST]; /* zone->nid */
- DECLARE_BITMAP(fullzones, MAX_ZONES_PER_ZONELIST); /* zone full? */
- unsigned long last_full_zap; /* when last zap'd (jiffies) */
-};
-#else
-#define MAX_ZONELISTS 1
-struct zonelist_cache;
#endif
/*
@@ -683,11 +614,7 @@ struct zoneref {
* zonelist_node_idx() - Return the index of the node for an entry
*/
struct zonelist {
- struct zonelist_cache *zlcache_ptr; // NULL or &zlcache
struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
-#ifdef CONFIG_NUMA
- struct zonelist_cache zlcache; // optional ...
-#endif
};
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 506eac8b38af..8db0b6d66165 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2248,122 +2248,6 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
}
#ifdef CONFIG_NUMA
-/*
- * zlc_setup - Setup for "zonelist cache". Uses cached zone data to
- * skip over zones that are not allowed by the cpuset, or that have
- * been recently (in last second) found to be nearly full. See further
- * comments in mmzone.h. Reduces cache footprint of zonelist scans
- * that have to skip over a lot of full or unallowed zones.
- *
- * If the zonelist cache is present in the passed zonelist, then
- * returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_states[N_MEMORY].)
- *
- * If the zonelist cache is not available for this zonelist, does
- * nothing and returns NULL.
- *
- * If the fullzones BITMAP in the zonelist cache is stale (more than
- * a second since last zap'd) then we zap it out (clear its bits.)
- *
- * We hold off even calling zlc_setup, until after we've checked the
- * first zone in the zonelist, on the theory that most allocations will
- * be satisfied from that first zone, so best to examine that zone as
- * quickly as we can.
- */
-static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
- nodemask_t *allowednodes; /* zonelist_cache approximation */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return NULL;
-
- if (time_after(jiffies, zlc->last_full_zap + HZ)) {
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
- zlc->last_full_zap = jiffies;
- }
-
- allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
- &cpuset_current_mems_allowed :
- &node_states[N_MEMORY];
- return allowednodes;
-}
-
-/*
- * Given 'z' scanning a zonelist, run a couple of quick checks to see
- * if it is worth looking at further for free memory:
- * 1) Check that the zone isn't thought to be full (doesn't have its
- * bit set in the zonelist_cache fullzones BITMAP).
- * 2) Check that the zones node (obtained from the zonelist_cache
- * z_to_n[] mapping) is allowed in the passed in allowednodes mask.
- * Return true (non-zero) if zone is worth looking at further, or
- * else return false (zero) if it is not.
- *
- * This check -ignores- the distinction between various watermarks,
- * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ... If a zone is
- * found to be full for any variation of these watermarks, it will
- * be considered full for up to one second by all requests, unless
- * we are so low on memory on all allowed nodes that we are forced
- * into the second scan of the zonelist.
- *
- * In the second scan we ignore this zonelist cache and exactly
- * apply the watermarks to all zones, even it is slower to do so.
- * We are low on memory in the second scan, and should leave no stone
- * unturned looking for a free page.
- */
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
- nodemask_t *allowednodes)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
- int i; /* index of *z in zonelist zones */
- int n; /* node that zone *z is on */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return 1;
-
- i = z - zonelist->_zonerefs;
- n = zlc->z_to_n[i];
-
- /* This zone is worth trying if it is allowed but not full */
- return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
-}
-
-/*
- * Given 'z' scanning a zonelist, set the corresponding bit in
- * zlc->fullzones, so that subsequent attempts to allocate a page
- * from that zone don't waste time re-examining it.
- */
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
- int i; /* index of *z in zonelist zones */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return;
-
- i = z - zonelist->_zonerefs;
-
- set_bit(i, zlc->fullzones);
-}
-
-/*
- * clear all zones full, called after direct reclaim makes progress so that
- * a zone that was recently full is not skipped over for up to a second
- */
-static void zlc_clear_zones_full(struct zonelist *zonelist)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return;
-
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-}
-
static bool zone_local(struct zone *local_zone, struct zone *zone)
{
return local_zone->node == zone->node;
@@ -2374,28 +2258,7 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
RECLAIM_DISTANCE;
}
-
#else /* CONFIG_NUMA */
-
-static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
-{
- return NULL;
-}
-
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
- nodemask_t *allowednodes)
-{
- return 1;
-}
-
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
-{
-}
-
-static void zlc_clear_zones_full(struct zonelist *zonelist)
-{
-}
-
static bool zone_local(struct zone *local_zone, struct zone *zone)
{
return true;
@@ -2405,7 +2268,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return true;
}
-
#endif /* CONFIG_NUMA */
static void reset_alloc_batches(struct zone *preferred_zone)
@@ -2432,9 +2294,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
struct zoneref *z;
struct page *page = NULL;
struct zone *zone;
- nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
- int zlc_active = 0; /* set if using zonelist_cache */
- int did_zlc_setup = 0; /* just call zlc_setup() one time */
bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
(gfp_mask & __GFP_WRITE);
int nr_fair_skipped = 0;
@@ -2451,9 +2310,6 @@ zonelist_scan:
ac->nodemask) {
unsigned long mark;
- if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
- !zlc_zone_worth_trying(zonelist, z, allowednodes))
- continue;
if (cpusets_enabled() &&
(alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed(zone, gfp_mask))
@@ -2511,28 +2367,8 @@ zonelist_scan:
if (alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;
- if (IS_ENABLED(CONFIG_NUMA) &&
- !did_zlc_setup && nr_online_nodes > 1) {
- /*
- * we do zlc_setup if there are multiple nodes
- * and before considering the first zone allowed
- * by the cpuset.
- */
- allowednodes = zlc_setup(zonelist, alloc_flags);
- zlc_active = 1;
- did_zlc_setup = 1;
- }
-
if (zone_reclaim_mode == 0 ||
!zone_allows_reclaim(ac->preferred_zone, zone))
- goto this_zone_full;
-
- /*
- * As we may have just activated ZLC, check if the first
- * eligible zone has failed zone_reclaim recently.
- */
- if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
- !zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
ret = zone_reclaim(zone, gfp_mask, order);
@@ -2549,19 +2385,6 @@ zonelist_scan:
ac->classzone_idx, alloc_flags))
goto try_this_zone;
- /*
- * Failed to reclaim enough to meet watermark.
- * Only mark the zone full if checking the min
- * watermark or if we failed to reclaim just
- * 1<<order pages or else the page allocator
- * fastpath will prematurely mark zones full
- * when the watermark is between the low and
- * min watermarks.
- */
- if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
- ret == ZONE_RECLAIM_SOME)
- goto this_zone_full;
-
continue;
}
}
@@ -2574,9 +2397,6 @@ try_this_zone:
goto try_this_zone;
return page;
}
-this_zone_full:
- if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
- zlc_mark_zone_full(zonelist, z);
}
/*
@@ -2597,12 +2417,6 @@ this_zone_full:
zonelist_rescan = true;
}
- if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
- /* Disable zlc cache for second zonelist scan */
- zlc_active = 0;
- zonelist_rescan = true;
- }
-
if (zonelist_rescan)
goto zonelist_scan;
@@ -2842,10 +2656,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
if (unlikely(!(*did_some_progress)))
return NULL;
- /* After successful reclaim, reconsider all zones for allocation */
- if (IS_ENABLED(CONFIG_NUMA))
- zlc_clear_zones_full(ac->zonelist);
-
retry:
page = get_page_from_freelist(gfp_mask, order,
alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
@@ -3155,7 +2965,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = {
.high_zoneidx = gfp_zone(gfp_mask),
- .nodemask = nodemask,
+ .nodemask = nodemask ? : &cpuset_current_mems_allowed,
.migratetype = gfpflags_to_migratetype(gfp_mask),
};
@@ -3186,8 +2996,7 @@ retry_cpuset:
ac.zonelist = zonelist;
/* The preferred zone is used for statistics later */
preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
- ac.nodemask ? : &cpuset_current_mems_allowed,
- &ac.preferred_zone);
+ ac.nodemask, &ac.preferred_zone);
if (!ac.preferred_zone)
goto out;
ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
@@ -4167,20 +3976,6 @@ static void build_zonelists(pg_data_t *pgdat)
build_thisnode_zonelists(pgdat);
}
-/* Construct the zonelist performance cache - see further mmzone.h */
-static void build_zonelist_cache(pg_data_t *pgdat)
-{
- struct zonelist *zonelist;
- struct zonelist_cache *zlc;
- struct zoneref *z;
-
- zonelist = &pgdat->node_zonelists[0];
- zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
- for (z = zonelist->_zonerefs; z->zone; z++)
- zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z);
-}
-
#ifdef CONFIG_HAVE_MEMORYLESS_NODES
/*
* Return node id of node used for "local" allocations.
@@ -4241,12 +4036,6 @@ static void build_zonelists(pg_data_t *pgdat)
zonelist->_zonerefs[j].zone_idx = 0;
}
-/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
-static void build_zonelist_cache(pg_data_t *pgdat)
-{
- pgdat->node_zonelists[0].zlcache_ptr = NULL;
-}
-
#endif /* CONFIG_NUMA */
/*
@@ -4287,14 +4076,12 @@ static int __build_all_zonelists(void *data)
if (self && !node_online(self->node_id)) {
build_zonelists(self);
- build_zonelist_cache(self);
}
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
build_zonelists(pgdat);
- build_zonelist_cache(pgdat);
}
/*
--
2.4.3
From: Mel Gorman <[email protected]>
No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
removes the unnecessary parameter.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 2 +-
mm/page_alloc.c | 5 +++--
mm/vmscan.c | 4 ++--
3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 754289f371fa..672ac437c43c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -729,7 +729,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
bool zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, int alloc_flags);
bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
- unsigned long mark, int classzone_idx, int alloc_flags);
+ unsigned long mark, int classzone_idx);
enum memmap_context {
MEMMAP_EARLY,
MEMMAP_HOTPLUG,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8db0b6d66165..4b35b196aeda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2207,6 +2207,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
min -= min / 2;
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;
+
#ifdef CONFIG_CMA
/* If allocation can't use CMA areas don't use free CMA pages */
if (!(alloc_flags & ALLOC_CMA))
@@ -2236,14 +2237,14 @@ bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
}
bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
- unsigned long mark, int classzone_idx, int alloc_flags)
+ unsigned long mark, int classzone_idx)
{
long free_pages = zone_page_state(z, NR_FREE_PAGES);
if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
- return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+ return __zone_watermark_ok(z, order, mark, classzone_idx, 0,
free_pages);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e61445dce04e..f1d8eae285f2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2454,7 +2454,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
- watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
+ watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
/*
* If compaction is deferred, reclaim up to a point where
@@ -2937,7 +2937,7 @@ static bool zone_balanced(struct zone *zone, int order,
unsigned long balance_gap, int classzone_idx)
{
if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
- balance_gap, classzone_idx, 0))
+ balance_gap, classzone_idx))
return false;
if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
--
2.4.3
From: Mel Gorman <[email protected]>
File-backed pages that will be immediately dirtied are balanced between
zones but it's unnecessarily expensive. Move consider_zone_balanced into
the alloc_context instead of checking bitmaps multiple times.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 1 +
mm/page_alloc.c | 9 ++++++---
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1e2ca6..8977348fbeec 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -129,6 +129,7 @@ struct alloc_context {
int classzone_idx;
int migratetype;
enum zone_type high_zoneidx;
+ bool consider_zone_dirty;
};
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b35b196aeda..7c2dc022f4ba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2295,8 +2295,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
struct zoneref *z;
struct page *page = NULL;
struct zone *zone;
- bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
- (gfp_mask & __GFP_WRITE);
int nr_fair_skipped = 0;
bool zonelist_rescan;
@@ -2355,7 +2353,7 @@ zonelist_scan:
* will require awareness of zones in the
* dirty-throttling and the flusher threads.
*/
- if (consider_zone_dirty && !zone_dirty_ok(zone))
+ if (ac->consider_zone_dirty && !zone_dirty_ok(zone))
continue;
mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
@@ -2995,6 +2993,10 @@ retry_cpuset:
/* We set it here, as __alloc_pages_slowpath might have changed it */
ac.zonelist = zonelist;
+
+ /* Dirty zone balancing only done in the fast path */
+ ac.consider_zone_dirty = (gfp_mask & __GFP_WRITE);
+
/* The preferred zone is used for statistics later */
preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
ac.nodemask, &ac.preferred_zone);
@@ -3012,6 +3014,7 @@ retry_cpuset:
* complete.
*/
alloc_mask = memalloc_noio_flags(gfp_mask);
+ ac.consider_zone_dirty = false;
page = __alloc_pages_slowpath(alloc_mask, order, &ac);
}
--
2.4.3
From: Mel Gorman <[email protected]>
There is a seqcounter that protects spurious allocation fails when a task
is changing the allowed nodes in a cpuset. There is no need to check the
seqcounter until a cpuset exists.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/cpuset.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 1b357997cac5..6eb27cb480b7 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -104,6 +104,9 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
*/
static inline unsigned int read_mems_allowed_begin(void)
{
+ if (!cpusets_enabled())
+ return 0;
+
return read_seqcount_begin(¤t->mems_allowed_seq);
}
@@ -115,6 +118,9 @@ static inline unsigned int read_mems_allowed_begin(void)
*/
static inline bool read_mems_allowed_retry(unsigned int seq)
{
+ if (!cpusets_enabled())
+ return false;
+
return read_seqcount_retry(¤t->mems_allowed_seq, seq);
}
--
2.4.3
From: Mel Gorman <[email protected]>
During boot and suspend there is a restriction on the allowed GFP
flags. During boot it prevents blocking operations before the scheduler
is active. During suspend it is to avoid IO operations when storage is
unavailable. The restriction on the mask is applied in some allocator
hot-paths during normal operation which is wasteful. Use jump labels
to only update the GFP mask when it is restricted.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 33 ++++++++++++++++++++++++++++-----
init/main.c | 2 +-
mm/page_alloc.c | 21 +++++++--------------
mm/slab.c | 4 ++--
mm/slob.c | 4 ++--
mm/slub.c | 6 +++---
6 files changed, 43 insertions(+), 27 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ad35f300b9a4..6d3a2d430715 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -394,12 +394,35 @@ static inline void page_alloc_init_late(void)
/*
* gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
- * GFP flags are used before interrupts are enabled. Once interrupts are
- * enabled, it is set to __GFP_BITS_MASK while the system is running. During
- * hibernation, it is used by PM to avoid I/O during memory allocation while
- * devices are suspended.
+ * GFP flags are used before interrupts are enabled. During hibernation, it is
+ * used by PM to avoid I/O during memory allocation while devices are suspended.
*/
-extern gfp_t gfp_allowed_mask;
+extern gfp_t __gfp_allowed_mask;
+
+/* Only update the gfp_mask when it is restricted */
+extern struct static_key gfp_restricted_key;
+
+static inline gfp_t gfp_allowed_mask(gfp_t gfp_mask)
+{
+ if (static_key_false(&gfp_restricted_key))
+ return gfp_mask;
+
+ return gfp_mask & __gfp_allowed_mask;
+}
+
+static inline void unrestrict_gfp_allowed_mask(void)
+{
+ WARN_ON(!static_key_enabled(&gfp_restricted_key));
+ __gfp_allowed_mask = __GFP_BITS_MASK;
+ static_key_slow_dec(&gfp_restricted_key);
+}
+
+static inline void restrict_gfp_allowed_mask(gfp_t gfp_mask)
+{
+ WARN_ON(static_key_enabled(&gfp_restricted_key));
+ __gfp_allowed_mask = gfp_mask;
+ static_key_slow_inc(&gfp_restricted_key);
+}
/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
diff --git a/init/main.c b/init/main.c
index c5d5626289ce..7e3a227559c6 100644
--- a/init/main.c
+++ b/init/main.c
@@ -983,7 +983,7 @@ static noinline void __init kernel_init_freeable(void)
wait_for_completion(&kthreadd_done);
/* Now the scheduler is fully set up and can do blocking allocations */
- gfp_allowed_mask = __GFP_BITS_MASK;
+ unrestrict_gfp_allowed_mask();
/*
* init can allocate pages on any node
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7c2dc022f4ba..56432b59b797 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -124,7 +124,9 @@ unsigned long totalcma_pages __read_mostly;
unsigned long dirty_balance_reserve __read_mostly;
int percpu_pagelist_fraction;
-gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
+
+gfp_t __gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
+struct static_key gfp_restricted_key __read_mostly = STATIC_KEY_INIT_TRUE;
#ifdef CONFIG_PM_SLEEP
/*
@@ -136,30 +138,21 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
* guaranteed not to run in parallel with that modification).
*/
-static gfp_t saved_gfp_mask;
-
void pm_restore_gfp_mask(void)
{
WARN_ON(!mutex_is_locked(&pm_mutex));
- if (saved_gfp_mask) {
- gfp_allowed_mask = saved_gfp_mask;
- saved_gfp_mask = 0;
- }
+ unrestrict_gfp_allowed_mask();
}
void pm_restrict_gfp_mask(void)
{
WARN_ON(!mutex_is_locked(&pm_mutex));
- WARN_ON(saved_gfp_mask);
- saved_gfp_mask = gfp_allowed_mask;
- gfp_allowed_mask &= ~GFP_IOFS;
+ restrict_gfp_allowed_mask(__GFP_BITS_MASK & ~GFP_IOFS);
}
bool pm_suspended_storage(void)
{
- if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
- return false;
- return true;
+ return static_key_enabled(&gfp_restricted_key);
}
#endif /* CONFIG_PM_SLEEP */
@@ -2968,7 +2961,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
.migratetype = gfpflags_to_migratetype(gfp_mask),
};
- gfp_mask &= gfp_allowed_mask;
+ gfp_mask = gfp_allowed_mask(gfp_mask);
lockdep_trace_alloc(gfp_mask);
diff --git a/mm/slab.c b/mm/slab.c
index 200e22412a16..2c715b8c88f7 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3151,7 +3151,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
void *ptr;
int slab_node = numa_mem_id();
- flags &= gfp_allowed_mask;
+ flags = gfp_allowed_mask(flags);
lockdep_trace_alloc(flags);
@@ -3239,7 +3239,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
unsigned long save_flags;
void *objp;
- flags &= gfp_allowed_mask;
+ flags = gfp_allowed_mask(flags);
lockdep_trace_alloc(flags);
diff --git a/mm/slob.c b/mm/slob.c
index 4765f65019c7..23dbdac87fcb 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -430,7 +430,7 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
void *ret;
- gfp &= gfp_allowed_mask;
+ gfp = gfp_allowed_mask(gfp);
lockdep_trace_alloc(gfp);
@@ -536,7 +536,7 @@ static void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
{
void *b;
- flags &= gfp_allowed_mask;
+ flags = gfp_allowed_mask(flags);
lockdep_trace_alloc(flags);
diff --git a/mm/slub.c b/mm/slub.c
index 816df0016555..9eb79f7a48ba 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1261,7 +1261,7 @@ static inline void kfree_hook(const void *x)
static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
gfp_t flags)
{
- flags &= gfp_allowed_mask;
+ flags = gfp_allowed_mask(flags);
lockdep_trace_alloc(flags);
might_sleep_if(flags & __GFP_WAIT);
@@ -1274,7 +1274,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
static inline void slab_post_alloc_hook(struct kmem_cache *s,
gfp_t flags, void *object)
{
- flags &= gfp_allowed_mask;
+ flags = gfp_allowed_mask(flags);
kmemcheck_slab_alloc(s, flags, object, slab_ksize(s));
kmemleak_alloc_recursive(object, s->object_size, 1, s->flags, flags);
memcg_kmem_put_cache(s);
@@ -1337,7 +1337,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
struct kmem_cache_order_objects oo = s->oo;
gfp_t alloc_gfp;
- flags &= gfp_allowed_mask;
+ flags = gfp_allowed_mask(flags);
if (flags & __GFP_WAIT)
local_irq_enable();
--
2.4.3
From: Mel Gorman <[email protected]>
The global variable page_group_by_mobility_disabled remembers if page grouping
by mobility was disabled at boot time. It's more efficient to do this by jump
label.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 2 +-
include/linux/mmzone.h | 7 ++++++-
mm/page_alloc.c | 15 ++++++---------
3 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6d3a2d430715..5a27bbba63ed 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -151,7 +151,7 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
{
WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
- if (unlikely(page_group_by_mobility_disabled))
+ if (page_group_by_mobility_disabled())
return MIGRATE_UNMOVABLE;
/* Group based on mobility */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 672ac437c43c..c9497519340a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -73,7 +73,12 @@ enum {
for (order = 0; order < MAX_ORDER; order++) \
for (type = 0; type < MIGRATE_TYPES; type++)
-extern int page_group_by_mobility_disabled;
+extern struct static_key page_group_by_mobility_key;
+
+static inline bool page_group_by_mobility_disabled(void)
+{
+ return static_key_false(&page_group_by_mobility_key);
+}
#define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 56432b59b797..403cf31f8cf9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -228,7 +228,7 @@ EXPORT_SYMBOL(nr_node_ids);
EXPORT_SYMBOL(nr_online_nodes);
#endif
-int page_group_by_mobility_disabled __read_mostly;
+struct static_key page_group_by_mobility_key __read_mostly = STATIC_KEY_INIT_FALSE;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
static inline void reset_deferred_meminit(pg_data_t *pgdat)
@@ -303,8 +303,7 @@ static inline bool update_defer_init(pg_data_t *pgdat,
void set_pageblock_migratetype(struct page *page, int migratetype)
{
- if (unlikely(page_group_by_mobility_disabled &&
- migratetype < MIGRATE_PCPTYPES))
+ if (page_group_by_mobility_disabled() && migratetype < MIGRATE_PCPTYPES)
migratetype = MIGRATE_UNMOVABLE;
set_pageblock_flags_group(page, (unsigned long)migratetype,
@@ -1501,7 +1500,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
if (order >= pageblock_order / 2 ||
start_mt == MIGRATE_RECLAIMABLE ||
start_mt == MIGRATE_UNMOVABLE ||
- page_group_by_mobility_disabled)
+ page_group_by_mobility_disabled())
return true;
return false;
@@ -1530,7 +1529,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
/* Claim the whole block if over half of it is free */
if (pages >= (1 << (pageblock_order-1)) ||
- page_group_by_mobility_disabled)
+ page_group_by_mobility_disabled())
set_pageblock_migratetype(page, start_type);
}
@@ -4156,15 +4155,13 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
* disabled and enable it later
*/
if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES))
- page_group_by_mobility_disabled = 1;
- else
- page_group_by_mobility_disabled = 0;
+ static_key_slow_inc(&page_group_by_mobility_key);
pr_info("Built %i zonelists in %s order, mobility grouping %s. "
"Total pages: %ld\n",
nr_online_nodes,
zonelist_order_name[current_zonelist_order],
- page_group_by_mobility_disabled ? "off" : "on",
+ page_group_by_mobility_disabled() ? "off" : "on",
vm_total_pages);
#ifdef CONFIG_NUMA
pr_info("Policy zone: %s\n", zone_names[policy_zone]);
--
2.4.3
From: Mel Gorman <[email protected]>
This patch redefines which GFP bits are used for specifying mobility and
the order of the migrate types. Once redefined it's possible to convert
GFP flags to a migrate type with a simple mask and shift. The only downside
is that readers of OOM kill messages and allocation failures may have been
used to the existing values but scripts/gfp-translate will help.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 12 +++++++-----
include/linux/mmzone.h | 2 +-
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5a27bbba63ed..ec00a8263f5b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -14,7 +14,7 @@ struct vm_area_struct;
#define ___GFP_HIGHMEM 0x02u
#define ___GFP_DMA32 0x04u
#define ___GFP_MOVABLE 0x08u
-#define ___GFP_WAIT 0x10u
+#define ___GFP_RECLAIMABLE 0x10u
#define ___GFP_HIGH 0x20u
#define ___GFP_IO 0x40u
#define ___GFP_FS 0x80u
@@ -29,7 +29,7 @@ struct vm_area_struct;
#define ___GFP_NOMEMALLOC 0x10000u
#define ___GFP_HARDWALL 0x20000u
#define ___GFP_THISNODE 0x40000u
-#define ___GFP_RECLAIMABLE 0x80000u
+#define ___GFP_WAIT 0x80000u
#define ___GFP_NOACCOUNT 0x100000u
#define ___GFP_NOTRACK 0x200000u
#define ___GFP_NO_KSWAPD 0x400000u
@@ -123,6 +123,7 @@ struct vm_area_struct;
/* This mask makes up all the page movable related flags */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
+#define GFP_MOVABLE_SHIFT 3
/* Control page allocator reclaim behavior */
#define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
@@ -149,14 +150,15 @@ struct vm_area_struct;
/* Convert GFP flags to their corresponding migrate type */
static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
{
- WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+ VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+ BUILD_BUG_ON(1UL << GFP_MOVABLE_SHIFT != ___GFP_MOVABLE);
+ BUILD_BUG_ON(___GFP_MOVABLE >> GFP_MOVABLE_SHIFT != MIGRATE_MOVABLE);
if (page_group_by_mobility_disabled())
return MIGRATE_UNMOVABLE;
/* Group based on mobility */
- return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
- ((gfp_flags & __GFP_RECLAIMABLE) != 0);
+ return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
}
#ifdef CONFIG_HIGHMEM
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c9497519340a..3afd1ca2ca98 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -37,8 +37,8 @@
enum {
MIGRATE_UNMOVABLE,
- MIGRATE_RECLAIMABLE,
MIGRATE_MOVABLE,
+ MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_RESERVE = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
--
2.4.3
From: Mel Gorman <[email protected]>
MIGRATE_RESERVE preserves an old property of the buddy allocator that existed
prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to
remain free until the only alternative was to fail the allocation. At the
time it was discovered that high-order atomic allocations relied on this
property so MIGRATE_RESERVE was introduced. A later patch will introduce
an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE
and supporting code so it'll be easier to review. Note that this patch
in isolation may look like a false regression if someone was bisecting
high-order atomic allocation failures.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 10 +---
mm/huge_memory.c | 2 +-
mm/page_alloc.c | 148 +++----------------------------------------------
mm/vmstat.c | 1 -
4 files changed, 11 insertions(+), 150 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3afd1ca2ca98..0faa196eb10a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,8 +39,6 @@ enum {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
- MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
- MIGRATE_RESERVE = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
/*
* MIGRATE_CMA migration type is designed to mimic the way
@@ -63,6 +61,8 @@ enum {
MIGRATE_TYPES
};
+#define MIGRATE_PCPTYPES (MIGRATE_RECLAIMABLE+1)
+
#ifdef CONFIG_CMA
# define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
#else
@@ -430,12 +430,6 @@ struct zone {
const char *name;
- /*
- * Number of MIGRATE_RESERVE page block. To maintain for just
- * optimization. Protected by zone->lock.
- */
- int nr_migrate_reserve_block;
-
#ifdef CONFIG_MEMORY_ISOLATION
/*
* Number of isolated pageblock. It is used to solve incorrect
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c107094f79ba..705ac13b969d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -113,7 +113,7 @@ static int set_recommended_min_free_kbytes(void)
for_each_populated_zone(zone)
nr_zones++;
- /* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
+ /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
recommended_min = pageblock_nr_pages * nr_zones * 2;
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 403cf31f8cf9..3249b0d9879e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -787,7 +787,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (unlikely(has_isolate_pageblock(zone)))
mt = get_pageblock_migratetype(page);
- /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
} while (--to_free && --batch_free && !list_empty(list));
@@ -1369,15 +1368,14 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
* the free lists for the desirable migrate type are depleted
*/
static int fallbacks[MIGRATE_TYPES][4] = {
- [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
- [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
- [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
+ [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
+ [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
+ [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
#ifdef CONFIG_CMA
- [MIGRATE_CMA] = { MIGRATE_RESERVE }, /* Never used */
+ [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
#endif
- [MIGRATE_RESERVE] = { MIGRATE_RESERVE }, /* Never used */
#ifdef CONFIG_MEMORY_ISOLATION
- [MIGRATE_ISOLATE] = { MIGRATE_RESERVE }, /* Never used */
+ [MIGRATE_ISOLATE] = { MIGRATE_TYPES }, /* Never used */
#endif
};
@@ -1551,7 +1549,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
*can_steal = false;
for (i = 0;; i++) {
fallback_mt = fallbacks[migratetype][i];
- if (fallback_mt == MIGRATE_RESERVE)
+ if (fallback_mt == MIGRATE_TYPES)
break;
if (list_empty(&area->free_list[fallback_mt]))
@@ -1630,25 +1628,13 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
{
struct page *page;
-retry_reserve:
page = __rmqueue_smallest(zone, order, migratetype);
-
- if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
+ if (unlikely(!page)) {
if (migratetype == MIGRATE_MOVABLE)
page = __rmqueue_cma_fallback(zone, order);
if (!page)
page = __rmqueue_fallback(zone, order, migratetype);
-
- /*
- * Use MIGRATE_RESERVE rather than fail an allocation. goto
- * is used because __rmqueue_smallest is an inline function
- * and we want just one call site
- */
- if (!page) {
- migratetype = MIGRATE_RESERVE;
- goto retry_reserve;
- }
}
trace_mm_page_alloc_zone_locked(page, order, migratetype);
@@ -3426,7 +3412,6 @@ static void show_migration_types(unsigned char type)
[MIGRATE_UNMOVABLE] = 'U',
[MIGRATE_RECLAIMABLE] = 'E',
[MIGRATE_MOVABLE] = 'M',
- [MIGRATE_RESERVE] = 'R',
#ifdef CONFIG_CMA
[MIGRATE_CMA] = 'C',
#endif
@@ -4235,120 +4220,6 @@ static inline unsigned long wait_table_bits(unsigned long size)
}
/*
- * Check if a pageblock contains reserved pages
- */
-static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn < end_pfn; pfn++) {
- if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
- return 1;
- }
- return 0;
-}
-
-/*
- * Mark a number of pageblocks as MIGRATE_RESERVE. The number
- * of blocks reserved is based on min_wmark_pages(zone). The memory within
- * the reserve will tend to store contiguous free pages. Setting min_free_kbytes
- * higher will lead to a bigger reserve which will get freed as contiguous
- * blocks as reclaim kicks in
- */
-static void setup_zone_migrate_reserve(struct zone *zone)
-{
- unsigned long start_pfn, pfn, end_pfn, block_end_pfn;
- struct page *page;
- unsigned long block_migratetype;
- int reserve;
- int old_reserve;
-
- /*
- * Get the start pfn, end pfn and the number of blocks to reserve
- * We have to be careful to be aligned to pageblock_nr_pages to
- * make sure that we always check pfn_valid for the first page in
- * the block.
- */
- start_pfn = zone->zone_start_pfn;
- end_pfn = zone_end_pfn(zone);
- start_pfn = roundup(start_pfn, pageblock_nr_pages);
- reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >>
- pageblock_order;
-
- /*
- * Reserve blocks are generally in place to help high-order atomic
- * allocations that are short-lived. A min_free_kbytes value that
- * would result in more than 2 reserve blocks for atomic allocations
- * is assumed to be in place to help anti-fragmentation for the
- * future allocation of hugepages at runtime.
- */
- reserve = min(2, reserve);
- old_reserve = zone->nr_migrate_reserve_block;
-
- /* When memory hot-add, we almost always need to do nothing */
- if (reserve == old_reserve)
- return;
- zone->nr_migrate_reserve_block = reserve;
-
- for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
- if (!early_page_nid_uninitialised(pfn, zone_to_nid(zone)))
- return;
-
- if (!pfn_valid(pfn))
- continue;
- page = pfn_to_page(pfn);
-
- /* Watch out for overlapping nodes */
- if (page_to_nid(page) != zone_to_nid(zone))
- continue;
-
- block_migratetype = get_pageblock_migratetype(page);
-
- /* Only test what is necessary when the reserves are not met */
- if (reserve > 0) {
- /*
- * Blocks with reserved pages will never free, skip
- * them.
- */
- block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn);
- if (pageblock_is_reserved(pfn, block_end_pfn))
- continue;
-
- /* If this block is reserved, account for it */
- if (block_migratetype == MIGRATE_RESERVE) {
- reserve--;
- continue;
- }
-
- /* Suitable for reserving if this block is movable */
- if (block_migratetype == MIGRATE_MOVABLE) {
- set_pageblock_migratetype(page,
- MIGRATE_RESERVE);
- move_freepages_block(zone, page,
- MIGRATE_RESERVE);
- reserve--;
- continue;
- }
- } else if (!old_reserve) {
- /*
- * At boot time we don't need to scan the whole zone
- * for turning off MIGRATE_RESERVE.
- */
- break;
- }
-
- /*
- * If the reserve is met and this is a previous reserved block,
- * take it back
- */
- if (block_migratetype == MIGRATE_RESERVE) {
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
- move_freepages_block(zone, page, MIGRATE_MOVABLE);
- }
- }
-}
-
-/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
@@ -4387,9 +4258,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
* movable at startup. This will force kernel allocations
* to reserve their blocks rather than leaking throughout
* the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
+ * kernel allocations are made.
*
* bitmap is created for zone's valid pfn range. but memmap
* can be created for invalid pages (for alignment)
@@ -5939,7 +5808,6 @@ static void __setup_per_zone_wmarks(void)
high_wmark_pages(zone) - low_wmark_pages(zone) -
atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
- setup_zone_migrate_reserve(zone);
spin_unlock_irqrestore(&zone->lock, flags);
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..49963aa2dff3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,7 +901,6 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Reclaimable",
"Movable",
- "Reserve",
#ifdef CONFIG_CMA
"CMA",
#endif
--
2.4.3
From: Mel Gorman <[email protected]>
High-order watermark checking exists for two reasons -- kswapd high-order
awareness and protection for high-order atomic requests. Historically we
depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
that reserves pageblocks for high-order atomic allocations. This is expected
to be more reliable than MIGRATE_RESERVE was.
A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
a pageblock but limits the total number to 10% of the zone.
The pageblocks are unreserved if an allocation fails after a direct
reclaim attempt.
The watermark checks account for the reserved pageblocks when the allocation
request is not a high-order atomic allocation.
The stutter benchmark was used to evaluate this but while it was running
there was a systemtap script that randomly allocated between 1 and 1G worth
of order-3 pages using GFP_ATOMIC. In kernel 4.2-rc1 running this workload
on a single-node machine there were 339574 allocation failures. With this
patch applied there were 28798 failures -- a 92% reduction. On a 4-node
machine, allocation failures went from 76917 to 0 failures.
There are minor theoritical side-effects. If the system is intensively
making large numbers of long-lived high-order atomic allocations then
there will be a lot of reserved pageblocks. This may push some workloads
into reclaim until the number of reserved pageblocks is reduced again. This
problem was not observed in reclaim intensive workloads but such workloads
are also not atomic high-order intensive.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 6 ++-
mm/page_alloc.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++---
mm/vmstat.c | 1 +
3 files changed, 112 insertions(+), 9 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0faa196eb10a..73a148ee79e3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,8 @@ enum {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
+ MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
+ MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
/*
* MIGRATE_CMA migration type is designed to mimic the way
@@ -61,8 +63,6 @@ enum {
MIGRATE_TYPES
};
-#define MIGRATE_PCPTYPES (MIGRATE_RECLAIMABLE+1)
-
#ifdef CONFIG_CMA
# define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
#else
@@ -335,6 +335,8 @@ struct zone {
/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long watermark[NR_WMARK];
+ unsigned long nr_reserved_highatomic;
+
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3249b0d9879e..e5755390a5e5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1568,6 +1568,76 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
return -1;
}
+/*
+ * Reserve a pageblock for exclusive use of high-order atomic allocations if
+ * there are no empty page blocks that contain a page with a suitable order
+ */
+static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
+ unsigned int alloc_order)
+{
+ int mt = get_pageblock_migratetype(page);
+ unsigned long max_managed, flags;
+
+ if (mt == MIGRATE_HIGHATOMIC)
+ return;
+
+ /*
+ * Limit the number reserved to 1 pageblock or roughly 10% of a zone.
+ * Check is race-prone but harmless.
+ */
+ max_managed = (zone->managed_pages / 10) + pageblock_nr_pages;
+ if (zone->nr_reserved_highatomic >= max_managed)
+ return;
+
+ /* Yoink! */
+ spin_lock_irqsave(&zone->lock, flags);
+ zone->nr_reserved_highatomic += pageblock_nr_pages;
+ set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
+ move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+/*
+ * Used when an allocation is about to fail under memory pressure. This
+ * potentially hurts the reliability of high-order allocations when under
+ * intense memory pressure but failed atomic allocations should be easier
+ * to recover from than an OOM.
+ */
+static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
+{
+ struct zonelist *zonelist = ac->zonelist;
+ unsigned long flags;
+ struct zoneref *z;
+ struct zone *zone;
+ struct page *page;
+ int order;
+
+ for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
+ ac->nodemask) {
+ /* Preserve at least one pageblock */
+ if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ for (order = 0; order < MAX_ORDER; order++) {
+ struct free_area *area = &(zone->free_area[order]);
+
+ if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
+ continue;
+
+ page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
+ struct page, lru);
+
+ zone->nr_reserved_highatomic -= pageblock_nr_pages;
+ set_pageblock_migratetype(page, ac->migratetype);
+ move_freepages_block(zone, page, ac->migratetype);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return;
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+}
+
/* Remove an element from the buddy allocator from the fallback list */
static inline struct page *
__rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
@@ -1619,15 +1689,26 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
return NULL;
}
+static inline bool gfp_mask_atomic(gfp_t gfp_mask)
+{
+ return !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
+}
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
static struct page *__rmqueue(struct zone *zone, unsigned int order,
- int migratetype)
+ int migratetype, gfp_t gfp_flags)
{
struct page *page;
+ if (unlikely(order && gfp_mask_atomic(gfp_flags))) {
+ page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+ if (page)
+ goto out;
+ }
+
page = __rmqueue_smallest(zone, order, migratetype);
if (unlikely(!page)) {
if (migratetype == MIGRATE_MOVABLE)
@@ -1637,6 +1718,7 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
page = __rmqueue_fallback(zone, order, migratetype);
}
+out:
trace_mm_page_alloc_zone_locked(page, order, migratetype);
return page;
}
@@ -1654,7 +1736,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order, migratetype);
+ struct page *page = __rmqueue(zone, order, migratetype, 0);
if (unlikely(page == NULL))
break;
@@ -2065,7 +2147,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
WARN_ON_ONCE(order > 1);
}
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order, migratetype);
+ page = __rmqueue(zone, order, migratetype, gfp_flags);
spin_unlock(&zone->lock);
if (!page)
goto failed;
@@ -2175,15 +2257,23 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, int alloc_flags,
long free_pages)
{
- /* free_pages may go negative - that's OK */
long min = mark;
int o;
long free_cma = 0;
+ /* free_pages may go negative - that's OK */
free_pages -= (1 << order) - 1;
+
if (alloc_flags & ALLOC_HIGH)
min -= min / 2;
- if (alloc_flags & ALLOC_HARDER)
+
+ /*
+ * If the caller is not atomic then discount the reserves. This will
+ * over-estimate how the atomic reserve but it avoids a search
+ */
+ if (likely(!(alloc_flags & ALLOC_HARDER)))
+ free_pages -= z->nr_reserved_highatomic;
+ else
min -= min / 4;
#ifdef CONFIG_CMA
@@ -2372,6 +2462,14 @@ try_this_zone:
if (page) {
if (prep_new_page(page, order, gfp_mask, alloc_flags))
goto try_this_zone;
+
+ /*
+ * If this is a high-order atomic allocation then check
+ * if the pageblock should be reserved for the future
+ */
+ if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
+ reserve_highatomic_pageblock(page, zone, order);
+
return page;
}
}
@@ -2639,9 +2737,11 @@ retry:
/*
* If an allocation failed after direct reclaim, it could be because
- * pages are pinned on the per-cpu lists. Drain them and try again
+ * pages are pinned on the per-cpu lists or in high alloc reserves.
+ * Shrink them them and try again
*/
if (!page && !drained) {
+ unreserve_highatomic_pageblock(ac);
drain_all_pages(NULL);
drained = true;
goto retry;
@@ -2686,7 +2786,7 @@ static inline int
gfp_to_alloc_flags(gfp_t gfp_mask)
{
int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
- const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
+ const bool atomic = gfp_mask_atomic(gfp_mask);
/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 49963aa2dff3..3427a155f85e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,6 +901,7 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Reclaimable",
"Movable",
+ "HighAtomic",
#ifdef CONFIG_CMA
"CMA",
#endif
--
2.4.3
From: Mel Gorman <[email protected]>
The primary purpose of watermarks is to ensure that reclaim can always
make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
These assume that order-0 allocations are all that is necessary for
forward progress.
High-order watermarks serve a different purpose. Kswapd had no high-order
awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
This was particularly important when there were high-order atomic requests.
The watermarks both gave kswapd awareness and made a reserve for those
atomic requests.
There are two important side-effects of this. The most important is that
a non-atomic high-order request can fail even though free pages are available
and the order-0 watermarks are ok. The second is that high-order watermark
checks are expensive as the free list counts up to the requested order must
be examined.
With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
have high-order watermarks. Kswapd and compaction still need high-order
awareness which is handled by checking that at least one suitable high-order
page is free.
In kernel 4.2-rc1 running this workload on a single-node machine there
were 339574 allocation failures. With HighAtomic reserves, it drops to
28798 failures. With this patch applied, it drops to 9567 failures --
a 98% reduction compared to the vanilla kernel or 67% in comparison to
having high atomic reserves with watermark checking.
The one potential side-effect of this is that in a vanilla kernel, the
watermark checks may have kept a free page for an atomic allocation. Now,
we are 100% relying on the HighAtomic reserves and an early allocation to
have allocated them. If the first high-order atomic allocation is after
the system is already heavily fragmented then it'll fail.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 38 ++++++++++++++++++++++++--------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5755390a5e5..e756df60dba6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2250,8 +2250,10 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
#endif /* CONFIG_FAIL_PAGE_ALLOC */
/*
- * Return true if free pages are above 'mark'. This takes into account the order
- * of the allocation.
+ * Return true if free base pages are above 'mark'. For high-order checks it
+ * will return true of the order-0 watermark is reached and there is at least
+ * one free page of a suitable size. Checking now avoids taking the zone lock
+ * to check in the allocation paths if no pages are free.
*/
static bool __zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, int alloc_flags,
@@ -2259,7 +2261,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
{
long min = mark;
int o;
- long free_cma = 0;
+ const bool atomic = (alloc_flags & ALLOC_HARDER);
/* free_pages may go negative - that's OK */
free_pages -= (1 << order) - 1;
@@ -2271,7 +2273,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
* If the caller is not atomic then discount the reserves. This will
* over-estimate how the atomic reserve but it avoids a search
*/
- if (likely(!(alloc_flags & ALLOC_HARDER)))
+ if (likely(!atomic))
free_pages -= z->nr_reserved_highatomic;
else
min -= min / 4;
@@ -2279,22 +2281,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
#ifdef CONFIG_CMA
/* If allocation can't use CMA areas don't use free CMA pages */
if (!(alloc_flags & ALLOC_CMA))
- free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
+ free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
#endif
- if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
+ if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return false;
- for (o = 0; o < order; o++) {
- /* At the next order, this order's pages become unavailable */
- free_pages -= z->free_area[o].nr_free << o;
- /* Require fewer higher order pages to be free */
- min >>= 1;
+ /* order-0 watermarks are ok */
+ if (!order)
+ return true;
+
+ /* Check at least one high-order page is free */
+ for (o = order; o < MAX_ORDER; o++) {
+ struct free_area *area = &z->free_area[o];
+ int mt;
+
+ if (atomic && area->nr_free)
+ return true;
- if (free_pages <= min)
- return false;
+ for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+ if (!list_empty(&area->free_list[mt]))
+ return true;
+ }
}
- return true;
+ return false;
}
bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
--
2.4.3
On Mon, 20 Jul 2015, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> The zonelist cache (zlc) was introduced to skip over zones that were
> recently known to be full. At the time the paths it bypassed were the
> cpuset checks, the watermark calculations and zone_reclaim. The situation
> today is different and the complexity of zlc is harder to justify.
>
> 1) The cpuset checks are no-ops unless a cpuset is active and in general are
> a lot cheaper.
>
> 2) zone_reclaim is now disabled by default and I suspect that was a large
> source of the cost that zlc wanted to avoid. When it is enabled, it's
> known to be a major source of stalling when nodes fill up and it's
> unwise to hit every other user with the overhead.
>
> 3) Watermark checks are expensive to calculate for high-order
> allocation requests. Later patches in this series will reduce the cost of
> the watermark checking.
>
> 4) The most important issue is that in the current implementation it
> is possible for a failed THP allocation to mark a zone full for order-0
> allocations and cause a fallback to remote nodes.
>
> The last issue could be addressed with additional complexity but it's
> not clear that we need zlc at all so this patch deletes it. If stalls
> due to repeated zone_reclaim are ever reported as an issue then we should
> introduce deferring logic based on a timeout inside zone_reclaim itself
> and leave the page allocator fast paths alone.
>
> Impact on page-allocator microbenchmarks is negligible as they don't hit
> the paths where the zlc comes into play. The impact was noticable in
> a workload called "stutter". One part uses a lot of anonymous memory,
> a second measures mmap latency and a third copies a large file. In an
> ideal world the latency application would not notice the mmap latency.
> On a 4-node machine the results of this patch are
>
> 4-node machine stutter
> 4.2.0-rc1 4.2.0-rc1
> vanilla nozlc-v1r20
> Min mmap 53.9902 ( 0.00%) 49.3629 ( 8.57%)
> 1st-qrtle mmap 54.6776 ( 0.00%) 54.1201 ( 1.02%)
> 2nd-qrtle mmap 54.9242 ( 0.00%) 54.5961 ( 0.60%)
> 3rd-qrtle mmap 55.1817 ( 0.00%) 54.9338 ( 0.45%)
> Max-90% mmap 55.3952 ( 0.00%) 55.3929 ( 0.00%)
> Max-93% mmap 55.4766 ( 0.00%) 57.5712 ( -3.78%)
> Max-95% mmap 55.5522 ( 0.00%) 57.8376 ( -4.11%)
> Max-99% mmap 55.7938 ( 0.00%) 63.6180 (-14.02%)
> Max mmap 6344.0292 ( 0.00%) 67.2477 ( 98.94%)
> Mean mmap 57.3732 ( 0.00%) 54.5680 ( 4.89%)
>
> Note the maximum stall latency which was 6 seconds and becomes 67ms with
> this patch applied. However, also note that it is not guaranteed this
> benchmark always hits pathelogical cases and the milage varies. There is
> a secondary impact with more direct reclaim because zones are now being
> considered instead of being skipped by zlc.
>
> 4.1.0 4.1.0
> vanilla nozlc-v1r4
> Swap Ins 838 502
> Swap Outs 1149395 2622895
> DMA32 allocs 17839113 15863747
> Normal allocs 129045707 137847920
> Direct pages scanned 4070089 29046893
> Kswapd pages scanned 17147837 17140694
> Kswapd pages reclaimed 17146691 17139601
> Direct pages reclaimed 1888879 4886630
> Kswapd efficiency 99% 99%
> Kswapd velocity 17523.721 17518.928
> Direct efficiency 46% 16%
> Direct velocity 4159.306 29687.854
> Percentage direct scans 19% 62%
> Page writes by reclaim 1149395.000 2622895.000
> Page writes file 0 0
> Page writes anon 1149395 2622895
>
> The direct page scan and reclaim rates are noticable. It is possible
> this will not be a universal win on all workloads but cycling through
> zonelists waiting for zlc->last_full_zap to expire is not the right
> decision.
>
> Signed-off-by: Mel Gorman <[email protected]>
I don't use a config that uses cpusets to restrict memory allocation
anymore, but it'd be interesting to see the impact that the spinlock and
cpuset hierarchy scan has for non-hardwalled allocations.
This removed the #define MAX_ZONELISTS 1 for UMA configs, which will cause
build errors, but once that's fixed:
Acked-by: David Rientjes <[email protected]>
I'm glad to see this go.
On Mon, 20 Jul 2015, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
> removes the unnecessary parameter.
>
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
On Mon, 20 Jul 2015, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> File-backed pages that will be immediately dirtied are balanced between
> zones but it's unnecessarily expensive. Move consider_zone_balanced into
> the alloc_context instead of checking bitmaps multiple times.
>
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
consider_zone_dirty eliminates zones over their dirty limits and
zone_dirty_ok() returns true if zones are under their dirty limits, so the
naming of both are a little strange. You might consider changing them
while you're here.
On Mon, 20 Jul 2015, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> There is a seqcounter that protects spurious allocation fails when a task
> is changing the allowed nodes in a cpuset. There is no need to check the
> seqcounter until a cpuset exists.
>
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
but there's a typo in your email address in the signed-off-by line. Nice
to know you actually type them by hand though :)
On Tue, Jul 21, 2015 at 04:47:35PM -0700, David Rientjes wrote:
> On Mon, 20 Jul 2015, Mel Gorman wrote:
>
> > From: Mel Gorman <[email protected]>
> >
> > The zonelist cache (zlc) was introduced to skip over zones that were
> > recently known to be full. At the time the paths it bypassed were the
> > cpuset checks, the watermark calculations and zone_reclaim. The situation
> > today is different and the complexity of zlc is harder to justify.
> >
> > 1) The cpuset checks are no-ops unless a cpuset is active and in general are
> > a lot cheaper.
> >
> > 2) zone_reclaim is now disabled by default and I suspect that was a large
> > source of the cost that zlc wanted to avoid. When it is enabled, it's
> > known to be a major source of stalling when nodes fill up and it's
> > unwise to hit every other user with the overhead.
> >
> > 3) Watermark checks are expensive to calculate for high-order
> > allocation requests. Later patches in this series will reduce the cost of
> > the watermark checking.
> >
> > 4) The most important issue is that in the current implementation it
> > is possible for a failed THP allocation to mark a zone full for order-0
> > allocations and cause a fallback to remote nodes.
> >
> > The last issue could be addressed with additional complexity but it's
> > not clear that we need zlc at all so this patch deletes it. If stalls
> > due to repeated zone_reclaim are ever reported as an issue then we should
> > introduce deferring logic based on a timeout inside zone_reclaim itself
> > and leave the page allocator fast paths alone.
> >
> > Impact on page-allocator microbenchmarks is negligible as they don't hit
> > the paths where the zlc comes into play. The impact was noticable in
> > a workload called "stutter". One part uses a lot of anonymous memory,
> > a second measures mmap latency and a third copies a large file. In an
> > ideal world the latency application would not notice the mmap latency.
> > On a 4-node machine the results of this patch are
> >
> > 4-node machine stutter
> > 4.2.0-rc1 4.2.0-rc1
> > vanilla nozlc-v1r20
> > Min mmap 53.9902 ( 0.00%) 49.3629 ( 8.57%)
> > 1st-qrtle mmap 54.6776 ( 0.00%) 54.1201 ( 1.02%)
> > 2nd-qrtle mmap 54.9242 ( 0.00%) 54.5961 ( 0.60%)
> > 3rd-qrtle mmap 55.1817 ( 0.00%) 54.9338 ( 0.45%)
> > Max-90% mmap 55.3952 ( 0.00%) 55.3929 ( 0.00%)
> > Max-93% mmap 55.4766 ( 0.00%) 57.5712 ( -3.78%)
> > Max-95% mmap 55.5522 ( 0.00%) 57.8376 ( -4.11%)
> > Max-99% mmap 55.7938 ( 0.00%) 63.6180 (-14.02%)
> > Max mmap 6344.0292 ( 0.00%) 67.2477 ( 98.94%)
> > Mean mmap 57.3732 ( 0.00%) 54.5680 ( 4.89%)
> >
> > Note the maximum stall latency which was 6 seconds and becomes 67ms with
> > this patch applied. However, also note that it is not guaranteed this
> > benchmark always hits pathelogical cases and the milage varies. There is
> > a secondary impact with more direct reclaim because zones are now being
> > considered instead of being skipped by zlc.
> >
> > 4.1.0 4.1.0
> > vanilla nozlc-v1r4
> > Swap Ins 838 502
> > Swap Outs 1149395 2622895
> > DMA32 allocs 17839113 15863747
> > Normal allocs 129045707 137847920
> > Direct pages scanned 4070089 29046893
> > Kswapd pages scanned 17147837 17140694
> > Kswapd pages reclaimed 17146691 17139601
> > Direct pages reclaimed 1888879 4886630
> > Kswapd efficiency 99% 99%
> > Kswapd velocity 17523.721 17518.928
> > Direct efficiency 46% 16%
> > Direct velocity 4159.306 29687.854
> > Percentage direct scans 19% 62%
> > Page writes by reclaim 1149395.000 2622895.000
> > Page writes file 0 0
> > Page writes anon 1149395 2622895
> >
> > The direct page scan and reclaim rates are noticable. It is possible
> > this will not be a universal win on all workloads but cycling through
> > zonelists waiting for zlc->last_full_zap to expire is not the right
> > decision.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> I don't use a config that uses cpusets to restrict memory allocation
> anymore, but it'd be interesting to see the impact that the spinlock and
> cpuset hierarchy scan has for non-hardwalled allocations.
>
> This removed the #define MAX_ZONELISTS 1 for UMA configs, which will cause
> build errors, but once that's fixed:
>
The build error is now fixed. Thanks.
--
Mel Gorman
SUSE Labs
On Tue, Jul 21, 2015 at 05:08:42PM -0700, David Rientjes wrote:
> On Mon, 20 Jul 2015, Mel Gorman wrote:
>
> > From: Mel Gorman <[email protected]>
> >
> > File-backed pages that will be immediately dirtied are balanced between
> > zones but it's unnecessarily expensive. Move consider_zone_balanced into
> > the alloc_context instead of checking bitmaps multiple times.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Acked-by: David Rientjes <[email protected]>
>
Thanks.
> consider_zone_dirty eliminates zones over their dirty limits and
> zone_dirty_ok() returns true if zones are under their dirty limits, so the
> naming of both are a little strange. You might consider changing them
> while you're here.
Yeah, that seems sensible. I named the struct field spread_dirty_page so
the relevant check now looks like
if (ac->spread_dirty_page && !zone_dirty_ok(zone))
Alternative suggestions welcome but I think this is more meaningful than
consider_zone_dirty was.
--
Mel Gorman
SUSE Labs
On 07/20/2015 10:00 AM, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
> removes the unnecessary parameter.
>
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 07/20/2015 10:00 AM, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> File-backed pages that will be immediately dirtied are balanced between
> zones but it's unnecessarily expensive. Move consider_zone_balanced into
> the alloc_context instead of checking bitmaps multiple times.
>
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Agreed with new ac->spread_dirty_page name (or rather plural,
spread_dirty_pages?) , and a nitpick below.
> ---
> mm/internal.h | 1 +
> mm/page_alloc.c | 9 ++++++---
> 2 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 36b23f1e2ca6..8977348fbeec 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -129,6 +129,7 @@ struct alloc_context {
> int classzone_idx;
> int migratetype;
> enum zone_type high_zoneidx;
> + bool consider_zone_dirty;
> };
>
> /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4b35b196aeda..7c2dc022f4ba 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2295,8 +2295,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> struct zoneref *z;
> struct page *page = NULL;
> struct zone *zone;
> - bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
> - (gfp_mask & __GFP_WRITE);
> int nr_fair_skipped = 0;
> bool zonelist_rescan;
>
> @@ -2355,7 +2353,7 @@ zonelist_scan:
I've been recently suggested by mhocko to add to ~/.gitconfig
[diff "default"]
xfuncname = "^[[:alpha:]$_].*[^:]$"
So that git produces function names in hunk context instead of labels. I
gladly spread this arcane knowledge :)
> * will require awareness of zones in the
> * dirty-throttling and the flusher threads.
> */
This comment (in the part not shown) mentions ALLOC_WMARK_LOW as the
mechanism to distinguish fastpath from slowpath. This is no longer true,
so update it too?
> - if (consider_zone_dirty && !zone_dirty_ok(zone))
> + if (ac->consider_zone_dirty && !zone_dirty_ok(zone))
> continue;
>
> mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
> @@ -2995,6 +2993,10 @@ retry_cpuset:
>
> /* We set it here, as __alloc_pages_slowpath might have changed it */
> ac.zonelist = zonelist;
> +
> + /* Dirty zone balancing only done in the fast path */
> + ac.consider_zone_dirty = (gfp_mask & __GFP_WRITE);
> +
> /* The preferred zone is used for statistics later */
> preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
> ac.nodemask, &ac.preferred_zone);
> @@ -3012,6 +3014,7 @@ retry_cpuset:
> * complete.
> */
> alloc_mask = memalloc_noio_flags(gfp_mask);
> + ac.consider_zone_dirty = false;
>
> page = __alloc_pages_slowpath(alloc_mask, order, &ac);
> }
>
On 07/20/2015 10:00 AM, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> There is a seqcounter that protects spurious allocation fails when a task
> is changing the allowed nodes in a cpuset. There is no need to check the
> seqcounter until a cpuset exists.
If cpusets become enabled betwen _begin and _retry, then it will retry
due to comparing with 0, but not crash, so it's safe.
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> ---
> include/linux/cpuset.h | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 1b357997cac5..6eb27cb480b7 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -104,6 +104,9 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
> */
> static inline unsigned int read_mems_allowed_begin(void)
> {
> + if (!cpusets_enabled())
> + return 0;
> +
> return read_seqcount_begin(¤t->mems_allowed_seq);
> }
>
> @@ -115,6 +118,9 @@ static inline unsigned int read_mems_allowed_begin(void)
> */
> static inline bool read_mems_allowed_retry(unsigned int seq)
> {
> + if (!cpusets_enabled())
> + return false;
> +
> return read_seqcount_retry(¤t->mems_allowed_seq, seq);
> }
>
>
On 07/20/2015 10:00 AM, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> During boot and suspend there is a restriction on the allowed GFP
> flags. During boot it prevents blocking operations before the scheduler
> is active. During suspend it is to avoid IO operations when storage is
> unavailable. The restriction on the mask is applied in some allocator
> hot-paths during normal operation which is wasteful. Use jump labels
> to only update the GFP mask when it is restricted.
>
> Signed-off-by: Mel Gorman <[email protected]>
[+CC Peterz, not trimmed due to that]
> ---
> include/linux/gfp.h | 33 ++++++++++++++++++++++++++++-----
> init/main.c | 2 +-
> mm/page_alloc.c | 21 +++++++--------------
> mm/slab.c | 4 ++--
> mm/slob.c | 4 ++--
> mm/slub.c | 6 +++---
> 6 files changed, 43 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index ad35f300b9a4..6d3a2d430715 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -394,12 +394,35 @@ static inline void page_alloc_init_late(void)
>
> /*
> * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
> - * GFP flags are used before interrupts are enabled. Once interrupts are
> - * enabled, it is set to __GFP_BITS_MASK while the system is running. During
> - * hibernation, it is used by PM to avoid I/O during memory allocation while
> - * devices are suspended.
> + * GFP flags are used before interrupts are enabled. During hibernation, it is
> + * used by PM to avoid I/O during memory allocation while devices are suspended.
> */
> -extern gfp_t gfp_allowed_mask;
> +extern gfp_t __gfp_allowed_mask;
> +
> +/* Only update the gfp_mask when it is restricted */
> +extern struct static_key gfp_restricted_key;
> +
> +static inline gfp_t gfp_allowed_mask(gfp_t gfp_mask)
> +{
> + if (static_key_false(&gfp_restricted_key))
This is where it uses static_key_false()...
> + return gfp_mask;
> +
> + return gfp_mask & __gfp_allowed_mask;
> +}
> +
> +static inline void unrestrict_gfp_allowed_mask(void)
> +{
> + WARN_ON(!static_key_enabled(&gfp_restricted_key));
> + __gfp_allowed_mask = __GFP_BITS_MASK;
> + static_key_slow_dec(&gfp_restricted_key);
> +}
> +
> +static inline void restrict_gfp_allowed_mask(gfp_t gfp_mask)
> +{
> + WARN_ON(static_key_enabled(&gfp_restricted_key));
> + __gfp_allowed_mask = gfp_mask;
> + static_key_slow_inc(&gfp_restricted_key);
> +}
>
> /* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
> bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
> diff --git a/init/main.c b/init/main.c
> index c5d5626289ce..7e3a227559c6 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -983,7 +983,7 @@ static noinline void __init kernel_init_freeable(void)
> wait_for_completion(&kthreadd_done);
>
> /* Now the scheduler is fully set up and can do blocking allocations */
> - gfp_allowed_mask = __GFP_BITS_MASK;
> + unrestrict_gfp_allowed_mask();
>
> /*
> * init can allocate pages on any node
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7c2dc022f4ba..56432b59b797 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -124,7 +124,9 @@ unsigned long totalcma_pages __read_mostly;
> unsigned long dirty_balance_reserve __read_mostly;
>
> int percpu_pagelist_fraction;
> -gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> +
> +gfp_t __gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> +struct static_key gfp_restricted_key __read_mostly = STATIC_KEY_INIT_TRUE;
... and here it's combined with STATIC_KEY_INIT_TRUE. I've suspected
that this is not allowed, which Peter confirmed on IRC.
It's however true that the big comment at the top of
include/linux/jump_label.h only explicitly talks about combining
static_key_false() and static_key_true().
I'm not sure what's the correct idiom for a default-false static key
which however has to start as true on boot (Peter said such cases do
exist)...
>
> #ifdef CONFIG_PM_SLEEP
> /*
> @@ -136,30 +138,21 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> * guaranteed not to run in parallel with that modification).
> */
>
> -static gfp_t saved_gfp_mask;
> -
> void pm_restore_gfp_mask(void)
> {
> WARN_ON(!mutex_is_locked(&pm_mutex));
> - if (saved_gfp_mask) {
> - gfp_allowed_mask = saved_gfp_mask;
> - saved_gfp_mask = 0;
> - }
> + unrestrict_gfp_allowed_mask();
> }
>
> void pm_restrict_gfp_mask(void)
> {
> WARN_ON(!mutex_is_locked(&pm_mutex));
> - WARN_ON(saved_gfp_mask);
> - saved_gfp_mask = gfp_allowed_mask;
> - gfp_allowed_mask &= ~GFP_IOFS;
> + restrict_gfp_allowed_mask(__GFP_BITS_MASK & ~GFP_IOFS);
> }
>
> bool pm_suspended_storage(void)
> {
> - if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> - return false;
> - return true;
> + return static_key_enabled(&gfp_restricted_key);
> }
> #endif /* CONFIG_PM_SLEEP */
>
> @@ -2968,7 +2961,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> .migratetype = gfpflags_to_migratetype(gfp_mask),
> };
>
> - gfp_mask &= gfp_allowed_mask;
> + gfp_mask = gfp_allowed_mask(gfp_mask);
>
> lockdep_trace_alloc(gfp_mask);
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 200e22412a16..2c715b8c88f7 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3151,7 +3151,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
> void *ptr;
> int slab_node = numa_mem_id();
>
> - flags &= gfp_allowed_mask;
> + flags = gfp_allowed_mask(flags);
>
> lockdep_trace_alloc(flags);
>
> @@ -3239,7 +3239,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
> unsigned long save_flags;
> void *objp;
>
> - flags &= gfp_allowed_mask;
> + flags = gfp_allowed_mask(flags);
>
> lockdep_trace_alloc(flags);
>
> diff --git a/mm/slob.c b/mm/slob.c
> index 4765f65019c7..23dbdac87fcb 100644
> --- a/mm/slob.c
> +++ b/mm/slob.c
> @@ -430,7 +430,7 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
> int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
> void *ret;
>
> - gfp &= gfp_allowed_mask;
> + gfp = gfp_allowed_mask(gfp);
>
> lockdep_trace_alloc(gfp);
>
> @@ -536,7 +536,7 @@ static void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
> {
> void *b;
>
> - flags &= gfp_allowed_mask;
> + flags = gfp_allowed_mask(flags);
>
> lockdep_trace_alloc(flags);
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 816df0016555..9eb79f7a48ba 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1261,7 +1261,7 @@ static inline void kfree_hook(const void *x)
> static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
> gfp_t flags)
> {
> - flags &= gfp_allowed_mask;
> + flags = gfp_allowed_mask(flags);
> lockdep_trace_alloc(flags);
> might_sleep_if(flags & __GFP_WAIT);
>
> @@ -1274,7 +1274,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
> static inline void slab_post_alloc_hook(struct kmem_cache *s,
> gfp_t flags, void *object)
> {
> - flags &= gfp_allowed_mask;
> + flags = gfp_allowed_mask(flags);
> kmemcheck_slab_alloc(s, flags, object, slab_ksize(s));
> kmemleak_alloc_recursive(object, s->object_size, 1, s->flags, flags);
> memcg_kmem_put_cache(s);
> @@ -1337,7 +1337,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> struct kmem_cache_order_objects oo = s->oo;
> gfp_t alloc_gfp;
>
> - flags &= gfp_allowed_mask;
> + flags = gfp_allowed_mask(flags);
>
> if (flags & __GFP_WAIT)
> local_irq_enable();
>
On 07/20/2015 10:00 AM, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> The global variable page_group_by_mobility_disabled remembers if page grouping
> by mobility was disabled at boot time. It's more efficient to do this by jump
> label.
>
> Signed-off-by: Mel Gorman <[email protected]>
[+CC Peterz]
> ---
> include/linux/gfp.h | 2 +-
> include/linux/mmzone.h | 7 ++++++-
> mm/page_alloc.c | 15 ++++++---------
> 3 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 6d3a2d430715..5a27bbba63ed 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -151,7 +151,7 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
> {
> WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
>
> - if (unlikely(page_group_by_mobility_disabled))
> + if (page_group_by_mobility_disabled())
> return MIGRATE_UNMOVABLE;
>
> /* Group based on mobility */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 672ac437c43c..c9497519340a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -73,7 +73,12 @@ enum {
> for (order = 0; order < MAX_ORDER; order++) \
> for (type = 0; type < MIGRATE_TYPES; type++)
>
> -extern int page_group_by_mobility_disabled;
> +extern struct static_key page_group_by_mobility_key;
The "disabled" part is no longer in the name, but I suspect you didn't
want it to be too long?
> +
> +static inline bool page_group_by_mobility_disabled(void)
> +{
> + return static_key_false(&page_group_by_mobility_key);
> +}
>
> #define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
> #define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 56432b59b797..403cf31f8cf9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -228,7 +228,7 @@ EXPORT_SYMBOL(nr_node_ids);
> EXPORT_SYMBOL(nr_online_nodes);
> #endif
>
> -int page_group_by_mobility_disabled __read_mostly;
> +struct static_key page_group_by_mobility_key __read_mostly = STATIC_KEY_INIT_FALSE;
>
> #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> static inline void reset_deferred_meminit(pg_data_t *pgdat)
> @@ -303,8 +303,7 @@ static inline bool update_defer_init(pg_data_t *pgdat,
>
> void set_pageblock_migratetype(struct page *page, int migratetype)
> {
> - if (unlikely(page_group_by_mobility_disabled &&
> - migratetype < MIGRATE_PCPTYPES))
> + if (page_group_by_mobility_disabled() && migratetype < MIGRATE_PCPTYPES)
> migratetype = MIGRATE_UNMOVABLE;
>
> set_pageblock_flags_group(page, (unsigned long)migratetype,
> @@ -1501,7 +1500,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
> if (order >= pageblock_order / 2 ||
> start_mt == MIGRATE_RECLAIMABLE ||
> start_mt == MIGRATE_UNMOVABLE ||
> - page_group_by_mobility_disabled)
> + page_group_by_mobility_disabled())
> return true;
>
> return false;
> @@ -1530,7 +1529,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
>
> /* Claim the whole block if over half of it is free */
> if (pages >= (1 << (pageblock_order-1)) ||
> - page_group_by_mobility_disabled)
> + page_group_by_mobility_disabled())
> set_pageblock_migratetype(page, start_type);
> }
>
> @@ -4156,15 +4155,13 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
> * disabled and enable it later
> */
> if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES))
> - page_group_by_mobility_disabled = 1;
> - else
> - page_group_by_mobility_disabled = 0;
> + static_key_slow_inc(&page_group_by_mobility_key);
Um so previously, booting with little memory would disable grouping by
mobility, and later hotpluging would enable it again, right? But this is
now removed, and once disabled means always disabled? That can't be
right, and I'm not sure about the effects of the recently introduced
delayed initialization here?
Looks like the API addition that Peter just posted, would be useful here
:) http://marc.info/?l=linux-kernel&m=143808996921651&w=2
>
> pr_info("Built %i zonelists in %s order, mobility grouping %s. "
> "Total pages: %ld\n",
> nr_online_nodes,
> zonelist_order_name[current_zonelist_order],
> - page_group_by_mobility_disabled ? "off" : "on",
> + page_group_by_mobility_disabled() ? "off" : "on",
> vm_total_pages);
> #ifdef CONFIG_NUMA
> pr_info("Policy zone: %s\n", zone_names[policy_zone]);
>
On Tue, Jul 28, 2015 at 03:36:05PM +0200, Vlastimil Babka wrote:
> >+static inline gfp_t gfp_allowed_mask(gfp_t gfp_mask)
> >+{
> >+ if (static_key_false(&gfp_restricted_key))
>
> This is where it uses static_key_false()...
> >+struct static_key gfp_restricted_key __read_mostly = STATIC_KEY_INIT_TRUE;
>
> ... and here it's combined with STATIC_KEY_INIT_TRUE. I've suspected that
> this is not allowed, which Peter confirmed on IRC.
>
> It's however true that the big comment at the top of
> include/linux/jump_label.h only explicitly talks about combining
> static_key_false() and static_key_true().
>
> I'm not sure what's the correct idiom for a default-false static key which
> however has to start as true on boot (Peter said such cases do exist)...
There currently isn't one. But see the patchset I just send to address
this:
lkml.kernel.org/r/[email protected]
On Tue, Jul 28, 2015 at 03:36:05PM +0200, Vlastimil Babka wrote:
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -124,7 +124,9 @@ unsigned long totalcma_pages __read_mostly;
> > unsigned long dirty_balance_reserve __read_mostly;
> >
> > int percpu_pagelist_fraction;
> >-gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> >+
> >+gfp_t __gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> >+struct static_key gfp_restricted_key __read_mostly = STATIC_KEY_INIT_TRUE;
>
> ... and here it's combined with STATIC_KEY_INIT_TRUE. I've suspected
> that this is not allowed, which Peter confirmed on IRC.
>
Thanks because I was not aware of hazards of that nature. I'll drop the
jump-label related patches from the series until the patches related to
the correct idiom are finalised. The micro-optimisations are not the
main point of this series and the savings are tiny.
--
Mel Gorman
SUSE Labs
On 07/20/2015 10:00 AM, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> MIGRATE_RESERVE preserves an old property of the buddy allocator that existed
> prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to
> remain free until the only alternative was to fail the allocation. At the
^ I think you meant contiguous instead of free? Is it because
splitting chooses lowest possible order, and grouping by mobility means you
might be splitting e.g. order-5 movable page instead of using order-0 unmovable
page? And that the fallback heuristics specifically select highest available
order? I think it's not that obvious, so worth mentioning.
> time it was discovered that high-order atomic allocations relied on this
> property so MIGRATE_RESERVE was introduced. A later patch will introduce
> an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE
> and supporting code so it'll be easier to review. Note that this patch
> in isolation may look like a false regression if someone was bisecting
> high-order atomic allocation failures.
>
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 07/20/2015 10:00 AM, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> High-order watermark checking exists for two reasons -- kswapd high-order
> awareness and protection for high-order atomic requests. Historically we
> depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> that reserves pageblocks for high-order atomic allocations. This is expected
> to be more reliable than MIGRATE_RESERVE was.
>
> A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> a pageblock but limits the total number to 10% of the zone.
This looked weird, until I read the implementation and realized that "an
allocation request" is limited to high-order atomic allocation requests.
> The pageblocks are unreserved if an allocation fails after a direct
> reclaim attempt.
>
> The watermark checks account for the reserved pageblocks when the allocation
> request is not a high-order atomic allocation.
>
> The stutter benchmark was used to evaluate this but while it was running
> there was a systemtap script that randomly allocated between 1 and 1G worth
> of order-3 pages using GFP_ATOMIC. In kernel 4.2-rc1 running this workload
> on a single-node machine there were 339574 allocation failures. With this
> patch applied there were 28798 failures -- a 92% reduction. On a 4-node
> machine, allocation failures went from 76917 to 0 failures.
>
> There are minor theoritical side-effects. If the system is intensively
> making large numbers of long-lived high-order atomic allocations then
> there will be a lot of reserved pageblocks. This may push some workloads
> into reclaim until the number of reserved pageblocks is reduced again. This
> problem was not observed in reclaim intensive workloads but such workloads
> are also not atomic high-order intensive.
>
> Signed-off-by: Mel Gorman <[email protected]>
[...]
> +/*
> + * Used when an allocation is about to fail under memory pressure. This
> + * potentially hurts the reliability of high-order allocations when under
> + * intense memory pressure but failed atomic allocations should be easier
> + * to recover from than an OOM.
> + */
> +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> +{
> + struct zonelist *zonelist = ac->zonelist;
> + unsigned long flags;
> + struct zoneref *z;
> + struct zone *zone;
> + struct page *page;
> + int order;
> +
> + for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
> + ac->nodemask) {
This fixed order might bias some zones over others wrt unreserving. Is it OK?
> + /* Preserve at least one pageblock */
> + if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> + continue;
> +
> + spin_lock_irqsave(&zone->lock, flags);
> + for (order = 0; order < MAX_ORDER; order++) {
Would it make more sense to look in descending order for a higher chance of
unreserving a pageblock that's mostly free? Like the traditional page stealing does?
> + struct free_area *area = &(zone->free_area[order]);
> +
> + if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
> + continue;
> +
> + page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
> + struct page, lru);
> +
> + zone->nr_reserved_highatomic -= pageblock_nr_pages;
> + set_pageblock_migratetype(page, ac->migratetype);
Would it make more sense to assume MIGRATE_UNMOVABLE, as high-order allocations
present in the pageblock typically would be, and apply the traditional page
stealing heuristics to decide if it should be changed to ac->migratetype (if
that differs)?
> + move_freepages_block(zone, page, ac->migratetype);
> + spin_unlock_irqrestore(&zone->lock, flags);
> + return;
> + }
> + spin_unlock_irqrestore(&zone->lock, flags);
> + }
> +}
> +
> /* Remove an element from the buddy allocator from the fallback list */
> static inline struct page *
> __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
> @@ -1619,15 +1689,26 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
> return NULL;
> }
>
> +static inline bool gfp_mask_atomic(gfp_t gfp_mask)
> +{
> + return !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
> +}
> +
> /*
> * Do the hard work of removing an element from the buddy allocator.
> * Call me with the zone->lock already held.
> */
> static struct page *__rmqueue(struct zone *zone, unsigned int order,
> - int migratetype)
> + int migratetype, gfp_t gfp_flags)
> {
> struct page *page;
>
> + if (unlikely(order && gfp_mask_atomic(gfp_flags))) {
> + page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> + if (page)
> + goto out;
> + }
> +
> page = __rmqueue_smallest(zone, order, migratetype);
> if (unlikely(!page)) {
> if (migratetype == MIGRATE_MOVABLE)
> @@ -1637,6 +1718,7 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
> page = __rmqueue_fallback(zone, order, migratetype);
> }
>
> +out:
> trace_mm_page_alloc_zone_locked(page, order, migratetype);
> return page;
> }
> @@ -1654,7 +1736,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>
> spin_lock(&zone->lock);
> for (i = 0; i < count; ++i) {
> - struct page *page = __rmqueue(zone, order, migratetype);
> + struct page *page = __rmqueue(zone, order, migratetype, 0);
> if (unlikely(page == NULL))
> break;
>
> @@ -2065,7 +2147,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> WARN_ON_ONCE(order > 1);
> }
> spin_lock_irqsave(&zone->lock, flags);
> - page = __rmqueue(zone, order, migratetype);
> + page = __rmqueue(zone, order, migratetype, gfp_flags);
> spin_unlock(&zone->lock);
> if (!page)
> goto failed;
> @@ -2175,15 +2257,23 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> unsigned long mark, int classzone_idx, int alloc_flags,
> long free_pages)
> {
> - /* free_pages may go negative - that's OK */
> long min = mark;
> int o;
> long free_cma = 0;
>
> + /* free_pages may go negative - that's OK */
> free_pages -= (1 << order) - 1;
> +
> if (alloc_flags & ALLOC_HIGH)
> min -= min / 2;
> - if (alloc_flags & ALLOC_HARDER)
> +
> + /*
> + * If the caller is not atomic then discount the reserves. This will
> + * over-estimate how the atomic reserve but it avoids a search
> + */
> + if (likely(!(alloc_flags & ALLOC_HARDER)))
> + free_pages -= z->nr_reserved_highatomic;
Hm, so in the case the maximum of 10% reserved blocks is already full, we deny
the allocation access to another 10% of the memory and push it to reclaim. This
seems rather excessive.
Searching would of course suck, as would attempting to replicate the handling of
NR_FREE_CMA_PAGES. Sigh.
> + else
> min -= min / 4;
>
> #ifdef CONFIG_CMA
> @@ -2372,6 +2462,14 @@ try_this_zone:
> if (page) {
> if (prep_new_page(page, order, gfp_mask, alloc_flags))
> goto try_this_zone;
> +
> + /*
> + * If this is a high-order atomic allocation then check
> + * if the pageblock should be reserved for the future
> + */
> + if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
> + reserve_highatomic_pageblock(page, zone, order);
> +
> return page;
> }
> }
> @@ -2639,9 +2737,11 @@ retry:
>
> /*
> * If an allocation failed after direct reclaim, it could be because
> - * pages are pinned on the per-cpu lists. Drain them and try again
> + * pages are pinned on the per-cpu lists or in high alloc reserves.
> + * Shrink them them and try again
> */
> if (!page && !drained) {
> + unreserve_highatomic_pageblock(ac);
> drain_all_pages(NULL);
> drained = true;
> goto retry;
> @@ -2686,7 +2786,7 @@ static inline int
> gfp_to_alloc_flags(gfp_t gfp_mask)
> {
> int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> - const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
> + const bool atomic = gfp_mask_atomic(gfp_mask);
>
> /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 49963aa2dff3..3427a155f85e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -901,6 +901,7 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
> "Unmovable",
> "Reclaimable",
> "Movable",
> + "HighAtomic",
> #ifdef CONFIG_CMA
> "CMA",
> #endif
>
On Wed, Jul 29, 2015 at 11:59:15AM +0200, Vlastimil Babka wrote:
> On 07/20/2015 10:00 AM, Mel Gorman wrote:
> > From: Mel Gorman <[email protected]>
> >
> > MIGRATE_RESERVE preserves an old property of the buddy allocator that existed
> > prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to
> > remain free until the only alternative was to fail the allocation. At the
>
> ^ I think you meant contiguous instead of free?
That is exactly what I meant.
> Is it because
> splitting chooses lowest possible order, and grouping by mobility means you
> might be splitting e.g. order-5 movable page instead of using order-0 unmovable
> page? And that the fallback heuristics specifically select highest available
> order? I think it's not that obvious, so worth mentioning.
>
Yes, the commit that introduced MIGRATE_RESERVE discusses it so I didn't
repeat it as the git digging is simply
1. Find the commit that introduced MIGRATE_HIGHATOMIC and see it
replaced MIGRATE_RESERVE
2. Find the commit that introduced MIGRATE_RESERVE
That locates 56fd56b868f1 ("Bias the location of pages freed for
min_free_kbytes in the same MAX_ORDER_NR_PAGES blocks").
> > time it was discovered that high-order atomic allocations relied on this
> > property so MIGRATE_RESERVE was introduced. A later patch will introduce
> > an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE
> > and supporting code so it'll be easier to review. Note that this patch
> > in isolation may look like a false regression if someone was bisecting
> > high-order atomic allocation failures.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
>
Thanks.
--
Mel Gorman
SUSE Labs
On 07/20/2015 10:00 AM, Mel Gorman wrote:
[...]
> static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> unsigned long mark, int classzone_idx, int alloc_flags,
> @@ -2259,7 +2261,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> {
> long min = mark;
> int o;
> - long free_cma = 0;
> + const bool atomic = (alloc_flags & ALLOC_HARDER);
>
> /* free_pages may go negative - that's OK */
> free_pages -= (1 << order) - 1;
> @@ -2271,7 +2273,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> * If the caller is not atomic then discount the reserves. This will
> * over-estimate how the atomic reserve but it avoids a search
> */
> - if (likely(!(alloc_flags & ALLOC_HARDER)))
> + if (likely(!atomic))
> free_pages -= z->nr_reserved_highatomic;
> else
> min -= min / 4;
> @@ -2279,22 +2281,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> #ifdef CONFIG_CMA
> /* If allocation can't use CMA areas don't use free CMA pages */
> if (!(alloc_flags & ALLOC_CMA))
> - free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
> + free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> #endif
>
> - if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
> + if (free_pages <= min + z->lowmem_reserve[classzone_idx])
> return false;
> - for (o = 0; o < order; o++) {
> - /* At the next order, this order's pages become unavailable */
> - free_pages -= z->free_area[o].nr_free << o;
>
> - /* Require fewer higher order pages to be free */
> - min >>= 1;
> + /* order-0 watermarks are ok */
> + if (!order)
> + return true;
> +
> + /* Check at least one high-order page is free */
> + for (o = order; o < MAX_ORDER; o++) {
> + struct free_area *area = &z->free_area[o];
> + int mt;
> +
> + if (atomic && area->nr_free)
> + return true;
This may be a false positive due to MIGRATE_CMA or MIGRATE_ISOLATE pages being
the only free ones. But maybe it doesn't matter that much?
>
> - if (free_pages <= min)
> - return false;
> + for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> + if (!list_empty(&area->free_list[mt]))
> + return true;
> + }
This may be a false negative for ALLOC_CMA allocations, if the only free pages
are of MIGRATE_CMA. Arguably that's the worse case than a false positive?
> }
> - return true;
> + return false;
> }
>
> bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>
On Wed, Jul 29, 2015 at 01:35:17PM +0200, Vlastimil Babka wrote:
> On 07/20/2015 10:00 AM, Mel Gorman wrote:
> > From: Mel Gorman <[email protected]>
> >
> > High-order watermark checking exists for two reasons -- kswapd high-order
> > awareness and protection for high-order atomic requests. Historically we
> > depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> > pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> > that reserves pageblocks for high-order atomic allocations. This is expected
> > to be more reliable than MIGRATE_RESERVE was.
> >
> > A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> > a pageblock but limits the total number to 10% of the zone.
>
> This looked weird, until I read the implementation and realized that "an
> allocation request" is limited to high-order atomic allocation requests.
>
Which is an important detail for understanding the patch, thanks.
> > The pageblocks are unreserved if an allocation fails after a direct
> > reclaim attempt.
> >
> > The watermark checks account for the reserved pageblocks when the allocation
> > request is not a high-order atomic allocation.
> >
> > The stutter benchmark was used to evaluate this but while it was running
> > there was a systemtap script that randomly allocated between 1 and 1G worth
> > of order-3 pages using GFP_ATOMIC. In kernel 4.2-rc1 running this workload
> > on a single-node machine there were 339574 allocation failures. With this
> > patch applied there were 28798 failures -- a 92% reduction. On a 4-node
> > machine, allocation failures went from 76917 to 0 failures.
> >
> > There are minor theoritical side-effects. If the system is intensively
> > making large numbers of long-lived high-order atomic allocations then
> > there will be a lot of reserved pageblocks. This may push some workloads
> > into reclaim until the number of reserved pageblocks is reduced again. This
> > problem was not observed in reclaim intensive workloads but such workloads
> > are also not atomic high-order intensive.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> [...]
>
> > +/*
> > + * Used when an allocation is about to fail under memory pressure. This
> > + * potentially hurts the reliability of high-order allocations when under
> > + * intense memory pressure but failed atomic allocations should be easier
> > + * to recover from than an OOM.
> > + */
> > +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> > +{
> > + struct zonelist *zonelist = ac->zonelist;
> > + unsigned long flags;
> > + struct zoneref *z;
> > + struct zone *zone;
> > + struct page *page;
> > + int order;
> > +
> > + for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
> > + ac->nodemask) {
>
> This fixed order might bias some zones over others wrt unreserving. Is it OK?
I could not think of a situation where it mattered. It'll always be
preferring highest zone over lower zones. Allocation requests that can
use any zone that do not care. Allocation requests that are limited to
lower zones are protected as long as possible.
>
> > + /* Preserve at least one pageblock */
> > + if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> > + continue;
> > +
> > + spin_lock_irqsave(&zone->lock, flags);
> > + for (order = 0; order < MAX_ORDER; order++) {
>
> Would it make more sense to look in descending order for a higher chance of
> unreserving a pageblock that's mostly free? Like the traditional page stealing does?
>
I don't think it's worth the search cost. Traditional page stealing is
searching because it's trying to minimise events that cause external
fragmentation. Here we'd gain very little. We are under some memory
pressure here, if enough pages are not free then another one will get
freed shortly. Either way, I doubt the difference is measurable.
> > + struct free_area *area = &(zone->free_area[order]);
> > +
> > + if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
> > + continue;
> > +
> > + page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
> > + struct page, lru);
> > +
> > + zone->nr_reserved_highatomic -= pageblock_nr_pages;
> > + set_pageblock_migratetype(page, ac->migratetype);
>
> Would it make more sense to assume MIGRATE_UNMOVABLE, as high-order allocations
> present in the pageblock typically would be, and apply the traditional page
> stealing heuristics to decide if it should be changed to ac->migratetype (if
> that differs)?
>
Superb spot, I had to think about this for a while and initially I was
thinking your suggestion was a no-brainer and obviously the right thing
to do.
On the pro side, it preserves the fragmentation logic because it'll force
the normal page stealing logic to be applied.
On the con side, we may reassign the pageblock twice -- once to
MIGRATE_UNMOVABLE and once to ac->migratetype. That one does not matter
but the second con is that we inadvertly increase the number of unmovable
blocks in some cases.
Lets say we default to MIGRATE_UNMOVABLE, ac->migratetype is MIGRATE_MOVABLE
and there are enough free pages to satisfy the allocation but not steal
the whole pageblock. The end result is that we have a new unmovable
pageblock that may not be necessary. The next unmovable allocation
potentially is forever. They key observation is that previously the
pageblock could have been short-lived high-order allocations that could
be completely free soon if it was assigned MIGRATE_MOVABLE. This may not
apply when SLUB is using high-order allocations but the point still
holds.
Grouping pages by mobility really needs to strive to keep the number of
unmovable blocks as low as possible. If ac->migratetype is
MIGRATE_UNMOVABLE then we lose nothing. If it's any other type then the
current code keeps the number of unmovable blocks as low as possible.
On that basis I think the current code is fine but it needs a comment to
record why it's like this.
> > @@ -2175,15 +2257,23 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > unsigned long mark, int classzone_idx, int alloc_flags,
> > long free_pages)
> > {
> > - /* free_pages may go negative - that's OK */
> > long min = mark;
> > int o;
> > long free_cma = 0;
> >
> > + /* free_pages may go negative - that's OK */
> > free_pages -= (1 << order) - 1;
> > +
> > if (alloc_flags & ALLOC_HIGH)
> > min -= min / 2;
> > - if (alloc_flags & ALLOC_HARDER)
> > +
> > + /*
> > + * If the caller is not atomic then discount the reserves. This will
> > + * over-estimate how the atomic reserve but it avoids a search
> > + */
> > + if (likely(!(alloc_flags & ALLOC_HARDER)))
> > + free_pages -= z->nr_reserved_highatomic;
>
> Hm, so in the case the maximum of 10% reserved blocks is already full, we deny
> the allocation access to another 10% of the memory and push it to reclaim. This
> seems rather excessive.
It's necessary. If normal callers can use it then the reserve fills with
normal pages, the memory gets fragmented and high-order atomic allocations
fail due to fragmentation. Similarly, the number of MIGRATE_HIGHORDER
pageblocks cannot be unbound or everything else will be continually pushed
into reclaim even if there is plenty of memory free.
--
Mel Gorman
SUSE Labs
On Wed, Jul 29, 2015 at 02:25:13PM +0200, Vlastimil Babka wrote:
> On 07/20/2015 10:00 AM, Mel Gorman wrote:
>
> [...]
>
> > static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > unsigned long mark, int classzone_idx, int alloc_flags,
> > @@ -2259,7 +2261,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > {
> > long min = mark;
> > int o;
> > - long free_cma = 0;
> > + const bool atomic = (alloc_flags & ALLOC_HARDER);
> >
> > /* free_pages may go negative - that's OK */
> > free_pages -= (1 << order) - 1;
> > @@ -2271,7 +2273,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > * If the caller is not atomic then discount the reserves. This will
> > * over-estimate how the atomic reserve but it avoids a search
> > */
> > - if (likely(!(alloc_flags & ALLOC_HARDER)))
> > + if (likely(!atomic))
> > free_pages -= z->nr_reserved_highatomic;
> > else
> > min -= min / 4;
> > @@ -2279,22 +2281,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > #ifdef CONFIG_CMA
> > /* If allocation can't use CMA areas don't use free CMA pages */
> > if (!(alloc_flags & ALLOC_CMA))
> > - free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
> > + free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> > #endif
> >
> > - if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
> > + if (free_pages <= min + z->lowmem_reserve[classzone_idx])
> > return false;
> > - for (o = 0; o < order; o++) {
> > - /* At the next order, this order's pages become unavailable */
> > - free_pages -= z->free_area[o].nr_free << o;
> >
> > - /* Require fewer higher order pages to be free */
> > - min >>= 1;
> > + /* order-0 watermarks are ok */
> > + if (!order)
> > + return true;
> > +
> > + /* Check at least one high-order page is free */
> > + for (o = order; o < MAX_ORDER; o++) {
> > + struct free_area *area = &z->free_area[o];
> > + int mt;
> > +
> > + if (atomic && area->nr_free)
> > + return true;
>
> This may be a false positive due to MIGRATE_CMA or MIGRATE_ISOLATE pages being
> the only free ones. But maybe it doesn't matter that much?
>
I don't think it does. If it it's a false positive then a high-order
atomic allocation may fail which is still meant to be a situation the
caller can cope with.
For MIGRATE_ISOLATE, it's a transient situation.
If this can be demonstrated as a problem for users of CMA then it would be
best to be certain there is a use case that requires more reliable high-order
atomic allocations *and* CMA at the same time. Ordinarily, CMA users are
also not atomic because they cannot migrate. If such an important use case
can be identified then it's a one-liner patch and a changelog that adds
if (!IS_ENABLED(CONFIG_CMA) && atomic && area->nr_free)
> >
> > - if (free_pages <= min)
> > - return false;
> > + for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> > + if (!list_empty(&area->free_list[mt]))
> > + return true;
> > + }
>
> This may be a false negative for ALLOC_CMA allocations, if the only free pages
> are of MIGRATE_CMA. Arguably that's the worse case than a false positive?
>
I also think this is unlikely that there are many high-order atomic
allocations and CMA at the same time. If it's identified to be the case
then CMA also needs to check the pageblock type inside when CONFIG_CMA
is enabled. Again, it's something I would prefer to see that has a
concrete use case first.
--
Mel Gorman
SUSE Labs
Hello, Mel.
On Mon, Jul 20, 2015 at 09:00:18AM +0100, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> High-order watermark checking exists for two reasons -- kswapd high-order
> awareness and protection for high-order atomic requests. Historically we
> depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> that reserves pageblocks for high-order atomic allocations. This is expected
> to be more reliable than MIGRATE_RESERVE was.
I have some concerns on this patch.
1) This patch breaks intention of __GFP_WAIT.
__GFP_WAIT is used when we want to succeed allocation even if we need
to do some reclaim/compaction work. That implies importance of
allocation success. But, reserved pageblock for MIGRATE_HIGHATOMIC makes
atomic allocation (~__GFP_WAIT) more successful than allocation with
__GFP_WAIT in many situation. It breaks basic assumption of gfp flags
and doesn't make any sense.
2) Who care about success of high-order atomic allocation with this
reliability?
In case of allocation without __GFP_WAIT, requestor preare sufficient
fallback method. They just want to success if it is easily successful.
They don't want to succeed allocation with paying great cost that slow
down general workload by this patch that can be accidentally reserve
too much memory.
> A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> a pageblock but limits the total number to 10% of the zone.
When steals happens, pageblock already can be fragmented and we can't
fully utilize this pageblock without allowing order-0 allocation. This
is very waste.
> The pageblocks are unreserved if an allocation fails after a direct
> reclaim attempt.
>
> The watermark checks account for the reserved pageblocks when the allocation
> request is not a high-order atomic allocation.
>
> The stutter benchmark was used to evaluate this but while it was running
> there was a systemtap script that randomly allocated between 1 and 1G worth
> of order-3 pages using GFP_ATOMIC. In kernel 4.2-rc1 running this workload
> on a single-node machine there were 339574 allocation failures. With this
> patch applied there were 28798 failures -- a 92% reduction. On a 4-node
> machine, allocation failures went from 76917 to 0 failures.
There is some missing information to justify benchmark result.
Especially, I'd like to know:
1) Detailed system setup (CPU, MEMORY, etc...)
2) Total number of attempt of GFP_ATOMIC allocation request
I don't know how you modify stutter benchmark in mmtests but it
looks like there is no delay when continually requesting GFP_ATOMIC
allocation. 1G of order-3 allocation request without delay seems insane
to me. Could you tell me how you modify that benchmark for this patch?
> There are minor theoritical side-effects. If the system is intensively
> making large numbers of long-lived high-order atomic allocations then
> there will be a lot of reserved pageblocks. This may push some workloads
> into reclaim until the number of reserved pageblocks is reduced again. This
> problem was not observed in reclaim intensive workloads but such workloads
> are also not atomic high-order intensive.
I don't think this is theoritical side-effects. It can happen easily.
Recently, network subsystem makes some of their high-order allocation
request ~_GFP_WAIT (fb05e7a89f50: net: don't wait for order-3 page
allocation). And, I've submitted similar patch for slub today
(mm/slub: don't wait for high-order page allocation). That
makes system atomic high-order allocation request more and this side-effect
can be possible in many situation.
Thanks.
On Mon, Jul 20, 2015 at 09:00:19AM +0100, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> The primary purpose of watermarks is to ensure that reclaim can always
> make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> These assume that order-0 allocations are all that is necessary for
> forward progress.
>
> High-order watermarks serve a different purpose. Kswapd had no high-order
> awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
> This was particularly important when there were high-order atomic requests.
> The watermarks both gave kswapd awareness and made a reserve for those
> atomic requests.
>
> There are two important side-effects of this. The most important is that
> a non-atomic high-order request can fail even though free pages are available
> and the order-0 watermarks are ok. The second is that high-order watermark
> checks are expensive as the free list counts up to the requested order must
> be examined.
>
> With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> have high-order watermarks. Kswapd and compaction still need high-order
> awareness which is handled by checking that at least one suitable high-order
> page is free.
I totally agree removing watermark checking for order from
PAGE_ALLOC_COSTLY_ORDER to MAX_ORDER. It doesn't make sense to
maintain such high-order freepage that MM don't guarantee allocation
success. For example, in my system, when there is 1 order-9 freepage,
allocation request for order-9 fails because watermark check requires
at least 2 order-9 freepages in order to succeed order-9 allocation.
But, I think watermark checking with order up to PAGE_ALLOC_COSTLY_ORDER is
different. If we maintain just 1 high-order freepages, successive
high-order allocation request that should be success always fall into
allocation slow-path and go into the direct reclaim/compaction. It enlarges
many workload's latency. We should prepare at least some number of freepage
to handle successive high-order allocation request gracefully.
So, how about following?
1) kswapd checks watermark as is up to PAGE_ALLOC_COSTLY_ORDER. It
guarantees kswapd prepares some number of high-order freepages so
successive high-order allocation request will be handlded gracefully.
2) In case of !kswapd, just check whether appropriate freepage is
in buddy or not.
Thanks.
On Mon, Jul 20, 2015 at 09:00:09AM +0100, Mel Gorman wrote:
> From: Mel Gorman <[email protected]>
>
> This series started with the idea to move LRU lists to pgdat but this
> part was more important to start with. It was written against 4.2-rc1 but
> applies to 4.2-rc3.
>
> The zonelist cache has been around for a long time but it is of dubious merit
> with a lot of complexity. There are a few reasons why it needs help that
> are explained in the first patch but the most important is that a failed
> THP allocation can cause a zone to be treated as "full". This potentially
> causes unnecessary stalls, reclaim activity or remote fallbacks. Maybe the
> issues could be fixed but it's not worth it. The series places a small
> number of other micro-optimisations on top before examining watermarks.
>
> High-order watermarks are something that can cause high-order allocations to
> fail even though pages are free. This was originally to protect high-order
> atomic allocations but there is a much better way that can be handled using
> migrate types. This series uses page grouping by mobility to preserve some
> pageblocks for high-order allocations with the size of the reservation
> depending on demand. kswapd awareness is maintained by examining the free
> lists. By patch 10 in this series, there are no high-order watermark checks
> while preserving the properties that motivated the introduction of the
> watermark checks.
I guess that removal of zonelist cache and high-order watermarks has
different purpose and different set of reader. It is better to
separate this two kinds of patches next time to help reviewer to see
what they want to see.
Thanks.
On Fri, Jul 31, 2015 at 02:54:07PM +0900, Joonsoo Kim wrote:
> Hello, Mel.
>
> On Mon, Jul 20, 2015 at 09:00:18AM +0100, Mel Gorman wrote:
> > From: Mel Gorman <[email protected]>
> >
> > High-order watermark checking exists for two reasons -- kswapd high-order
> > awareness and protection for high-order atomic requests. Historically we
> > depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> > pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> > that reserves pageblocks for high-order atomic allocations. This is expected
> > to be more reliable than MIGRATE_RESERVE was.
>
> I have some concerns on this patch.
>
> 1) This patch breaks intention of __GFP_WAIT.
> __GFP_WAIT is used when we want to succeed allocation even if we need
> to do some reclaim/compaction work. That implies importance of
> allocation success. But, reserved pageblock for MIGRATE_HIGHATOMIC makes
> atomic allocation (~__GFP_WAIT) more successful than allocation with
> __GFP_WAIT in many situation. It breaks basic assumption of gfp flags
> and doesn't make any sense.
>
Currently allocation requests that do not specify __GFP_WAIT get the
ALLOC_HARDER flag which allows them to dip further into watermark reserves.
It already is the case that there are corner cases where a high atomic
allocation can succeed when a non-atomic allocation would reclaim.
> 2) Who care about success of high-order atomic allocation with this
> reliability?
Historically network configurations with large MTUs that could not scatter
gather. These days network will also attempt atomic order-3 allocations
to reduce overhead. SLUB also attempts atomic high-order allocations to
reduce overhead. It's why MIGRATE_RESERVE exists at all so the intent of
the patch is to preserve what MIGRATE_RESERVE was for but do it better.
> In case of allocation without __GFP_WAIT, requestor preare sufficient
> fallback method. They just want to success if it is easily successful.
> They don't want to succeed allocation with paying great cost that slow
> down general workload by this patch that can be accidentally reserve
> too much memory.
>
Not necessary true. In the historical case, the network request was atomic
because it was from IRQ context and could not sleep.
> > A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> > a pageblock but limits the total number to 10% of the zone.
>
> When steals happens, pageblock already can be fragmented and we can't
> fully utilize this pageblock without allowing order-0 allocation. This
> is very waste.
>
If the pageblock was stolen, it implies there was at least 1 usable page
of the correct order. As the pageblock is then reserved, any pages that
free in that block stay free for use by high-order atomic allocations.
Else, the number of pageblocks will increase again until the 10% limit
is hit.
> > The pageblocks are unreserved if an allocation fails after a direct
> > reclaim attempt.
> >
> > The watermark checks account for the reserved pageblocks when the allocation
> > request is not a high-order atomic allocation.
> >
> > The stutter benchmark was used to evaluate this but while it was running
> > there was a systemtap script that randomly allocated between 1 and 1G worth
> > of order-3 pages using GFP_ATOMIC. In kernel 4.2-rc1 running this workload
> > on a single-node machine there were 339574 allocation failures. With this
> > patch applied there were 28798 failures -- a 92% reduction. On a 4-node
> > machine, allocation failures went from 76917 to 0 failures.
>
> There is some missing information to justify benchmark result.
> Especially, I'd like to know:
>
> 1) Detailed system setup (CPU, MEMORY, etc...)
CPUs were 8 core Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz with 8G of RAM.
> 2) Total number of attempt of GFP_ATOMIC allocation request
>
Each attempt was between 1 and 1G randomly as described already.
> I don't know how you modify stutter benchmark in mmtests but it
> looks like there is no delay when continually requesting GFP_ATOMIC
> allocation.
> 1G of order-3 allocation request without delay seems insane
> to me. Could you tell me how you modify that benchmark for this patch?
>
The stutter benchmark was not modified. The watch-stress-highorder-atomic
monitor was run in parallel and that's what is doing the allocation. It's
true that up to 1G of order-3 allocations without delay would be insane
in a normal situation. The point was to show an extreme case where atomic
allocations were used and to test whether the reserves held up or not.
> > There are minor theoritical side-effects. If the system is intensively
> > making large numbers of long-lived high-order atomic allocations then
> > there will be a lot of reserved pageblocks. This may push some workloads
> > into reclaim until the number of reserved pageblocks is reduced again. This
> > problem was not observed in reclaim intensive workloads but such workloads
> > are also not atomic high-order intensive.
>
> I don't think this is theoritical side-effects. It can happen easily.
> Recently, network subsystem makes some of their high-order allocation
> request ~_GFP_WAIT (fb05e7a89f50: net: don't wait for order-3 page
> allocation). And, I've submitted similar patch for slub today
> (mm/slub: don't wait for high-order page allocation). That
> makes system atomic high-order allocation request more and this side-effect
> can be possible in many situation.
>
The key is long-lived allocations. The network subsystem frees theirs. I
was not able to trigger a situation in a variety of workloads where these
happened which is why I classified it as theoritical.
--
Mel Gorman
SUSE Labs
On Fri, Jul 31, 2015 at 03:08:38PM +0900, Joonsoo Kim wrote:
> On Mon, Jul 20, 2015 at 09:00:19AM +0100, Mel Gorman wrote:
> > From: Mel Gorman <[email protected]>
> >
> > The primary purpose of watermarks is to ensure that reclaim can always
> > make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> > These assume that order-0 allocations are all that is necessary for
> > forward progress.
> >
> > High-order watermarks serve a different purpose. Kswapd had no high-order
> > awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
> > This was particularly important when there were high-order atomic requests.
> > The watermarks both gave kswapd awareness and made a reserve for those
> > atomic requests.
> >
> > There are two important side-effects of this. The most important is that
> > a non-atomic high-order request can fail even though free pages are available
> > and the order-0 watermarks are ok. The second is that high-order watermark
> > checks are expensive as the free list counts up to the requested order must
> > be examined.
> >
> > With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> > have high-order watermarks. Kswapd and compaction still need high-order
> > awareness which is handled by checking that at least one suitable high-order
> > page is free.
>
> I totally agree removing watermark checking for order from
> PAGE_ALLOC_COSTLY_ORDER to MAX_ORDER. It doesn't make sense to
> maintain such high-order freepage that MM don't guarantee allocation
> success. For example, in my system, when there is 1 order-9 freepage,
> allocation request for order-9 fails because watermark check requires
> at least 2 order-9 freepages in order to succeed order-9 allocation.
>
> But, I think watermark checking with order up to PAGE_ALLOC_COSTLY_ORDER is
> different. If we maintain just 1 high-order freepages, successive
> high-order allocation request that should be success always fall into
> allocation slow-path and go into the direct reclaim/compaction. It enlarges
> many workload's latency. We should prepare at least some number of freepage
> to handle successive high-order allocation request gracefully.
>
> So, how about following?
>
> 1) kswapd checks watermark as is up to PAGE_ALLOC_COSTLY_ORDER. It
> guarantees kswapd prepares some number of high-order freepages so
> successive high-order allocation request will be handlded gracefully.
> 2) In case of !kswapd, just check whether appropriate freepage is
> in buddy or not.
>
If !atomic allocations use the high-order reserves then they'll fragment
similarly to how they get fragmented today. It defeats the purpose of
the reserve. I noted in the leader that embedded platforms may choose to
carry an out-of-ftree patch that makes the reserves a kernel reserve for
high-order pages but that I didn't think it was a good idea for mainline.
Your suggestion implies we have two watermark checks. The fast path
which obeys watermarks in the traditional way. kswapd would use the same
watermark check. The slow path would use the watermark check in this
path. It is quite complex when historically it was expected that a
!atomic high-order allocation request may take a long time. Furthermore,
it's the case that kswapd gives up high-order reclaim requests very
quickly because there were cases where a high-order request would cause
kswapd to continually reclaim when the system was fragmented. I fear
that your suggestion would partially reintroduce the problem in the name
of trying to decrease the latency of a !atomic high-order allocation
request that is expected to be expensive sometimes.
--
Mel Gorman
SUSE Labs
On Fri, Jul 31, 2015 at 03:14:03PM +0900, Joonsoo Kim wrote:
> On Mon, Jul 20, 2015 at 09:00:09AM +0100, Mel Gorman wrote:
> > From: Mel Gorman <[email protected]>
> >
> > This series started with the idea to move LRU lists to pgdat but this
> > part was more important to start with. It was written against 4.2-rc1 but
> > applies to 4.2-rc3.
> >
> > The zonelist cache has been around for a long time but it is of dubious merit
> > with a lot of complexity. There are a few reasons why it needs help that
> > are explained in the first patch but the most important is that a failed
> > THP allocation can cause a zone to be treated as "full". This potentially
> > causes unnecessary stalls, reclaim activity or remote fallbacks. Maybe the
> > issues could be fixed but it's not worth it. The series places a small
> > number of other micro-optimisations on top before examining watermarks.
> >
> > High-order watermarks are something that can cause high-order allocations to
> > fail even though pages are free. This was originally to protect high-order
> > atomic allocations but there is a much better way that can be handled using
> > migrate types. This series uses page grouping by mobility to preserve some
> > pageblocks for high-order allocations with the size of the reservation
> > depending on demand. kswapd awareness is maintained by examining the free
> > lists. By patch 10 in this series, there are no high-order watermark checks
> > while preserving the properties that motivated the introduction of the
> > watermark checks.
>
> I guess that removal of zonelist cache and high-order watermarks has
> different purpose and different set of reader. It is better to
> separate this two kinds of patches next time to help reviewer to see
> what they want to see.
>
One of the reasons zonelist existed was to avoid watermark checks in some
case. The series also intends to reduce the cost of watermark checks in some
cases which is why they are part of the same series. I'm not comfortable
doing one without the other.
--
Mel Gorman
SUSE Labs
On 07/31/2015 09:11 AM, Mel Gorman wrote:
> On Fri, Jul 31, 2015 at 02:54:07PM +0900, Joonsoo Kim wrote:
>> Hello, Mel.
>>
>> On Mon, Jul 20, 2015 at 09:00:18AM +0100, Mel Gorman wrote:
>>> From: Mel Gorman <[email protected]>
>>>
>>> High-order watermark checking exists for two reasons -- kswapd high-order
>>> awareness and protection for high-order atomic requests. Historically we
>>> depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
>>> pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
>>> that reserves pageblocks for high-order atomic allocations. This is expected
>>> to be more reliable than MIGRATE_RESERVE was.
>>
>> I have some concerns on this patch.
>>
>> 1) This patch breaks intention of __GFP_WAIT.
>> __GFP_WAIT is used when we want to succeed allocation even if we need
>> to do some reclaim/compaction work. That implies importance of
>> allocation success. But, reserved pageblock for MIGRATE_HIGHATOMIC makes
>> atomic allocation (~__GFP_WAIT) more successful than allocation with
>> __GFP_WAIT in many situation. It breaks basic assumption of gfp flags
>> and doesn't make any sense.
>>
>
> Currently allocation requests that do not specify __GFP_WAIT get the
> ALLOC_HARDER flag which allows them to dip further into watermark reserves.
> It already is the case that there are corner cases where a high atomic
> allocation can succeed when a non-atomic allocation would reclaim.
I think (and said so before elsewhere) is that the problem is that we
don't currently distinguish allocations that can't wait (=are really
atomic and have no order-0 fallback) and allocations that just don't
want to wait (=they have fallbacks). The second ones should obviously
not access the current ALLOC_HARDER watermark-based reserves nor the
proposed highatomic reserves.
Well we do look at __GFP_NO_KSWAPD flag to treat allocation as
non-atomic, so that covers THP allocations and two drivers. But the
recent networking commit fb05e7a89f50 didn't add the flag and nor does
Joonsoo's slub patch use it. Either we should rename the flag and employ
it where appropriate, or agree that access to reserves is orthogonal
concern to waking up kswapd, and distinguish non-atomic non-__GFP_WAIT
allocations differently.
>>> A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
>>> a pageblock but limits the total number to 10% of the zone.
>>
>> When steals happens, pageblock already can be fragmented and we can't
>> fully utilize this pageblock without allowing order-0 allocation. This
>> is very waste.
>>
>
> If the pageblock was stolen, it implies there was at least 1 usable page
> of the correct order. As the pageblock is then reserved, any pages that
> free in that block stay free for use by high-order atomic allocations.
> Else, the number of pageblocks will increase again until the 10% limit
> is hit.
It's however true that many of the "any pages free in that block" may be
order-0, so they both won't be useful to high-order atomic allocations,
and won't be available to other allocations, so they might remain unused.
On Fri, Jul 31, 2015 at 09:25:13AM +0200, Vlastimil Babka wrote:
> On 07/31/2015 09:11 AM, Mel Gorman wrote:
> >On Fri, Jul 31, 2015 at 02:54:07PM +0900, Joonsoo Kim wrote:
> >>Hello, Mel.
> >>
> >>On Mon, Jul 20, 2015 at 09:00:18AM +0100, Mel Gorman wrote:
> >>>From: Mel Gorman <[email protected]>
> >>>
> >>>High-order watermark checking exists for two reasons -- kswapd high-order
> >>>awareness and protection for high-order atomic requests. Historically we
> >>>depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> >>>pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> >>>that reserves pageblocks for high-order atomic allocations. This is expected
> >>>to be more reliable than MIGRATE_RESERVE was.
> >>
> >>I have some concerns on this patch.
> >>
> >>1) This patch breaks intention of __GFP_WAIT.
> >>__GFP_WAIT is used when we want to succeed allocation even if we need
> >>to do some reclaim/compaction work. That implies importance of
> >>allocation success. But, reserved pageblock for MIGRATE_HIGHATOMIC makes
> >>atomic allocation (~__GFP_WAIT) more successful than allocation with
> >>__GFP_WAIT in many situation. It breaks basic assumption of gfp flags
> >>and doesn't make any sense.
> >>
> >
> >Currently allocation requests that do not specify __GFP_WAIT get the
> >ALLOC_HARDER flag which allows them to dip further into watermark reserves.
> >It already is the case that there are corner cases where a high atomic
> >allocation can succeed when a non-atomic allocation would reclaim.
>
> I think (and said so before elsewhere) is that the problem is that we don't
> currently distinguish allocations that can't wait (=are really atomic and
> have no order-0 fallback) and allocations that just don't want to wait
> (=they have fallbacks). The second ones should obviously not access the
> current ALLOC_HARDER watermark-based reserves nor the proposed highatomic
> reserves.
>
It's a separate issue though. There are a number of cases
1. can't wait because a spinlock is held or in interrupt
2. does not want to wait because a fallback option is available
3. does not want to wait or wake kswapd because a fallback option is available
4. should not fail as it would cause major difficulties
5. cannot fail because it's a functional failure
5 is never meant to occur might be the situation on embedded platforms.
If this is the case then they should consider modifying MIGRATE_HIGHATOMIC
in this patch series but I don't think it belongs in mainline as it has
other consequences. 1-4 are a separate series.
Right now, this series does not drastically alter the concept that in
some cases atomic allocations will succeed without delay when a !atomic
allocation would have to reclaim.
There is certainly value to ironing out 1-4 on top and teaching SLUB,
THP and networking the distinction.
> Well we do look at __GFP_NO_KSWAPD flag to treat allocation as non-atomic,
> so that covers THP allocations and two drivers. But the recent networking
> commit fb05e7a89f50 didn't add the flag and nor does Joonsoo's slub patch
> use it. Either we should rename the flag and employ it where appropriate, or
> agree that access to reserves is orthogonal concern to waking up kswapd, and
> distinguish non-atomic non-__GFP_WAIT allocations differently.
>
Separate problem with a separate series. This one is about removing
the zonelist cache due to complexity and removing an odd anomaly where
allocations can fail due to how watermarks are calculated.
> >>>A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> >>>a pageblock but limits the total number to 10% of the zone.
> >>
> >>When steals happens, pageblock already can be fragmented and we can't
> >>fully utilize this pageblock without allowing order-0 allocation. This
> >>is very waste.
> >>
> >
> >If the pageblock was stolen, it implies there was at least 1 usable page
> >of the correct order. As the pageblock is then reserved, any pages that
> >free in that block stay free for use by high-order atomic allocations.
> >Else, the number of pageblocks will increase again until the 10% limit
> >is hit.
>
> It's however true that many of the "any pages free in that block" may be
> order-0, so they both won't be useful to high-order atomic allocations, and
> won't be available to other allocations, so they might remain unused.
I typoed lightly and missed a letter but the same outcome applies when
slightly corrected -- any pages that are *freed* in that block stay free
if it merges with buddies for use by high-order atomic allocations.
or else, the number of pageblocks will increase again until the 10%
limit is hit.
If the limit is hit and we are still failing then it's no different to
what can happen today except it took a lot longer and was a lot harder to
trigger. As the changelog pointed out, with this approach the allocation
failure rate was massively reduced but not eliminated.
--
Mel Gorman
SUSE Labs
On Fri, Jul 31, 2015 at 08:11:13AM +0100, Mel Gorman wrote:
> On Fri, Jul 31, 2015 at 02:54:07PM +0900, Joonsoo Kim wrote:
> > Hello, Mel.
> >
> > On Mon, Jul 20, 2015 at 09:00:18AM +0100, Mel Gorman wrote:
> > > From: Mel Gorman <[email protected]>
> > >
> > > High-order watermark checking exists for two reasons -- kswapd high-order
> > > awareness and protection for high-order atomic requests. Historically we
> > > depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> > > pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> > > that reserves pageblocks for high-order atomic allocations. This is expected
> > > to be more reliable than MIGRATE_RESERVE was.
> >
> > I have some concerns on this patch.
> >
> > 1) This patch breaks intention of __GFP_WAIT.
> > __GFP_WAIT is used when we want to succeed allocation even if we need
> > to do some reclaim/compaction work. That implies importance of
> > allocation success. But, reserved pageblock for MIGRATE_HIGHATOMIC makes
> > atomic allocation (~__GFP_WAIT) more successful than allocation with
> > __GFP_WAIT in many situation. It breaks basic assumption of gfp flags
> > and doesn't make any sense.
> >
>
> Currently allocation requests that do not specify __GFP_WAIT get the
> ALLOC_HARDER flag which allows them to dip further into watermark reserves.
> It already is the case that there are corner cases where a high atomic
> allocation can succeed when a non-atomic allocation would reclaim.
I know that. It's matter of magnitute. If your patch is applied,
GFP_ATOMIC almost succeed and there is no merit to use GFP_WAIT.
If user can easily bypass big overhead from reclaim/compaction through
GFP_ATOMIC allocation, they will decide to use GFP_ATOMIC flag rather than
adding GFP_WAIT.
>
> > 2) Who care about success of high-order atomic allocation with this
> > reliability?
>
> Historically network configurations with large MTUs that could not scatter
> gather. These days network will also attempt atomic order-3 allocations
> to reduce overhead. SLUB also attempts atomic high-order allocations to
> reduce overhead. It's why MIGRATE_RESERVE exists at all so the intent of
> the patch is to preserve what MIGRATE_RESERVE was for but do it better.
Normally, SLUB doesn't rely on success of high-order allocation. So,
don't need to such reliability. It can fallback to low-order allocation.
Moreover, we can get such benefit of high-order allocation by using
kcompactd as suggested by Vlastimil soon.
> > In case of allocation without __GFP_WAIT, requestor preare sufficient
> > fallback method. They just want to success if it is easily successful.
> > They don't want to succeed allocation with paying great cost that slow
> > down general workload by this patch that can be accidentally reserve
> > too much memory.
> >
>
> Not necessary true. In the historical case, the network request was atomic
> because it was from IRQ context and could not sleep.
If some of atomic high-order allocation requestor rely on success of
atomic high-order allocation, they should be changed as reserving how
much they need. Not, here MM. MM can't do anything if allocation is
requested in IRQ context. Reserving a lot of memory to guarantee
them doesn't make sense. And, I don't see any recent claim to guarantee such
allocation more reliable.
> > > A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> > > a pageblock but limits the total number to 10% of the zone.
> >
> > When steals happens, pageblock already can be fragmented and we can't
> > fully utilize this pageblock without allowing order-0 allocation. This
> > is very waste.
> >
>
> If the pageblock was stolen, it implies there was at least 1 usable page
> of the correct order. As the pageblock is then reserved, any pages that
> free in that block stay free for use by high-order atomic allocations.
> Else, the number of pageblocks will increase again until the 10% limit
> is hit.
It really depends on luck.
> > > The pageblocks are unreserved if an allocation fails after a direct
> > > reclaim attempt.
> > >
> > > The watermark checks account for the reserved pageblocks when the allocation
> > > request is not a high-order atomic allocation.
> > >
> > > The stutter benchmark was used to evaluate this but while it was running
> > > there was a systemtap script that randomly allocated between 1 and 1G worth
> > > of order-3 pages using GFP_ATOMIC. In kernel 4.2-rc1 running this workload
> > > on a single-node machine there were 339574 allocation failures. With this
> > > patch applied there were 28798 failures -- a 92% reduction. On a 4-node
> > > machine, allocation failures went from 76917 to 0 failures.
> >
> > There is some missing information to justify benchmark result.
> > Especially, I'd like to know:
> >
> > 1) Detailed system setup (CPU, MEMORY, etc...)
>
> CPUs were 8 core Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz with 8G of RAM.
>
> > 2) Total number of attempt of GFP_ATOMIC allocation request
> >
>
> Each attempt was between 1 and 1G randomly as described already.
So, number of attempt was randomly choosen, but, number of failure is static.
Please describe same level of statistics. Am I missing something?
> > I don't know how you modify stutter benchmark in mmtests but it
> > looks like there is no delay when continually requesting GFP_ATOMIC
> > allocation.
> > 1G of order-3 allocation request without delay seems insane
> > to me. Could you tell me how you modify that benchmark for this patch?
> >
>
> The stutter benchmark was not modified. The watch-stress-highorder-atomic
> monitor was run in parallel and that's what is doing the allocation. It's
> true that up to 1G of order-3 allocations without delay would be insane
> in a normal situation. The point was to show an extreme case where atomic
> allocations were used and to test whether the reserves held up or not.
You may change MAX_BURST in stap script to certain value that 1G
successive attemtps is possible. 1G of order-3 atomic allocation
without delay isn't really helpful benchmark, because it really doesn't
reflect any real world situation. Even if extreme case, it should
reflect real world situation at some point.
If number of successive attemtps is back to the realistic value, such
large failure happens?
>
>
> > > There are minor theoritical side-effects. If the system is intensively
> > > making large numbers of long-lived high-order atomic allocations then
> > > there will be a lot of reserved pageblocks. This may push some workloads
> > > into reclaim until the number of reserved pageblocks is reduced again. This
> > > problem was not observed in reclaim intensive workloads but such workloads
> > > are also not atomic high-order intensive.
> >
> > I don't think this is theoritical side-effects. It can happen easily.
> > Recently, network subsystem makes some of their high-order allocation
> > request ~_GFP_WAIT (fb05e7a89f50: net: don't wait for order-3 page
> > allocation). And, I've submitted similar patch for slub today
> > (mm/slub: don't wait for high-order page allocation). That
> > makes system atomic high-order allocation request more and this side-effect
> > can be possible in many situation.
> >
>
> The key is long-lived allocations. The network subsystem frees theirs. I
> was not able to trigger a situation in a variety of workloads where these
> happened which is why I classified it as theoritical.
SLUB allocation would be long-lived.
Thanks.
On 07/29/2015 02:53 PM, Mel Gorman wrote:
> On Wed, Jul 29, 2015 at 01:35:17PM +0200, Vlastimil Babka wrote:
>> On 07/20/2015 10:00 AM, Mel Gorman wrote:
>>> +/*
>>> + * Used when an allocation is about to fail under memory pressure. This
>>> + * potentially hurts the reliability of high-order allocations when under
>>> + * intense memory pressure but failed atomic allocations should be easier
>>> + * to recover from than an OOM.
>>> + */
>>> +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
>>> +{
>>> + struct zonelist *zonelist = ac->zonelist;
>>> + unsigned long flags;
>>> + struct zoneref *z;
>>> + struct zone *zone;
>>> + struct page *page;
>>> + int order;
>>> +
>>> + for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
>>> + ac->nodemask) {
>>
>> This fixed order might bias some zones over others wrt unreserving. Is it OK?
>
> I could not think of a situation where it mattered. It'll always be
> preferring highest zone over lower zones. Allocation requests that can
> use any zone that do not care. Allocation requests that are limited to
> lower zones are protected as long as possible.
Hmm... allocation requests will follow fair zone policy and thus the
highatomic reservations will be spread fairly among all zones? Unless
the allocations require lower zones of course.
But for unreservations, normal/high allocations failing under memory
pressure will lead to unreserving highatomic pageblocks first in the
higher zones and only then the lower zones, and that was my concern. But
it's true that failing allocations that require lower zones will lead to
unreserving the lower zones, so it might be ok in the end.
>
>>
>>> + /* Preserve at least one pageblock */
>>> + if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
>>> + continue;
>>> +
>>> + spin_lock_irqsave(&zone->lock, flags);
>>> + for (order = 0; order < MAX_ORDER; order++) {
>>
>> Would it make more sense to look in descending order for a higher chance of
>> unreserving a pageblock that's mostly free? Like the traditional page stealing does?
>>
>
> I don't think it's worth the search cost. Traditional page stealing is
> searching because it's trying to minimise events that cause external
> fragmentation. Here we'd gain very little. We are under some memory
> pressure here, if enough pages are not free then another one will get
> freed shortly. Either way, I doubt the difference is measurable.
Hmm, I guess...
>
>>> + struct free_area *area = &(zone->free_area[order]);
>>> +
>>> + if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
>>> + continue;
>>> +
>>> + page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
>>> + struct page, lru);
>>> +
>>> + zone->nr_reserved_highatomic -= pageblock_nr_pages;
>>> + set_pageblock_migratetype(page, ac->migratetype);
>>
>> Would it make more sense to assume MIGRATE_UNMOVABLE, as high-order allocations
>> present in the pageblock typically would be, and apply the traditional page
>> stealing heuristics to decide if it should be changed to ac->migratetype (if
>> that differs)?
>>
>
> Superb spot, I had to think about this for a while and initially I was
> thinking your suggestion was a no-brainer and obviously the right thing
> to do.
>
> On the pro side, it preserves the fragmentation logic because it'll force
> the normal page stealing logic to be applied.
>
> On the con side, we may reassign the pageblock twice -- once to
> MIGRATE_UNMOVABLE and once to ac->migratetype. That one does not matter
> but the second con is that we inadvertly increase the number of unmovable
> blocks in some cases.
>
> Lets say we default to MIGRATE_UNMOVABLE, ac->migratetype is MIGRATE_MOVABLE
> and there are enough free pages to satisfy the allocation but not steal
> the whole pageblock. The end result is that we have a new unmovable
> pageblock that may not be necessary. The next unmovable allocation
> potentially is forever. They key observation is that previously the
> pageblock could have been short-lived high-order allocations that could
> be completely free soon if it was assigned MIGRATE_MOVABLE. This may not
> apply when SLUB is using high-order allocations but the point still
> holds.
Yeah, I see the point. The obvious counterexample is a pageblock that we
design as MOVABLE and yet it contains some long-lived unmovable
allocation. More unmovable allocations could lead to choosing another
movable block as a fallback, while if we marked this pageblock as
unmovable, they could go here and not increase fragmentation.
The problem is, we can't known which one is the case. I've toyed with an
idea of MIGRATE_MIXED blocks that would be for cases where the
heuristics decide that e.g. an UNMOVABLE block is empty enough to change
it to MOVABLE, but still it may contain some unmovable allocations. Such
pageblocks should be preferred fallbacks for future unmovable
allocations before truly pristine movable pageblocks.
The scenario here could be another where MIGRATE_MIXED would make sense.
> Grouping pages by mobility really needs to strive to keep the number of
> unmovable blocks as low as possible.
More precisely, the number of blocks with unmovable allocations in them.
Which is much harder objective.
> If ac->migratetype is
> MIGRATE_UNMOVABLE then we lose nothing. If it's any other type then the
> current code keeps the number of unmovable blocks as low as possible.
>
> On that basis I think the current code is fine but it needs a comment to
> record why it's like this.
>
>>> @@ -2175,15 +2257,23 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>>> unsigned long mark, int classzone_idx, int alloc_flags,
>>> long free_pages)
>>> {
>>> - /* free_pages may go negative - that's OK */
>>> long min = mark;
>>> int o;
>>> long free_cma = 0;
>>>
>>> + /* free_pages may go negative - that's OK */
>>> free_pages -= (1 << order) - 1;
>>> +
>>> if (alloc_flags & ALLOC_HIGH)
>>> min -= min / 2;
>>> - if (alloc_flags & ALLOC_HARDER)
>>> +
>>> + /*
>>> + * If the caller is not atomic then discount the reserves. This will
>>> + * over-estimate how the atomic reserve but it avoids a search
>>> + */
>>> + if (likely(!(alloc_flags & ALLOC_HARDER)))
>>> + free_pages -= z->nr_reserved_highatomic;
>>
>> Hm, so in the case the maximum of 10% reserved blocks is already full, we deny
>> the allocation access to another 10% of the memory and push it to reclaim. This
>> seems rather excessive.
>
> It's necessary. If normal callers can use it then the reserve fills with
> normal pages, the memory gets fragmented and high-order atomic allocations
> fail due to fragmentation. Similarly, the number of MIGRATE_HIGHORDER
> pageblocks cannot be unbound or everything else will be continually pushed
> into reclaim even if there is plenty of memory free.
I understand denying normal allocations access to highatomic reserves
via watermarks is necessary. But my concern is that for each reserved
pageblock we effectively deny up to two pageblocks-worth-of-pages to
normal allocations. One pageblock that is marked as MIGRATE_HIGHATOMIC,
and once it becomes full, free_pages above are decreased twice - once by
the pageblock becoming full, and then again by subtracting
z->nr_reserved_highatomic. This extra gap is still usable by highatomic
allocations, but they will also potentially mark more pageblocks
highatomic and further increase the gap. In the worst case we have 10%
of pageblocks marked highatomic and full, and another 10% that is only
usable by highatomic allocations (but won't be marked as such), and if
no more highatomic allocations come then the 10% is wasted.
On Fri, Jul 31, 2015 at 09:25:13AM +0200, Vlastimil Babka wrote:
> On 07/31/2015 09:11 AM, Mel Gorman wrote:
> >On Fri, Jul 31, 2015 at 02:54:07PM +0900, Joonsoo Kim wrote:
> >>Hello, Mel.
> >>
> >>On Mon, Jul 20, 2015 at 09:00:18AM +0100, Mel Gorman wrote:
> >>>From: Mel Gorman <[email protected]>
> >>>
> >>>High-order watermark checking exists for two reasons -- kswapd high-order
> >>>awareness and protection for high-order atomic requests. Historically we
> >>>depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> >>>pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> >>>that reserves pageblocks for high-order atomic allocations. This is expected
> >>>to be more reliable than MIGRATE_RESERVE was.
> >>
> >>I have some concerns on this patch.
> >>
> >>1) This patch breaks intention of __GFP_WAIT.
> >>__GFP_WAIT is used when we want to succeed allocation even if we need
> >>to do some reclaim/compaction work. That implies importance of
> >>allocation success. But, reserved pageblock for MIGRATE_HIGHATOMIC makes
> >>atomic allocation (~__GFP_WAIT) more successful than allocation with
> >>__GFP_WAIT in many situation. It breaks basic assumption of gfp flags
> >>and doesn't make any sense.
> >>
> >
> >Currently allocation requests that do not specify __GFP_WAIT get the
> >ALLOC_HARDER flag which allows them to dip further into watermark reserves.
> >It already is the case that there are corner cases where a high atomic
> >allocation can succeed when a non-atomic allocation would reclaim.
>
> I think (and said so before elsewhere) is that the problem is that
> we don't currently distinguish allocations that can't wait (=are
> really atomic and have no order-0 fallback) and allocations that
> just don't want to wait (=they have fallbacks). The second ones
> should obviously not access the current ALLOC_HARDER watermark-based
> reserves nor the proposed highatomic reserves.
Yes, I agree. If we distinguish such cases, I'm not sure that this
kinds of reservation is needed. It is better that atomic high-order
allocation in IRQ context should prepare their own fallback such as
reserving memory or low-order allocation.
>
> Well we do look at __GFP_NO_KSWAPD flag to treat allocation as
> non-atomic, so that covers THP allocations and two drivers. But the
> recent networking commit fb05e7a89f50 didn't add the flag and nor
> does Joonsoo's slub patch use it. Either we should rename the flag
> and employ it where appropriate, or agree that access to reserves is
> orthogonal concern to waking up kswapd, and distinguish non-atomic
> non-__GFP_WAIT allocations differently.
>
> >>>A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> >>>a pageblock but limits the total number to 10% of the zone.
> >>
> >>When steals happens, pageblock already can be fragmented and we can't
> >>fully utilize this pageblock without allowing order-0 allocation. This
> >>is very waste.
> >>
> >
> >If the pageblock was stolen, it implies there was at least 1 usable page
> >of the correct order. As the pageblock is then reserved, any pages that
> >free in that block stay free for use by high-order atomic allocations.
> >Else, the number of pageblocks will increase again until the 10% limit
> >is hit.
>
> It's however true that many of the "any pages free in that block"
> may be order-0, so they both won't be useful to high-order atomic
> allocations, and won't be available to other allocations, so they
> might remain unused.
Agreed.
Thanks.
On Fri, Jul 31, 2015 at 08:19:07AM +0100, Mel Gorman wrote:
> On Fri, Jul 31, 2015 at 03:08:38PM +0900, Joonsoo Kim wrote:
> > On Mon, Jul 20, 2015 at 09:00:19AM +0100, Mel Gorman wrote:
> > > From: Mel Gorman <[email protected]>
> > >
> > > The primary purpose of watermarks is to ensure that reclaim can always
> > > make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> > > These assume that order-0 allocations are all that is necessary for
> > > forward progress.
> > >
> > > High-order watermarks serve a different purpose. Kswapd had no high-order
> > > awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
> > > This was particularly important when there were high-order atomic requests.
> > > The watermarks both gave kswapd awareness and made a reserve for those
> > > atomic requests.
> > >
> > > There are two important side-effects of this. The most important is that
> > > a non-atomic high-order request can fail even though free pages are available
> > > and the order-0 watermarks are ok. The second is that high-order watermark
> > > checks are expensive as the free list counts up to the requested order must
> > > be examined.
> > >
> > > With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> > > have high-order watermarks. Kswapd and compaction still need high-order
> > > awareness which is handled by checking that at least one suitable high-order
> > > page is free.
> >
> > I totally agree removing watermark checking for order from
> > PAGE_ALLOC_COSTLY_ORDER to MAX_ORDER. It doesn't make sense to
> > maintain such high-order freepage that MM don't guarantee allocation
> > success. For example, in my system, when there is 1 order-9 freepage,
> > allocation request for order-9 fails because watermark check requires
> > at least 2 order-9 freepages in order to succeed order-9 allocation.
> >
> > But, I think watermark checking with order up to PAGE_ALLOC_COSTLY_ORDER is
> > different. If we maintain just 1 high-order freepages, successive
> > high-order allocation request that should be success always fall into
> > allocation slow-path and go into the direct reclaim/compaction. It enlarges
> > many workload's latency. We should prepare at least some number of freepage
> > to handle successive high-order allocation request gracefully.
> >
> > So, how about following?
> >
> > 1) kswapd checks watermark as is up to PAGE_ALLOC_COSTLY_ORDER. It
> > guarantees kswapd prepares some number of high-order freepages so
> > successive high-order allocation request will be handlded gracefully.
> > 2) In case of !kswapd, just check whether appropriate freepage is
> > in buddy or not.
> >
>
> If !atomic allocations use the high-order reserves then they'll fragment
> similarly to how they get fragmented today. It defeats the purpose of
> the reserve. I noted in the leader that embedded platforms may choose to
> carry an out-of-ftree patch that makes the reserves a kernel reserve for
> high-order pages but that I didn't think it was a good idea for mainline.
I assume that your previous patch isn't merged. !atomic allocation can
use reserve that kswapd makes in normal pageblock. That will fragment
similarly as is, but, it isn't unsolvable problem. If compaction is enhanced,
we don't need to worry about fragmentation as I experienced in embedded
platform.
>
> Your suggestion implies we have two watermark checks. The fast path
> which obeys watermarks in the traditional way. kswapd would use the same
> watermark check. The slow path would use the watermark check in this
> path. It is quite complex when historically it was expected that a
> !atomic high-order allocation request may take a long time. Furthermore,
Why quite complex? Watermark check already apply different threshold.
> it's the case that kswapd gives up high-order reclaim requests very
> quickly because there were cases where a high-order request would cause
> kswapd to continually reclaim when the system was fragmented. I fear
> that your suggestion would partially reintroduce the problem in the name
> of trying to decrease the latency of a !atomic high-order allocation
> request that is expected to be expensive sometimes.
!atomic high-order allocation request is expected to be expensive sometimes,
but, they don't want to be expensive. IMO, optimizing them is MM's duty.
Thanks.
On Fri, Jul 31, 2015 at 05:26:41PM +0900, Joonsoo Kim wrote:
> On Fri, Jul 31, 2015 at 08:11:13AM +0100, Mel Gorman wrote:
> > On Fri, Jul 31, 2015 at 02:54:07PM +0900, Joonsoo Kim wrote:
> > > Hello, Mel.
> > >
> > > On Mon, Jul 20, 2015 at 09:00:18AM +0100, Mel Gorman wrote:
> > > > From: Mel Gorman <[email protected]>
> > > >
> > > > High-order watermark checking exists for two reasons -- kswapd high-order
> > > > awareness and protection for high-order atomic requests. Historically we
> > > > depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free
> > > > pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> > > > that reserves pageblocks for high-order atomic allocations. This is expected
> > > > to be more reliable than MIGRATE_RESERVE was.
> > >
> > > I have some concerns on this patch.
> > >
> > > 1) This patch breaks intention of __GFP_WAIT.
> > > __GFP_WAIT is used when we want to succeed allocation even if we need
> > > to do some reclaim/compaction work. That implies importance of
> > > allocation success. But, reserved pageblock for MIGRATE_HIGHATOMIC makes
> > > atomic allocation (~__GFP_WAIT) more successful than allocation with
> > > __GFP_WAIT in many situation. It breaks basic assumption of gfp flags
> > > and doesn't make any sense.
> > >
> >
> > Currently allocation requests that do not specify __GFP_WAIT get the
> > ALLOC_HARDER flag which allows them to dip further into watermark reserves.
> > It already is the case that there are corner cases where a high atomic
> > allocation can succeed when a non-atomic allocation would reclaim.
>
> I know that. It's matter of magnitute. If your patch is applied,
> GFP_ATOMIC almost succeed and there is no merit to use GFP_WAIT.
Yes there is. If the reserves are too high then it will unnecessarily push
order-0 allocations into reclaim. The use for atomic should be just
that, atomic.
> If user can easily bypass big overhead from reclaim/compaction through
> GFP_ATOMIC allocation, they will decide to use GFP_ATOMIC flag rather than
> adding GFP_WAIT.
>
They overhead cannot be avoided, they simply hit failure instead. If the
degree of magnitude is a problem then I can drop the reseserves from 10%
to 1% so it's closer to what MIGRATE_RESERVE does today.
> >
> > > 2) Who care about success of high-order atomic allocation with this
> > > reliability?
> >
> > Historically network configurations with large MTUs that could not scatter
> > gather. These days network will also attempt atomic order-3 allocations
> > to reduce overhead. SLUB also attempts atomic high-order allocations to
> > reduce overhead. It's why MIGRATE_RESERVE exists at all so the intent of
> > the patch is to preserve what MIGRATE_RESERVE was for but do it better.
>
> Normally, SLUB doesn't rely on success of high-order allocation. So,
> don't need to such reliability. It can fallback to low-order allocation.
> Moreover, we can get such benefit of high-order allocation by using
> kcompactd as suggested by Vlastimil soon.
>
Then dropping maximum reserves to 1%. Or replicate what MIGRATE_RESERVE
does and limit it to 2 pageblocks per zone.
> > > In case of allocation without __GFP_WAIT, requestor preare sufficient
> > > fallback method. They just want to success if it is easily successful.
> > > They don't want to succeed allocation with paying great cost that slow
> > > down general workload by this patch that can be accidentally reserve
> > > too much memory.
> > >
> >
> > Not necessary true. In the historical case, the network request was atomic
> > because it was from IRQ context and could not sleep.
>
> If some of atomic high-order allocation requestor rely on success of
> atomic high-order allocation, they should be changed as reserving how
> much they need. Not, here MM. MM can't do anything if allocation is
> requested in IRQ context. Reserving a lot of memory to guarantee
> them doesn't make sense. And, I don't see any recent claim to guarantee such
> allocation more reliable.
>
Ok, will limit to 2 pageblocks per zone on the next revision.
> > > > A MIGRATE_HIGHORDER pageblock is created when an allocation request steals
> > > > a pageblock but limits the total number to 10% of the zone.
> > >
> > > When steals happens, pageblock already can be fragmented and we can't
> > > fully utilize this pageblock without allowing order-0 allocation. This
> > > is very waste.
> > >
> >
> > If the pageblock was stolen, it implies there was at least 1 usable page
> > of the correct order. As the pageblock is then reserved, any pages that
> > free in that block stay free for use by high-order atomic allocations.
> > Else, the number of pageblocks will increase again until the 10% limit
> > is hit.
>
> It really depends on luck.
>
Success of high-order allocations *always* depended on the allocation/free
request stream. The series does not change that.
> > > > The pageblocks are unreserved if an allocation fails after a direct
> > > > reclaim attempt.
> > > >
> > > > The watermark checks account for the reserved pageblocks when the allocation
> > > > request is not a high-order atomic allocation.
> > > >
> > > > The stutter benchmark was used to evaluate this but while it was running
> > > > there was a systemtap script that randomly allocated between 1 and 1G worth
> > > > of order-3 pages using GFP_ATOMIC. In kernel 4.2-rc1 running this workload
> > > > on a single-node machine there were 339574 allocation failures. With this
> > > > patch applied there were 28798 failures -- a 92% reduction. On a 4-node
> > > > machine, allocation failures went from 76917 to 0 failures.
> > >
> > > There is some missing information to justify benchmark result.
> > > Especially, I'd like to know:
> > >
> > > 1) Detailed system setup (CPU, MEMORY, etc...)
> >
> > CPUs were 8 core Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz with 8G of RAM.
> >
> > > 2) Total number of attempt of GFP_ATOMIC allocation request
> > >
> >
> > Each attempt was between 1 and 1G randomly as described already.
>
> So, number of attempt was randomly choosen, but, number of failure is static.
> Please describe same level of statistics. Am I missing something?
>
I reported the number of failures relative to the number of successful
attempts. Reporting in greater detail would not help in any way because
it'd be for one specific case. The experiences for other workloads will
always be different. I'll put it another way -- what would you consider
to be a meaningful test? Obviously you have something in mind.
The intent of what I did was to create a workload that is known to cause
fragmentation and combine it with an unreasonable stream of atomic
high-order allocations to stress the worst-case. The average case is
unknowable because it depends on the workload and the requirements of
the hardware.
> > > I don't know how you modify stutter benchmark in mmtests but it
> > > looks like there is no delay when continually requesting GFP_ATOMIC
> > > allocation.
> > > 1G of order-3 allocation request without delay seems insane
> > > to me. Could you tell me how you modify that benchmark for this patch?
> > >
> >
> > The stutter benchmark was not modified. The watch-stress-highorder-atomic
> > monitor was run in parallel and that's what is doing the allocation. It's
> > true that up to 1G of order-3 allocations without delay would be insane
> > in a normal situation. The point was to show an extreme case where atomic
> > allocations were used and to test whether the reserves held up or not.
>
> You may change MAX_BURST in stap script to certain value that 1G
> successive attemtps is possible. 1G of order-3 atomic allocation
> without delay isn't really helpful benchmark, because it really doesn't
> reflect any real world situation. Even if extreme case, it should
> reflect real world situation at some point.
>
I did not claim it was a real world situation. It was the exteme case.
> If number of successive attemtps is back to the realistic value, such
> large failure happens?
>
It entirely depends on what you mean by realistic. The requirements of
embedded are entirely different to a standard server. This is why I
evaluated a potential worse-case scenario -- massive storms of atomic
high-order allocations.
> > > > There are minor theoritical side-effects. If the system is intensively
> > > > making large numbers of long-lived high-order atomic allocations then
> > > > there will be a lot of reserved pageblocks. This may push some workloads
> > > > into reclaim until the number of reserved pageblocks is reduced again. This
> > > > problem was not observed in reclaim intensive workloads but such workloads
> > > > are also not atomic high-order intensive.
> > >
> > > I don't think this is theoritical side-effects. It can happen easily.
> > > Recently, network subsystem makes some of their high-order allocation
> > > request ~_GFP_WAIT (fb05e7a89f50: net: don't wait for order-3 page
> > > allocation). And, I've submitted similar patch for slub today
> > > (mm/slub: don't wait for high-order page allocation). That
> > > makes system atomic high-order allocation request more and this side-effect
> > > can be possible in many situation.
> > >
> >
> > The key is long-lived allocations. The network subsystem frees theirs. I
> > was not able to trigger a situation in a variety of workloads where these
> > happened which is why I classified it as theoritical.
>
> SLUB allocation would be long-lived.
>
Will drop reserves to 2 pageblocks per zone so.
--
Mel Gorman
SUSE Labs
On Fri, Jul 31, 2015 at 10:28:52AM +0200, Vlastimil Babka wrote:
> On 07/29/2015 02:53 PM, Mel Gorman wrote:
> >On Wed, Jul 29, 2015 at 01:35:17PM +0200, Vlastimil Babka wrote:
> >>On 07/20/2015 10:00 AM, Mel Gorman wrote:
> >>>+/*
> >>>+ * Used when an allocation is about to fail under memory pressure. This
> >>>+ * potentially hurts the reliability of high-order allocations when under
> >>>+ * intense memory pressure but failed atomic allocations should be easier
> >>>+ * to recover from than an OOM.
> >>>+ */
> >>>+static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> >>>+{
> >>>+ struct zonelist *zonelist = ac->zonelist;
> >>>+ unsigned long flags;
> >>>+ struct zoneref *z;
> >>>+ struct zone *zone;
> >>>+ struct page *page;
> >>>+ int order;
> >>>+
> >>>+ for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
> >>>+ ac->nodemask) {
> >>
> >>This fixed order might bias some zones over others wrt unreserving. Is it OK?
> >
> >I could not think of a situation where it mattered. It'll always be
> >preferring highest zone over lower zones. Allocation requests that can
> >use any zone that do not care. Allocation requests that are limited to
> >lower zones are protected as long as possible.
>
> Hmm... allocation requests will follow fair zone policy and thus the
> highatomic reservations will be spread fairly among all zones? Unless the
> allocations require lower zones of course.
>
Does that matter?
> But for unreservations, normal/high allocations failing under memory
> pressure will lead to unreserving highatomic pageblocks first in the higher
> zones and only then the lower zones, and that was my concern. But it's true
> that failing allocations that require lower zones will lead to unreserving
> the lower zones, so it might be ok in the end.
>
Again, I don't think it matters.
> >
> >>
> >>>+ /* Preserve at least one pageblock */
> >>>+ if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> >>>+ continue;
> >>>+
> >>>+ spin_lock_irqsave(&zone->lock, flags);
> >>>+ for (order = 0; order < MAX_ORDER; order++) {
> >>
> >>Would it make more sense to look in descending order for a higher chance of
> >>unreserving a pageblock that's mostly free? Like the traditional page stealing does?
> >>
> >
> >I don't think it's worth the search cost. Traditional page stealing is
> >searching because it's trying to minimise events that cause external
> >fragmentation. Here we'd gain very little. We are under some memory
> >pressure here, if enough pages are not free then another one will get
> >freed shortly. Either way, I doubt the difference is measurable.
>
> Hmm, I guess...
>
> >
> >>>+ struct free_area *area = &(zone->free_area[order]);
> >>>+
> >>>+ if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
> >>>+ continue;
> >>>+
> >>>+ page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
> >>>+ struct page, lru);
> >>>+
> >>>+ zone->nr_reserved_highatomic -= pageblock_nr_pages;
> >>>+ set_pageblock_migratetype(page, ac->migratetype);
> >>
> >>Would it make more sense to assume MIGRATE_UNMOVABLE, as high-order allocations
> >>present in the pageblock typically would be, and apply the traditional page
> >>stealing heuristics to decide if it should be changed to ac->migratetype (if
> >>that differs)?
> >>
> >
> >Superb spot, I had to think about this for a while and initially I was
> >thinking your suggestion was a no-brainer and obviously the right thing
> >to do.
> >
> >On the pro side, it preserves the fragmentation logic because it'll force
> >the normal page stealing logic to be applied.
> >
> >On the con side, we may reassign the pageblock twice -- once to
> >MIGRATE_UNMOVABLE and once to ac->migratetype. That one does not matter
> >but the second con is that we inadvertly increase the number of unmovable
> >blocks in some cases.
> >
> >Lets say we default to MIGRATE_UNMOVABLE, ac->migratetype is MIGRATE_MOVABLE
> >and there are enough free pages to satisfy the allocation but not steal
> >the whole pageblock. The end result is that we have a new unmovable
> >pageblock that may not be necessary. The next unmovable allocation
> >potentially is forever. They key observation is that previously the
> >pageblock could have been short-lived high-order allocations that could
> >be completely free soon if it was assigned MIGRATE_MOVABLE. This may not
> >apply when SLUB is using high-order allocations but the point still
> >holds.
>
> Yeah, I see the point. The obvious counterexample is a pageblock that we
> design as MOVABLE and yet it contains some long-lived unmovable allocation.
> More unmovable allocations could lead to choosing another movable block as a
> fallback, while if we marked this pageblock as unmovable, they could go here
> and not increase fragmentation.
>
> The problem is, we can't known which one is the case. I've toyed with an
> idea of MIGRATE_MIXED blocks that would be for cases where the heuristics
> decide that e.g. an UNMOVABLE block is empty enough to change it to MOVABLE,
> but still it may contain some unmovable allocations. Such pageblocks should
> be preferred fallbacks for future unmovable allocations before truly
> pristine movable pageblocks.
>
I tried that once upon a time. The number of blocks simply increased
over time and there was no sensible way to recover from it. I never
found a solution that worked for very long.
> >>>+
> >>>+ /*
> >>>+ * If the caller is not atomic then discount the reserves. This will
> >>>+ * over-estimate how the atomic reserve but it avoids a search
> >>>+ */
> >>>+ if (likely(!(alloc_flags & ALLOC_HARDER)))
> >>>+ free_pages -= z->nr_reserved_highatomic;
> >>
> >>Hm, so in the case the maximum of 10% reserved blocks is already full, we deny
> >>the allocation access to another 10% of the memory and push it to reclaim. This
> >>seems rather excessive.
> >
> >It's necessary. If normal callers can use it then the reserve fills with
> >normal pages, the memory gets fragmented and high-order atomic allocations
> >fail due to fragmentation. Similarly, the number of MIGRATE_HIGHORDER
> >pageblocks cannot be unbound or everything else will be continually pushed
> >into reclaim even if there is plenty of memory free.
>
> I understand denying normal allocations access to highatomic reserves via
> watermarks is necessary. But my concern is that for each reserved pageblock
> we effectively deny up to two pageblocks-worth-of-pages to normal
> allocations. One pageblock that is marked as MIGRATE_HIGHATOMIC, and once it
> becomes full, free_pages above are decreased twice - once by the pageblock
> becoming full, and then again by subtracting z->nr_reserved_highatomic. This
> extra gap is still usable by highatomic allocations, but they will also
> potentially mark more pageblocks highatomic and further increase the gap. In
> the worst case we have 10% of pageblocks marked highatomic and full, and
> another 10% that is only usable by highatomic allocations (but won't be
> marked as such), and if no more highatomic allocations come then the 10% is
> wasted.
Similar to what I said to Joosoo, I'll drop the size of the reserves in
the next revision. It brings things back in line with MIGRATE_RESERVE in
terms of the amount of memory we reserve while still removing the need
for high-order watermark checks. It means that the rate of atomic
high-order allocation failure will remain the same after the series as
before but that should not matter.
--
Mel Gorman
SUSE Labs