2015-08-12 10:48:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 00/10] Remove zonelist cache and high-order watermark checking v2

Changelog since V1
o Rebase to 4.2-rc5
o Distinguish between high priority callers and callers that avoid sleep
o Remove jump label related damage patches

Overall, the intent of this series is to remove the zonelist cache which
was introduced to avoid high overhead in the page allocator. Once this is
done, it is necessary to reduce the cost of watermark checks.

The zonelist cache has been around for a long time but it is of dubious
merit with a lot of complexity. Some issues are explained in the first
patch but the most important is that a failed THP allocation can cause a
zone to be treated as "full". This potentially causes unnecessary stalls,
reclaim activity or remote fallbacks. The issues could be fixed but it's
not worth it. The series places a small number of other micro-optimisations
on top before examining GFP flags watermarks.

GFP flags specify the requirements of the caller. __GFP_WAIT historically
identified callers that could not sleep and could access reserves. This was
later abused to identify callers that simply prefer to avoid sleeping and
have other options. A patch is added to distinguish between atomic callers,
high-priority callers and those that simply wish to avoid sleep.

High-order watermarks enforcement can cause high-order allocations to fail
even though pages are free. The watermark checks both protect high-order
atomic allocations and make kswapd aware of high-order pages but there is
a much better way that can be handled using migrate types. This series uses
page grouping by mobility to reserve pageblocks for high-order allocations
with the size of the reservation depending on demand. kswapd awareness
is maintained by examining the free lists. By patch 10 in this series,
there are no high-order watermark checks while preserving the properties
that motivated the introduction of the watermark checks.

Documentation/vm/balance | 14 +-
arch/arm/mm/dma-mapping.c | 4 +-
arch/arm64/mm/dma-mapping.c | 4 +-
arch/x86/kernel/pci-dma.c | 2 +-
block/bio.c | 26 +-
block/blk-core.c | 16 +-
block/blk-ioc.c | 2 +-
block/blk-mq-tag.c | 2 +-
block/blk-mq.c | 8 +-
block/cfq-iosched.c | 4 +-
block/scsi_ioctl.c | 6 +-
drivers/block/drbd/drbd_bitmap.c | 2 +-
drivers/block/drbd/drbd_receiver.c | 2 +-
drivers/block/mtip32xx/mtip32xx.c | 2 +-
drivers/block/nvme-core.c | 4 +-
drivers/block/osdblk.c | 2 +-
drivers/block/paride/pd.c | 2 +-
drivers/block/pktcdvd.c | 4 +-
drivers/connector/connector.c | 3 +-
drivers/firewire/core-cdev.c | 2 +-
drivers/gpu/drm/i915/i915_gem.c | 4 +-
drivers/ide/ide-atapi.c | 2 +-
drivers/ide/ide-cd.c | 2 +-
drivers/ide/ide-cd_ioctl.c | 2 +-
drivers/ide/ide-devsets.c | 2 +-
drivers/ide/ide-disk.c | 2 +-
drivers/ide/ide-ioctls.c | 4 +-
drivers/ide/ide-park.c | 2 +-
drivers/ide/ide-pm.c | 4 +-
drivers/ide/ide-tape.c | 4 +-
drivers/ide/ide-taskfile.c | 4 +-
drivers/infiniband/core/sa_query.c | 2 +-
drivers/infiniband/hw/ipath/ipath_file_ops.c | 2 +-
drivers/infiniband/hw/qib/qib_init.c | 2 +-
drivers/iommu/amd_iommu.c | 2 +-
drivers/iommu/intel-iommu.c | 2 +-
drivers/md/dm-crypt.c | 6 +-
drivers/misc/vmw_balloon.c | 2 +-
drivers/mtd/mtdcore.c | 3 +-
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +-
drivers/scsi/scsi_error.c | 2 +-
drivers/scsi/scsi_lib.c | 4 +-
drivers/staging/android/ion/ion_system_heap.c | 2 +-
.../lustre/include/linux/libcfs/libcfs_private.h | 2 +-
drivers/usb/host/u132-hcd.c | 2 +-
fs/btrfs/disk-io.c | 2 +-
fs/btrfs/extent_io.c | 14 +-
fs/btrfs/volumes.c | 4 +-
fs/cachefiles/internal.h | 2 +-
fs/direct-io.c | 2 +-
fs/ext3/super.c | 2 +-
fs/ext4/super.c | 2 +-
fs/fscache/cookie.c | 2 +-
fs/fscache/page.c | 6 +-
fs/jbd/transaction.c | 4 +-
fs/jbd2/transaction.c | 4 +-
fs/nfs/file.c | 6 +-
fs/nilfs2/mdt.h | 2 +-
fs/xfs/xfs_qm.c | 2 +-
include/linux/cpuset.h | 6 +
include/linux/gfp.h | 68 ++-
include/linux/mmzone.h | 85 +--
include/linux/skbuff.h | 6 +-
include/net/sock.h | 2 +-
include/trace/events/gfpflags.h | 5 +-
kernel/audit.c | 6 +-
kernel/locking/lockdep.c | 2 +-
kernel/power/swap.c | 14 +-
kernel/smp.c | 2 +-
lib/idr.c | 4 +-
lib/percpu_ida.c | 2 +-
lib/radix-tree.c | 10 +-
mm/backing-dev.c | 2 +-
mm/dmapool.c | 2 +-
mm/failslab.c | 8 +-
mm/filemap.c | 2 +-
mm/huge_memory.c | 4 +-
mm/internal.h | 1 +
mm/memcontrol.c | 8 +-
mm/mempool.c | 10 +-
mm/migrate.c | 2 +-
mm/page_alloc.c | 569 +++++++--------------
mm/slab.c | 18 +-
mm/slub.c | 6 +-
mm/vmalloc.c | 2 +-
mm/vmscan.c | 6 +-
mm/vmstat.c | 2 +-
net/core/skbuff.c | 8 +-
net/core/sock.c | 6 +-
net/netlink/af_netlink.c | 2 +-
net/rxrpc/ar-connection.c | 2 +-
net/sctp/associola.c | 2 +-
security/integrity/ima/ima_crypto.c | 2 +-
93 files changed, 424 insertions(+), 684 deletions(-)

--
2.4.6


2015-08-12 10:48:37

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/10] mm, page_alloc: Delete the zonelist_cache

The zonelist cache (zlc) was introduced to skip over zones that were
recently known to be full. This avoided expensive operations such as the
cpuset checks, watermark calculations and zone_reclaim. The situation
today is different and the complexity of zlc is harder to justify.

1) The cpuset checks are no-ops unless a cpuset is active and in general are
a lot cheaper.

2) zone_reclaim is now disabled by default and I suspect that was a large
source of the cost that zlc wanted to avoid. When it is enabled, it's
known to be a major source of stalling when nodes fill up and it's
unwise to hit every other user with the overhead.

3) Watermark checks are expensive to calculate for high-order
allocation requests. Later patches in this series will reduce the cost
of the watermark checking.

4) The most important issue is that in the current implementation it
is possible for a failed THP allocation to mark a zone full for order-0
allocations and cause a fallback to remote nodes.

The last issue could be addressed with additional complexity but as the
benefit of zlc is questionable, it is better to remove it. If stalls
due to zone_reclaim are ever reported then an alternative would be to
introduce deferring logic based on a timeout inside zone_reclaim itself
and leave the page allocator fast paths alone.

The impact on page-allocator microbenchmarks is negligible as they don't
hit the paths where the zlc comes into play. The impact was noticeable
in a workload called "stutter". One part uses a lot of anonymous memory,
a second measures mmap latency and a third copies a large file. In an
ideal world the latency application would not notice the mmap latency.
On a 4-node machine the results of this patch are

4-node machine stutter
4.2.0-rc1 4.2.0-rc1
vanilla nozlc-v1r20
Min mmap 53.9902 ( 0.00%) 49.3629 ( 8.57%)
1st-qrtle mmap 54.6776 ( 0.00%) 54.1201 ( 1.02%)
2nd-qrtle mmap 54.9242 ( 0.00%) 54.5961 ( 0.60%)
3rd-qrtle mmap 55.1817 ( 0.00%) 54.9338 ( 0.45%)
Max-90% mmap 55.3952 ( 0.00%) 55.3929 ( 0.00%)
Max-93% mmap 55.4766 ( 0.00%) 57.5712 ( -3.78%)
Max-95% mmap 55.5522 ( 0.00%) 57.8376 ( -4.11%)
Max-99% mmap 55.7938 ( 0.00%) 63.6180 (-14.02%)
Max mmap 6344.0292 ( 0.00%) 67.2477 ( 98.94%)
Mean mmap 57.3732 ( 0.00%) 54.5680 ( 4.89%)

Note the maximum stall latency which was 6 seconds and becomes 67ms with
this patch applied. However, also note that it is not guaranteed this
benchmark always hits pathelogical cases and the milage varies. There is
a secondary impact with more direct reclaim because zones are now being
considered instead of being skipped by zlc.

4.1.0 4.1.0
vanilla nozlc-v1r4
Swap Ins 838 502
Swap Outs 1149395 2622895
DMA32 allocs 17839113 15863747
Normal allocs 129045707 137847920
Direct pages scanned 4070089 29046893
Kswapd pages scanned 17147837 17140694
Kswapd pages reclaimed 17146691 17139601
Direct pages reclaimed 1888879 4886630
Kswapd efficiency 99% 99%
Kswapd velocity 17523.721 17518.928
Direct efficiency 46% 16%
Direct velocity 4159.306 29687.854
Percentage direct scans 19% 62%
Page writes by reclaim 1149395.000 2622895.000
Page writes file 0 0
Page writes anon 1149395 2622895

The direct page scan and reclaim rates are noticeable. It is possible
this will not be a universal win on all workloads but cycling through
zonelists waiting for zlc->last_full_zap to expire is not the right
decision.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
include/linux/mmzone.h | 71 ----------------
mm/page_alloc.c | 217 +------------------------------------------------
2 files changed, 2 insertions(+), 286 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 754c25966a0a..decc99a007f5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -585,75 +585,8 @@ static inline bool zone_is_empty(struct zone *zone)
* [1] : No fallback (__GFP_THISNODE)
*/
#define MAX_ZONELISTS 2
-
-
-/*
- * We cache key information from each zonelist for smaller cache
- * footprint when scanning for free pages in get_page_from_freelist().
- *
- * 1) The BITMAP fullzones tracks which zones in a zonelist have come
- * up short of free memory since the last time (last_fullzone_zap)
- * we zero'd fullzones.
- * 2) The array z_to_n[] maps each zone in the zonelist to its node
- * id, so that we can efficiently evaluate whether that node is
- * set in the current tasks mems_allowed.
- *
- * Both fullzones and z_to_n[] are one-to-one with the zonelist,
- * indexed by a zones offset in the zonelist zones[] array.
- *
- * The get_page_from_freelist() routine does two scans. During the
- * first scan, we skip zones whose corresponding bit in 'fullzones'
- * is set or whose corresponding node in current->mems_allowed (which
- * comes from cpusets) is not set. During the second scan, we bypass
- * this zonelist_cache, to ensure we look methodically at each zone.
- *
- * Once per second, we zero out (zap) fullzones, forcing us to
- * reconsider nodes that might have regained more free memory.
- * The field last_full_zap is the time we last zapped fullzones.
- *
- * This mechanism reduces the amount of time we waste repeatedly
- * reexaming zones for free memory when they just came up low on
- * memory momentarilly ago.
- *
- * The zonelist_cache struct members logically belong in struct
- * zonelist. However, the mempolicy zonelists constructed for
- * MPOL_BIND are intentionally variable length (and usually much
- * shorter). A general purpose mechanism for handling structs with
- * multiple variable length members is more mechanism than we want
- * here. We resort to some special case hackery instead.
- *
- * The MPOL_BIND zonelists don't need this zonelist_cache (in good
- * part because they are shorter), so we put the fixed length stuff
- * at the front of the zonelist struct, ending in a variable length
- * zones[], as is needed by MPOL_BIND.
- *
- * Then we put the optional zonelist cache on the end of the zonelist
- * struct. This optional stuff is found by a 'zlcache_ptr' pointer in
- * the fixed length portion at the front of the struct. This pointer
- * both enables us to find the zonelist cache, and in the case of
- * MPOL_BIND zonelists, (which will just set the zlcache_ptr to NULL)
- * to know that the zonelist cache is not there.
- *
- * The end result is that struct zonelists come in two flavors:
- * 1) The full, fixed length version, shown below, and
- * 2) The custom zonelists for MPOL_BIND.
- * The custom MPOL_BIND zonelists have a NULL zlcache_ptr and no zlcache.
- *
- * Even though there may be multiple CPU cores on a node modifying
- * fullzones or last_full_zap in the same zonelist_cache at the same
- * time, we don't lock it. This is just hint data - if it is wrong now
- * and then, the allocator will still function, perhaps a bit slower.
- */
-
-
-struct zonelist_cache {
- unsigned short z_to_n[MAX_ZONES_PER_ZONELIST]; /* zone->nid */
- DECLARE_BITMAP(fullzones, MAX_ZONES_PER_ZONELIST); /* zone full? */
- unsigned long last_full_zap; /* when last zap'd (jiffies) */
-};
#else
#define MAX_ZONELISTS 1
-struct zonelist_cache;
#endif

/*
@@ -683,11 +616,7 @@ struct zoneref {
* zonelist_node_idx() - Return the index of the node for an entry
*/
struct zonelist {
- struct zonelist_cache *zlcache_ptr; // NULL or &zlcache
struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
-#ifdef CONFIG_NUMA
- struct zonelist_cache zlcache; // optional ...
-#endif
};

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef19f22b2b7d..41c0799b9049 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2250,122 +2250,6 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
}

#ifdef CONFIG_NUMA
-/*
- * zlc_setup - Setup for "zonelist cache". Uses cached zone data to
- * skip over zones that are not allowed by the cpuset, or that have
- * been recently (in last second) found to be nearly full. See further
- * comments in mmzone.h. Reduces cache footprint of zonelist scans
- * that have to skip over a lot of full or unallowed zones.
- *
- * If the zonelist cache is present in the passed zonelist, then
- * returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_states[N_MEMORY].)
- *
- * If the zonelist cache is not available for this zonelist, does
- * nothing and returns NULL.
- *
- * If the fullzones BITMAP in the zonelist cache is stale (more than
- * a second since last zap'd) then we zap it out (clear its bits.)
- *
- * We hold off even calling zlc_setup, until after we've checked the
- * first zone in the zonelist, on the theory that most allocations will
- * be satisfied from that first zone, so best to examine that zone as
- * quickly as we can.
- */
-static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
- nodemask_t *allowednodes; /* zonelist_cache approximation */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return NULL;
-
- if (time_after(jiffies, zlc->last_full_zap + HZ)) {
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
- zlc->last_full_zap = jiffies;
- }
-
- allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
- &cpuset_current_mems_allowed :
- &node_states[N_MEMORY];
- return allowednodes;
-}
-
-/*
- * Given 'z' scanning a zonelist, run a couple of quick checks to see
- * if it is worth looking at further for free memory:
- * 1) Check that the zone isn't thought to be full (doesn't have its
- * bit set in the zonelist_cache fullzones BITMAP).
- * 2) Check that the zones node (obtained from the zonelist_cache
- * z_to_n[] mapping) is allowed in the passed in allowednodes mask.
- * Return true (non-zero) if zone is worth looking at further, or
- * else return false (zero) if it is not.
- *
- * This check -ignores- the distinction between various watermarks,
- * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ... If a zone is
- * found to be full for any variation of these watermarks, it will
- * be considered full for up to one second by all requests, unless
- * we are so low on memory on all allowed nodes that we are forced
- * into the second scan of the zonelist.
- *
- * In the second scan we ignore this zonelist cache and exactly
- * apply the watermarks to all zones, even it is slower to do so.
- * We are low on memory in the second scan, and should leave no stone
- * unturned looking for a free page.
- */
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
- nodemask_t *allowednodes)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
- int i; /* index of *z in zonelist zones */
- int n; /* node that zone *z is on */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return 1;
-
- i = z - zonelist->_zonerefs;
- n = zlc->z_to_n[i];
-
- /* This zone is worth trying if it is allowed but not full */
- return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
-}
-
-/*
- * Given 'z' scanning a zonelist, set the corresponding bit in
- * zlc->fullzones, so that subsequent attempts to allocate a page
- * from that zone don't waste time re-examining it.
- */
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
- int i; /* index of *z in zonelist zones */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return;
-
- i = z - zonelist->_zonerefs;
-
- set_bit(i, zlc->fullzones);
-}
-
-/*
- * clear all zones full, called after direct reclaim makes progress so that
- * a zone that was recently full is not skipped over for up to a second
- */
-static void zlc_clear_zones_full(struct zonelist *zonelist)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return;
-
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-}
-
static bool zone_local(struct zone *local_zone, struct zone *zone)
{
return local_zone->node == zone->node;
@@ -2376,28 +2260,7 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
RECLAIM_DISTANCE;
}
-
#else /* CONFIG_NUMA */
-
-static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
-{
- return NULL;
-}
-
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
- nodemask_t *allowednodes)
-{
- return 1;
-}
-
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
-{
-}
-
-static void zlc_clear_zones_full(struct zonelist *zonelist)
-{
-}
-
static bool zone_local(struct zone *local_zone, struct zone *zone)
{
return true;
@@ -2407,7 +2270,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return true;
}
-
#endif /* CONFIG_NUMA */

static void reset_alloc_batches(struct zone *preferred_zone)
@@ -2434,9 +2296,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
struct zoneref *z;
struct page *page = NULL;
struct zone *zone;
- nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
- int zlc_active = 0; /* set if using zonelist_cache */
- int did_zlc_setup = 0; /* just call zlc_setup() one time */
bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
(gfp_mask & __GFP_WRITE);
int nr_fair_skipped = 0;
@@ -2453,9 +2312,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
ac->nodemask) {
unsigned long mark;

- if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
- !zlc_zone_worth_trying(zonelist, z, allowednodes))
- continue;
if (cpusets_enabled() &&
(alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed(zone, gfp_mask))
@@ -2513,28 +2369,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
if (alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;

- if (IS_ENABLED(CONFIG_NUMA) &&
- !did_zlc_setup && nr_online_nodes > 1) {
- /*
- * we do zlc_setup if there are multiple nodes
- * and before considering the first zone allowed
- * by the cpuset.
- */
- allowednodes = zlc_setup(zonelist, alloc_flags);
- zlc_active = 1;
- did_zlc_setup = 1;
- }
-
if (zone_reclaim_mode == 0 ||
!zone_allows_reclaim(ac->preferred_zone, zone))
- goto this_zone_full;
-
- /*
- * As we may have just activated ZLC, check if the first
- * eligible zone has failed zone_reclaim recently.
- */
- if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
- !zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;

ret = zone_reclaim(zone, gfp_mask, order);
@@ -2551,19 +2387,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
ac->classzone_idx, alloc_flags))
goto try_this_zone;

- /*
- * Failed to reclaim enough to meet watermark.
- * Only mark the zone full if checking the min
- * watermark or if we failed to reclaim just
- * 1<<order pages or else the page allocator
- * fastpath will prematurely mark zones full
- * when the watermark is between the low and
- * min watermarks.
- */
- if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
- ret == ZONE_RECLAIM_SOME)
- goto this_zone_full;
-
continue;
}
}
@@ -2576,9 +2399,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
goto try_this_zone;
return page;
}
-this_zone_full:
- if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
- zlc_mark_zone_full(zonelist, z);
}

/*
@@ -2599,12 +2419,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
zonelist_rescan = true;
}

- if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
- /* Disable zlc cache for second zonelist scan */
- zlc_active = 0;
- zonelist_rescan = true;
- }
-
if (zonelist_rescan)
goto zonelist_scan;

@@ -2844,10 +2658,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
if (unlikely(!(*did_some_progress)))
return NULL;

- /* After successful reclaim, reconsider all zones for allocation */
- if (IS_ENABLED(CONFIG_NUMA))
- zlc_clear_zones_full(ac->zonelist);
-
retry:
page = get_page_from_freelist(gfp_mask, order,
alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
@@ -3157,7 +2967,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = {
.high_zoneidx = gfp_zone(gfp_mask),
- .nodemask = nodemask,
+ .nodemask = nodemask ? : &cpuset_current_mems_allowed,
.migratetype = gfpflags_to_migratetype(gfp_mask),
};

@@ -3188,8 +2998,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
ac.zonelist = zonelist;
/* The preferred zone is used for statistics later */
preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
- ac.nodemask ? : &cpuset_current_mems_allowed,
- &ac.preferred_zone);
+ ac.nodemask, &ac.preferred_zone);
if (!ac.preferred_zone)
goto out;
ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
@@ -4169,20 +3978,6 @@ static void build_zonelists(pg_data_t *pgdat)
build_thisnode_zonelists(pgdat);
}

-/* Construct the zonelist performance cache - see further mmzone.h */
-static void build_zonelist_cache(pg_data_t *pgdat)
-{
- struct zonelist *zonelist;
- struct zonelist_cache *zlc;
- struct zoneref *z;
-
- zonelist = &pgdat->node_zonelists[0];
- zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
- for (z = zonelist->_zonerefs; z->zone; z++)
- zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z);
-}
-
#ifdef CONFIG_HAVE_MEMORYLESS_NODES
/*
* Return node id of node used for "local" allocations.
@@ -4243,12 +4038,6 @@ static void build_zonelists(pg_data_t *pgdat)
zonelist->_zonerefs[j].zone_idx = 0;
}

-/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
-static void build_zonelist_cache(pg_data_t *pgdat)
-{
- pgdat->node_zonelists[0].zlcache_ptr = NULL;
-}
-
#endif /* CONFIG_NUMA */

/*
@@ -4289,14 +4078,12 @@ static int __build_all_zonelists(void *data)

if (self && !node_online(self->node_id)) {
build_zonelists(self);
- build_zonelist_cache(self);
}

for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);

build_zonelists(pgdat);
- build_zonelist_cache(pgdat);
}

/*
--
2.4.6

2015-08-12 10:45:41

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe

No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
removes the unnecessary parameter.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/mmzone.h | 2 +-
mm/page_alloc.c | 5 +++--
mm/vmscan.c | 4 ++--
3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index decc99a007f5..8b86ec5df968 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -731,7 +731,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
bool zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, int alloc_flags);
bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
- unsigned long mark, int classzone_idx, int alloc_flags);
+ unsigned long mark, int classzone_idx);
enum memmap_context {
MEMMAP_EARLY,
MEMMAP_HOTPLUG,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 41c0799b9049..5e1f6f4370bc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2209,6 +2209,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
min -= min / 2;
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;
+
#ifdef CONFIG_CMA
/* If allocation can't use CMA areas don't use free CMA pages */
if (!(alloc_flags & ALLOC_CMA))
@@ -2238,14 +2239,14 @@ bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
}

bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
- unsigned long mark, int classzone_idx, int alloc_flags)
+ unsigned long mark, int classzone_idx)
{
long free_pages = zone_page_state(z, NR_FREE_PAGES);

if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);

- return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+ return __zone_watermark_ok(z, order, mark, classzone_idx, 0,
free_pages);
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e61445dce04e..f1d8eae285f2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2454,7 +2454,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
- watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
+ watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);

/*
* If compaction is deferred, reclaim up to a point where
@@ -2937,7 +2937,7 @@ static bool zone_balanced(struct zone *zone, int order,
unsigned long balance_gap, int classzone_idx)
{
if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
- balance_gap, classzone_idx, 0))
+ balance_gap, classzone_idx))
return false;

if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
--
2.4.6

2015-08-12 10:45:42

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing

File-backed pages that will be immediately are balanced between zones but
it's unnecessarily expensive. Move consider_zone_balanced into the alloc_context
instead of checking bitmaps multiple times. The patch also gives the parameter
a more meaningful name.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
mm/internal.h | 1 +
mm/page_alloc.c | 11 +++++++----
2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1e2ca6..9331f802a067 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -129,6 +129,7 @@ struct alloc_context {
int classzone_idx;
int migratetype;
enum zone_type high_zoneidx;
+ bool spread_dirty_pages;
};

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5e1f6f4370bc..94f2f6bdd6d5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2297,8 +2297,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
struct zoneref *z;
struct page *page = NULL;
struct zone *zone;
- bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
- (gfp_mask & __GFP_WRITE);
int nr_fair_skipped = 0;
bool zonelist_rescan;

@@ -2350,14 +2348,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
*
* XXX: For now, allow allocations to potentially
* exceed the per-zone dirty limit in the slowpath
- * (ALLOC_WMARK_LOW unset) before going into reclaim,
+ * (spread_dirty_pages unset) before going into reclaim,
* which is important when on a NUMA setup the allowed
* zones are together not big enough to reach the
* global limit. The proper fix for these situations
* will require awareness of zones in the
* dirty-throttling and the flusher threads.
*/
- if (consider_zone_dirty && !zone_dirty_ok(zone))
+ if (ac->spread_dirty_pages && !zone_dirty_ok(zone))
continue;

mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
@@ -2997,6 +2995,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,

/* We set it here, as __alloc_pages_slowpath might have changed it */
ac.zonelist = zonelist;
+
+ /* Dirty zone balancing only done in the fast path */
+ ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
+
/* The preferred zone is used for statistics later */
preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
ac.nodemask, &ac.preferred_zone);
@@ -3014,6 +3016,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
* complete.
*/
alloc_mask = memalloc_noio_flags(gfp_mask);
+ ac.spread_dirty_pages = false;

page = __alloc_pages_slowpath(alloc_mask, order, &ac);
}
--
2.4.6

2015-08-12 10:48:36

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled

There is a seqcounter that protects against spurious allocation failures
when a task is changing the allowed nodes in a cpuset. There is no need
to check the seqcounter until a cpuset exists.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/cpuset.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 1b357997cac5..6eb27cb480b7 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -104,6 +104,9 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
*/
static inline unsigned int read_mems_allowed_begin(void)
{
+ if (!cpusets_enabled())
+ return 0;
+
return read_seqcount_begin(&current->mems_allowed_seq);
}

@@ -115,6 +118,9 @@ static inline unsigned int read_mems_allowed_begin(void)
*/
static inline bool read_mems_allowed_retry(unsigned int seq)
{
+ if (!cpusets_enabled())
+ return false;
+
return read_seqcount_retry(&current->mems_allowed_seq, seq);
}

--
2.4.6

2015-08-12 10:48:33

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types

This patch redefines which GFP bits are used for specifying mobility and
the order of the migrate types. Once redefined it's possible to convert
GFP flags to a migrate type with a simple mask and shift. The only downside
is that readers of OOM kill messages and allocation failures may have been
used to the existing values but scripts/gfp-translate will help.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 12 +++++++-----
include/linux/mmzone.h | 2 +-
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ad35f300b9a4..43246850a85f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -14,7 +14,7 @@ struct vm_area_struct;
#define ___GFP_HIGHMEM 0x02u
#define ___GFP_DMA32 0x04u
#define ___GFP_MOVABLE 0x08u
-#define ___GFP_WAIT 0x10u
+#define ___GFP_RECLAIMABLE 0x10u
#define ___GFP_HIGH 0x20u
#define ___GFP_IO 0x40u
#define ___GFP_FS 0x80u
@@ -29,7 +29,7 @@ struct vm_area_struct;
#define ___GFP_NOMEMALLOC 0x10000u
#define ___GFP_HARDWALL 0x20000u
#define ___GFP_THISNODE 0x40000u
-#define ___GFP_RECLAIMABLE 0x80000u
+#define ___GFP_WAIT 0x80000u
#define ___GFP_NOACCOUNT 0x100000u
#define ___GFP_NOTRACK 0x200000u
#define ___GFP_NO_KSWAPD 0x400000u
@@ -123,6 +123,7 @@ struct vm_area_struct;

/* This mask makes up all the page movable related flags */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
+#define GFP_MOVABLE_SHIFT 3

/* Control page allocator reclaim behavior */
#define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
@@ -149,14 +150,15 @@ struct vm_area_struct;
/* Convert GFP flags to their corresponding migrate type */
static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
{
- WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+ VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+ BUILD_BUG_ON(1UL << GFP_MOVABLE_SHIFT != ___GFP_MOVABLE);
+ BUILD_BUG_ON(___GFP_MOVABLE >> GFP_MOVABLE_SHIFT != MIGRATE_MOVABLE);

if (unlikely(page_group_by_mobility_disabled))
return MIGRATE_UNMOVABLE;

/* Group based on mobility */
- return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
- ((gfp_flags & __GFP_RECLAIMABLE) != 0);
+ return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
}

#ifdef CONFIG_HIGHMEM
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8b86ec5df968..79a0d033a2f3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -37,8 +37,8 @@

enum {
MIGRATE_UNMOVABLE,
- MIGRATE_RECLAIMABLE,
MIGRATE_MOVABLE,
+ MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_RESERVE = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
--
2.4.6

2015-08-12 10:46:41

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/10] mm: page_alloc: Distinguish between being unable to sleep, unwilling to unwilling and avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min". __GFP_HIGH users get
access to the first lower watermark and can be called the "high priority
reserve". Atomic users and interrupts access yet another lower watermark
that can be called the "atomic reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
as a caller that is willing to enter direct reclaim and wake kswapd for
background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag. Callers with
interrupts disabled still automatically use the atomic reserves.

o Callers that have a limited mempool to guarantee forward progress use
__GFP_DIRECT_RECLAIM. bio allocations fall into this category where
kswapd will still be woken but atomic reserves are not used as there
is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allows_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

The key hazard to watch out for is callers that removed __GFP_WAIT and
was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/vm/balance | 14 ++++----
arch/arm/mm/dma-mapping.c | 4 +--
arch/arm64/mm/dma-mapping.c | 4 +--
arch/x86/kernel/pci-dma.c | 2 +-
block/bio.c | 26 +++++++--------
block/blk-core.c | 16 ++++-----
block/blk-ioc.c | 2 +-
block/blk-mq-tag.c | 2 +-
block/blk-mq.c | 8 ++---
block/cfq-iosched.c | 4 +--
drivers/block/osdblk.c | 2 +-
drivers/connector/connector.c | 3 +-
drivers/firewire/core-cdev.c | 2 +-
drivers/gpu/drm/i915/i915_gem.c | 2 +-
drivers/infiniband/core/sa_query.c | 2 +-
drivers/iommu/amd_iommu.c | 2 +-
drivers/iommu/intel-iommu.c | 2 +-
drivers/md/dm-crypt.c | 6 ++--
drivers/mtd/mtdcore.c | 3 +-
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +-
drivers/staging/android/ion/ion_system_heap.c | 2 +-
drivers/usb/host/u132-hcd.c | 2 +-
fs/btrfs/disk-io.c | 2 +-
fs/btrfs/extent_io.c | 8 ++---
fs/btrfs/volumes.c | 4 +--
fs/ext3/super.c | 2 +-
fs/ext4/super.c | 2 +-
fs/fscache/cookie.c | 2 +-
fs/fscache/page.c | 6 ++--
fs/jbd/transaction.c | 4 +--
fs/jbd2/transaction.c | 4 +--
fs/nfs/file.c | 6 ++--
fs/xfs/xfs_qm.c | 2 +-
include/linux/gfp.h | 44 ++++++++++++++++++-------
include/linux/skbuff.h | 6 ++--
include/net/sock.h | 2 +-
include/trace/events/gfpflags.h | 5 +--
kernel/audit.c | 6 ++--
kernel/locking/lockdep.c | 2 +-
kernel/smp.c | 2 +-
lib/idr.c | 4 +--
lib/radix-tree.c | 10 +++---
mm/backing-dev.c | 2 +-
mm/dmapool.c | 2 +-
mm/memcontrol.c | 8 ++---
mm/mempool.c | 10 +++---
mm/page_alloc.c | 29 +++++++++-------
mm/slab.c | 18 +++++-----
mm/slub.c | 6 ++--
mm/vmalloc.c | 2 +-
mm/vmscan.c | 2 +-
net/core/skbuff.c | 8 ++---
net/core/sock.c | 6 ++--
net/sctp/associola.c | 2 +-
54 files changed, 181 insertions(+), 149 deletions(-)

diff --git a/Documentation/vm/balance b/Documentation/vm/balance
index c46e68cf9344..6f1f6fae30f5 100644
--- a/Documentation/vm/balance
+++ b/Documentation/vm/balance
@@ -1,12 +1,14 @@
Started Jan 2000 by Kanoj Sarcar <[email protected]>

-Memory balancing is needed for non __GFP_WAIT as well as for non
-__GFP_IO allocations.
+Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
+well as for non __GFP_IO allocations.

-There are two reasons to be requesting non __GFP_WAIT allocations:
-the caller can not sleep (typically intr context), or does not want
-to incur cost overheads of page stealing and possible swap io for
-whatever reasons.
+The first reason why a caller may avoid reclaim is that the caller can not
+sleep due to holding a spinlock or is in interrupt context. The second may
+be that the caller is willing to fail the allocation without incurring the
+overhead of page stealing. This may happen for opportunistic high-order
+allocation requests that have order-0 fallback options. In such cases,
+the caller may also wish to avoid waking kswapd.

__GFP_IO allocation requests are made to prevent file system deadlocks.

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index cba12f34ff77..100d3fbaebae 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -650,7 +650,7 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,

if (is_coherent || nommu())
addr = __alloc_simple_buffer(dev, size, gfp, &page);
- else if (!(gfp & __GFP_WAIT))
+ else if (gfp & __GFP_ATOMIC)
addr = __alloc_from_pool(size, &page);
else if (!dev_get_cma_area(dev))
addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller, want_vaddr);
@@ -1369,7 +1369,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
*handle = DMA_ERROR_CODE;
size = PAGE_ALIGN(size);

- if (!(gfp & __GFP_WAIT))
+ if (gfp & __GFP_ATOMIC)
return __iommu_alloc_atomic(dev, size, handle);

/*
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index d16a1cead23f..713d963fb96b 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
if (IS_ENABLED(CONFIG_ZONE_DMA) &&
dev->coherent_dma_mask <= DMA_BIT_MASK(32))
flags |= GFP_DMA;
- if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) {
+ if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_DIRECT_RECLAIM)) {
struct page *page;
void *addr;

@@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,

size = PAGE_ALIGN(size);

- if (!coherent && !(flags & __GFP_WAIT)) {
+ if (!coherent && (flags & __GFP_ATOMIC)) {
struct page *page = NULL;
void *addr = __alloc_from_pool(size, &page, flags);

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 353972c1946c..9a13ebb0f621 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -101,7 +101,7 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
again:
page = NULL;
/* CMA can be used only in the context which permits sleeping */
- if (flag & __GFP_WAIT) {
+ if (flag & __GFP_DIRECT_RECLAIM) {
page = dma_alloc_from_contiguous(dev, count, get_order(size));
if (page && page_to_phys(page) + size > dma_mask) {
dma_release_from_contiguous(dev, page, count);
diff --git a/block/bio.c b/block/bio.c
index d6e5ba3399f0..fbc558b50e67 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -211,7 +211,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
bvl = mempool_alloc(pool, gfp_mask);
} else {
struct biovec_slab *bvs = bvec_slabs + *idx;
- gfp_t __gfp_mask = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
+ gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO);

/*
* Make this allocation restricted and don't dump info on
@@ -221,11 +221,11 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
__gfp_mask |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;

/*
- * Try a slab allocation. If this fails and __GFP_WAIT
+ * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
* is set, retry with the 1-entry mempool
*/
bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
- if (unlikely(!bvl && (gfp_mask & __GFP_WAIT))) {
+ if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
*idx = BIOVEC_MAX_IDX;
goto fallback;
}
@@ -393,12 +393,12 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
* If @bs is NULL, uses kmalloc() to allocate the bio; else the allocation is
* backed by the @bs's mempool.
*
- * When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
- * able to allocate a bio. This is due to the mempool guarantees. To make this
- * work, callers must never allocate more than 1 bio at a time from this pool.
- * Callers that need to allocate more than 1 bio must always submit the
- * previously allocated bio for IO before attempting to allocate a new one.
- * Failure to do so can cause deadlocks under memory pressure.
+ * When @bs is not NULL, if %__GFP_DIRECT_RECLAIM is set then bio_alloc will
+ * always be able to allocate a bio. This is due to the mempool guarantees.
+ * To make this work, callers must never allocate more than 1 bio at a time
+ * from this pool. Callers that need to allocate more than 1 bio must always
+ * submit the previously allocated bio for IO before attempting to allocate
+ * a new one. Failure to do so can cause deadlocks under memory pressure.
*
* Note that when running under generic_make_request() (i.e. any block
* driver), bios are not submitted until after you return - see the code in
@@ -457,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
* We solve this, and guarantee forward progress, with a rescuer
* workqueue per bio_set. If we go to allocate and there are
* bios on current->bio_list, we first try the allocation
- * without __GFP_WAIT; if that fails, we punt those bios we
- * would be blocking to the rescuer workqueue before we retry
- * with the original gfp_flags.
+ * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
+ * bios we would be blocking to the rescuer workqueue before
+ * we retry with the original gfp_flags.
*/

if (current->bio_list && !bio_list_empty(current->bio_list))
- gfp_mask &= ~__GFP_WAIT;
+ gfp_mask &= ~__GFP_DIRECT_RECLAIM;

p = mempool_alloc(bs->bio_pool, gfp_mask);
if (!p && gfp_mask != saved_gfp) {
diff --git a/block/blk-core.c b/block/blk-core.c
index 627ed0c593fb..c53c2513d472 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1156,8 +1156,8 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
* @bio: bio to allocate request for (can be %NULL)
* @gfp_mask: allocation mask
*
- * Get a free request from @q. If %__GFP_WAIT is set in @gfp_mask, this
- * function keeps retrying under memory pressure and fails iff @q is dead.
+ * Get a free request from @q. If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
+ * this function keeps retrying under memory pressure and fails iff @q is dead.
*
* Must be called with @q->queue_lock held and,
* Returns ERR_PTR on failure, with @q->queue_lock held.
@@ -1177,7 +1177,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
if (!IS_ERR(rq))
return rq;

- if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dying(q))) {
+ if (!gfpflags_allows_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
blk_put_rl(rl);
return rq;
}
@@ -1255,11 +1255,11 @@ EXPORT_SYMBOL(blk_get_request);
* BUG.
*
* WARNING: When allocating/cloning a bio-chain, careful consideration should be
- * given to how you allocate bios. In particular, you cannot use __GFP_WAIT for
- * anything but the first bio in the chain. Otherwise you risk waiting for IO
- * completion of a bio that hasn't been submitted yet, thus resulting in a
- * deadlock. Alternatively bios should be allocated using bio_kmalloc() instead
- * of bio_alloc(), as that avoids the mempool deadlock.
+ * given to how you allocate bios. In particular, you cannot use
+ * __GFP_DIRECT_RECLAIM for anything but the first bio in the chain. Otherwise
+ * you risk waiting for IO completion of a bio that hasn't been submitted yet,
+ * thus resulting in a deadlock. Alternatively bios should be allocated using
+ * bio_kmalloc() instead of bio_alloc(), as that avoids the mempool deadlock.
* If possible a big IO should be split into smaller parts when allocation
* fails. Partial allocation should not be an error, or you risk a live-lock.
*/
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 1a27f45ec776..0e7e7d9ffc04 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -289,7 +289,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
{
struct io_context *ioc;

- might_sleep_if(gfp_flags & __GFP_WAIT);
+ might_sleep_if(gfpflags_allows_blocking(gfp_flags));

do {
task_lock(task);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9b6e28830b82..7e27f6164298 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -264,7 +264,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
if (tag != -1)
return tag;

- if (!(data->gfp & __GFP_WAIT))
+ if (!gfpflags_allows_blocking(data->gfp))
return -1;

bs = bt_wait_ptr(bt, hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d842db59699..df8cba632ec2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -85,7 +85,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
if (percpu_ref_tryget_live(&q->mq_usage_counter))
return 0;

- if (!(gfp & __GFP_WAIT))
+ if (!gfpflags_allows_blocking(gfp))
return -EBUSY;

ret = wait_event_interruptible(q->mq_freeze_wq,
@@ -261,11 +261,11 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,

ctx = blk_mq_get_ctx(q);
hctx = q->mq_ops->map_queue(q, ctx->cpu);
- blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_WAIT,
+ blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
reserved, ctx, hctx);

rq = __blk_mq_alloc_request(&alloc_data, rw);
- if (!rq && (gfp & __GFP_WAIT)) {
+ if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
__blk_mq_run_hw_queue(hctx);
blk_mq_put_ctx(ctx);

@@ -1221,7 +1221,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
ctx = blk_mq_get_ctx(q);
hctx = q->mq_ops->map_queue(q, ctx->cpu);
blk_mq_set_alloc_data(&alloc_data, q,
- __GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
+ __GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
rq = __blk_mq_alloc_request(&alloc_data, rw);
ctx = alloc_data.ctx;
hctx = alloc_data.hctx;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c62bb2e650b8..4c7ca678856a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3674,7 +3674,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
if (new_cfqq) {
cfqq = new_cfqq;
new_cfqq = NULL;
- } else if (gfp_mask & __GFP_WAIT) {
+ } else if (gfp_mask & __GFP_DIRECT_RECLAIM) {
rcu_read_unlock();
spin_unlock_irq(cfqd->queue->queue_lock);
new_cfqq = kmem_cache_alloc_node(cfq_pool,
@@ -4289,7 +4289,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
const bool is_sync = rq_is_sync(rq);
struct cfq_queue *cfqq;

- might_sleep_if(gfp_mask & __GFP_WAIT);
+ might_sleep_if(gfpflags_allows_blocking(gfp_mask));

spin_lock_irq(q->queue_lock);

diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index e22942596207..1b709a4e3b5e 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -271,7 +271,7 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
goto err_out;

tmp->bi_bdev = NULL;
- gfpmask &= ~__GFP_WAIT;
+ gfpmask &= ~__GFP_DIRECT_RECLAIM;
tmp->bi_next = NULL;

if (!new_chain)
diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 30f522848c73..6255c8df6ae9 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -124,7 +124,8 @@ int cn_netlink_send_mult(struct cn_msg *msg, u16 len, u32 portid, u32 __group,
if (group)
return netlink_broadcast(dev->nls, skb, portid, group,
gfp_mask);
- return netlink_unicast(dev->nls, skb, portid, !(gfp_mask&__GFP_WAIT));
+ return netlink_unicast(dev->nls, skb, portid,
+ !gfpflags_allows_blocking(gfp_mask));
}
EXPORT_SYMBOL_GPL(cn_netlink_send_mult);

diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
index 2a3973a7c441..dc611c8cad10 100644
--- a/drivers/firewire/core-cdev.c
+++ b/drivers/firewire/core-cdev.c
@@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
static int add_client_resource(struct client *client,
struct client_resource *resource, gfp_t gfp_mask)
{
- bool preload = !!(gfp_mask & __GFP_WAIT);
+ bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
unsigned long flags;
int ret;

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 52b446b27b4d..c2b45081c5ab 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2225,7 +2225,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
*/
mapping = file_inode(obj->base.filp)->i_mapping;
gfp = mapping_gfp_mask(mapping);
- gfp |= __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD;
+ gfp |= __GFP_NORETRY | __GFP_NOWARN;
gfp &= ~(__GFP_IO | __GFP_WAIT);
sg = st->sgl;
st->nents = 0;
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index ca919f429666..76cf4cee3d64 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -619,7 +619,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)

static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
{
- bool preload = !!(gfp_mask & __GFP_WAIT);
+ bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
unsigned long flags;
int ret, id;

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 658ee39e6569..f4adbe89cd20 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2755,7 +2755,7 @@ static void *alloc_coherent(struct device *dev, size_t size,

page = alloc_pages(flag | __GFP_NOWARN, get_order(size));
if (!page) {
- if (!(flag & __GFP_WAIT))
+ if (!gfpflags_allows_blocking(flag))
return NULL;

page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT,
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 0649b94f5958..0f614d66eb03 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3566,7 +3566,7 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
flags |= GFP_DMA32;
}

- if (flags & __GFP_WAIT) {
+ if (gfpflags_allows_blocking(flags)) {
unsigned int count = size >> PAGE_SHIFT;

page = dma_alloc_from_contiguous(dev, count, order);
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 0f48fed44a17..6dda08385309 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -993,7 +993,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
struct bio_vec *bvec;

retry:
- if (unlikely(gfp_mask & __GFP_WAIT))
+ if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
mutex_lock(&cc->bio_alloc_lock);

clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, cc->bs);
@@ -1009,7 +1009,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
if (!page) {
crypt_free_buffer_pages(cc, clone);
bio_put(clone);
- gfp_mask |= __GFP_WAIT;
+ gfp_mask |= __GFP_DIRECT_RECLAIM;
goto retry;
}

@@ -1026,7 +1026,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
}

return_clone:
- if (unlikely(gfp_mask & __GFP_WAIT))
+ if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
mutex_unlock(&cc->bio_alloc_lock);

return clone;
diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 8bbbb751bf45..2dfb291a47c6 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -1188,8 +1188,7 @@ EXPORT_SYMBOL_GPL(mtd_writev);
*/
void *mtd_kmalloc_up_to(const struct mtd_info *mtd, size_t *size)
{
- gfp_t flags = __GFP_NOWARN | __GFP_WAIT |
- __GFP_NORETRY | __GFP_NO_KSWAPD;
+ gfp_t flags = __GFP_NOWARN | __GFP_DIRECT_RECLAIM | __GFP_NORETRY;
size_t min_alloc = max_t(size_t, mtd->writesize, PAGE_SIZE);
void *kbuf;

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index a90d7364334f..8458329a877e 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -689,7 +689,7 @@ static void *bnx2x_frag_alloc(const struct bnx2x_fastpath *fp, gfp_t gfp_mask)
{
if (fp->rx_frag_size) {
/* GFP_KERNEL allocations are used only during initialization */
- if (unlikely(gfp_mask & __GFP_WAIT))
+ if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
return (void *)__get_free_page(gfp_mask);

return netdev_alloc_frag(fp->rx_frag_size);
diff --git a/drivers/staging/android/ion/ion_system_heap.c b/drivers/staging/android/ion/ion_system_heap.c
index da2a63c0a9ba..2615e0ae4f0a 100644
--- a/drivers/staging/android/ion/ion_system_heap.c
+++ b/drivers/staging/android/ion/ion_system_heap.c
@@ -27,7 +27,7 @@
#include "ion_priv.h"

static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
- __GFP_NORETRY) & ~__GFP_WAIT;
+ __GFP_NORETRY) & ~__GFP_DIRECT_RECLAIM;
static gfp_t low_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN);
static const unsigned int orders[] = {8, 4, 0};
static const int num_orders = ARRAY_SIZE(orders);
diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
index d51687780b61..06badad3ab75 100644
--- a/drivers/usb/host/u132-hcd.c
+++ b/drivers/usb/host/u132-hcd.c
@@ -2247,7 +2247,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
{
struct u132 *u132 = hcd_to_u132(hcd);
if (irqs_disabled()) {
- if (__GFP_WAIT & mem_flags) {
+ if (__GFP_DIRECT_RECLAIM & mem_flags) {
printk(KERN_ERR "invalid context for function that migh"
"t sleep\n");
return -EINVAL;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f556c3732c2c..3dd4792b8099 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2566,7 +2566,7 @@ int open_ctree(struct super_block *sb,
fs_info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
fs_info->avg_delayed_ref_runtime = NSEC_PER_SEC >> 6; /* div by 64 */
/* readahead state */
- INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_WAIT);
+ INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
spin_lock_init(&fs_info->reada_lock);

fs_info->thread_pool_size = min_t(unsigned long,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 02d05817cbdf..35660da77921 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
clear = 1;
again:
- if (!prealloc && (mask & __GFP_WAIT)) {
+ if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
/*
* Don't care for allocation failure here because we might end
* up not needing the pre-allocated extent state at all, which
@@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,

bits |= EXTENT_FIRST_DELALLOC;
again:
- if (!prealloc && (mask & __GFP_WAIT)) {
+ if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
prealloc = alloc_extent_state(mask);
BUG_ON(!prealloc);
}
@@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
btrfs_debug_check_extent_io_range(tree, start, end);

again:
- if (!prealloc && (mask & __GFP_WAIT)) {
+ if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
/*
* Best effort, don't worry if extent state allocation fails
* here for the first iteration. We might have a cached state
@@ -4265,7 +4265,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
u64 start = page_offset(page);
u64 end = start + PAGE_CACHE_SIZE - 1;

- if ((mask & __GFP_WAIT) &&
+ if ((mask & __GFP_DIRECT_RECLAIM) &&
page->mapping->host->i_size > 16 * 1024 * 1024) {
u64 len;
while (start <= end) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fbe7c104531c..b1968f36a39b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -156,8 +156,8 @@ static struct btrfs_device *__alloc_device(void)
spin_lock_init(&dev->reada_lock);
atomic_set(&dev->reada_in_flight, 0);
atomic_set(&dev->dev_stats_ccnt, 0);
- INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_WAIT);
- INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_WAIT);
+ INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
+ INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);

return dev;
}
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 5ed0044fbb37..9004c786716f 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -750,7 +750,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
return 0;
if (journal)
return journal_try_to_free_buffers(journal, page,
- wait & ~__GFP_WAIT);
+ wait & ~__GFP_DIRECT_RECLAIM);
return try_to_free_buffers(page);
}

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 58987b5c514b..abe76d41ef1e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1045,7 +1045,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
return 0;
if (journal)
return jbd2_journal_try_to_free_buffers(journal, page,
- wait & ~__GFP_WAIT);
+ wait & ~__GFP_DIRECT_RECLAIM);
return try_to_free_buffers(page);
}

diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index d403c69bee08..4304072161aa 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -111,7 +111,7 @@ struct fscache_cookie *__fscache_acquire_cookie(

/* radix tree insertion won't use the preallocation pool unless it's
* told it may not wait */
- INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_WAIT);
+ INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);

switch (cookie->def->type) {
case FSCACHE_COOKIE_TYPE_INDEX:
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 483bbc613bf0..79483b3d8c6f 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -58,7 +58,7 @@ bool release_page_wait_timeout(struct fscache_cookie *cookie, struct page *page)

/*
* decide whether a page can be released, possibly by cancelling a store to it
- * - we're allowed to sleep if __GFP_WAIT is flagged
+ * - we're allowed to sleep if __GFP_DIRECT_RECLAIM is flagged
*/
bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
struct page *page,
@@ -122,7 +122,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
* allocator as the work threads writing to the cache may all end up
* sleeping on memory allocation, so we may need to impose a timeout
* too. */
- if (!(gfp & __GFP_WAIT) || !(gfp & __GFP_FS)) {
+ if (!(gfp & __GFP_DIRECT_RECLAIM) || !(gfp & __GFP_FS)) {
fscache_stat(&fscache_n_store_vmscan_busy);
return false;
}
@@ -132,7 +132,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
_debug("fscache writeout timeout page: %p{%lx}",
page, page->index);

- gfp &= ~__GFP_WAIT;
+ gfp &= ~__GFP_DIRECT_RECLAIM;
goto try_again;
}
EXPORT_SYMBOL(__fscache_maybe_release_page);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 1695ba8334a2..f45b90ba7c5c 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1690,8 +1690,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
* @journal: journal for operation
* @page: to try and free
* @gfp_mask: we use the mask to detect how hard should we try to release
- * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
- * release the buffers.
+ * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
+ * code to release the buffers.
*
*
* For all the buffers on this page,
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index f3d06174b051..06e18bcdb888 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1893,8 +1893,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
* @journal: journal for operation
* @page: to try and free
* @gfp_mask: we use the mask to detect how hard should we try to release
- * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
- * release the buffers.
+ * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
+ * code to release the buffers.
*
*
* For all the buffers on this page,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index cc4fa1ed61fc..5664e1938da1 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -480,8 +480,8 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);

/* Always try to initiate a 'commit' if relevant, but only
- * wait for it if __GFP_WAIT is set. Even then, only wait 1
- * second and only if the 'bdi' is not congested.
+ * wait for it if __GFP_DIRECT_RECLAIM is set. Even then,
+ * only wait 1 second and only if the 'bdi' is not congested.
* Waiting indefinitely can cause deadlocks when the NFS
* server is on this machine, when a new TCP connection is
* needed and in other rare cases. There is no particular
@@ -491,7 +491,7 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
if (mapping) {
struct nfs_server *nfss = NFS_SERVER(mapping->host);
nfs_commit_inode(mapping->host, 0);
- if ((gfp & __GFP_WAIT) &&
+ if ((gfp & __GFP_DIRECT_RECLAIM) &&
!bdi_write_congested(&nfss->backing_dev_info)) {
wait_on_page_bit_killable_timeout(page, PG_private,
HZ);
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index eac9549efd52..587174fd4f2c 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -525,7 +525,7 @@ xfs_qm_shrink_scan(
unsigned long freed;
int error;

- if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
+ if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
return 0;

INIT_LIST_HEAD(&isol.buffers);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 43246850a85f..dbd246a14e2f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -29,12 +29,13 @@ struct vm_area_struct;
#define ___GFP_NOMEMALLOC 0x10000u
#define ___GFP_HARDWALL 0x20000u
#define ___GFP_THISNODE 0x40000u
-#define ___GFP_WAIT 0x80000u
+#define ___GFP_ATOMIC 0x80000u
#define ___GFP_NOACCOUNT 0x100000u
#define ___GFP_NOTRACK 0x200000u
-#define ___GFP_NO_KSWAPD 0x400000u
+#define ___GFP_DIRECT_RECLAIM 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
+#define ___GFP_KSWAPD_RECLAIM 0x2000000u
/* If the above are modified, __GFP_BITS_SHIFT may need updating */

/*
@@ -68,7 +69,7 @@ struct vm_area_struct;
* __GFP_MOVABLE: Flag that this page will be movable by the page migration
* mechanism or reclaimed
*/
-#define __GFP_WAIT ((__force gfp_t)___GFP_WAIT) /* Can wait and reschedule? */
+#define __GFP_ATOMIC ((__force gfp_t)___GFP_ATOMIC) /* Caller cannot wait or reschedule */
#define __GFP_HIGH ((__force gfp_t)___GFP_HIGH) /* Should access emergency pools? */
#define __GFP_IO ((__force gfp_t)___GFP_IO) /* Can start physical IO? */
#define __GFP_FS ((__force gfp_t)___GFP_FS) /* Can call down to low-level FS? */
@@ -91,23 +92,37 @@ struct vm_area_struct;
#define __GFP_NOACCOUNT ((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to kmemcg */
#define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */

-#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */

/*
+ * A caller that is willing to wait may enter direct reclaim and will
+ * wake kswapd to reclaim pages in the background until the high
+ * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
+ * avoid unnecessary delays when a fallback option is available but
+ * still allow kswapd to reclaim in the background. The kswapd flag
+ * can be cleared when the reclaiming of pages would cause unnecessary
+ * disruption.
+ */
+#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
+#define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
+#define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
+
+/*
* This may seem redundant, but it's a way of annotating false positives vs.
* allocations that simply cannot be supported (e.g. page tables).
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)

-#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

-/* This equals 0, but use constants in case they ever change */
-#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
-/* GFP_ATOMIC means both !wait (__GFP_WAIT not set) and use emergency pool */
-#define GFP_ATOMIC (__GFP_HIGH)
+/*
+ * GFP_ATOMIC callers can not sleep, need the allocation to succeed.
+ * A lower watermark is applied to allow access to "atomic reserves"
+ */
+#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
+#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
#define GFP_NOIO (__GFP_WAIT)
#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
@@ -117,9 +132,9 @@ struct vm_area_struct;
#define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
#define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
#define GFP_IOFS (__GFP_IO | __GFP_FS)
-#define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
- __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
- __GFP_NO_KSWAPD)
+#define GFP_TRANSHUGE ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
+ __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
+ ~__GFP_KSWAPD_RECLAIM)

/* This mask makes up all the page movable related flags */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
@@ -161,6 +176,11 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
}

+static inline bool gfpflags_allows_blocking(const gfp_t gfp_flags)
+{
+ return gfp_flags & __GFP_DIRECT_RECLAIM;
+}
+
#ifdef CONFIG_HIGHMEM
#define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
#else
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d6cdd6e87d53..1086bfa3eb80 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1109,7 +1109,7 @@ static inline int skb_cloned(const struct sk_buff *skb)

static inline int skb_unclone(struct sk_buff *skb, gfp_t pri)
{
- might_sleep_if(pri & __GFP_WAIT);
+ might_sleep_if(gfpflags_allows_blocking(pri));

if (skb_cloned(skb))
return pskb_expand_head(skb, 0, 0, pri);
@@ -1193,7 +1193,7 @@ static inline int skb_shared(const struct sk_buff *skb)
*/
static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
{
- might_sleep_if(pri & __GFP_WAIT);
+ might_sleep_if(gfpflags_allows_blocking(pri));
if (skb_shared(skb)) {
struct sk_buff *nskb = skb_clone(skb, pri);

@@ -1229,7 +1229,7 @@ static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
static inline struct sk_buff *skb_unshare(struct sk_buff *skb,
gfp_t pri)
{
- might_sleep_if(pri & __GFP_WAIT);
+ might_sleep_if(gfpflags_allows_blocking(pri));
if (skb_cloned(skb)) {
struct sk_buff *nskb = skb_copy(skb, pri);

diff --git a/include/net/sock.h b/include/net/sock.h
index f21f0708ec59..3ab94d6b0b56 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2035,7 +2035,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
*/
static inline struct page_frag *sk_page_frag(struct sock *sk)
{
- if (sk->sk_allocation & __GFP_WAIT)
+ if (sk->sk_allocation & __GFP_DIRECT_RECLAIM)
return &current->task_frag;

return &sk->sk_frag;
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index d6fd8e5b14b7..dde6bf092c8a 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -20,7 +20,7 @@
{(unsigned long)GFP_ATOMIC, "GFP_ATOMIC"}, \
{(unsigned long)GFP_NOIO, "GFP_NOIO"}, \
{(unsigned long)__GFP_HIGH, "GFP_HIGH"}, \
- {(unsigned long)__GFP_WAIT, "GFP_WAIT"}, \
+ {(unsigned long)__GFP_ATOMIC, "GFP_ATOMIC"}, \
{(unsigned long)__GFP_IO, "GFP_IO"}, \
{(unsigned long)__GFP_COLD, "GFP_COLD"}, \
{(unsigned long)__GFP_NOWARN, "GFP_NOWARN"}, \
@@ -36,7 +36,8 @@
{(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
{(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"}, \
{(unsigned long)__GFP_NOTRACK, "GFP_NOTRACK"}, \
- {(unsigned long)__GFP_NO_KSWAPD, "GFP_NO_KSWAPD"}, \
+ {(unsigned long)__GFP_DIRECT_RECLAIM, "GFP_DIRECT_RECLAIM"}, \
+ {(unsigned long)__GFP_KSWAPD_RECLAIM, "GFP_KSWAPD_RECLAIM"}, \
{(unsigned long)__GFP_OTHER_NODE, "GFP_OTHER_NODE"} \
) : "GFP_NOWAIT"

diff --git a/kernel/audit.c b/kernel/audit.c
index f9e6065346db..6ab7a55dbdff 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1357,16 +1357,16 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask,
if (unlikely(audit_filter_type(type)))
return NULL;

- if (gfp_mask & __GFP_WAIT) {
+ if (gfp_mask & __GFP_DIRECT_RECLAIM) {
if (audit_pid && audit_pid == current->pid)
- gfp_mask &= ~__GFP_WAIT;
+ gfp_mask &= ~__GFP_DIRECT_RECLAIM;
else
reserve = 0;
}

while (audit_backlog_limit
&& skb_queue_len(&audit_skb_queue) > audit_backlog_limit + reserve) {
- if (gfp_mask & __GFP_WAIT && audit_backlog_wait_time) {
+ if (gfp_mask & __GFP_DIRECT_RECLAIM && audit_backlog_wait_time) {
long sleep_time;

sleep_time = timeout_start + audit_backlog_wait_time - jiffies;
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 8acfbf773e06..9aa39f20f593 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2738,7 +2738,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
return;

/* no reclaim without waiting on it */
- if (!(gfp_mask & __GFP_WAIT))
+ if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
return;

/* this guy won't enter reclaim */
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..32ee47f6ac11 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
cpumask_var_t cpus;
int cpu, ret;

- might_sleep_if(gfp_flags & __GFP_WAIT);
+ might_sleep_if(gfp_flags & __GFP_DIRECT_RECLAIM);

if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
preempt_disable();
diff --git a/lib/idr.c b/lib/idr.c
index 5335c43adf46..e5118fc82961 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
* allocation guarantee. Disallow usage from those contexts.
*/
WARN_ON_ONCE(in_interrupt());
- might_sleep_if(gfp_mask & __GFP_WAIT);
+ might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

preempt_disable();

@@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
struct idr_layer *pa[MAX_IDR_LEVEL + 1];
int id;

- might_sleep_if(gfp_mask & __GFP_WAIT);
+ might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

/* sanity checks */
if (WARN_ON_ONCE(start < 0))
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index f9ebe1c82060..cc5fdc3fb734 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
* preloading in the interrupt anyway as all the allocations have to
* be atomic. So just do normal allocation when in interrupt.
*/
- if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
+ if (!(gfp_mask & __GFP_DIRECT_RECLAIM) && !in_interrupt()) {
struct radix_tree_preload *rtp;

/*
@@ -249,7 +249,7 @@ radix_tree_node_free(struct radix_tree_node *node)
* with preemption not disabled.
*
* To make use of this facility, the radix tree must be initialised without
- * __GFP_WAIT being passed to INIT_RADIX_TREE().
+ * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
*/
static int __radix_tree_preload(gfp_t gfp_mask)
{
@@ -286,12 +286,12 @@ static int __radix_tree_preload(gfp_t gfp_mask)
* with preemption not disabled.
*
* To make use of this facility, the radix tree must be initialised without
- * __GFP_WAIT being passed to INIT_RADIX_TREE().
+ * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
*/
int radix_tree_preload(gfp_t gfp_mask)
{
/* Warn on non-sensical use... */
- WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
+ WARN_ON_ONCE(!(gfp_mask & __GFP_DIRECT_RECLAIM));
return __radix_tree_preload(gfp_mask);
}
EXPORT_SYMBOL(radix_tree_preload);
@@ -303,7 +303,7 @@ EXPORT_SYMBOL(radix_tree_preload);
*/
int radix_tree_maybe_preload(gfp_t gfp_mask)
{
- if (gfp_mask & __GFP_WAIT)
+ if (gfp_mask & __GFP_DIRECT_RECLAIM)
return __radix_tree_preload(gfp_mask);
/* Preloading doesn't help anything with this gfp mask, skip it */
preempt_disable();
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index dac5bf59309d..2056d16807de 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
{
struct bdi_writeback *wb;

- might_sleep_if(gfp & __GFP_WAIT);
+ might_sleep_if(gfp & __GFP_DIRECT_RECLAIM);

if (!memcg_css->parent)
return &bdi->wb;
diff --git a/mm/dmapool.c b/mm/dmapool.c
index fd5fe4342e93..248f6e864a92 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -323,7 +323,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
size_t offset;
void *retval;

- might_sleep_if(mem_flags & __GFP_WAIT);
+ might_sleep_if(gfpflags_allows_blocking(mem_flags));

spin_lock_irqsave(&pool->lock, flags);
list_for_each_entry(page, &pool->page_list, page_list) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acb93c554f6e..7155a556a8d4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2268,7 +2268,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (unlikely(task_in_memcg_oom(current)))
goto nomem;

- if (!(gfp_mask & __GFP_WAIT))
+ if (!gfpflags_allows_blocking(gfp_mask))
goto nomem;

mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
@@ -2327,7 +2327,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
css_get_many(&memcg->css, batch);
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
- if (!(gfp_mask & __GFP_WAIT))
+ if (!gfpflags_allows_blocking(gfp_mask))
goto done;
/*
* If the hierarchy is above the normal consumption range,
@@ -4696,8 +4696,8 @@ static int mem_cgroup_do_precharge(unsigned long count)
{
int ret;

- /* Try a single bulk charge without reclaim first */
- ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
+ /* Try a single bulk charge without reclaim first, kswapd may wake */
+ ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
if (!ret) {
mc.precharge += count;
return ret;
diff --git a/mm/mempool.c b/mm/mempool.c
index 2cc08de8b1db..bfd2a0dd0e18 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -317,13 +317,13 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
gfp_t gfp_temp;

VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
- might_sleep_if(gfp_mask & __GFP_WAIT);
+ might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */
gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
gfp_mask |= __GFP_NOWARN; /* failures are OK */

- gfp_temp = gfp_mask & ~(__GFP_WAIT|__GFP_IO);
+ gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);

repeat_alloc:

@@ -346,7 +346,7 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
}

/*
- * We use gfp mask w/o __GFP_WAIT or IO for the first round. If
+ * We use gfp mask w/o direct reclaim or IO for the first round. If
* alloc failed with that and @pool was empty, retry immediately.
*/
if (gfp_temp != gfp_mask) {
@@ -355,8 +355,8 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
goto repeat_alloc;
}

- /* We must not sleep if !__GFP_WAIT */
- if (!(gfp_mask & __GFP_WAIT)) {
+ /* We must not sleep if !__GFP_DIRECT_RECLAIM */
+ if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
spin_unlock_irqrestore(&pool->lock, flags);
return NULL;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 94f2f6bdd6d5..ccd235d02923 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2143,7 +2143,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
return false;
if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
return false;
- if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
+ if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
return false;

return should_fail(&fail_page_alloc.attr, 1 << order);
@@ -2459,7 +2459,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
if (test_thread_flag(TIF_MEMDIE) ||
(current->flags & (PF_MEMALLOC | PF_EXITING)))
filter &= ~SHOW_MEM_FILTER_NODES;
- if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
+ if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
filter &= ~SHOW_MEM_FILTER_NODES;

if (fmt) {
@@ -2710,7 +2710,6 @@ static inline int
gfp_to_alloc_flags(gfp_t gfp_mask)
{
int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
- const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));

/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -2719,11 +2718,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
* The caller may dip into page reserves a bit more if the caller
* cannot run direct reclaim, or if the caller has realtime scheduling
* policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
- * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
+ * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
*/
alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);

- if (atomic) {
+ if (gfp_mask & __GFP_ATOMIC) {
/*
* Not worth trying to allocate harder for __GFP_NOMEMALLOC even
* if it can't schedule.
@@ -2764,7 +2763,7 @@ static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
{
- const gfp_t wait = gfp_mask & __GFP_WAIT;
+ bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
struct page *page = NULL;
int alloc_flags;
unsigned long pages_reclaimed = 0;
@@ -2785,15 +2784,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
}

/*
+ * We also sanity check to catch abuse of atomic reserves being used by
+ * callers that are not in atomic context.
+ */
+ if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
+ (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
+ gfp_mask &= ~__GFP_ATOMIC;
+
+ /*
* If this allocation cannot block and it is for a specific node, then
* fail early. There's no need to wakeup kswapd or retry for a
* speculative node-specific allocation.
*/
- if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait)
+ if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !can_direct_reclaim)
goto nopage;

retry:
- if (!(gfp_mask & __GFP_NO_KSWAPD))
+ if (gfp_mask & __GFP_KSWAPD_RECLAIM)
wake_all_kswapds(order, ac);

/*
@@ -2836,8 +2843,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
}
}

- /* Atomic allocations - we can't balance anything */
- if (!wait) {
+ /* Caller is not willing to reclaim, we can't balance anything */
+ if (!can_direct_reclaim) {
/*
* All existing users of the deprecated __GFP_NOFAIL are
* blockable, so warn of any new users that actually allow this
@@ -2974,7 +2981,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,

lockdep_trace_alloc(gfp_mask);

- might_sleep_if(gfp_mask & __GFP_WAIT);
+ might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

if (should_fail_alloc_page(gfp_mask, order))
return NULL;
diff --git a/mm/slab.c b/mm/slab.c
index 200e22412a16..a7bcbcc5692e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1030,12 +1030,12 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
}

/*
- * Construct gfp mask to allocate from a specific node but do not invoke reclaim
- * or warn about failures.
+ * Construct gfp mask to allocate from a specific node but do not direct reclaim
+ * or warn about failures. kswapd may still wake to reclaim in the background.
*/
static inline gfp_t gfp_exact_node(gfp_t flags)
{
- return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
+ return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
}
#endif

@@ -2625,7 +2625,7 @@ static int cache_grow(struct kmem_cache *cachep,

offset *= cachep->colour_off;

- if (local_flags & __GFP_WAIT)
+ if (local_flags & __GFP_DIRECT_RECLAIM)
local_irq_enable();

/*
@@ -2655,7 +2655,7 @@ static int cache_grow(struct kmem_cache *cachep,

cache_init_objs(cachep, page);

- if (local_flags & __GFP_WAIT)
+ if (local_flags & __GFP_DIRECT_RECLAIM)
local_irq_disable();
check_irq_off();
spin_lock(&n->list_lock);
@@ -2669,7 +2669,7 @@ static int cache_grow(struct kmem_cache *cachep,
opps1:
kmem_freepages(cachep, page);
failed:
- if (local_flags & __GFP_WAIT)
+ if (local_flags & __GFP_DIRECT_RECLAIM)
local_irq_disable();
return 0;
}
@@ -2861,7 +2861,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
gfp_t flags)
{
- might_sleep_if(flags & __GFP_WAIT);
+ might_sleep_if(flags & __GFP_DIRECT_RECLAIM);
#if DEBUG
kmem_flagcheck(cachep, flags);
#endif
@@ -3049,11 +3049,11 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
*/
struct page *page;

- if (local_flags & __GFP_WAIT)
+ if (local_flags & __GFP_DIRECT_RECLAIM)
local_irq_enable();
kmem_flagcheck(cache, flags);
page = kmem_getpages(cache, local_flags, numa_mem_id());
- if (local_flags & __GFP_WAIT)
+ if (local_flags & __GFP_DIRECT_RECLAIM)
local_irq_disable();
if (page) {
/*
diff --git a/mm/slub.c b/mm/slub.c
index 816df0016555..b658a66ffce4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1263,7 +1263,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
{
flags &= gfp_allowed_mask;
lockdep_trace_alloc(flags);
- might_sleep_if(flags & __GFP_WAIT);
+ might_sleep_if(gfpflags_allows_blocking(flags));

if (should_failslab(s->object_size, flags, s->flags))
return NULL;
@@ -1339,7 +1339,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)

flags &= gfp_allowed_mask;

- if (flags & __GFP_WAIT)
+ if (flags & __GFP_DIRECT_RECLAIM)
local_irq_enable();

flags |= s->allocflags;
@@ -1380,7 +1380,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
kmemcheck_mark_unallocated_pages(page, pages);
}

- if (flags & __GFP_WAIT)
+ if (flags & __GFP_DIRECT_RECLAIM)
local_irq_disable();
if (!page)
return NULL;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2faaa2976447..c6ce91b20c91 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1617,7 +1617,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
goto fail;
}
area->pages[i] = page;
- if (gfp_mask & __GFP_WAIT)
+ if (gfpflags_allows_blocking(gfp_mask))
cond_resched();
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f1d8eae285f2..bc9ab358b77a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3768,7 +3768,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
/*
* Do not scan if the allocation should not be delayed.
*/
- if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
+ if (!gfpflags_allows_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
return ZONE_RECLAIM_NOSCAN;

/*
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b6a19ca0f99e..6f025e2544de 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -414,7 +414,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
len += NET_SKB_PAD;

if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
- (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
+ (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
if (!skb)
goto skb_fail;
@@ -481,7 +481,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
len += NET_SKB_PAD + NET_IP_ALIGN;

if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
- (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
+ (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
if (!skb)
goto skb_fail;
@@ -4452,7 +4452,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
return NULL;

gfp_head = gfp_mask;
- if (gfp_head & __GFP_WAIT)
+ if (gfp_head & __GFP_DIRECT_RECLAIM)
gfp_head |= __GFP_REPEAT;

*errcode = -ENOBUFS;
@@ -4467,7 +4467,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,

while (order) {
if (npages >= 1 << order) {
- page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
+ page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) |
__GFP_COMP |
__GFP_NOWARN |
__GFP_NORETRY,
diff --git a/net/core/sock.c b/net/core/sock.c
index 193901d09757..02b705cc9eb3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1879,8 +1879,10 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)

pfrag->offset = 0;
if (SKB_FRAG_PAGE_ORDER) {
- pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
- __GFP_NOWARN | __GFP_NORETRY,
+ /* Avoid direct reclaim but allow kswapd to wake */
+ pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+ __GFP_COMP | __GFP_NOWARN |
+ __GFP_NORETRY,
SKB_FRAG_PAGE_ORDER);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 197c3f59ecbf..c5fcdd6f85b7 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
/* Set an association id for a given association */
int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
{
- bool preload = !!(gfp & __GFP_WAIT);
+ bool preload = !!(gfp & __GFP_DIRECT_RECLAIM);
int ret;

/* If the id is already assigned, keep it. */
--
2.4.6

2015-08-12 10:46:43

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/10] mm: page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should clear
__GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
behaves differently, there is a risk that people will clear the wrong
flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
what it does -- setting it allows all reclaim activity, clearing them
prevents it.

Signed-off-by: Mel Gorman <[email protected]>
---
block/blk-mq.c | 2 +-
block/scsi_ioctl.c | 6 +++---
drivers/block/drbd/drbd_bitmap.c | 2 +-
drivers/block/drbd/drbd_receiver.c | 2 +-
drivers/block/mtip32xx/mtip32xx.c | 2 +-
drivers/block/nvme-core.c | 4 ++--
drivers/block/paride/pd.c | 2 +-
drivers/block/pktcdvd.c | 4 ++--
drivers/gpu/drm/i915/i915_gem.c | 2 +-
drivers/ide/ide-atapi.c | 2 +-
drivers/ide/ide-cd.c | 2 +-
drivers/ide/ide-cd_ioctl.c | 2 +-
drivers/ide/ide-devsets.c | 2 +-
drivers/ide/ide-disk.c | 2 +-
drivers/ide/ide-ioctls.c | 4 ++--
drivers/ide/ide-park.c | 2 +-
drivers/ide/ide-pm.c | 4 ++--
drivers/ide/ide-tape.c | 4 ++--
drivers/ide/ide-taskfile.c | 4 ++--
drivers/infiniband/hw/ipath/ipath_file_ops.c | 2 +-
drivers/infiniband/hw/qib/qib_init.c | 2 +-
drivers/misc/vmw_balloon.c | 2 +-
drivers/scsi/scsi_error.c | 2 +-
drivers/scsi/scsi_lib.c | 4 ++--
.../staging/lustre/include/linux/libcfs/libcfs_private.h | 2 +-
fs/btrfs/extent_io.c | 6 +++---
fs/cachefiles/internal.h | 2 +-
fs/direct-io.c | 2 +-
fs/nilfs2/mdt.h | 2 +-
include/linux/gfp.h | 16 ++++++++--------
kernel/power/swap.c | 14 +++++++-------
lib/percpu_ida.c | 2 +-
mm/failslab.c | 8 ++++----
mm/filemap.c | 2 +-
mm/huge_memory.c | 2 +-
mm/migrate.c | 2 +-
mm/page_alloc.c | 10 +++++-----
net/netlink/af_netlink.c | 2 +-
net/rxrpc/ar-connection.c | 2 +-
security/integrity/ima/ima_crypto.c | 2 +-
40 files changed, 71 insertions(+), 71 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df8cba632ec2..873c7b4d14ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1221,7 +1221,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
ctx = blk_mq_get_ctx(q);
hctx = q->mq_ops->map_queue(q, ctx->cpu);
blk_mq_set_alloc_data(&alloc_data, q,
- __GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
+ __GFP_RECLAIM|__GFP_HIGH, false, ctx, hctx);
rq = __blk_mq_alloc_request(&alloc_data, rw);
ctx = alloc_data.ctx;
hctx = alloc_data.hctx;
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index dda653ce7b24..0774799942e0 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -444,7 +444,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,

}

- rq = blk_get_request(q, in_len ? WRITE : READ, __GFP_WAIT);
+ rq = blk_get_request(q, in_len ? WRITE : READ, __GFP_RECLAIM);
if (IS_ERR(rq)) {
err = PTR_ERR(rq);
goto error_free_buffer;
@@ -495,7 +495,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
break;
}

- if (bytes && blk_rq_map_kern(q, rq, buffer, bytes, __GFP_WAIT)) {
+ if (bytes && blk_rq_map_kern(q, rq, buffer, bytes, __GFP_RECLAIM)) {
err = DRIVER_ERROR << 24;
goto error;
}
@@ -536,7 +536,7 @@ static int __blk_send_generic(struct request_queue *q, struct gendisk *bd_disk,
struct request *rq;
int err;

- rq = blk_get_request(q, WRITE, __GFP_WAIT);
+ rq = blk_get_request(q, WRITE, __GFP_RECLAIM);
if (IS_ERR(rq))
return PTR_ERR(rq);
blk_rq_set_block_pc(rq);
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 434c77dcc99e..2940da0011e0 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -1016,7 +1016,7 @@ static void bm_page_io_async(struct drbd_bm_aio_ctx *ctx, int page_nr) __must_ho
bm_set_page_unchanged(b->bm_pages[page_nr]);

if (ctx->flags & BM_AIO_COPY_PAGES) {
- page = mempool_alloc(drbd_md_io_page_pool, __GFP_HIGHMEM|__GFP_WAIT);
+ page = mempool_alloc(drbd_md_io_page_pool, __GFP_HIGHMEM|__GFP_RECLAIM);
copy_highpage(page, b->bm_pages[page_nr]);
bm_store_page_idx(page, page_nr);
} else
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index c097909c589c..1d2046e68808 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -357,7 +357,7 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
}

if (has_payload && data_size) {
- page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_WAIT));
+ page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_RECLAIM));
if (!page)
goto fail;
}
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 4a2ef09e6704..a694b23cb8f9 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -173,7 +173,7 @@ static struct mtip_cmd *mtip_get_int_command(struct driver_data *dd)
{
struct request *rq;

- rq = blk_mq_alloc_request(dd->queue, 0, __GFP_WAIT, true);
+ rq = blk_mq_alloc_request(dd->queue, 0, __GFP_RECLAIM, true);
return blk_mq_rq_to_pdu(rq);
}

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 7920c2741b47..0a8b1682305f 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -1033,11 +1033,11 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
req->special = (void *)0;

if (buffer && bufflen) {
- ret = blk_rq_map_kern(q, req, buffer, bufflen, __GFP_WAIT);
+ ret = blk_rq_map_kern(q, req, buffer, bufflen, __GFP_RECLAIM);
if (ret)
goto out;
} else if (ubuffer && bufflen) {
- ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen, __GFP_WAIT);
+ ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen, __GFP_RECLAIM);
if (ret)
goto out;
bio = req->bio;
diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
index b9242d78283d..562b5a4ca7b7 100644
--- a/drivers/block/paride/pd.c
+++ b/drivers/block/paride/pd.c
@@ -723,7 +723,7 @@ static int pd_special_command(struct pd_unit *disk,
struct request *rq;
int err = 0;

- rq = blk_get_request(disk->gd->queue, READ, __GFP_WAIT);
+ rq = blk_get_request(disk->gd->queue, READ, __GFP_RECLAIM);
if (IS_ERR(rq))
return PTR_ERR(rq);

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 4c20c228184c..e372a5f08847 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -704,14 +704,14 @@ static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *
int ret = 0;

rq = blk_get_request(q, (cgc->data_direction == CGC_DATA_WRITE) ?
- WRITE : READ, __GFP_WAIT);
+ WRITE : READ, __GFP_RECLAIM);
if (IS_ERR(rq))
return PTR_ERR(rq);
blk_rq_set_block_pc(rq);

if (cgc->buflen) {
ret = blk_rq_map_kern(q, rq, cgc->buffer, cgc->buflen,
- __GFP_WAIT);
+ __GFP_RECLAIM);
if (ret)
goto out;
}
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c2b45081c5ab..2ca8638c5b81 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2226,7 +2226,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
mapping = file_inode(obj->base.filp)->i_mapping;
gfp = mapping_gfp_mask(mapping);
gfp |= __GFP_NORETRY | __GFP_NOWARN;
- gfp &= ~(__GFP_IO | __GFP_WAIT);
+ gfp &= ~(__GFP_IO | __GFP_RECLAIM);
sg = st->sgl;
st->nents = 0;
for (i = 0; i < page_count; i++) {
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index 1362ad80a76c..05352f490d60 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -92,7 +92,7 @@ int ide_queue_pc_tail(ide_drive_t *drive, struct gendisk *disk,
struct request *rq;
int error;

- rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+ rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
rq->cmd_type = REQ_TYPE_DRV_PRIV;
rq->special = (char *)pc;

diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 64a6b827b3dd..ef907fd5ba98 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -441,7 +441,7 @@ int ide_cd_queue_pc(ide_drive_t *drive, const unsigned char *cmd,
struct request *rq;
int error;

- rq = blk_get_request(drive->queue, write, __GFP_WAIT);
+ rq = blk_get_request(drive->queue, write, __GFP_RECLAIM);

memcpy(rq->cmd, cmd, BLK_MAX_CDB);
rq->cmd_type = REQ_TYPE_ATA_PC;
diff --git a/drivers/ide/ide-cd_ioctl.c b/drivers/ide/ide-cd_ioctl.c
index 066e39036518..474173eb31bb 100644
--- a/drivers/ide/ide-cd_ioctl.c
+++ b/drivers/ide/ide-cd_ioctl.c
@@ -303,7 +303,7 @@ int ide_cdrom_reset(struct cdrom_device_info *cdi)
struct request *rq;
int ret;

- rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+ rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
rq->cmd_type = REQ_TYPE_DRV_PRIV;
rq->cmd_flags = REQ_QUIET;
ret = blk_execute_rq(drive->queue, cd->disk, rq, 0);
diff --git a/drivers/ide/ide-devsets.c b/drivers/ide/ide-devsets.c
index b05a74d78ef5..0dd43b4fcec6 100644
--- a/drivers/ide/ide-devsets.c
+++ b/drivers/ide/ide-devsets.c
@@ -165,7 +165,7 @@ int ide_devset_execute(ide_drive_t *drive, const struct ide_devset *setting,
if (!(setting->flags & DS_SYNC))
return setting->set(drive, arg);

- rq = blk_get_request(q, READ, __GFP_WAIT);
+ rq = blk_get_request(q, READ, __GFP_RECLAIM);
rq->cmd_type = REQ_TYPE_DRV_PRIV;
rq->cmd_len = 5;
rq->cmd[0] = REQ_DEVSET_EXEC;
diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
index 56b9708894a5..37a8a907febe 100644
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
@@ -477,7 +477,7 @@ static int set_multcount(ide_drive_t *drive, int arg)
if (drive->special_flags & IDE_SFLAG_SET_MULTMODE)
return -EBUSY;

- rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+ rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
rq->cmd_type = REQ_TYPE_ATA_TASKFILE;

drive->mult_req = arg;
diff --git a/drivers/ide/ide-ioctls.c b/drivers/ide/ide-ioctls.c
index aa2e9b77b20d..d05db2469209 100644
--- a/drivers/ide/ide-ioctls.c
+++ b/drivers/ide/ide-ioctls.c
@@ -125,7 +125,7 @@ static int ide_cmd_ioctl(ide_drive_t *drive, unsigned long arg)
if (NULL == (void *) arg) {
struct request *rq;

- rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+ rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
err = blk_execute_rq(drive->queue, NULL, rq, 0);
blk_put_request(rq);
@@ -221,7 +221,7 @@ static int generic_drive_reset(ide_drive_t *drive)
struct request *rq;
int ret = 0;

- rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+ rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
rq->cmd_type = REQ_TYPE_DRV_PRIV;
rq->cmd_len = 1;
rq->cmd[0] = REQ_DRIVE_RESET;
diff --git a/drivers/ide/ide-park.c b/drivers/ide/ide-park.c
index c80868520488..2d7dca56dd24 100644
--- a/drivers/ide/ide-park.c
+++ b/drivers/ide/ide-park.c
@@ -31,7 +31,7 @@ static void issue_park_cmd(ide_drive_t *drive, unsigned long timeout)
}
spin_unlock_irq(&hwif->lock);

- rq = blk_get_request(q, READ, __GFP_WAIT);
+ rq = blk_get_request(q, READ, __GFP_RECLAIM);
rq->cmd[0] = REQ_PARK_HEADS;
rq->cmd_len = 1;
rq->cmd_type = REQ_TYPE_DRV_PRIV;
diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 081e43458d50..e34af488693a 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -18,7 +18,7 @@ int generic_ide_suspend(struct device *dev, pm_message_t mesg)
}

memset(&rqpm, 0, sizeof(rqpm));
- rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+ rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
rq->cmd_type = REQ_TYPE_ATA_PM_SUSPEND;
rq->special = &rqpm;
rqpm.pm_step = IDE_PM_START_SUSPEND;
@@ -88,7 +88,7 @@ int generic_ide_resume(struct device *dev)
}

memset(&rqpm, 0, sizeof(rqpm));
- rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+ rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
rq->cmd_type = REQ_TYPE_ATA_PM_RESUME;
rq->cmd_flags |= REQ_PREEMPT;
rq->special = &rqpm;
diff --git a/drivers/ide/ide-tape.c b/drivers/ide/ide-tape.c
index f5d51d1d09ee..12fa04997dcc 100644
--- a/drivers/ide/ide-tape.c
+++ b/drivers/ide/ide-tape.c
@@ -852,7 +852,7 @@ static int idetape_queue_rw_tail(ide_drive_t *drive, int cmd, int size)
BUG_ON(cmd != REQ_IDETAPE_READ && cmd != REQ_IDETAPE_WRITE);
BUG_ON(size < 0 || size % tape->blk_size);

- rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+ rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
rq->cmd_type = REQ_TYPE_DRV_PRIV;
rq->cmd[13] = cmd;
rq->rq_disk = tape->disk;
@@ -860,7 +860,7 @@ static int idetape_queue_rw_tail(ide_drive_t *drive, int cmd, int size)

if (size) {
ret = blk_rq_map_kern(drive->queue, rq, tape->buf, size,
- __GFP_WAIT);
+ __GFP_RECLAIM);
if (ret)
goto out_put;
}
diff --git a/drivers/ide/ide-taskfile.c b/drivers/ide/ide-taskfile.c
index 0979e126fff1..a716693417a3 100644
--- a/drivers/ide/ide-taskfile.c
+++ b/drivers/ide/ide-taskfile.c
@@ -430,7 +430,7 @@ int ide_raw_taskfile(ide_drive_t *drive, struct ide_cmd *cmd, u8 *buf,
int error;
int rw = !(cmd->tf_flags & IDE_TFLAG_WRITE) ? READ : WRITE;

- rq = blk_get_request(drive->queue, rw, __GFP_WAIT);
+ rq = blk_get_request(drive->queue, rw, __GFP_RECLAIM);
rq->cmd_type = REQ_TYPE_ATA_TASKFILE;

/*
@@ -441,7 +441,7 @@ int ide_raw_taskfile(ide_drive_t *drive, struct ide_cmd *cmd, u8 *buf,
*/
if (nsect) {
error = blk_rq_map_kern(drive->queue, rq, buf,
- nsect * SECTOR_SIZE, __GFP_WAIT);
+ nsect * SECTOR_SIZE, __GFP_RECLAIM);
if (error)
goto put_req;
}
diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index 450d15965005..c11f6c58ce53 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -905,7 +905,7 @@ static int ipath_create_user_egr(struct ipath_portdata *pd)
* heavy filesystem activity makes these fail, and we can
* use compound pages.
*/
- gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
+ gfp_flags = __GFP_RECLAIM | __GFP_IO | __GFP_COMP;

egrcnt = dd->ipath_rcvegrcnt;
/* TID number offset for this port */
diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c
index 7e00470adc30..4ff340fe904f 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -1680,7 +1680,7 @@ int qib_setup_eagerbufs(struct qib_ctxtdata *rcd)
* heavy filesystem activity makes these fail, and we can
* use compound pages.
*/
- gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
+ gfp_flags = __GFP_RECLAIM | __GFP_IO | __GFP_COMP;

egrcnt = rcd->rcvegrcnt;
egroff = rcd->rcvegr_tid_base;
diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 191617492181..5a312958c094 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -85,7 +85,7 @@ MODULE_LICENSE("GPL");

/*
* Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We don't
- * allow wait (__GFP_WAIT) for NOSLEEP page allocations. Use
+ * allow wait (__GFP_RECLAIM) for NOSLEEP page allocations. Use
* __GFP_NOWARN, to suppress page allocation failure warnings.
*/
#define VMW_PAGE_ALLOC_NOSLEEP (__GFP_HIGHMEM|__GFP_NOWARN)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index cfadccef045c..26416e21295d 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1961,7 +1961,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)
struct request *req;

/*
- * blk_get_request with GFP_KERNEL (__GFP_WAIT) sleeps until a
+ * blk_get_request with GFP_KERNEL (__GFP_RECLAIM) sleeps until a
* request becomes available
*/
req = blk_get_request(sdev->request_queue, READ, GFP_KERNEL);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 448ebdaa3d69..2396259b682b 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -221,13 +221,13 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
int write = (data_direction == DMA_TO_DEVICE);
int ret = DRIVER_ERROR << 24;

- req = blk_get_request(sdev->request_queue, write, __GFP_WAIT);
+ req = blk_get_request(sdev->request_queue, write, __GFP_RECLAIM);
if (IS_ERR(req))
return ret;
blk_rq_set_block_pc(req);

if (bufflen && blk_rq_map_kern(sdev->request_queue, req,
- buffer, bufflen, __GFP_WAIT))
+ buffer, bufflen, __GFP_RECLAIM))
goto out;

req->cmd_len = COMMAND_SIZE(cmd[0]);
diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
index ed37d26eb20d..393270436a4b 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
@@ -113,7 +113,7 @@ do { \
do { \
LASSERT(!in_interrupt() || \
((size) <= LIBCFS_VMALLOC_SIZE && \
- ((mask) & __GFP_WAIT) == 0)); \
+ ((mask) & __GFP_RECLAIM) == 0)); \
} while (0)

#define LIBCFS_ALLOC_POST(ptr, size) \
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 35660da77921..92e284d0362e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -718,7 +718,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
if (start > end)
goto out;
spin_unlock(&tree->lock);
- if (mask & __GFP_WAIT)
+ if (mask & __GFP_RECLAIM)
cond_resched();
goto again;
}
@@ -1028,7 +1028,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
if (start > end)
goto out;
spin_unlock(&tree->lock);
- if (mask & __GFP_WAIT)
+ if (mask & __GFP_RECLAIM)
cond_resched();
goto again;
}
@@ -1253,7 +1253,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
if (start > end)
goto out;
spin_unlock(&tree->lock);
- if (mask & __GFP_WAIT)
+ if (mask & __GFP_RECLAIM)
cond_resched();
first_iteration = false;
goto again;
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index aecd0859eacb..9c4b737a54df 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -30,7 +30,7 @@ extern unsigned cachefiles_debug;
#define CACHEFILES_DEBUG_KLEAVE 2
#define CACHEFILES_DEBUG_KDEBUG 4

-#define cachefiles_gfp (__GFP_WAIT | __GFP_NORETRY | __GFP_NOMEMALLOC)
+#define cachefiles_gfp (__GFP_RECLAIM | __GFP_NORETRY | __GFP_NOMEMALLOC)

/*
* node records
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 745d2342651a..b97cf506a20e 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -360,7 +360,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,

/*
* bio_alloc() is guaranteed to return a bio when called with
- * __GFP_WAIT and we request a valid number of vectors.
+ * __GFP_RECLAIM and we request a valid number of vectors.
*/
bio = bio_alloc(GFP_KERNEL, nr_vecs);

diff --git a/fs/nilfs2/mdt.h b/fs/nilfs2/mdt.h
index fe529a87a208..03246cac3338 100644
--- a/fs/nilfs2/mdt.h
+++ b/fs/nilfs2/mdt.h
@@ -72,7 +72,7 @@ static inline struct nilfs_mdt_info *NILFS_MDT(const struct inode *inode)
}

/* Default GFP flags using highmem */
-#define NILFS_MDT_GFP (__GFP_WAIT | __GFP_IO | __GFP_HIGHMEM)
+#define NILFS_MDT_GFP (__GFP_RECLAIM | __GFP_IO | __GFP_HIGHMEM)

int nilfs_mdt_get_block(struct inode *, unsigned long, int,
void (*init_block)(struct inode *,
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dbd246a14e2f..e066f3afae73 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -104,7 +104,7 @@ struct vm_area_struct;
* can be cleared when the reclaiming of pages would cause unnecessary
* disruption.
*/
-#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
+#define __GFP_RECLAIM (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
#define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
#define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */

@@ -123,12 +123,12 @@ struct vm_area_struct;
*/
#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
-#define GFP_NOIO (__GFP_WAIT)
-#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
-#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_TEMPORARY (__GFP_WAIT | __GFP_IO | __GFP_FS | \
+#define GFP_NOIO (__GFP_RECLAIM)
+#define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
+#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
+#define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
__GFP_RECLAIMABLE)
-#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
+#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
#define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
#define GFP_IOFS (__GFP_IO | __GFP_FS)
@@ -141,12 +141,12 @@ struct vm_area_struct;
#define GFP_MOVABLE_SHIFT 3

/* Control page allocator reclaim behavior */
-#define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
+#define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)

/* Control slab gfp mask during early boot */
-#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
+#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))

/* Control allocation constraints */
#define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 2f30ca91e4fa..3841af470cf9 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -261,7 +261,7 @@ static int hib_submit_io(int rw, pgoff_t page_off, void *addr,
struct bio *bio;
int error = 0;

- bio = bio_alloc(__GFP_WAIT | __GFP_HIGH, 1);
+ bio = bio_alloc(__GFP_RECLAIM | __GFP_HIGH, 1);
bio->bi_iter.bi_sector = page_off * (PAGE_SIZE >> 9);
bio->bi_bdev = hib_resume_bdev;

@@ -360,7 +360,7 @@ static int write_page(void *buf, sector_t offset, struct hib_bio_batch *hb)
return -ENOSPC;

if (hb) {
- src = (void *)__get_free_page(__GFP_WAIT | __GFP_NOWARN |
+ src = (void *)__get_free_page(__GFP_RECLAIM | __GFP_NOWARN |
__GFP_NORETRY);
if (src) {
copy_page(src, buf);
@@ -368,7 +368,7 @@ static int write_page(void *buf, sector_t offset, struct hib_bio_batch *hb)
ret = hib_wait_io(hb); /* Free pages */
if (ret)
return ret;
- src = (void *)__get_free_page(__GFP_WAIT |
+ src = (void *)__get_free_page(__GFP_RECLAIM |
__GFP_NOWARN |
__GFP_NORETRY);
if (src) {
@@ -676,7 +676,7 @@ static int save_image_lzo(struct swap_map_handle *handle,
nr_threads = num_online_cpus() - 1;
nr_threads = clamp_val(nr_threads, 1, LZO_THREADS);

- page = (void *)__get_free_page(__GFP_WAIT | __GFP_HIGH);
+ page = (void *)__get_free_page(__GFP_RECLAIM | __GFP_HIGH);
if (!page) {
printk(KERN_ERR "PM: Failed to allocate LZO page\n");
ret = -ENOMEM;
@@ -979,7 +979,7 @@ static int get_swap_reader(struct swap_map_handle *handle,
last = tmp;

tmp->map = (struct swap_map_page *)
- __get_free_page(__GFP_WAIT | __GFP_HIGH);
+ __get_free_page(__GFP_RECLAIM | __GFP_HIGH);
if (!tmp->map) {
release_swap_reader(handle);
return -ENOMEM;
@@ -1246,8 +1246,8 @@ static int load_image_lzo(struct swap_map_handle *handle,

for (i = 0; i < read_pages; i++) {
page[i] = (void *)__get_free_page(i < LZO_CMP_PAGES ?
- __GFP_WAIT | __GFP_HIGH :
- __GFP_WAIT | __GFP_NOWARN |
+ __GFP_RECLAIM | __GFP_HIGH :
+ __GFP_RECLAIM | __GFP_NOWARN |
__GFP_NORETRY);

if (!page[i]) {
diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index f75715131f20..6d40944960de 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -135,7 +135,7 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
* TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, of course).
*
* @gfp indicates whether or not to wait until a free id is available (it's not
- * used for internal memory allocations); thus if passed __GFP_WAIT we may sleep
+ * used for internal memory allocations); thus if passed __GFP_RECLAIM we may sleep
* however long it takes until another thread frees an id (same semantics as a
* mempool).
*
diff --git a/mm/failslab.c b/mm/failslab.c
index fefaabaab76d..69f083146a37 100644
--- a/mm/failslab.c
+++ b/mm/failslab.c
@@ -3,11 +3,11 @@

static struct {
struct fault_attr attr;
- u32 ignore_gfp_wait;
+ u32 ignore_gfp_reclaim;
int cache_filter;
} failslab = {
.attr = FAULT_ATTR_INITIALIZER,
- .ignore_gfp_wait = 1,
+ .ignore_gfp_reclaim = 1,
.cache_filter = 0,
};

@@ -16,7 +16,7 @@ bool should_failslab(size_t size, gfp_t gfpflags, unsigned long cache_flags)
if (gfpflags & __GFP_NOFAIL)
return false;

- if (failslab.ignore_gfp_wait && (gfpflags & __GFP_WAIT))
+ if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_RECLAIM))
return false;

if (failslab.cache_filter && !(cache_flags & SLAB_FAILSLAB))
@@ -42,7 +42,7 @@ static int __init failslab_debugfs_init(void)
return PTR_ERR(dir);

if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
- &failslab.ignore_gfp_wait))
+ &failslab.ignore_gfp_reclaim))
goto fail;
if (!debugfs_create_bool("cache-filter", mode, dir,
&failslab.cache_filter))
diff --git a/mm/filemap.c b/mm/filemap.c
index 1283fc825458..986fe45a5d27 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2673,7 +2673,7 @@ EXPORT_SYMBOL(generic_file_write_iter);
* page is known to the local caching routines.
*
* The @gfp_mask argument specifies whether I/O may be performed to release
- * this page (__GFP_IO), and whether the call may block (__GFP_WAIT & __GFP_FS).
+ * this page (__GFP_IO), and whether the call may block (__GFP_RECLAIM & __GFP_FS).
*
*/
int try_to_release_page(struct page *page, gfp_t gfp_mask)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c107094f79ba..f563473b5e99 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -767,7 +767,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,

static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
{
- return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+ return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_RECLAIM)) | extra_gfp;
}

/* Caller must hold page table lock. */
diff --git a/mm/migrate.c b/mm/migrate.c
index ee401e4e5ef1..e92b55868c6d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1734,7 +1734,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
goto out_dropref;

new_page = alloc_pages_node(node,
- (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_WAIT,
+ (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
HPAGE_PMD_ORDER);
if (!new_page)
goto out_fail;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ccd235d02923..17064a3f4909 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2120,11 +2120,11 @@ static struct {
struct fault_attr attr;

u32 ignore_gfp_highmem;
- u32 ignore_gfp_wait;
+ u32 ignore_gfp_reclaim;
u32 min_order;
} fail_page_alloc = {
.attr = FAULT_ATTR_INITIALIZER,
- .ignore_gfp_wait = 1,
+ .ignore_gfp_reclaim = 1,
.ignore_gfp_highmem = 1,
.min_order = 1,
};
@@ -2143,7 +2143,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
return false;
if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
return false;
- if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
+ if (fail_page_alloc.ignore_gfp_reclaim && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
return false;

return should_fail(&fail_page_alloc.attr, 1 << order);
@@ -2162,7 +2162,7 @@ static int __init fail_page_alloc_debugfs(void)
return PTR_ERR(dir);

if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
- &fail_page_alloc.ignore_gfp_wait))
+ &fail_page_alloc.ignore_gfp_reclaim))
goto fail;
if (!debugfs_create_bool("ignore-gfp-highmem", mode, dir,
&fail_page_alloc.ignore_gfp_highmem))
@@ -2459,7 +2459,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
if (test_thread_flag(TIF_MEMDIE) ||
(current->flags & (PF_MEMALLOC | PF_EXITING)))
filter &= ~SHOW_MEM_FILTER_NODES;
- if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
+ if (in_interrupt() || !(gfp_mask & __GFP_RECLAIM) || (gfp_mask & __GFP_ATOMIC))
filter &= ~SHOW_MEM_FILTER_NODES;

if (fmt) {
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index d8e2e3918ce2..4bee2392dbb2 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2061,7 +2061,7 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
consume_skb(info.skb2);

if (info.delivered) {
- if (info.congested && (allocation & __GFP_WAIT))
+ if (info.congested && (allocation & __GFP_RECLAIM))
yield();
return 0;
}
diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
index 6631f4f1e39b..b5cd65401a28 100644
--- a/net/rxrpc/ar-connection.c
+++ b/net/rxrpc/ar-connection.c
@@ -500,7 +500,7 @@ int rxrpc_connect_call(struct rxrpc_sock *rx,
if (bundle->num_conns >= 20) {
_debug("too many conns");

- if (!(gfp & __GFP_WAIT)) {
+ if (!(gfp & __GFP_RECLAIM)) {
_leave(" = -EAGAIN");
return -EAGAIN;
}
diff --git a/security/integrity/ima/ima_crypto.c b/security/integrity/ima/ima_crypto.c
index e24121afb2f2..6eb62936c672 100644
--- a/security/integrity/ima/ima_crypto.c
+++ b/security/integrity/ima/ima_crypto.c
@@ -126,7 +126,7 @@ static void *ima_alloc_pages(loff_t max_size, size_t *allocated_size,
{
void *ptr;
int order = ima_maxorder;
- gfp_t gfp_mask = __GFP_WAIT | __GFP_NOWARN | __GFP_NORETRY;
+ gfp_t gfp_mask = __GFP_RECLAIM | __GFP_NOWARN | __GFP_NORETRY;

if (order)
order = min(get_order(max_size), order);
--
2.4.6

2015-08-12 10:46:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/10] mm, page_alloc: Remove MIGRATE_RESERVE

MIGRATE_RESERVE preserves an old property of the buddy allocator that existed
prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to
remain contiguous until the only alternative was to fail the allocation. At the
time it was discovered that high-order atomic allocations relied on this
property so MIGRATE_RESERVE was introduced. A later patch will introduce
an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE
and supporting code so it'll be easier to review. Note that this patch
in isolation may look like a false regression if someone was bisecting
high-order atomic allocation failures.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/mmzone.h | 10 +---
mm/huge_memory.c | 2 +-
mm/page_alloc.c | 148 +++----------------------------------------------
mm/vmstat.c | 1 -
4 files changed, 11 insertions(+), 150 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 79a0d033a2f3..874755ca0abc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,8 +39,6 @@ enum {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
- MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
- MIGRATE_RESERVE = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
/*
* MIGRATE_CMA migration type is designed to mimic the way
@@ -63,6 +61,8 @@ enum {
MIGRATE_TYPES
};

+#define MIGRATE_PCPTYPES (MIGRATE_RECLAIMABLE+1)
+
#ifdef CONFIG_CMA
# define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
#else
@@ -425,12 +425,6 @@ struct zone {

const char *name;

- /*
- * Number of MIGRATE_RESERVE page block. To maintain for just
- * optimization. Protected by zone->lock.
- */
- int nr_migrate_reserve_block;
-
#ifdef CONFIG_MEMORY_ISOLATION
/*
* Number of isolated pageblock. It is used to solve incorrect
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f563473b5e99..ccf998194ae1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -113,7 +113,7 @@ static int set_recommended_min_free_kbytes(void)
for_each_populated_zone(zone)
nr_zones++;

- /* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
+ /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
recommended_min = pageblock_nr_pages * nr_zones * 2;

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17064a3f4909..4708dadeaadf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -793,7 +793,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (unlikely(has_isolate_pageblock(zone)))
mt = get_pageblock_migratetype(page);

- /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
} while (--to_free && --batch_free && !list_empty(list));
@@ -1375,15 +1374,14 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
* the free lists for the desirable migrate type are depleted
*/
static int fallbacks[MIGRATE_TYPES][4] = {
- [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
- [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
- [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
+ [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
+ [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
+ [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
#ifdef CONFIG_CMA
- [MIGRATE_CMA] = { MIGRATE_RESERVE }, /* Never used */
+ [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
#endif
- [MIGRATE_RESERVE] = { MIGRATE_RESERVE }, /* Never used */
#ifdef CONFIG_MEMORY_ISOLATION
- [MIGRATE_ISOLATE] = { MIGRATE_RESERVE }, /* Never used */
+ [MIGRATE_ISOLATE] = { MIGRATE_TYPES }, /* Never used */
#endif
};

@@ -1557,7 +1555,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
*can_steal = false;
for (i = 0;; i++) {
fallback_mt = fallbacks[migratetype][i];
- if (fallback_mt == MIGRATE_RESERVE)
+ if (fallback_mt == MIGRATE_TYPES)
break;

if (list_empty(&area->free_list[fallback_mt]))
@@ -1636,25 +1634,13 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
{
struct page *page;

-retry_reserve:
page = __rmqueue_smallest(zone, order, migratetype);
-
- if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
+ if (unlikely(!page)) {
if (migratetype == MIGRATE_MOVABLE)
page = __rmqueue_cma_fallback(zone, order);

if (!page)
page = __rmqueue_fallback(zone, order, migratetype);
-
- /*
- * Use MIGRATE_RESERVE rather than fail an allocation. goto
- * is used because __rmqueue_smallest is an inline function
- * and we want just one call site
- */
- if (!page) {
- migratetype = MIGRATE_RESERVE;
- goto retry_reserve;
- }
}

trace_mm_page_alloc_zone_locked(page, order, migratetype);
@@ -3443,7 +3429,6 @@ static void show_migration_types(unsigned char type)
[MIGRATE_UNMOVABLE] = 'U',
[MIGRATE_RECLAIMABLE] = 'E',
[MIGRATE_MOVABLE] = 'M',
- [MIGRATE_RESERVE] = 'R',
#ifdef CONFIG_CMA
[MIGRATE_CMA] = 'C',
#endif
@@ -4254,120 +4239,6 @@ static inline unsigned long wait_table_bits(unsigned long size)
}

/*
- * Check if a pageblock contains reserved pages
- */
-static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn < end_pfn; pfn++) {
- if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
- return 1;
- }
- return 0;
-}
-
-/*
- * Mark a number of pageblocks as MIGRATE_RESERVE. The number
- * of blocks reserved is based on min_wmark_pages(zone). The memory within
- * the reserve will tend to store contiguous free pages. Setting min_free_kbytes
- * higher will lead to a bigger reserve which will get freed as contiguous
- * blocks as reclaim kicks in
- */
-static void setup_zone_migrate_reserve(struct zone *zone)
-{
- unsigned long start_pfn, pfn, end_pfn, block_end_pfn;
- struct page *page;
- unsigned long block_migratetype;
- int reserve;
- int old_reserve;
-
- /*
- * Get the start pfn, end pfn and the number of blocks to reserve
- * We have to be careful to be aligned to pageblock_nr_pages to
- * make sure that we always check pfn_valid for the first page in
- * the block.
- */
- start_pfn = zone->zone_start_pfn;
- end_pfn = zone_end_pfn(zone);
- start_pfn = roundup(start_pfn, pageblock_nr_pages);
- reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >>
- pageblock_order;
-
- /*
- * Reserve blocks are generally in place to help high-order atomic
- * allocations that are short-lived. A min_free_kbytes value that
- * would result in more than 2 reserve blocks for atomic allocations
- * is assumed to be in place to help anti-fragmentation for the
- * future allocation of hugepages at runtime.
- */
- reserve = min(2, reserve);
- old_reserve = zone->nr_migrate_reserve_block;
-
- /* When memory hot-add, we almost always need to do nothing */
- if (reserve == old_reserve)
- return;
- zone->nr_migrate_reserve_block = reserve;
-
- for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
- if (!early_page_nid_uninitialised(pfn, zone_to_nid(zone)))
- return;
-
- if (!pfn_valid(pfn))
- continue;
- page = pfn_to_page(pfn);
-
- /* Watch out for overlapping nodes */
- if (page_to_nid(page) != zone_to_nid(zone))
- continue;
-
- block_migratetype = get_pageblock_migratetype(page);
-
- /* Only test what is necessary when the reserves are not met */
- if (reserve > 0) {
- /*
- * Blocks with reserved pages will never free, skip
- * them.
- */
- block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn);
- if (pageblock_is_reserved(pfn, block_end_pfn))
- continue;
-
- /* If this block is reserved, account for it */
- if (block_migratetype == MIGRATE_RESERVE) {
- reserve--;
- continue;
- }
-
- /* Suitable for reserving if this block is movable */
- if (block_migratetype == MIGRATE_MOVABLE) {
- set_pageblock_migratetype(page,
- MIGRATE_RESERVE);
- move_freepages_block(zone, page,
- MIGRATE_RESERVE);
- reserve--;
- continue;
- }
- } else if (!old_reserve) {
- /*
- * At boot time we don't need to scan the whole zone
- * for turning off MIGRATE_RESERVE.
- */
- break;
- }
-
- /*
- * If the reserve is met and this is a previous reserved block,
- * take it back
- */
- if (block_migratetype == MIGRATE_RESERVE) {
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
- move_freepages_block(zone, page, MIGRATE_MOVABLE);
- }
- }
-}
-
-/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
@@ -4406,9 +4277,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
* movable at startup. This will force kernel allocations
* to reserve their blocks rather than leaking throughout
* the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
+ * kernel allocations are made.
*
* bitmap is created for zone's valid pfn range. but memmap
* can be created for invalid pages (for alignment)
@@ -5958,7 +5827,6 @@ static void __setup_per_zone_wmarks(void)
high_wmark_pages(zone) - low_wmark_pages(zone) -
atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));

- setup_zone_migrate_reserve(zone);
spin_unlock_irqrestore(&zone->lock, flags);
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..49963aa2dff3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,7 +901,6 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Reclaimable",
"Movable",
- "Reserve",
#ifdef CONFIG_CMA
"CMA",
#endif
--
2.4.6

2015-08-12 10:46:38

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand

High-order watermark checking exists for two reasons -- kswapd high-order
awareness and protection for high-order atomic requests. Historically the
kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order
free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
that reserves pageblocks for high-order atomic allocations on demand and
avoids using those blocks for order-0 allocations. This is more flexible
and reliable than MIGRATE_RESERVE was.

A MIGRATE_HIGHORDER pageblock is created when a high-order allocation
request steals a pageblock but limits the total number to 1% of the zone.
Callers that speculatively abuse atomic allocations for long-lived
high-order allocations to access the reserve will quickly fail. Note that
SLUB is currently not such an abuser as it reclaims at least once. It is
possible that the pageblock stolen has few suitable high-order pages and
will need to steal again in the near future but there would need to be
strong justification to search all pageblocks for an ideal candidate.

The pageblocks are unreserved if an allocation fails after a direct
reclaim attempt.

The watermark checks account for the reserved pageblocks when the allocation
request is not a high-order atomic allocation.

The reserved pageblocks can not be used for order-0 allocations. This may
allow temporary wastage until a failed reclaim reassigns the pageblock. This
is deliberate as the intent of the reservation is to satisfy a limited
number of atomic high-order short-lived requests if the system requires them.

The stutter benchmark was used to evaluate this but while it was running
there was a systemtap script that randomly allocated between 1 high-order
page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
is much larger than the potential reserve and it does not attempt to be
realistic. It is intended to stress random high-order allocations from
an unknown source, show that there is a reduction in failures without
introducing an anomaly where atomic allocations are more reliable than
regular allocations. The amount of memory reserved varied throughout the
workload as reserves were created and reclaimed under memory pressure. The
allocation failures once the workload warmed up were as follows;

4.2-rc5-vanilla 70%
4.2-rc5-atomic-reserve 56%

The failure rate was also measured while building multiple kernels. The
failure rate was 14% but is 6% with this patch applied.

Overall, this is a small reduction but the reserves are small relative to the
number of allocation requests. In early versions of the patch, the failure
rate reduced by a much larger amount but that required much larger reserves
and perversely made atomic allocations seem more reliable than regular allocations.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 6 ++-
mm/page_alloc.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++---
mm/vmstat.c | 1 +
3 files changed, 116 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 874755ca0abc..2fafab14e63a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,8 @@ enum {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
+ MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
+ MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
/*
* MIGRATE_CMA migration type is designed to mimic the way
@@ -61,8 +63,6 @@ enum {
MIGRATE_TYPES
};

-#define MIGRATE_PCPTYPES (MIGRATE_RECLAIMABLE+1)
-
#ifdef CONFIG_CMA
# define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
#else
@@ -330,6 +330,8 @@ struct zone {
/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long watermark[NR_WMARK];

+ unsigned long nr_reserved_highatomic;
+
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4708dadeaadf..c7f78a6cd708 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1574,6 +1574,86 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
return -1;
}

+/*
+ * Reserve a pageblock for exclusive use of high-order atomic allocations if
+ * there are no empty page blocks that contain a page with a suitable order
+ */
+static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
+ unsigned int alloc_order)
+{
+ int mt = get_pageblock_migratetype(page);
+ unsigned long max_managed, flags;
+
+ if (mt == MIGRATE_HIGHATOMIC)
+ return;
+
+ /*
+ * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
+ * Check is race-prone but harmless.
+ */
+ max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
+ if (zone->nr_reserved_highatomic >= max_managed)
+ return;
+
+ /* Yoink! */
+ spin_lock_irqsave(&zone->lock, flags);
+ zone->nr_reserved_highatomic += pageblock_nr_pages;
+ set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
+ move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+/*
+ * Used when an allocation is about to fail under memory pressure. This
+ * potentially hurts the reliability of high-order allocations when under
+ * intense memory pressure but failed atomic allocations should be easier
+ * to recover from than an OOM.
+ */
+static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
+{
+ struct zonelist *zonelist = ac->zonelist;
+ unsigned long flags;
+ struct zoneref *z;
+ struct zone *zone;
+ struct page *page;
+ int order;
+
+ for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
+ ac->nodemask) {
+ /* Preserve at least one pageblock */
+ if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ for (order = 0; order < MAX_ORDER; order++) {
+ struct free_area *area = &(zone->free_area[order]);
+
+ if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
+ continue;
+
+ page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
+ struct page, lru);
+
+ zone->nr_reserved_highatomic -= pageblock_nr_pages;
+
+ /*
+ * Convert to ac->migratetype and avoid the normal
+ * pageblock stealing heuristics. Minimally, the caller
+ * is doing the work and needs the pages. More
+ * importantly, if the block was always converted to
+ * MIGRATE_UNMOVABLE or another type then the number
+ * of pageblocks that cannot be completely freed
+ * may increase.
+ */
+ set_pageblock_migratetype(page, ac->migratetype);
+ move_freepages_block(zone, page, ac->migratetype);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return;
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+}
+
/* Remove an element from the buddy allocator from the fallback list */
static inline struct page *
__rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
@@ -1630,10 +1710,16 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
* Call me with the zone->lock already held.
*/
static struct page *__rmqueue(struct zone *zone, unsigned int order,
- int migratetype)
+ int migratetype, gfp_t gfp_flags)
{
struct page *page;

+ if (unlikely(order && (gfp_flags & __GFP_ATOMIC))) {
+ page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+ if (page)
+ goto out;
+ }
+
page = __rmqueue_smallest(zone, order, migratetype);
if (unlikely(!page)) {
if (migratetype == MIGRATE_MOVABLE)
@@ -1643,6 +1729,7 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
page = __rmqueue_fallback(zone, order, migratetype);
}

+out:
trace_mm_page_alloc_zone_locked(page, order, migratetype);
return page;
}
@@ -1660,7 +1747,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,

spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order, migratetype);
+ struct page *page = __rmqueue(zone, order, migratetype, 0);
if (unlikely(page == NULL))
break;

@@ -2075,7 +2162,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
WARN_ON_ONCE(order > 1);
}
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order, migratetype);
+ page = __rmqueue(zone, order, migratetype, gfp_flags);
spin_unlock(&zone->lock);
if (!page)
goto failed;
@@ -2185,15 +2272,23 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, int alloc_flags,
long free_pages)
{
- /* free_pages may go negative - that's OK */
long min = mark;
int o;
long free_cma = 0;

+ /* free_pages may go negative - that's OK */
free_pages -= (1 << order) - 1;
+
if (alloc_flags & ALLOC_HIGH)
min -= min / 2;
- if (alloc_flags & ALLOC_HARDER)
+
+ /*
+ * If the caller is not atomic then discount the reserves. This will
+ * over-estimate how the atomic reserve but it avoids a search
+ */
+ if (likely(!(alloc_flags & ALLOC_HARDER)))
+ free_pages -= z->nr_reserved_highatomic;
+ else
min -= min / 4;

#ifdef CONFIG_CMA
@@ -2382,6 +2477,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
if (page) {
if (prep_new_page(page, order, gfp_mask, alloc_flags))
goto try_this_zone;
+
+ /*
+ * If this is a high-order atomic allocation then check
+ * if the pageblock should be reserved for the future
+ */
+ if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
+ reserve_highatomic_pageblock(page, zone, order);
+
return page;
}
}
@@ -2649,9 +2752,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,

/*
* If an allocation failed after direct reclaim, it could be because
- * pages are pinned on the per-cpu lists. Drain them and try again
+ * pages are pinned on the per-cpu lists or in high alloc reserves.
+ * Shrink them them and try again
*/
if (!page && !drained) {
+ unreserve_highatomic_pageblock(ac);
drain_all_pages(NULL);
drained = true;
goto retry;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 49963aa2dff3..3427a155f85e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,6 +901,7 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Reclaimable",
"Movable",
+ "HighAtomic",
#ifdef CONFIG_CMA
"CMA",
#endif
--
2.4.6

2015-08-12 10:46:36

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations

The primary purpose of watermarks is to ensure that reclaim can always
make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
These assume that order-0 allocations are all that is necessary for
forward progress.

High-order watermarks serve a different purpose. Kswapd had no high-order
awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
This was particularly important when there were high-order atomic requests.
The watermarks both gave kswapd awareness and made a reserve for those
atomic requests.

There are two important side-effects of this. The most important is that
a non-atomic high-order request can fail even though free pages are available
and the order-0 watermarks are ok. The second is that high-order watermark
checks are expensive as the free list counts up to the requested order must
be examined.

With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
have high-order watermarks. Kswapd and compaction still need high-order
awareness which is handled by checking that at least one suitable high-order
page is free.

With the patch applied, there was little difference in the allocation
failure rates as the atomic reserves are small relative to the number of
allocation attempts. The expected impact is that there will never be an
allocation failure report that shows suitable pages on the free lists.

The one potential side-effect of this is that in a vanilla kernel, the
watermark checks may have kept a free page for an atomic allocation. Now,
we are 100% relying on the HighAtomic reserves and an early allocation to
have allocated them. If the first high-order atomic allocation is after
the system is already heavily fragmented then it'll fail.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 38 ++++++++++++++++++++++++--------------
1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c7f78a6cd708..862fdfe2d219 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2265,8 +2265,10 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
#endif /* CONFIG_FAIL_PAGE_ALLOC */

/*
- * Return true if free pages are above 'mark'. This takes into account the order
- * of the allocation.
+ * Return true if free base pages are above 'mark'. For high-order checks it
+ * will return true of the order-0 watermark is reached and there is at least
+ * one free page of a suitable size. Checking now avoids taking the zone lock
+ * to check in the allocation paths if no pages are free.
*/
static bool __zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, int alloc_flags,
@@ -2274,7 +2276,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
{
long min = mark;
int o;
- long free_cma = 0;
+ const bool atomic = (alloc_flags & ALLOC_HARDER);

/* free_pages may go negative - that's OK */
free_pages -= (1 << order) - 1;
@@ -2286,7 +2288,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
* If the caller is not atomic then discount the reserves. This will
* over-estimate how the atomic reserve but it avoids a search
*/
- if (likely(!(alloc_flags & ALLOC_HARDER)))
+ if (likely(!atomic))
free_pages -= z->nr_reserved_highatomic;
else
min -= min / 4;
@@ -2294,22 +2296,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
#ifdef CONFIG_CMA
/* If allocation can't use CMA areas don't use free CMA pages */
if (!(alloc_flags & ALLOC_CMA))
- free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
+ free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
#endif

- if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
+ if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return false;
- for (o = 0; o < order; o++) {
- /* At the next order, this order's pages become unavailable */
- free_pages -= z->free_area[o].nr_free << o;

- /* Require fewer higher order pages to be free */
- min >>= 1;
+ /* order-0 watermarks are ok */
+ if (!order)
+ return true;
+
+ /* Check at least one high-order page is free */
+ for (o = order; o < MAX_ORDER; o++) {
+ struct free_area *area = &z->free_area[o];
+ int mt;
+
+ if (atomic && area->nr_free)
+ return true;

- if (free_pages <= min)
- return false;
+ for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+ if (!list_empty(&area->free_list[mt]))
+ return true;
+ }
}
- return true;
+ return false;
}

bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
--
2.4.6

2015-08-12 13:22:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: page_alloc: Distinguish between being unable to sleep, unwilling to unwilling and avoiding waking kswapd

On Wed 12-08-15 11:45:31, Mel Gorman wrote:
> __GFP_WAIT has been used to identify atomic context in callers that hold
> spinlocks or are in interrupts. They are expected to be high priority and
> have access one of two watermarks lower than "min". __GFP_HIGH users get
> access to the first lower watermark and can be called the "high priority
> reserve". Atomic users and interrupts access yet another lower watermark
> that can be called the "atomic reserve".
>
> Over time, callers had a requirement to not block when fallback options
> were available. Some have abused __GFP_WAIT leading to a situation where
> an optimisitic allocation with a fallback option can access atomic reserves.
>
> This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> cannot sleep and have no alternative. High priority users continue to use
> __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> as a caller that is willing to enter direct reclaim and wake kswapd for
> background reclaim.

Yes this makes a lot of sense and it is much more logical model than
what we have right now.

> This patch then converts a number of sites
>
> o __GFP_ATOMIC is used by callers that are high priority and have memory
> pools for those requests. GFP_ATOMIC uses this flag. Callers with
> interrupts disabled still automatically use the atomic reserves.
>
> o Callers that have a limited mempool to guarantee forward progress use
> __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
> kswapd will still be woken but atomic reserves are not used as there
> is a one-entry mempool to guarantee progress.
>
> o Callers that are checking if they are non-blocking should use the
> helper gfpflags_allows_blocking() where possible. This is because
> checking for __GFP_WAIT as was done historically now can trigger false
> positives. Some exceptions like dm-crypt.c exist where the code intent
> is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
> flag manipulations.
>
> The key hazard to watch out for is callers that removed __GFP_WAIT and
> was depending on access to atomic reserves for inconspicuous reasons.
> In some cases it may be appropriate for them to use __GFP_HIGH.

I will go through all the converted sites but the mm part looks great to
me!

> Signed-off-by: Mel Gorman <[email protected]>
> ---
> Documentation/vm/balance | 14 ++++----
> arch/arm/mm/dma-mapping.c | 4 +--
> arch/arm64/mm/dma-mapping.c | 4 +--
> arch/x86/kernel/pci-dma.c | 2 +-
> block/bio.c | 26 +++++++--------
> block/blk-core.c | 16 ++++-----
> block/blk-ioc.c | 2 +-
> block/blk-mq-tag.c | 2 +-
> block/blk-mq.c | 8 ++---
> block/cfq-iosched.c | 4 +--
> drivers/block/osdblk.c | 2 +-
> drivers/connector/connector.c | 3 +-
> drivers/firewire/core-cdev.c | 2 +-
> drivers/gpu/drm/i915/i915_gem.c | 2 +-
> drivers/infiniband/core/sa_query.c | 2 +-
> drivers/iommu/amd_iommu.c | 2 +-
> drivers/iommu/intel-iommu.c | 2 +-
> drivers/md/dm-crypt.c | 6 ++--
> drivers/mtd/mtdcore.c | 3 +-
> drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +-
> drivers/staging/android/ion/ion_system_heap.c | 2 +-
> drivers/usb/host/u132-hcd.c | 2 +-
> fs/btrfs/disk-io.c | 2 +-
> fs/btrfs/extent_io.c | 8 ++---
> fs/btrfs/volumes.c | 4 +--
> fs/ext3/super.c | 2 +-
> fs/ext4/super.c | 2 +-
> fs/fscache/cookie.c | 2 +-
> fs/fscache/page.c | 6 ++--
> fs/jbd/transaction.c | 4 +--
> fs/jbd2/transaction.c | 4 +--
> fs/nfs/file.c | 6 ++--
> fs/xfs/xfs_qm.c | 2 +-
> include/linux/gfp.h | 44 ++++++++++++++++++-------
> include/linux/skbuff.h | 6 ++--
> include/net/sock.h | 2 +-
> include/trace/events/gfpflags.h | 5 +--
> kernel/audit.c | 6 ++--
> kernel/locking/lockdep.c | 2 +-
> kernel/smp.c | 2 +-
> lib/idr.c | 4 +--
> lib/radix-tree.c | 10 +++---
> mm/backing-dev.c | 2 +-
> mm/dmapool.c | 2 +-
> mm/memcontrol.c | 8 ++---
> mm/mempool.c | 10 +++---
> mm/page_alloc.c | 29 +++++++++-------
> mm/slab.c | 18 +++++-----
> mm/slub.c | 6 ++--
> mm/vmalloc.c | 2 +-
> mm/vmscan.c | 2 +-
> net/core/skbuff.c | 8 ++---
> net/core/sock.c | 6 ++--
> net/sctp/associola.c | 2 +-
> 54 files changed, 181 insertions(+), 149 deletions(-)
>
> diff --git a/Documentation/vm/balance b/Documentation/vm/balance
> index c46e68cf9344..6f1f6fae30f5 100644
> --- a/Documentation/vm/balance
> +++ b/Documentation/vm/balance
> @@ -1,12 +1,14 @@
> Started Jan 2000 by Kanoj Sarcar <[email protected]>
>
> -Memory balancing is needed for non __GFP_WAIT as well as for non
> -__GFP_IO allocations.
> +Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> +well as for non __GFP_IO allocations.
>
> -There are two reasons to be requesting non __GFP_WAIT allocations:
> -the caller can not sleep (typically intr context), or does not want
> -to incur cost overheads of page stealing and possible swap io for
> -whatever reasons.
> +The first reason why a caller may avoid reclaim is that the caller can not
> +sleep due to holding a spinlock or is in interrupt context. The second may
> +be that the caller is willing to fail the allocation without incurring the
> +overhead of page stealing. This may happen for opportunistic high-order
> +allocation requests that have order-0 fallback options. In such cases,
> +the caller may also wish to avoid waking kswapd.
>
> __GFP_IO allocation requests are made to prevent file system deadlocks.
>
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index cba12f34ff77..100d3fbaebae 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -650,7 +650,7 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
>
> if (is_coherent || nommu())
> addr = __alloc_simple_buffer(dev, size, gfp, &page);
> - else if (!(gfp & __GFP_WAIT))
> + else if (gfp & __GFP_ATOMIC)
> addr = __alloc_from_pool(size, &page);
> else if (!dev_get_cma_area(dev))
> addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller, want_vaddr);
> @@ -1369,7 +1369,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
> *handle = DMA_ERROR_CODE;
> size = PAGE_ALIGN(size);
>
> - if (!(gfp & __GFP_WAIT))
> + if (gfp & __GFP_ATOMIC)
> return __iommu_alloc_atomic(dev, size, handle);
>
> /*
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index d16a1cead23f..713d963fb96b 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
> if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> dev->coherent_dma_mask <= DMA_BIT_MASK(32))
> flags |= GFP_DMA;
> - if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) {
> + if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_DIRECT_RECLAIM)) {
> struct page *page;
> void *addr;
>
> @@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
>
> size = PAGE_ALIGN(size);
>
> - if (!coherent && !(flags & __GFP_WAIT)) {
> + if (!coherent && (flags & __GFP_ATOMIC)) {
> struct page *page = NULL;
> void *addr = __alloc_from_pool(size, &page, flags);
>
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 353972c1946c..9a13ebb0f621 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -101,7 +101,7 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
> again:
> page = NULL;
> /* CMA can be used only in the context which permits sleeping */
> - if (flag & __GFP_WAIT) {
> + if (flag & __GFP_DIRECT_RECLAIM) {
> page = dma_alloc_from_contiguous(dev, count, get_order(size));
> if (page && page_to_phys(page) + size > dma_mask) {
> dma_release_from_contiguous(dev, page, count);
> diff --git a/block/bio.c b/block/bio.c
> index d6e5ba3399f0..fbc558b50e67 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -211,7 +211,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
> bvl = mempool_alloc(pool, gfp_mask);
> } else {
> struct biovec_slab *bvs = bvec_slabs + *idx;
> - gfp_t __gfp_mask = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
> + gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO);
>
> /*
> * Make this allocation restricted and don't dump info on
> @@ -221,11 +221,11 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
> __gfp_mask |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
>
> /*
> - * Try a slab allocation. If this fails and __GFP_WAIT
> + * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
> * is set, retry with the 1-entry mempool
> */
> bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
> - if (unlikely(!bvl && (gfp_mask & __GFP_WAIT))) {
> + if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
> *idx = BIOVEC_MAX_IDX;
> goto fallback;
> }
> @@ -393,12 +393,12 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
> * If @bs is NULL, uses kmalloc() to allocate the bio; else the allocation is
> * backed by the @bs's mempool.
> *
> - * When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
> - * able to allocate a bio. This is due to the mempool guarantees. To make this
> - * work, callers must never allocate more than 1 bio at a time from this pool.
> - * Callers that need to allocate more than 1 bio must always submit the
> - * previously allocated bio for IO before attempting to allocate a new one.
> - * Failure to do so can cause deadlocks under memory pressure.
> + * When @bs is not NULL, if %__GFP_DIRECT_RECLAIM is set then bio_alloc will
> + * always be able to allocate a bio. This is due to the mempool guarantees.
> + * To make this work, callers must never allocate more than 1 bio at a time
> + * from this pool. Callers that need to allocate more than 1 bio must always
> + * submit the previously allocated bio for IO before attempting to allocate
> + * a new one. Failure to do so can cause deadlocks under memory pressure.
> *
> * Note that when running under generic_make_request() (i.e. any block
> * driver), bios are not submitted until after you return - see the code in
> @@ -457,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> * We solve this, and guarantee forward progress, with a rescuer
> * workqueue per bio_set. If we go to allocate and there are
> * bios on current->bio_list, we first try the allocation
> - * without __GFP_WAIT; if that fails, we punt those bios we
> - * would be blocking to the rescuer workqueue before we retry
> - * with the original gfp_flags.
> + * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> + * bios we would be blocking to the rescuer workqueue before
> + * we retry with the original gfp_flags.
> */
>
> if (current->bio_list && !bio_list_empty(current->bio_list))
> - gfp_mask &= ~__GFP_WAIT;
> + gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>
> p = mempool_alloc(bs->bio_pool, gfp_mask);
> if (!p && gfp_mask != saved_gfp) {
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 627ed0c593fb..c53c2513d472 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1156,8 +1156,8 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
> * @bio: bio to allocate request for (can be %NULL)
> * @gfp_mask: allocation mask
> *
> - * Get a free request from @q. If %__GFP_WAIT is set in @gfp_mask, this
> - * function keeps retrying under memory pressure and fails iff @q is dead.
> + * Get a free request from @q. If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
> + * this function keeps retrying under memory pressure and fails iff @q is dead.
> *
> * Must be called with @q->queue_lock held and,
> * Returns ERR_PTR on failure, with @q->queue_lock held.
> @@ -1177,7 +1177,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> if (!IS_ERR(rq))
> return rq;
>
> - if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dying(q))) {
> + if (!gfpflags_allows_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
> blk_put_rl(rl);
> return rq;
> }
> @@ -1255,11 +1255,11 @@ EXPORT_SYMBOL(blk_get_request);
> * BUG.
> *
> * WARNING: When allocating/cloning a bio-chain, careful consideration should be
> - * given to how you allocate bios. In particular, you cannot use __GFP_WAIT for
> - * anything but the first bio in the chain. Otherwise you risk waiting for IO
> - * completion of a bio that hasn't been submitted yet, thus resulting in a
> - * deadlock. Alternatively bios should be allocated using bio_kmalloc() instead
> - * of bio_alloc(), as that avoids the mempool deadlock.
> + * given to how you allocate bios. In particular, you cannot use
> + * __GFP_DIRECT_RECLAIM for anything but the first bio in the chain. Otherwise
> + * you risk waiting for IO completion of a bio that hasn't been submitted yet,
> + * thus resulting in a deadlock. Alternatively bios should be allocated using
> + * bio_kmalloc() instead of bio_alloc(), as that avoids the mempool deadlock.
> * If possible a big IO should be split into smaller parts when allocation
> * fails. Partial allocation should not be an error, or you risk a live-lock.
> */
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 1a27f45ec776..0e7e7d9ffc04 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -289,7 +289,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
> {
> struct io_context *ioc;
>
> - might_sleep_if(gfp_flags & __GFP_WAIT);
> + might_sleep_if(gfpflags_allows_blocking(gfp_flags));
>
> do {
> task_lock(task);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9b6e28830b82..7e27f6164298 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -264,7 +264,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
> if (tag != -1)
> return tag;
>
> - if (!(data->gfp & __GFP_WAIT))
> + if (!gfpflags_allows_blocking(data->gfp))
> return -1;
>
> bs = bt_wait_ptr(bt, hctx);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7d842db59699..df8cba632ec2 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -85,7 +85,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
> if (percpu_ref_tryget_live(&q->mq_usage_counter))
> return 0;
>
> - if (!(gfp & __GFP_WAIT))
> + if (!gfpflags_allows_blocking(gfp))
> return -EBUSY;
>
> ret = wait_event_interruptible(q->mq_freeze_wq,
> @@ -261,11 +261,11 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
>
> ctx = blk_mq_get_ctx(q);
> hctx = q->mq_ops->map_queue(q, ctx->cpu);
> - blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_WAIT,
> + blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
> reserved, ctx, hctx);
>
> rq = __blk_mq_alloc_request(&alloc_data, rw);
> - if (!rq && (gfp & __GFP_WAIT)) {
> + if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
> __blk_mq_run_hw_queue(hctx);
> blk_mq_put_ctx(ctx);
>
> @@ -1221,7 +1221,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
> ctx = blk_mq_get_ctx(q);
> hctx = q->mq_ops->map_queue(q, ctx->cpu);
> blk_mq_set_alloc_data(&alloc_data, q,
> - __GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
> + __GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
> rq = __blk_mq_alloc_request(&alloc_data, rw);
> ctx = alloc_data.ctx;
> hctx = alloc_data.hctx;
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index c62bb2e650b8..4c7ca678856a 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3674,7 +3674,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
> if (new_cfqq) {
> cfqq = new_cfqq;
> new_cfqq = NULL;
> - } else if (gfp_mask & __GFP_WAIT) {
> + } else if (gfp_mask & __GFP_DIRECT_RECLAIM) {
> rcu_read_unlock();
> spin_unlock_irq(cfqd->queue->queue_lock);
> new_cfqq = kmem_cache_alloc_node(cfq_pool,
> @@ -4289,7 +4289,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
> const bool is_sync = rq_is_sync(rq);
> struct cfq_queue *cfqq;
>
> - might_sleep_if(gfp_mask & __GFP_WAIT);
> + might_sleep_if(gfpflags_allows_blocking(gfp_mask));
>
> spin_lock_irq(q->queue_lock);
>
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> index e22942596207..1b709a4e3b5e 100644
> --- a/drivers/block/osdblk.c
> +++ b/drivers/block/osdblk.c
> @@ -271,7 +271,7 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
> goto err_out;
>
> tmp->bi_bdev = NULL;
> - gfpmask &= ~__GFP_WAIT;
> + gfpmask &= ~__GFP_DIRECT_RECLAIM;
> tmp->bi_next = NULL;
>
> if (!new_chain)
> diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
> index 30f522848c73..6255c8df6ae9 100644
> --- a/drivers/connector/connector.c
> +++ b/drivers/connector/connector.c
> @@ -124,7 +124,8 @@ int cn_netlink_send_mult(struct cn_msg *msg, u16 len, u32 portid, u32 __group,
> if (group)
> return netlink_broadcast(dev->nls, skb, portid, group,
> gfp_mask);
> - return netlink_unicast(dev->nls, skb, portid, !(gfp_mask&__GFP_WAIT));
> + return netlink_unicast(dev->nls, skb, portid,
> + !gfpflags_allows_blocking(gfp_mask));
> }
> EXPORT_SYMBOL_GPL(cn_netlink_send_mult);
>
> diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
> index 2a3973a7c441..dc611c8cad10 100644
> --- a/drivers/firewire/core-cdev.c
> +++ b/drivers/firewire/core-cdev.c
> @@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
> static int add_client_resource(struct client *client,
> struct client_resource *resource, gfp_t gfp_mask)
> {
> - bool preload = !!(gfp_mask & __GFP_WAIT);
> + bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
> unsigned long flags;
> int ret;
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 52b446b27b4d..c2b45081c5ab 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2225,7 +2225,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
> */
> mapping = file_inode(obj->base.filp)->i_mapping;
> gfp = mapping_gfp_mask(mapping);
> - gfp |= __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD;
> + gfp |= __GFP_NORETRY | __GFP_NOWARN;
> gfp &= ~(__GFP_IO | __GFP_WAIT);
> sg = st->sgl;
> st->nents = 0;
> diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
> index ca919f429666..76cf4cee3d64 100644
> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -619,7 +619,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
>
> static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
> {
> - bool preload = !!(gfp_mask & __GFP_WAIT);
> + bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
> unsigned long flags;
> int ret, id;
>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 658ee39e6569..f4adbe89cd20 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2755,7 +2755,7 @@ static void *alloc_coherent(struct device *dev, size_t size,
>
> page = alloc_pages(flag | __GFP_NOWARN, get_order(size));
> if (!page) {
> - if (!(flag & __GFP_WAIT))
> + if (!gfpflags_allows_blocking(flag))
> return NULL;
>
> page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT,
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 0649b94f5958..0f614d66eb03 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3566,7 +3566,7 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
> flags |= GFP_DMA32;
> }
>
> - if (flags & __GFP_WAIT) {
> + if (gfpflags_allows_blocking(flags)) {
> unsigned int count = size >> PAGE_SHIFT;
>
> page = dma_alloc_from_contiguous(dev, count, order);
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 0f48fed44a17..6dda08385309 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -993,7 +993,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
> struct bio_vec *bvec;
>
> retry:
> - if (unlikely(gfp_mask & __GFP_WAIT))
> + if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
> mutex_lock(&cc->bio_alloc_lock);
>
> clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, cc->bs);
> @@ -1009,7 +1009,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
> if (!page) {
> crypt_free_buffer_pages(cc, clone);
> bio_put(clone);
> - gfp_mask |= __GFP_WAIT;
> + gfp_mask |= __GFP_DIRECT_RECLAIM;
> goto retry;
> }
>
> @@ -1026,7 +1026,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
> }
>
> return_clone:
> - if (unlikely(gfp_mask & __GFP_WAIT))
> + if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
> mutex_unlock(&cc->bio_alloc_lock);
>
> return clone;
> diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
> index 8bbbb751bf45..2dfb291a47c6 100644
> --- a/drivers/mtd/mtdcore.c
> +++ b/drivers/mtd/mtdcore.c
> @@ -1188,8 +1188,7 @@ EXPORT_SYMBOL_GPL(mtd_writev);
> */
> void *mtd_kmalloc_up_to(const struct mtd_info *mtd, size_t *size)
> {
> - gfp_t flags = __GFP_NOWARN | __GFP_WAIT |
> - __GFP_NORETRY | __GFP_NO_KSWAPD;
> + gfp_t flags = __GFP_NOWARN | __GFP_DIRECT_RECLAIM | __GFP_NORETRY;
> size_t min_alloc = max_t(size_t, mtd->writesize, PAGE_SIZE);
> void *kbuf;
>
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> index a90d7364334f..8458329a877e 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -689,7 +689,7 @@ static void *bnx2x_frag_alloc(const struct bnx2x_fastpath *fp, gfp_t gfp_mask)
> {
> if (fp->rx_frag_size) {
> /* GFP_KERNEL allocations are used only during initialization */
> - if (unlikely(gfp_mask & __GFP_WAIT))
> + if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
> return (void *)__get_free_page(gfp_mask);
>
> return netdev_alloc_frag(fp->rx_frag_size);
> diff --git a/drivers/staging/android/ion/ion_system_heap.c b/drivers/staging/android/ion/ion_system_heap.c
> index da2a63c0a9ba..2615e0ae4f0a 100644
> --- a/drivers/staging/android/ion/ion_system_heap.c
> +++ b/drivers/staging/android/ion/ion_system_heap.c
> @@ -27,7 +27,7 @@
> #include "ion_priv.h"
>
> static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
> - __GFP_NORETRY) & ~__GFP_WAIT;
> + __GFP_NORETRY) & ~__GFP_DIRECT_RECLAIM;
> static gfp_t low_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN);
> static const unsigned int orders[] = {8, 4, 0};
> static const int num_orders = ARRAY_SIZE(orders);
> diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
> index d51687780b61..06badad3ab75 100644
> --- a/drivers/usb/host/u132-hcd.c
> +++ b/drivers/usb/host/u132-hcd.c
> @@ -2247,7 +2247,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
> {
> struct u132 *u132 = hcd_to_u132(hcd);
> if (irqs_disabled()) {
> - if (__GFP_WAIT & mem_flags) {
> + if (__GFP_DIRECT_RECLAIM & mem_flags) {
> printk(KERN_ERR "invalid context for function that migh"
> "t sleep\n");
> return -EINVAL;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index f556c3732c2c..3dd4792b8099 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2566,7 +2566,7 @@ int open_ctree(struct super_block *sb,
> fs_info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
> fs_info->avg_delayed_ref_runtime = NSEC_PER_SEC >> 6; /* div by 64 */
> /* readahead state */
> - INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_WAIT);
> + INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> spin_lock_init(&fs_info->reada_lock);
>
> fs_info->thread_pool_size = min_t(unsigned long,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 02d05817cbdf..35660da77921 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
> clear = 1;
> again:
> - if (!prealloc && (mask & __GFP_WAIT)) {
> + if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> /*
> * Don't care for allocation failure here because we might end
> * up not needing the pre-allocated extent state at all, which
> @@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>
> bits |= EXTENT_FIRST_DELALLOC;
> again:
> - if (!prealloc && (mask & __GFP_WAIT)) {
> + if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> prealloc = alloc_extent_state(mask);
> BUG_ON(!prealloc);
> }
> @@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> btrfs_debug_check_extent_io_range(tree, start, end);
>
> again:
> - if (!prealloc && (mask & __GFP_WAIT)) {
> + if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> /*
> * Best effort, don't worry if extent state allocation fails
> * here for the first iteration. We might have a cached state
> @@ -4265,7 +4265,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
> u64 start = page_offset(page);
> u64 end = start + PAGE_CACHE_SIZE - 1;
>
> - if ((mask & __GFP_WAIT) &&
> + if ((mask & __GFP_DIRECT_RECLAIM) &&
> page->mapping->host->i_size > 16 * 1024 * 1024) {
> u64 len;
> while (start <= end) {
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index fbe7c104531c..b1968f36a39b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -156,8 +156,8 @@ static struct btrfs_device *__alloc_device(void)
> spin_lock_init(&dev->reada_lock);
> atomic_set(&dev->reada_in_flight, 0);
> atomic_set(&dev->dev_stats_ccnt, 0);
> - INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_WAIT);
> - INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_WAIT);
> + INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> + INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>
> return dev;
> }
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 5ed0044fbb37..9004c786716f 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -750,7 +750,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
> return 0;
> if (journal)
> return journal_try_to_free_buffers(journal, page,
> - wait & ~__GFP_WAIT);
> + wait & ~__GFP_DIRECT_RECLAIM);
> return try_to_free_buffers(page);
> }
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 58987b5c514b..abe76d41ef1e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1045,7 +1045,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
> return 0;
> if (journal)
> return jbd2_journal_try_to_free_buffers(journal, page,
> - wait & ~__GFP_WAIT);
> + wait & ~__GFP_DIRECT_RECLAIM);
> return try_to_free_buffers(page);
> }
>
> diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
> index d403c69bee08..4304072161aa 100644
> --- a/fs/fscache/cookie.c
> +++ b/fs/fscache/cookie.c
> @@ -111,7 +111,7 @@ struct fscache_cookie *__fscache_acquire_cookie(
>
> /* radix tree insertion won't use the preallocation pool unless it's
> * told it may not wait */
> - INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_WAIT);
> + INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>
> switch (cookie->def->type) {
> case FSCACHE_COOKIE_TYPE_INDEX:
> diff --git a/fs/fscache/page.c b/fs/fscache/page.c
> index 483bbc613bf0..79483b3d8c6f 100644
> --- a/fs/fscache/page.c
> +++ b/fs/fscache/page.c
> @@ -58,7 +58,7 @@ bool release_page_wait_timeout(struct fscache_cookie *cookie, struct page *page)
>
> /*
> * decide whether a page can be released, possibly by cancelling a store to it
> - * - we're allowed to sleep if __GFP_WAIT is flagged
> + * - we're allowed to sleep if __GFP_DIRECT_RECLAIM is flagged
> */
> bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
> struct page *page,
> @@ -122,7 +122,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
> * allocator as the work threads writing to the cache may all end up
> * sleeping on memory allocation, so we may need to impose a timeout
> * too. */
> - if (!(gfp & __GFP_WAIT) || !(gfp & __GFP_FS)) {
> + if (!(gfp & __GFP_DIRECT_RECLAIM) || !(gfp & __GFP_FS)) {
> fscache_stat(&fscache_n_store_vmscan_busy);
> return false;
> }
> @@ -132,7 +132,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
> _debug("fscache writeout timeout page: %p{%lx}",
> page, page->index);
>
> - gfp &= ~__GFP_WAIT;
> + gfp &= ~__GFP_DIRECT_RECLAIM;
> goto try_again;
> }
> EXPORT_SYMBOL(__fscache_maybe_release_page);
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index 1695ba8334a2..f45b90ba7c5c 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -1690,8 +1690,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
> * @journal: journal for operation
> * @page: to try and free
> * @gfp_mask: we use the mask to detect how hard should we try to release
> - * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
> - * release the buffers.
> + * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
> + * code to release the buffers.
> *
> *
> * For all the buffers on this page,
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index f3d06174b051..06e18bcdb888 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1893,8 +1893,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
> * @journal: journal for operation
> * @page: to try and free
> * @gfp_mask: we use the mask to detect how hard should we try to release
> - * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
> - * release the buffers.
> + * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
> + * code to release the buffers.
> *
> *
> * For all the buffers on this page,
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index cc4fa1ed61fc..5664e1938da1 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -480,8 +480,8 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
> dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
>
> /* Always try to initiate a 'commit' if relevant, but only
> - * wait for it if __GFP_WAIT is set. Even then, only wait 1
> - * second and only if the 'bdi' is not congested.
> + * wait for it if __GFP_DIRECT_RECLAIM is set. Even then,
> + * only wait 1 second and only if the 'bdi' is not congested.
> * Waiting indefinitely can cause deadlocks when the NFS
> * server is on this machine, when a new TCP connection is
> * needed and in other rare cases. There is no particular
> @@ -491,7 +491,7 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
> if (mapping) {
> struct nfs_server *nfss = NFS_SERVER(mapping->host);
> nfs_commit_inode(mapping->host, 0);
> - if ((gfp & __GFP_WAIT) &&
> + if ((gfp & __GFP_DIRECT_RECLAIM) &&
> !bdi_write_congested(&nfss->backing_dev_info)) {
> wait_on_page_bit_killable_timeout(page, PG_private,
> HZ);
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index eac9549efd52..587174fd4f2c 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -525,7 +525,7 @@ xfs_qm_shrink_scan(
> unsigned long freed;
> int error;
>
> - if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
> + if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
> return 0;
>
> INIT_LIST_HEAD(&isol.buffers);
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 43246850a85f..dbd246a14e2f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -29,12 +29,13 @@ struct vm_area_struct;
> #define ___GFP_NOMEMALLOC 0x10000u
> #define ___GFP_HARDWALL 0x20000u
> #define ___GFP_THISNODE 0x40000u
> -#define ___GFP_WAIT 0x80000u
> +#define ___GFP_ATOMIC 0x80000u
> #define ___GFP_NOACCOUNT 0x100000u
> #define ___GFP_NOTRACK 0x200000u
> -#define ___GFP_NO_KSWAPD 0x400000u
> +#define ___GFP_DIRECT_RECLAIM 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> #define ___GFP_WRITE 0x1000000u
> +#define ___GFP_KSWAPD_RECLAIM 0x2000000u
> /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>
> /*
> @@ -68,7 +69,7 @@ struct vm_area_struct;
> * __GFP_MOVABLE: Flag that this page will be movable by the page migration
> * mechanism or reclaimed
> */
> -#define __GFP_WAIT ((__force gfp_t)___GFP_WAIT) /* Can wait and reschedule? */
> +#define __GFP_ATOMIC ((__force gfp_t)___GFP_ATOMIC) /* Caller cannot wait or reschedule */
> #define __GFP_HIGH ((__force gfp_t)___GFP_HIGH) /* Should access emergency pools? */
> #define __GFP_IO ((__force gfp_t)___GFP_IO) /* Can start physical IO? */
> #define __GFP_FS ((__force gfp_t)___GFP_FS) /* Can call down to low-level FS? */
> @@ -91,23 +92,37 @@ struct vm_area_struct;
> #define __GFP_NOACCOUNT ((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to kmemcg */
> #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */
>
> -#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
> #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
>
> /*
> + * A caller that is willing to wait may enter direct reclaim and will
> + * wake kswapd to reclaim pages in the background until the high
> + * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
> + * avoid unnecessary delays when a fallback option is available but
> + * still allow kswapd to reclaim in the background. The kswapd flag
> + * can be cleared when the reclaiming of pages would cause unnecessary
> + * disruption.
> + */
> +#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> +#define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
> +#define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
> +
> +/*
> * This may seem redundant, but it's a way of annotating false positives vs.
> * allocations that simply cannot be supported (e.g. page tables).
> */
> #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>
> -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> -/* This equals 0, but use constants in case they ever change */
> -#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
> -/* GFP_ATOMIC means both !wait (__GFP_WAIT not set) and use emergency pool */
> -#define GFP_ATOMIC (__GFP_HIGH)
> +/*
> + * GFP_ATOMIC callers can not sleep, need the allocation to succeed.
> + * A lower watermark is applied to allow access to "atomic reserves"
> + */
> +#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> +#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
> #define GFP_NOIO (__GFP_WAIT)
> #define GFP_NOFS (__GFP_WAIT | __GFP_IO)
> #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
> @@ -117,9 +132,9 @@ struct vm_area_struct;
> #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
> #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
> #define GFP_IOFS (__GFP_IO | __GFP_FS)
> -#define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> - __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> - __GFP_NO_KSWAPD)
> +#define GFP_TRANSHUGE ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> + __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> + ~__GFP_KSWAPD_RECLAIM)
>
> /* This mask makes up all the page movable related flags */
> #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
> @@ -161,6 +176,11 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
> return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
> }
>
> +static inline bool gfpflags_allows_blocking(const gfp_t gfp_flags)
> +{
> + return gfp_flags & __GFP_DIRECT_RECLAIM;
> +}
> +
> #ifdef CONFIG_HIGHMEM
> #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
> #else
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index d6cdd6e87d53..1086bfa3eb80 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1109,7 +1109,7 @@ static inline int skb_cloned(const struct sk_buff *skb)
>
> static inline int skb_unclone(struct sk_buff *skb, gfp_t pri)
> {
> - might_sleep_if(pri & __GFP_WAIT);
> + might_sleep_if(gfpflags_allows_blocking(pri));
>
> if (skb_cloned(skb))
> return pskb_expand_head(skb, 0, 0, pri);
> @@ -1193,7 +1193,7 @@ static inline int skb_shared(const struct sk_buff *skb)
> */
> static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
> {
> - might_sleep_if(pri & __GFP_WAIT);
> + might_sleep_if(gfpflags_allows_blocking(pri));
> if (skb_shared(skb)) {
> struct sk_buff *nskb = skb_clone(skb, pri);
>
> @@ -1229,7 +1229,7 @@ static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
> static inline struct sk_buff *skb_unshare(struct sk_buff *skb,
> gfp_t pri)
> {
> - might_sleep_if(pri & __GFP_WAIT);
> + might_sleep_if(gfpflags_allows_blocking(pri));
> if (skb_cloned(skb)) {
> struct sk_buff *nskb = skb_copy(skb, pri);
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index f21f0708ec59..3ab94d6b0b56 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2035,7 +2035,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
> */
> static inline struct page_frag *sk_page_frag(struct sock *sk)
> {
> - if (sk->sk_allocation & __GFP_WAIT)
> + if (sk->sk_allocation & __GFP_DIRECT_RECLAIM)
> return &current->task_frag;
>
> return &sk->sk_frag;
> diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
> index d6fd8e5b14b7..dde6bf092c8a 100644
> --- a/include/trace/events/gfpflags.h
> +++ b/include/trace/events/gfpflags.h
> @@ -20,7 +20,7 @@
> {(unsigned long)GFP_ATOMIC, "GFP_ATOMIC"}, \
> {(unsigned long)GFP_NOIO, "GFP_NOIO"}, \
> {(unsigned long)__GFP_HIGH, "GFP_HIGH"}, \
> - {(unsigned long)__GFP_WAIT, "GFP_WAIT"}, \
> + {(unsigned long)__GFP_ATOMIC, "GFP_ATOMIC"}, \
> {(unsigned long)__GFP_IO, "GFP_IO"}, \
> {(unsigned long)__GFP_COLD, "GFP_COLD"}, \
> {(unsigned long)__GFP_NOWARN, "GFP_NOWARN"}, \
> @@ -36,7 +36,8 @@
> {(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
> {(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"}, \
> {(unsigned long)__GFP_NOTRACK, "GFP_NOTRACK"}, \
> - {(unsigned long)__GFP_NO_KSWAPD, "GFP_NO_KSWAPD"}, \
> + {(unsigned long)__GFP_DIRECT_RECLAIM, "GFP_DIRECT_RECLAIM"}, \
> + {(unsigned long)__GFP_KSWAPD_RECLAIM, "GFP_KSWAPD_RECLAIM"}, \
> {(unsigned long)__GFP_OTHER_NODE, "GFP_OTHER_NODE"} \
> ) : "GFP_NOWAIT"
>
> diff --git a/kernel/audit.c b/kernel/audit.c
> index f9e6065346db..6ab7a55dbdff 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -1357,16 +1357,16 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask,
> if (unlikely(audit_filter_type(type)))
> return NULL;
>
> - if (gfp_mask & __GFP_WAIT) {
> + if (gfp_mask & __GFP_DIRECT_RECLAIM) {
> if (audit_pid && audit_pid == current->pid)
> - gfp_mask &= ~__GFP_WAIT;
> + gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> else
> reserve = 0;
> }
>
> while (audit_backlog_limit
> && skb_queue_len(&audit_skb_queue) > audit_backlog_limit + reserve) {
> - if (gfp_mask & __GFP_WAIT && audit_backlog_wait_time) {
> + if (gfp_mask & __GFP_DIRECT_RECLAIM && audit_backlog_wait_time) {
> long sleep_time;
>
> sleep_time = timeout_start + audit_backlog_wait_time - jiffies;
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 8acfbf773e06..9aa39f20f593 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -2738,7 +2738,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
> return;
>
> /* no reclaim without waiting on it */
> - if (!(gfp_mask & __GFP_WAIT))
> + if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
> return;
>
> /* this guy won't enter reclaim */
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 07854477c164..32ee47f6ac11 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
> cpumask_var_t cpus;
> int cpu, ret;
>
> - might_sleep_if(gfp_flags & __GFP_WAIT);
> + might_sleep_if(gfp_flags & __GFP_DIRECT_RECLAIM);
>
> if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
> preempt_disable();
> diff --git a/lib/idr.c b/lib/idr.c
> index 5335c43adf46..e5118fc82961 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
> * allocation guarantee. Disallow usage from those contexts.
> */
> WARN_ON_ONCE(in_interrupt());
> - might_sleep_if(gfp_mask & __GFP_WAIT);
> + might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
> preempt_disable();
>
> @@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
> struct idr_layer *pa[MAX_IDR_LEVEL + 1];
> int id;
>
> - might_sleep_if(gfp_mask & __GFP_WAIT);
> + might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
> /* sanity checks */
> if (WARN_ON_ONCE(start < 0))
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index f9ebe1c82060..cc5fdc3fb734 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
> * preloading in the interrupt anyway as all the allocations have to
> * be atomic. So just do normal allocation when in interrupt.
> */
> - if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> + if (!(gfp_mask & __GFP_DIRECT_RECLAIM) && !in_interrupt()) {
> struct radix_tree_preload *rtp;
>
> /*
> @@ -249,7 +249,7 @@ radix_tree_node_free(struct radix_tree_node *node)
> * with preemption not disabled.
> *
> * To make use of this facility, the radix tree must be initialised without
> - * __GFP_WAIT being passed to INIT_RADIX_TREE().
> + * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
> */
> static int __radix_tree_preload(gfp_t gfp_mask)
> {
> @@ -286,12 +286,12 @@ static int __radix_tree_preload(gfp_t gfp_mask)
> * with preemption not disabled.
> *
> * To make use of this facility, the radix tree must be initialised without
> - * __GFP_WAIT being passed to INIT_RADIX_TREE().
> + * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
> */
> int radix_tree_preload(gfp_t gfp_mask)
> {
> /* Warn on non-sensical use... */
> - WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
> + WARN_ON_ONCE(!(gfp_mask & __GFP_DIRECT_RECLAIM));
> return __radix_tree_preload(gfp_mask);
> }
> EXPORT_SYMBOL(radix_tree_preload);
> @@ -303,7 +303,7 @@ EXPORT_SYMBOL(radix_tree_preload);
> */
> int radix_tree_maybe_preload(gfp_t gfp_mask)
> {
> - if (gfp_mask & __GFP_WAIT)
> + if (gfp_mask & __GFP_DIRECT_RECLAIM)
> return __radix_tree_preload(gfp_mask);
> /* Preloading doesn't help anything with this gfp mask, skip it */
> preempt_disable();
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index dac5bf59309d..2056d16807de 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
> {
> struct bdi_writeback *wb;
>
> - might_sleep_if(gfp & __GFP_WAIT);
> + might_sleep_if(gfp & __GFP_DIRECT_RECLAIM);
>
> if (!memcg_css->parent)
> return &bdi->wb;
> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index fd5fe4342e93..248f6e864a92 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -323,7 +323,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
> size_t offset;
> void *retval;
>
> - might_sleep_if(mem_flags & __GFP_WAIT);
> + might_sleep_if(gfpflags_allows_blocking(mem_flags));
>
> spin_lock_irqsave(&pool->lock, flags);
> list_for_each_entry(page, &pool->page_list, page_list) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index acb93c554f6e..7155a556a8d4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2268,7 +2268,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (unlikely(task_in_memcg_oom(current)))
> goto nomem;
>
> - if (!(gfp_mask & __GFP_WAIT))
> + if (!gfpflags_allows_blocking(gfp_mask))
> goto nomem;
>
> mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
> @@ -2327,7 +2327,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> css_get_many(&memcg->css, batch);
> if (batch > nr_pages)
> refill_stock(memcg, batch - nr_pages);
> - if (!(gfp_mask & __GFP_WAIT))
> + if (!gfpflags_allows_blocking(gfp_mask))
> goto done;
> /*
> * If the hierarchy is above the normal consumption range,
> @@ -4696,8 +4696,8 @@ static int mem_cgroup_do_precharge(unsigned long count)
> {
> int ret;
>
> - /* Try a single bulk charge without reclaim first */
> - ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
> + /* Try a single bulk charge without reclaim first, kswapd may wake */
> + ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
> if (!ret) {
> mc.precharge += count;
> return ret;
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 2cc08de8b1db..bfd2a0dd0e18 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -317,13 +317,13 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> gfp_t gfp_temp;
>
> VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
> - might_sleep_if(gfp_mask & __GFP_WAIT);
> + might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
> gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */
> gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
> gfp_mask |= __GFP_NOWARN; /* failures are OK */
>
> - gfp_temp = gfp_mask & ~(__GFP_WAIT|__GFP_IO);
> + gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>
> repeat_alloc:
>
> @@ -346,7 +346,7 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> }
>
> /*
> - * We use gfp mask w/o __GFP_WAIT or IO for the first round. If
> + * We use gfp mask w/o direct reclaim or IO for the first round. If
> * alloc failed with that and @pool was empty, retry immediately.
> */
> if (gfp_temp != gfp_mask) {
> @@ -355,8 +355,8 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> goto repeat_alloc;
> }
>
> - /* We must not sleep if !__GFP_WAIT */
> - if (!(gfp_mask & __GFP_WAIT)) {
> + /* We must not sleep if !__GFP_DIRECT_RECLAIM */
> + if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
> spin_unlock_irqrestore(&pool->lock, flags);
> return NULL;
> }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 94f2f6bdd6d5..ccd235d02923 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2143,7 +2143,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> return false;
> if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
> return false;
> - if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> + if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> return false;
>
> return should_fail(&fail_page_alloc.attr, 1 << order);
> @@ -2459,7 +2459,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> if (test_thread_flag(TIF_MEMDIE) ||
> (current->flags & (PF_MEMALLOC | PF_EXITING)))
> filter &= ~SHOW_MEM_FILTER_NODES;
> - if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
> + if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
> filter &= ~SHOW_MEM_FILTER_NODES;
>
> if (fmt) {
> @@ -2710,7 +2710,6 @@ static inline int
> gfp_to_alloc_flags(gfp_t gfp_mask)
> {
> int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> - const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
>
> /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> @@ -2719,11 +2718,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> * The caller may dip into page reserves a bit more if the caller
> * cannot run direct reclaim, or if the caller has realtime scheduling
> * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
> - * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
> + * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
> */
> alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>
> - if (atomic) {
> + if (gfp_mask & __GFP_ATOMIC) {
> /*
> * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
> * if it can't schedule.
> @@ -2764,7 +2763,7 @@ static inline struct page *
> __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> struct alloc_context *ac)
> {
> - const gfp_t wait = gfp_mask & __GFP_WAIT;
> + bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
> struct page *page = NULL;
> int alloc_flags;
> unsigned long pages_reclaimed = 0;
> @@ -2785,15 +2784,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> }
>
> /*
> + * We also sanity check to catch abuse of atomic reserves being used by
> + * callers that are not in atomic context.
> + */
> + if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
> + (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> + gfp_mask &= ~__GFP_ATOMIC;
> +
> + /*
> * If this allocation cannot block and it is for a specific node, then
> * fail early. There's no need to wakeup kswapd or retry for a
> * speculative node-specific allocation.
> */
> - if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait)
> + if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !can_direct_reclaim)
> goto nopage;
>
> retry:
> - if (!(gfp_mask & __GFP_NO_KSWAPD))
> + if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> wake_all_kswapds(order, ac);
>
> /*
> @@ -2836,8 +2843,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> }
> }
>
> - /* Atomic allocations - we can't balance anything */
> - if (!wait) {
> + /* Caller is not willing to reclaim, we can't balance anything */
> + if (!can_direct_reclaim) {
> /*
> * All existing users of the deprecated __GFP_NOFAIL are
> * blockable, so warn of any new users that actually allow this
> @@ -2974,7 +2981,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>
> lockdep_trace_alloc(gfp_mask);
>
> - might_sleep_if(gfp_mask & __GFP_WAIT);
> + might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
> if (should_fail_alloc_page(gfp_mask, order))
> return NULL;
> diff --git a/mm/slab.c b/mm/slab.c
> index 200e22412a16..a7bcbcc5692e 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1030,12 +1030,12 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
> }
>
> /*
> - * Construct gfp mask to allocate from a specific node but do not invoke reclaim
> - * or warn about failures.
> + * Construct gfp mask to allocate from a specific node but do not direct reclaim
> + * or warn about failures. kswapd may still wake to reclaim in the background.
> */
> static inline gfp_t gfp_exact_node(gfp_t flags)
> {
> - return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
> + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
> }
> #endif
>
> @@ -2625,7 +2625,7 @@ static int cache_grow(struct kmem_cache *cachep,
>
> offset *= cachep->colour_off;
>
> - if (local_flags & __GFP_WAIT)
> + if (local_flags & __GFP_DIRECT_RECLAIM)
> local_irq_enable();
>
> /*
> @@ -2655,7 +2655,7 @@ static int cache_grow(struct kmem_cache *cachep,
>
> cache_init_objs(cachep, page);
>
> - if (local_flags & __GFP_WAIT)
> + if (local_flags & __GFP_DIRECT_RECLAIM)
> local_irq_disable();
> check_irq_off();
> spin_lock(&n->list_lock);
> @@ -2669,7 +2669,7 @@ static int cache_grow(struct kmem_cache *cachep,
> opps1:
> kmem_freepages(cachep, page);
> failed:
> - if (local_flags & __GFP_WAIT)
> + if (local_flags & __GFP_DIRECT_RECLAIM)
> local_irq_disable();
> return 0;
> }
> @@ -2861,7 +2861,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
> static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
> gfp_t flags)
> {
> - might_sleep_if(flags & __GFP_WAIT);
> + might_sleep_if(flags & __GFP_DIRECT_RECLAIM);
> #if DEBUG
> kmem_flagcheck(cachep, flags);
> #endif
> @@ -3049,11 +3049,11 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
> */
> struct page *page;
>
> - if (local_flags & __GFP_WAIT)
> + if (local_flags & __GFP_DIRECT_RECLAIM)
> local_irq_enable();
> kmem_flagcheck(cache, flags);
> page = kmem_getpages(cache, local_flags, numa_mem_id());
> - if (local_flags & __GFP_WAIT)
> + if (local_flags & __GFP_DIRECT_RECLAIM)
> local_irq_disable();
> if (page) {
> /*
> diff --git a/mm/slub.c b/mm/slub.c
> index 816df0016555..b658a66ffce4 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1263,7 +1263,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
> {
> flags &= gfp_allowed_mask;
> lockdep_trace_alloc(flags);
> - might_sleep_if(flags & __GFP_WAIT);
> + might_sleep_if(gfpflags_allows_blocking(flags));
>
> if (should_failslab(s->object_size, flags, s->flags))
> return NULL;
> @@ -1339,7 +1339,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>
> flags &= gfp_allowed_mask;
>
> - if (flags & __GFP_WAIT)
> + if (flags & __GFP_DIRECT_RECLAIM)
> local_irq_enable();
>
> flags |= s->allocflags;
> @@ -1380,7 +1380,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> kmemcheck_mark_unallocated_pages(page, pages);
> }
>
> - if (flags & __GFP_WAIT)
> + if (flags & __GFP_DIRECT_RECLAIM)
> local_irq_disable();
> if (!page)
> return NULL;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 2faaa2976447..c6ce91b20c91 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1617,7 +1617,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> goto fail;
> }
> area->pages[i] = page;
> - if (gfp_mask & __GFP_WAIT)
> + if (gfpflags_allows_blocking(gfp_mask))
> cond_resched();
> }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f1d8eae285f2..bc9ab358b77a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3768,7 +3768,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> /*
> * Do not scan if the allocation should not be delayed.
> */
> - if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> + if (!gfpflags_allows_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
> return ZONE_RECLAIM_NOSCAN;
>
> /*
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b6a19ca0f99e..6f025e2544de 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -414,7 +414,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
> len += NET_SKB_PAD;
>
> if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> - (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
> + (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> if (!skb)
> goto skb_fail;
> @@ -481,7 +481,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> len += NET_SKB_PAD + NET_IP_ALIGN;
>
> if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> - (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
> + (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> if (!skb)
> goto skb_fail;
> @@ -4452,7 +4452,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
> return NULL;
>
> gfp_head = gfp_mask;
> - if (gfp_head & __GFP_WAIT)
> + if (gfp_head & __GFP_DIRECT_RECLAIM)
> gfp_head |= __GFP_REPEAT;
>
> *errcode = -ENOBUFS;
> @@ -4467,7 +4467,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
>
> while (order) {
> if (npages >= 1 << order) {
> - page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
> + page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) |
> __GFP_COMP |
> __GFP_NOWARN |
> __GFP_NORETRY,
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 193901d09757..02b705cc9eb3 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1879,8 +1879,10 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>
> pfrag->offset = 0;
> if (SKB_FRAG_PAGE_ORDER) {
> - pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
> - __GFP_NOWARN | __GFP_NORETRY,
> + /* Avoid direct reclaim but allow kswapd to wake */
> + pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> + __GFP_COMP | __GFP_NOWARN |
> + __GFP_NORETRY,
> SKB_FRAG_PAGE_ORDER);
> if (likely(pfrag->page)) {
> pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 197c3f59ecbf..c5fcdd6f85b7 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
> /* Set an association id for a given association */
> int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
> {
> - bool preload = !!(gfp & __GFP_WAIT);
> + bool preload = !!(gfp & __GFP_DIRECT_RECLAIM);
> int ret;
>
> /* If the id is already assigned, keep it. */
> --
> 2.4.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs

Subject: Re: [PATCH 05/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types

On Wed, 12 Aug 2015, Mel Gorman wrote:

> @@ -149,14 +150,15 @@ struct vm_area_struct;
> /* Convert GFP flags to their corresponding migrate type */
> static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
> {
> - WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> + VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> + BUILD_BUG_ON(1UL << GFP_MOVABLE_SHIFT != ___GFP_MOVABLE);
> + BUILD_BUG_ON(___GFP_MOVABLE >> GFP_MOVABLE_SHIFT != MIGRATE_MOVABLE);

Add some parenthesis here. Difficult to read. Compiler takes this as is?

2015-08-13 00:17:15

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled

On Wed, 12 Aug 2015, Mel Gorman wrote:

> There is a seqcounter that protects against spurious allocation failures
> when a task is changing the allowed nodes in a cpuset. There is no need
> to check the seqcounter until a cpuset exists.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> ---
> include/linux/cpuset.h | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 1b357997cac5..6eb27cb480b7 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -104,6 +104,9 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
> */
> static inline unsigned int read_mems_allowed_begin(void)
> {
> + if (!cpusets_enabled())
> + return 0;
> +
> return read_seqcount_begin(&current->mems_allowed_seq);
> }
>
> @@ -115,6 +118,9 @@ static inline unsigned int read_mems_allowed_begin(void)
> */
> static inline bool read_mems_allowed_retry(unsigned int seq)
> {
> + if (!cpusets_enabled())
> + return false;
> +
> return read_seqcount_retry(&current->mems_allowed_seq, seq);
> }
>

This patch is an obvious improvement, but I think it's also possible to
change this to be

if (nr_cpusets() <= 1)
return false;

and likewise in the existing cpusets_enabled() check in
get_page_from_freelist(). A root cpuset may not exclude mems on the
system so, even if mounted, there's no need to check or be worried about
concurrent change when there is only one cpuset.

2015-08-17 11:58:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled

On Wed, Aug 12, 2015 at 05:16:50PM -0700, David Rientjes wrote:
> On Wed, 12 Aug 2015, Mel Gorman wrote:
>
> > There is a seqcounter that protects against spurious allocation failures
> > when a task is changing the allowed nodes in a cpuset. There is no need
> > to check the seqcounter until a cpuset exists.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > Acked-by: David Rientjes <[email protected]>
> > Acked-by: Vlastimil Babka <[email protected]>
> > ---
> > include/linux/cpuset.h | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index 1b357997cac5..6eb27cb480b7 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -104,6 +104,9 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
> > */
> > static inline unsigned int read_mems_allowed_begin(void)
> > {
> > + if (!cpusets_enabled())
> > + return 0;
> > +
> > return read_seqcount_begin(&current->mems_allowed_seq);
> > }
> >
> > @@ -115,6 +118,9 @@ static inline unsigned int read_mems_allowed_begin(void)
> > */
> > static inline bool read_mems_allowed_retry(unsigned int seq)
> > {
> > + if (!cpusets_enabled())
> > + return false;
> > +
> > return read_seqcount_retry(&current->mems_allowed_seq, seq);
> > }
> >
>
> This patch is an obvious improvement, but I think it's also possible to
> change this to be
>
> if (nr_cpusets() <= 1)
> return false;
>
> and likewise in the existing cpusets_enabled() check in
> get_page_from_freelist(). A root cpuset may not exclude mems on the
> system so, even if mounted, there's no need to check or be worried about
> concurrent change when there is only one cpuset.

Good idea. I'll make this a separate patch on top and rename cpuset_enabled
to cpuset_mems_enabled to be clear about what it's checking.

Thanks.

--
Mel Gorman
SUSE Labs

2015-08-19 14:44:45

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: page_alloc: Distinguish between being unable to sleep, unwilling to unwilling and avoiding waking kswapd

On 08/12/2015 12:45 PM, Mel Gorman wrote:
> __GFP_WAIT has been used to identify atomic context in callers that hold
> spinlocks or are in interrupts. They are expected to be high priority and
> have access one of two watermarks lower than "min". __GFP_HIGH users get
> access to the first lower watermark and can be called the "high priority
> reserve". Atomic users and interrupts access yet another lower watermark
> that can be called the "atomic reserve".
>
> Over time, callers had a requirement to not block when fallback options
> were available. Some have abused __GFP_WAIT leading to a situation where
> an optimisitic allocation with a fallback option can access atomic reserves.
>
> This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> cannot sleep and have no alternative. High priority users continue to use
> __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> as a caller that is willing to enter direct reclaim and wake kswapd for
> background reclaim.
>
> This patch then converts a number of sites
>
> o __GFP_ATOMIC is used by callers that are high priority and have memory
> pools for those requests. GFP_ATOMIC uses this flag. Callers with
> interrupts disabled still automatically use the atomic reserves.
>
> o Callers that have a limited mempool to guarantee forward progress use
> __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
> kswapd will still be woken but atomic reserves are not used as there
> is a one-entry mempool to guarantee progress.
>
> o Callers that are checking if they are non-blocking should use the
> helper gfpflags_allows_blocking() where possible. This is because
> checking for __GFP_WAIT as was done historically now can trigger false
> positives. Some exceptions like dm-crypt.c exist where the code intent
> is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
> flag manipulations.
>
> The key hazard to watch out for is callers that removed __GFP_WAIT and
> was depending on access to atomic reserves for inconspicuous reasons.
> In some cases it may be appropriate for them to use __GFP_HIGH.
>
> Signed-off-by: Mel Gorman <[email protected]>

I like the approach, it makes things much clearer. Note that this isn't full
review, just something that crossed my mind.

> @@ -117,9 +132,9 @@ struct vm_area_struct;
> #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
> #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
> #define GFP_IOFS (__GFP_IO | __GFP_FS)
> -#define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> - __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> - __GFP_NO_KSWAPD)
> +#define GFP_TRANSHUGE ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> + __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> + ~__GFP_KSWAPD_RECLAIM)

Unfortunately this is not as simple for all uses of GFP_TRANSHUGE.
Namely in __alloc_pages_slowpath() the checks could use __GFP_NO_KSWAPD as one
of the distinguishing flags, but to test for lack of __GFP_KSWAPD_RECLAIM, they
should be adjusted in order to be functionally equivalent.
Yes, it would be better if we could get rid of them, but that's out of scope
here. So, something like this?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index beda417..e019a89 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3073,7 +3073,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto got_pg;

/* Checks for THP-specific high-order allocations */
- if ((gfp_mask & GFP_TRANSHUGE) == GFP_TRANSHUGE) {
+ if ((gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM))
+ == GFP_TRANSHUGE) {
/*
* If compaction is deferred for high-order allocations, it is
* because sync compaction recently failed. If this is the case
@@ -3108,8 +3109,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* fault, so use asynchronous memory compaction for THP unless it is
* khugepaged trying to collapse.
*/
- if ((gfp_mask & GFP_TRANSHUGE) != GFP_TRANSHUGE ||
- (current->flags & PF_KTHREAD))
+ if ((gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)
+ != GFP_TRANSHUGE) || (current->flags & PF_KTHREAD))
migration_mode = MIGRATE_SYNC_LIGHT;

/* Try direct reclaim and then allocating */


2015-08-20 09:15:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: page_alloc: Distinguish between being unable to sleep, unwilling to unwilling and avoiding waking kswapd

On Wed, Aug 19, 2015 at 04:44:40PM +0200, Vlastimil Babka wrote:
>
> Unfortunately this is not as simple for all uses of GFP_TRANSHUGE.
> Namely in __alloc_pages_slowpath() the checks could use __GFP_NO_KSWAPD as one
> of the distinguishing flags, but to test for lack of __GFP_KSWAPD_RECLAIM, they
> should be adjusted in order to be functionally equivalent.
> Yes, it would be better if we could get rid of them, but that's out of scope
> here. So, something like this?
>

Nicely spotted. The only modification I made was to add a helper because
the flags trick is sufficiently complex. That results in this;

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9617e79d6931..0f92d4d42e2e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2774,6 +2774,11 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
}

+static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
+{
+ return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
@@ -2889,7 +2894,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto got_pg;

/* Checks for THP-specific high-order allocations */
- if ((gfp_mask & GFP_TRANSHUGE) == GFP_TRANSHUGE) {
+ if (is_thp_gfp_mask(gfp_mask)) {
/*
* If compaction is deferred for high-order allocations, it is
* because sync compaction recently failed. If this is the case
@@ -2924,8 +2929,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* fault, so use asynchronous memory compaction for THP unless it is
* khugepaged trying to collapse.
*/
- if ((gfp_mask & GFP_TRANSHUGE) != GFP_TRANSHUGE ||
- (current->flags & PF_KTHREAD))
+ if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
migration_mode = MIGRATE_SYNC_LIGHT;

/* Try direct reclaim and then allocating */
--
Mel Gorman
SUSE Labs

2015-08-20 12:29:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM

On Wed 12-08-15 11:45:32, Mel Gorman wrote:
> __GFP_WAIT was used to signal that the caller was in atomic context and
> could not sleep. Now it is possible to distinguish between true atomic
> context and callers that are not willing to sleep. The latter should clear
> __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
> behaves differently, there is a risk that people will clear the wrong
> flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
> what it does -- setting it allows all reclaim activity, clearing them
> prevents it.
>
> Signed-off-by: Mel Gorman <[email protected]>

I haven't checked all the converted places too deeply but they look
straightforward.

Acked-by: Michal Hocko <[email protected]>

> ---
> block/blk-mq.c | 2 +-
> block/scsi_ioctl.c | 6 +++---
> drivers/block/drbd/drbd_bitmap.c | 2 +-
> drivers/block/drbd/drbd_receiver.c | 2 +-
> drivers/block/mtip32xx/mtip32xx.c | 2 +-
> drivers/block/nvme-core.c | 4 ++--
> drivers/block/paride/pd.c | 2 +-
> drivers/block/pktcdvd.c | 4 ++--
> drivers/gpu/drm/i915/i915_gem.c | 2 +-
> drivers/ide/ide-atapi.c | 2 +-
> drivers/ide/ide-cd.c | 2 +-
> drivers/ide/ide-cd_ioctl.c | 2 +-
> drivers/ide/ide-devsets.c | 2 +-
> drivers/ide/ide-disk.c | 2 +-
> drivers/ide/ide-ioctls.c | 4 ++--
> drivers/ide/ide-park.c | 2 +-
> drivers/ide/ide-pm.c | 4 ++--
> drivers/ide/ide-tape.c | 4 ++--
> drivers/ide/ide-taskfile.c | 4 ++--
> drivers/infiniband/hw/ipath/ipath_file_ops.c | 2 +-
> drivers/infiniband/hw/qib/qib_init.c | 2 +-
> drivers/misc/vmw_balloon.c | 2 +-
> drivers/scsi/scsi_error.c | 2 +-
> drivers/scsi/scsi_lib.c | 4 ++--
> .../staging/lustre/include/linux/libcfs/libcfs_private.h | 2 +-
> fs/btrfs/extent_io.c | 6 +++---
> fs/cachefiles/internal.h | 2 +-
> fs/direct-io.c | 2 +-
> fs/nilfs2/mdt.h | 2 +-
> include/linux/gfp.h | 16 ++++++++--------
> kernel/power/swap.c | 14 +++++++-------
> lib/percpu_ida.c | 2 +-
> mm/failslab.c | 8 ++++----
> mm/filemap.c | 2 +-
> mm/huge_memory.c | 2 +-
> mm/migrate.c | 2 +-
> mm/page_alloc.c | 10 +++++-----
> net/netlink/af_netlink.c | 2 +-
> net/rxrpc/ar-connection.c | 2 +-
> security/integrity/ima/ima_crypto.c | 2 +-
> 40 files changed, 71 insertions(+), 71 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df8cba632ec2..873c7b4d14ec 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1221,7 +1221,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
> ctx = blk_mq_get_ctx(q);
> hctx = q->mq_ops->map_queue(q, ctx->cpu);
> blk_mq_set_alloc_data(&alloc_data, q,
> - __GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
> + __GFP_RECLAIM|__GFP_HIGH, false, ctx, hctx);
> rq = __blk_mq_alloc_request(&alloc_data, rw);
> ctx = alloc_data.ctx;
> hctx = alloc_data.hctx;
> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
> index dda653ce7b24..0774799942e0 100644
> --- a/block/scsi_ioctl.c
> +++ b/block/scsi_ioctl.c
> @@ -444,7 +444,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
>
> }
>
> - rq = blk_get_request(q, in_len ? WRITE : READ, __GFP_WAIT);
> + rq = blk_get_request(q, in_len ? WRITE : READ, __GFP_RECLAIM);
> if (IS_ERR(rq)) {
> err = PTR_ERR(rq);
> goto error_free_buffer;
> @@ -495,7 +495,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
> break;
> }
>
> - if (bytes && blk_rq_map_kern(q, rq, buffer, bytes, __GFP_WAIT)) {
> + if (bytes && blk_rq_map_kern(q, rq, buffer, bytes, __GFP_RECLAIM)) {
> err = DRIVER_ERROR << 24;
> goto error;
> }
> @@ -536,7 +536,7 @@ static int __blk_send_generic(struct request_queue *q, struct gendisk *bd_disk,
> struct request *rq;
> int err;
>
> - rq = blk_get_request(q, WRITE, __GFP_WAIT);
> + rq = blk_get_request(q, WRITE, __GFP_RECLAIM);
> if (IS_ERR(rq))
> return PTR_ERR(rq);
> blk_rq_set_block_pc(rq);
> diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
> index 434c77dcc99e..2940da0011e0 100644
> --- a/drivers/block/drbd/drbd_bitmap.c
> +++ b/drivers/block/drbd/drbd_bitmap.c
> @@ -1016,7 +1016,7 @@ static void bm_page_io_async(struct drbd_bm_aio_ctx *ctx, int page_nr) __must_ho
> bm_set_page_unchanged(b->bm_pages[page_nr]);
>
> if (ctx->flags & BM_AIO_COPY_PAGES) {
> - page = mempool_alloc(drbd_md_io_page_pool, __GFP_HIGHMEM|__GFP_WAIT);
> + page = mempool_alloc(drbd_md_io_page_pool, __GFP_HIGHMEM|__GFP_RECLAIM);
> copy_highpage(page, b->bm_pages[page_nr]);
> bm_store_page_idx(page, page_nr);
> } else
> diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
> index c097909c589c..1d2046e68808 100644
> --- a/drivers/block/drbd/drbd_receiver.c
> +++ b/drivers/block/drbd/drbd_receiver.c
> @@ -357,7 +357,7 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
> }
>
> if (has_payload && data_size) {
> - page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_WAIT));
> + page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_RECLAIM));
> if (!page)
> goto fail;
> }
> diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
> index 4a2ef09e6704..a694b23cb8f9 100644
> --- a/drivers/block/mtip32xx/mtip32xx.c
> +++ b/drivers/block/mtip32xx/mtip32xx.c
> @@ -173,7 +173,7 @@ static struct mtip_cmd *mtip_get_int_command(struct driver_data *dd)
> {
> struct request *rq;
>
> - rq = blk_mq_alloc_request(dd->queue, 0, __GFP_WAIT, true);
> + rq = blk_mq_alloc_request(dd->queue, 0, __GFP_RECLAIM, true);
> return blk_mq_rq_to_pdu(rq);
> }
>
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index 7920c2741b47..0a8b1682305f 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -1033,11 +1033,11 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
> req->special = (void *)0;
>
> if (buffer && bufflen) {
> - ret = blk_rq_map_kern(q, req, buffer, bufflen, __GFP_WAIT);
> + ret = blk_rq_map_kern(q, req, buffer, bufflen, __GFP_RECLAIM);
> if (ret)
> goto out;
> } else if (ubuffer && bufflen) {
> - ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen, __GFP_WAIT);
> + ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen, __GFP_RECLAIM);
> if (ret)
> goto out;
> bio = req->bio;
> diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
> index b9242d78283d..562b5a4ca7b7 100644
> --- a/drivers/block/paride/pd.c
> +++ b/drivers/block/paride/pd.c
> @@ -723,7 +723,7 @@ static int pd_special_command(struct pd_unit *disk,
> struct request *rq;
> int err = 0;
>
> - rq = blk_get_request(disk->gd->queue, READ, __GFP_WAIT);
> + rq = blk_get_request(disk->gd->queue, READ, __GFP_RECLAIM);
> if (IS_ERR(rq))
> return PTR_ERR(rq);
>
> diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
> index 4c20c228184c..e372a5f08847 100644
> --- a/drivers/block/pktcdvd.c
> +++ b/drivers/block/pktcdvd.c
> @@ -704,14 +704,14 @@ static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *
> int ret = 0;
>
> rq = blk_get_request(q, (cgc->data_direction == CGC_DATA_WRITE) ?
> - WRITE : READ, __GFP_WAIT);
> + WRITE : READ, __GFP_RECLAIM);
> if (IS_ERR(rq))
> return PTR_ERR(rq);
> blk_rq_set_block_pc(rq);
>
> if (cgc->buflen) {
> ret = blk_rq_map_kern(q, rq, cgc->buffer, cgc->buflen,
> - __GFP_WAIT);
> + __GFP_RECLAIM);
> if (ret)
> goto out;
> }
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index c2b45081c5ab..2ca8638c5b81 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2226,7 +2226,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
> mapping = file_inode(obj->base.filp)->i_mapping;
> gfp = mapping_gfp_mask(mapping);
> gfp |= __GFP_NORETRY | __GFP_NOWARN;
> - gfp &= ~(__GFP_IO | __GFP_WAIT);
> + gfp &= ~(__GFP_IO | __GFP_RECLAIM);
> sg = st->sgl;
> st->nents = 0;
> for (i = 0; i < page_count; i++) {
> diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
> index 1362ad80a76c..05352f490d60 100644
> --- a/drivers/ide/ide-atapi.c
> +++ b/drivers/ide/ide-atapi.c
> @@ -92,7 +92,7 @@ int ide_queue_pc_tail(ide_drive_t *drive, struct gendisk *disk,
> struct request *rq;
> int error;
>
> - rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
> + rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
> rq->cmd_type = REQ_TYPE_DRV_PRIV;
> rq->special = (char *)pc;
>
> diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
> index 64a6b827b3dd..ef907fd5ba98 100644
> --- a/drivers/ide/ide-cd.c
> +++ b/drivers/ide/ide-cd.c
> @@ -441,7 +441,7 @@ int ide_cd_queue_pc(ide_drive_t *drive, const unsigned char *cmd,
> struct request *rq;
> int error;
>
> - rq = blk_get_request(drive->queue, write, __GFP_WAIT);
> + rq = blk_get_request(drive->queue, write, __GFP_RECLAIM);
>
> memcpy(rq->cmd, cmd, BLK_MAX_CDB);
> rq->cmd_type = REQ_TYPE_ATA_PC;
> diff --git a/drivers/ide/ide-cd_ioctl.c b/drivers/ide/ide-cd_ioctl.c
> index 066e39036518..474173eb31bb 100644
> --- a/drivers/ide/ide-cd_ioctl.c
> +++ b/drivers/ide/ide-cd_ioctl.c
> @@ -303,7 +303,7 @@ int ide_cdrom_reset(struct cdrom_device_info *cdi)
> struct request *rq;
> int ret;
>
> - rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
> + rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
> rq->cmd_type = REQ_TYPE_DRV_PRIV;
> rq->cmd_flags = REQ_QUIET;
> ret = blk_execute_rq(drive->queue, cd->disk, rq, 0);
> diff --git a/drivers/ide/ide-devsets.c b/drivers/ide/ide-devsets.c
> index b05a74d78ef5..0dd43b4fcec6 100644
> --- a/drivers/ide/ide-devsets.c
> +++ b/drivers/ide/ide-devsets.c
> @@ -165,7 +165,7 @@ int ide_devset_execute(ide_drive_t *drive, const struct ide_devset *setting,
> if (!(setting->flags & DS_SYNC))
> return setting->set(drive, arg);
>
> - rq = blk_get_request(q, READ, __GFP_WAIT);
> + rq = blk_get_request(q, READ, __GFP_RECLAIM);
> rq->cmd_type = REQ_TYPE_DRV_PRIV;
> rq->cmd_len = 5;
> rq->cmd[0] = REQ_DEVSET_EXEC;
> diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
> index 56b9708894a5..37a8a907febe 100644
> --- a/drivers/ide/ide-disk.c
> +++ b/drivers/ide/ide-disk.c
> @@ -477,7 +477,7 @@ static int set_multcount(ide_drive_t *drive, int arg)
> if (drive->special_flags & IDE_SFLAG_SET_MULTMODE)
> return -EBUSY;
>
> - rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
> + rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
> rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
>
> drive->mult_req = arg;
> diff --git a/drivers/ide/ide-ioctls.c b/drivers/ide/ide-ioctls.c
> index aa2e9b77b20d..d05db2469209 100644
> --- a/drivers/ide/ide-ioctls.c
> +++ b/drivers/ide/ide-ioctls.c
> @@ -125,7 +125,7 @@ static int ide_cmd_ioctl(ide_drive_t *drive, unsigned long arg)
> if (NULL == (void *) arg) {
> struct request *rq;
>
> - rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
> + rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
> rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
> err = blk_execute_rq(drive->queue, NULL, rq, 0);
> blk_put_request(rq);
> @@ -221,7 +221,7 @@ static int generic_drive_reset(ide_drive_t *drive)
> struct request *rq;
> int ret = 0;
>
> - rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
> + rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
> rq->cmd_type = REQ_TYPE_DRV_PRIV;
> rq->cmd_len = 1;
> rq->cmd[0] = REQ_DRIVE_RESET;
> diff --git a/drivers/ide/ide-park.c b/drivers/ide/ide-park.c
> index c80868520488..2d7dca56dd24 100644
> --- a/drivers/ide/ide-park.c
> +++ b/drivers/ide/ide-park.c
> @@ -31,7 +31,7 @@ static void issue_park_cmd(ide_drive_t *drive, unsigned long timeout)
> }
> spin_unlock_irq(&hwif->lock);
>
> - rq = blk_get_request(q, READ, __GFP_WAIT);
> + rq = blk_get_request(q, READ, __GFP_RECLAIM);
> rq->cmd[0] = REQ_PARK_HEADS;
> rq->cmd_len = 1;
> rq->cmd_type = REQ_TYPE_DRV_PRIV;
> diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
> index 081e43458d50..e34af488693a 100644
> --- a/drivers/ide/ide-pm.c
> +++ b/drivers/ide/ide-pm.c
> @@ -18,7 +18,7 @@ int generic_ide_suspend(struct device *dev, pm_message_t mesg)
> }
>
> memset(&rqpm, 0, sizeof(rqpm));
> - rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
> + rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
> rq->cmd_type = REQ_TYPE_ATA_PM_SUSPEND;
> rq->special = &rqpm;
> rqpm.pm_step = IDE_PM_START_SUSPEND;
> @@ -88,7 +88,7 @@ int generic_ide_resume(struct device *dev)
> }
>
> memset(&rqpm, 0, sizeof(rqpm));
> - rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
> + rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
> rq->cmd_type = REQ_TYPE_ATA_PM_RESUME;
> rq->cmd_flags |= REQ_PREEMPT;
> rq->special = &rqpm;
> diff --git a/drivers/ide/ide-tape.c b/drivers/ide/ide-tape.c
> index f5d51d1d09ee..12fa04997dcc 100644
> --- a/drivers/ide/ide-tape.c
> +++ b/drivers/ide/ide-tape.c
> @@ -852,7 +852,7 @@ static int idetape_queue_rw_tail(ide_drive_t *drive, int cmd, int size)
> BUG_ON(cmd != REQ_IDETAPE_READ && cmd != REQ_IDETAPE_WRITE);
> BUG_ON(size < 0 || size % tape->blk_size);
>
> - rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
> + rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
> rq->cmd_type = REQ_TYPE_DRV_PRIV;
> rq->cmd[13] = cmd;
> rq->rq_disk = tape->disk;
> @@ -860,7 +860,7 @@ static int idetape_queue_rw_tail(ide_drive_t *drive, int cmd, int size)
>
> if (size) {
> ret = blk_rq_map_kern(drive->queue, rq, tape->buf, size,
> - __GFP_WAIT);
> + __GFP_RECLAIM);
> if (ret)
> goto out_put;
> }
> diff --git a/drivers/ide/ide-taskfile.c b/drivers/ide/ide-taskfile.c
> index 0979e126fff1..a716693417a3 100644
> --- a/drivers/ide/ide-taskfile.c
> +++ b/drivers/ide/ide-taskfile.c
> @@ -430,7 +430,7 @@ int ide_raw_taskfile(ide_drive_t *drive, struct ide_cmd *cmd, u8 *buf,
> int error;
> int rw = !(cmd->tf_flags & IDE_TFLAG_WRITE) ? READ : WRITE;
>
> - rq = blk_get_request(drive->queue, rw, __GFP_WAIT);
> + rq = blk_get_request(drive->queue, rw, __GFP_RECLAIM);
> rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
>
> /*
> @@ -441,7 +441,7 @@ int ide_raw_taskfile(ide_drive_t *drive, struct ide_cmd *cmd, u8 *buf,
> */
> if (nsect) {
> error = blk_rq_map_kern(drive->queue, rq, buf,
> - nsect * SECTOR_SIZE, __GFP_WAIT);
> + nsect * SECTOR_SIZE, __GFP_RECLAIM);
> if (error)
> goto put_req;
> }
> diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c
> index 450d15965005..c11f6c58ce53 100644
> --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
> +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
> @@ -905,7 +905,7 @@ static int ipath_create_user_egr(struct ipath_portdata *pd)
> * heavy filesystem activity makes these fail, and we can
> * use compound pages.
> */
> - gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
> + gfp_flags = __GFP_RECLAIM | __GFP_IO | __GFP_COMP;
>
> egrcnt = dd->ipath_rcvegrcnt;
> /* TID number offset for this port */
> diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c
> index 7e00470adc30..4ff340fe904f 100644
> --- a/drivers/infiniband/hw/qib/qib_init.c
> +++ b/drivers/infiniband/hw/qib/qib_init.c
> @@ -1680,7 +1680,7 @@ int qib_setup_eagerbufs(struct qib_ctxtdata *rcd)
> * heavy filesystem activity makes these fail, and we can
> * use compound pages.
> */
> - gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
> + gfp_flags = __GFP_RECLAIM | __GFP_IO | __GFP_COMP;
>
> egrcnt = rcd->rcvegrcnt;
> egroff = rcd->rcvegr_tid_base;
> diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
> index 191617492181..5a312958c094 100644
> --- a/drivers/misc/vmw_balloon.c
> +++ b/drivers/misc/vmw_balloon.c
> @@ -85,7 +85,7 @@ MODULE_LICENSE("GPL");
>
> /*
> * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We don't
> - * allow wait (__GFP_WAIT) for NOSLEEP page allocations. Use
> + * allow wait (__GFP_RECLAIM) for NOSLEEP page allocations. Use
> * __GFP_NOWARN, to suppress page allocation failure warnings.
> */
> #define VMW_PAGE_ALLOC_NOSLEEP (__GFP_HIGHMEM|__GFP_NOWARN)
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index cfadccef045c..26416e21295d 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1961,7 +1961,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)
> struct request *req;
>
> /*
> - * blk_get_request with GFP_KERNEL (__GFP_WAIT) sleeps until a
> + * blk_get_request with GFP_KERNEL (__GFP_RECLAIM) sleeps until a
> * request becomes available
> */
> req = blk_get_request(sdev->request_queue, READ, GFP_KERNEL);
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 448ebdaa3d69..2396259b682b 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -221,13 +221,13 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
> int write = (data_direction == DMA_TO_DEVICE);
> int ret = DRIVER_ERROR << 24;
>
> - req = blk_get_request(sdev->request_queue, write, __GFP_WAIT);
> + req = blk_get_request(sdev->request_queue, write, __GFP_RECLAIM);
> if (IS_ERR(req))
> return ret;
> blk_rq_set_block_pc(req);
>
> if (bufflen && blk_rq_map_kern(sdev->request_queue, req,
> - buffer, bufflen, __GFP_WAIT))
> + buffer, bufflen, __GFP_RECLAIM))
> goto out;
>
> req->cmd_len = COMMAND_SIZE(cmd[0]);
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> index ed37d26eb20d..393270436a4b 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> @@ -113,7 +113,7 @@ do { \
> do { \
> LASSERT(!in_interrupt() || \
> ((size) <= LIBCFS_VMALLOC_SIZE && \
> - ((mask) & __GFP_WAIT) == 0)); \
> + ((mask) & __GFP_RECLAIM) == 0)); \
> } while (0)
>
> #define LIBCFS_ALLOC_POST(ptr, size) \
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 35660da77921..92e284d0362e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -718,7 +718,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> if (start > end)
> goto out;
> spin_unlock(&tree->lock);
> - if (mask & __GFP_WAIT)
> + if (mask & __GFP_RECLAIM)
> cond_resched();
> goto again;
> }
> @@ -1028,7 +1028,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> if (start > end)
> goto out;
> spin_unlock(&tree->lock);
> - if (mask & __GFP_WAIT)
> + if (mask & __GFP_RECLAIM)
> cond_resched();
> goto again;
> }
> @@ -1253,7 +1253,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> if (start > end)
> goto out;
> spin_unlock(&tree->lock);
> - if (mask & __GFP_WAIT)
> + if (mask & __GFP_RECLAIM)
> cond_resched();
> first_iteration = false;
> goto again;
> diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
> index aecd0859eacb..9c4b737a54df 100644
> --- a/fs/cachefiles/internal.h
> +++ b/fs/cachefiles/internal.h
> @@ -30,7 +30,7 @@ extern unsigned cachefiles_debug;
> #define CACHEFILES_DEBUG_KLEAVE 2
> #define CACHEFILES_DEBUG_KDEBUG 4
>
> -#define cachefiles_gfp (__GFP_WAIT | __GFP_NORETRY | __GFP_NOMEMALLOC)
> +#define cachefiles_gfp (__GFP_RECLAIM | __GFP_NORETRY | __GFP_NOMEMALLOC)
>
> /*
> * node records
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 745d2342651a..b97cf506a20e 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -360,7 +360,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
>
> /*
> * bio_alloc() is guaranteed to return a bio when called with
> - * __GFP_WAIT and we request a valid number of vectors.
> + * __GFP_RECLAIM and we request a valid number of vectors.
> */
> bio = bio_alloc(GFP_KERNEL, nr_vecs);
>
> diff --git a/fs/nilfs2/mdt.h b/fs/nilfs2/mdt.h
> index fe529a87a208..03246cac3338 100644
> --- a/fs/nilfs2/mdt.h
> +++ b/fs/nilfs2/mdt.h
> @@ -72,7 +72,7 @@ static inline struct nilfs_mdt_info *NILFS_MDT(const struct inode *inode)
> }
>
> /* Default GFP flags using highmem */
> -#define NILFS_MDT_GFP (__GFP_WAIT | __GFP_IO | __GFP_HIGHMEM)
> +#define NILFS_MDT_GFP (__GFP_RECLAIM | __GFP_IO | __GFP_HIGHMEM)
>
> int nilfs_mdt_get_block(struct inode *, unsigned long, int,
> void (*init_block)(struct inode *,
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index dbd246a14e2f..e066f3afae73 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -104,7 +104,7 @@ struct vm_area_struct;
> * can be cleared when the reclaiming of pages would cause unnecessary
> * disruption.
> */
> -#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> +#define __GFP_RECLAIM (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> #define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
> #define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
>
> @@ -123,12 +123,12 @@ struct vm_area_struct;
> */
> #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
> -#define GFP_NOIO (__GFP_WAIT)
> -#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
> -#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
> -#define GFP_TEMPORARY (__GFP_WAIT | __GFP_IO | __GFP_FS | \
> +#define GFP_NOIO (__GFP_RECLAIM)
> +#define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
> +#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> +#define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
> __GFP_RECLAIMABLE)
> -#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> +#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
> #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
> #define GFP_IOFS (__GFP_IO | __GFP_FS)
> @@ -141,12 +141,12 @@ struct vm_area_struct;
> #define GFP_MOVABLE_SHIFT 3
>
> /* Control page allocator reclaim behavior */
> -#define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
> +#define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
> __GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
> __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
>
> /* Control slab gfp mask during early boot */
> -#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
> +#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
>
> /* Control allocation constraints */
> #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 2f30ca91e4fa..3841af470cf9 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c
> @@ -261,7 +261,7 @@ static int hib_submit_io(int rw, pgoff_t page_off, void *addr,
> struct bio *bio;
> int error = 0;
>
> - bio = bio_alloc(__GFP_WAIT | __GFP_HIGH, 1);
> + bio = bio_alloc(__GFP_RECLAIM | __GFP_HIGH, 1);
> bio->bi_iter.bi_sector = page_off * (PAGE_SIZE >> 9);
> bio->bi_bdev = hib_resume_bdev;
>
> @@ -360,7 +360,7 @@ static int write_page(void *buf, sector_t offset, struct hib_bio_batch *hb)
> return -ENOSPC;
>
> if (hb) {
> - src = (void *)__get_free_page(__GFP_WAIT | __GFP_NOWARN |
> + src = (void *)__get_free_page(__GFP_RECLAIM | __GFP_NOWARN |
> __GFP_NORETRY);
> if (src) {
> copy_page(src, buf);
> @@ -368,7 +368,7 @@ static int write_page(void *buf, sector_t offset, struct hib_bio_batch *hb)
> ret = hib_wait_io(hb); /* Free pages */
> if (ret)
> return ret;
> - src = (void *)__get_free_page(__GFP_WAIT |
> + src = (void *)__get_free_page(__GFP_RECLAIM |
> __GFP_NOWARN |
> __GFP_NORETRY);
> if (src) {
> @@ -676,7 +676,7 @@ static int save_image_lzo(struct swap_map_handle *handle,
> nr_threads = num_online_cpus() - 1;
> nr_threads = clamp_val(nr_threads, 1, LZO_THREADS);
>
> - page = (void *)__get_free_page(__GFP_WAIT | __GFP_HIGH);
> + page = (void *)__get_free_page(__GFP_RECLAIM | __GFP_HIGH);
> if (!page) {
> printk(KERN_ERR "PM: Failed to allocate LZO page\n");
> ret = -ENOMEM;
> @@ -979,7 +979,7 @@ static int get_swap_reader(struct swap_map_handle *handle,
> last = tmp;
>
> tmp->map = (struct swap_map_page *)
> - __get_free_page(__GFP_WAIT | __GFP_HIGH);
> + __get_free_page(__GFP_RECLAIM | __GFP_HIGH);
> if (!tmp->map) {
> release_swap_reader(handle);
> return -ENOMEM;
> @@ -1246,8 +1246,8 @@ static int load_image_lzo(struct swap_map_handle *handle,
>
> for (i = 0; i < read_pages; i++) {
> page[i] = (void *)__get_free_page(i < LZO_CMP_PAGES ?
> - __GFP_WAIT | __GFP_HIGH :
> - __GFP_WAIT | __GFP_NOWARN |
> + __GFP_RECLAIM | __GFP_HIGH :
> + __GFP_RECLAIM | __GFP_NOWARN |
> __GFP_NORETRY);
>
> if (!page[i]) {
> diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
> index f75715131f20..6d40944960de 100644
> --- a/lib/percpu_ida.c
> +++ b/lib/percpu_ida.c
> @@ -135,7 +135,7 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
> * TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, of course).
> *
> * @gfp indicates whether or not to wait until a free id is available (it's not
> - * used for internal memory allocations); thus if passed __GFP_WAIT we may sleep
> + * used for internal memory allocations); thus if passed __GFP_RECLAIM we may sleep
> * however long it takes until another thread frees an id (same semantics as a
> * mempool).
> *
> diff --git a/mm/failslab.c b/mm/failslab.c
> index fefaabaab76d..69f083146a37 100644
> --- a/mm/failslab.c
> +++ b/mm/failslab.c
> @@ -3,11 +3,11 @@
>
> static struct {
> struct fault_attr attr;
> - u32 ignore_gfp_wait;
> + u32 ignore_gfp_reclaim;
> int cache_filter;
> } failslab = {
> .attr = FAULT_ATTR_INITIALIZER,
> - .ignore_gfp_wait = 1,
> + .ignore_gfp_reclaim = 1,
> .cache_filter = 0,
> };
>
> @@ -16,7 +16,7 @@ bool should_failslab(size_t size, gfp_t gfpflags, unsigned long cache_flags)
> if (gfpflags & __GFP_NOFAIL)
> return false;
>
> - if (failslab.ignore_gfp_wait && (gfpflags & __GFP_WAIT))
> + if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_RECLAIM))
> return false;
>
> if (failslab.cache_filter && !(cache_flags & SLAB_FAILSLAB))
> @@ -42,7 +42,7 @@ static int __init failslab_debugfs_init(void)
> return PTR_ERR(dir);
>
> if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
> - &failslab.ignore_gfp_wait))
> + &failslab.ignore_gfp_reclaim))
> goto fail;
> if (!debugfs_create_bool("cache-filter", mode, dir,
> &failslab.cache_filter))
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1283fc825458..986fe45a5d27 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2673,7 +2673,7 @@ EXPORT_SYMBOL(generic_file_write_iter);
> * page is known to the local caching routines.
> *
> * The @gfp_mask argument specifies whether I/O may be performed to release
> - * this page (__GFP_IO), and whether the call may block (__GFP_WAIT & __GFP_FS).
> + * this page (__GFP_IO), and whether the call may block (__GFP_RECLAIM & __GFP_FS).
> *
> */
> int try_to_release_page(struct page *page, gfp_t gfp_mask)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c107094f79ba..f563473b5e99 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -767,7 +767,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>
> static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
> {
> - return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
> + return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_RECLAIM)) | extra_gfp;
> }
>
> /* Caller must hold page table lock. */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index ee401e4e5ef1..e92b55868c6d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1734,7 +1734,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> goto out_dropref;
>
> new_page = alloc_pages_node(node,
> - (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_WAIT,
> + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> HPAGE_PMD_ORDER);
> if (!new_page)
> goto out_fail;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ccd235d02923..17064a3f4909 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2120,11 +2120,11 @@ static struct {
> struct fault_attr attr;
>
> u32 ignore_gfp_highmem;
> - u32 ignore_gfp_wait;
> + u32 ignore_gfp_reclaim;
> u32 min_order;
> } fail_page_alloc = {
> .attr = FAULT_ATTR_INITIALIZER,
> - .ignore_gfp_wait = 1,
> + .ignore_gfp_reclaim = 1,
> .ignore_gfp_highmem = 1,
> .min_order = 1,
> };
> @@ -2143,7 +2143,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> return false;
> if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
> return false;
> - if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> + if (fail_page_alloc.ignore_gfp_reclaim && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> return false;
>
> return should_fail(&fail_page_alloc.attr, 1 << order);
> @@ -2162,7 +2162,7 @@ static int __init fail_page_alloc_debugfs(void)
> return PTR_ERR(dir);
>
> if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
> - &fail_page_alloc.ignore_gfp_wait))
> + &fail_page_alloc.ignore_gfp_reclaim))
> goto fail;
> if (!debugfs_create_bool("ignore-gfp-highmem", mode, dir,
> &fail_page_alloc.ignore_gfp_highmem))
> @@ -2459,7 +2459,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> if (test_thread_flag(TIF_MEMDIE) ||
> (current->flags & (PF_MEMALLOC | PF_EXITING)))
> filter &= ~SHOW_MEM_FILTER_NODES;
> - if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
> + if (in_interrupt() || !(gfp_mask & __GFP_RECLAIM) || (gfp_mask & __GFP_ATOMIC))
> filter &= ~SHOW_MEM_FILTER_NODES;
>
> if (fmt) {
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index d8e2e3918ce2..4bee2392dbb2 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2061,7 +2061,7 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
> consume_skb(info.skb2);
>
> if (info.delivered) {
> - if (info.congested && (allocation & __GFP_WAIT))
> + if (info.congested && (allocation & __GFP_RECLAIM))
> yield();
> return 0;
> }
> diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
> index 6631f4f1e39b..b5cd65401a28 100644
> --- a/net/rxrpc/ar-connection.c
> +++ b/net/rxrpc/ar-connection.c
> @@ -500,7 +500,7 @@ int rxrpc_connect_call(struct rxrpc_sock *rx,
> if (bundle->num_conns >= 20) {
> _debug("too many conns");
>
> - if (!(gfp & __GFP_WAIT)) {
> + if (!(gfp & __GFP_RECLAIM)) {
> _leave(" = -EAGAIN");
> return -EAGAIN;
> }
> diff --git a/security/integrity/ima/ima_crypto.c b/security/integrity/ima/ima_crypto.c
> index e24121afb2f2..6eb62936c672 100644
> --- a/security/integrity/ima/ima_crypto.c
> +++ b/security/integrity/ima/ima_crypto.c
> @@ -126,7 +126,7 @@ static void *ima_alloc_pages(loff_t max_size, size_t *allocated_size,
> {
> void *ptr;
> int order = ima_maxorder;
> - gfp_t gfp_mask = __GFP_WAIT | __GFP_NOWARN | __GFP_NORETRY;
> + gfp_t gfp_mask = __GFP_RECLAIM | __GFP_NOWARN | __GFP_NORETRY;
>
> if (order)
> order = min(get_order(max_size), order);
> --
> 2.4.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs

2015-08-20 12:30:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe

On Wed 12-08-15 11:45:27, Mel Gorman wrote:
> No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
> removes the unnecessary parameter.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> include/linux/mmzone.h | 2 +-
> mm/page_alloc.c | 5 +++--
> mm/vmscan.c | 4 ++--
> 3 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index decc99a007f5..8b86ec5df968 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -731,7 +731,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
> bool zone_watermark_ok(struct zone *z, unsigned int order,
> unsigned long mark, int classzone_idx, int alloc_flags);
> bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
> - unsigned long mark, int classzone_idx, int alloc_flags);
> + unsigned long mark, int classzone_idx);
> enum memmap_context {
> MEMMAP_EARLY,
> MEMMAP_HOTPLUG,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 41c0799b9049..5e1f6f4370bc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2209,6 +2209,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> min -= min / 2;
> if (alloc_flags & ALLOC_HARDER)
> min -= min / 4;
> +
> #ifdef CONFIG_CMA
> /* If allocation can't use CMA areas don't use free CMA pages */
> if (!(alloc_flags & ALLOC_CMA))
> @@ -2238,14 +2239,14 @@ bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> }
>
> bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
> - unsigned long mark, int classzone_idx, int alloc_flags)
> + unsigned long mark, int classzone_idx)
> {
> long free_pages = zone_page_state(z, NR_FREE_PAGES);
>
> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>
> - return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> + return __zone_watermark_ok(z, order, mark, classzone_idx, 0,
> free_pages);
> }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e61445dce04e..f1d8eae285f2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2454,7 +2454,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
> balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
> zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
> watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
> - watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
> + watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
>
> /*
> * If compaction is deferred, reclaim up to a point where
> @@ -2937,7 +2937,7 @@ static bool zone_balanced(struct zone *zone, int order,
> unsigned long balance_gap, int classzone_idx)
> {
> if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
> - balance_gap, classzone_idx, 0))
> + balance_gap, classzone_idx))
> return false;
>
> if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
> --
> 2.4.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs

2015-08-20 12:45:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing

On Wed 12-08-15 11:45:28, Mel Gorman wrote:
> File-backed pages that will be immediately are balanced between zones but
^written to...

> it's unnecessarily expensive.

to do WHAT? I guess you meant checking gfp_mask resp. alloc_mask? I
doubt it would make a noticeable difference as this is a slow path
already but I agree it doesn't make sense to check it again.

> Move consider_zone_balanced into the alloc_context
> instead of checking bitmaps multiple times. The patch also gives the parameter
> a more meaningful name.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> mm/internal.h | 1 +
> mm/page_alloc.c | 11 +++++++----
> 2 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 36b23f1e2ca6..9331f802a067 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -129,6 +129,7 @@ struct alloc_context {
> int classzone_idx;
> int migratetype;
> enum zone_type high_zoneidx;
> + bool spread_dirty_pages;
> };
>
> /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5e1f6f4370bc..94f2f6bdd6d5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2297,8 +2297,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> struct zoneref *z;
> struct page *page = NULL;
> struct zone *zone;
> - bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
> - (gfp_mask & __GFP_WRITE);
> int nr_fair_skipped = 0;
> bool zonelist_rescan;
>
> @@ -2350,14 +2348,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> *
> * XXX: For now, allow allocations to potentially
> * exceed the per-zone dirty limit in the slowpath
> - * (ALLOC_WMARK_LOW unset) before going into reclaim,
> + * (spread_dirty_pages unset) before going into reclaim,
> * which is important when on a NUMA setup the allowed
> * zones are together not big enough to reach the
> * global limit. The proper fix for these situations
> * will require awareness of zones in the
> * dirty-throttling and the flusher threads.
> */
> - if (consider_zone_dirty && !zone_dirty_ok(zone))
> + if (ac->spread_dirty_pages && !zone_dirty_ok(zone))
> continue;
>
> mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
> @@ -2997,6 +2995,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>
> /* We set it here, as __alloc_pages_slowpath might have changed it */
> ac.zonelist = zonelist;
> +
> + /* Dirty zone balancing only done in the fast path */
> + ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
> +
> /* The preferred zone is used for statistics later */
> preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
> ac.nodemask, &ac.preferred_zone);
> @@ -3014,6 +3016,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> * complete.
> */
> alloc_mask = memalloc_noio_flags(gfp_mask);
> + ac.spread_dirty_pages = false;
>
> page = __alloc_pages_slowpath(alloc_mask, order, &ac);
> }
> --
> 2.4.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs

2015-08-20 12:46:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM

On Thu 20-08-15 14:28:57, Michal Hocko wrote:
> On Wed 12-08-15 11:45:32, Mel Gorman wrote:
> > __GFP_WAIT was used to signal that the caller was in atomic context and
> > could not sleep. Now it is possible to distinguish between true atomic
> > context and callers that are not willing to sleep. The latter should clear
> > __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
> > behaves differently, there is a risk that people will clear the wrong
> > flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
> > what it does -- setting it allows all reclaim activity, clearing them
> > prevents it.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> I haven't checked all the converted places too deeply but they look
> straightforward.
>
> Acked-by: Michal Hocko <[email protected]>

I meant @suse.com, dang the old one is hardwired into my hands...
--
Michal Hocko
SUSE Labs

2015-08-20 13:18:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, page_alloc: Delete the zonelist_cache

On Wed 12-08-15 11:45:26, Mel Gorman wrote:
[...]
> 4-node machine stutter
> 4-node machine stutter
> 4.2.0-rc1 4.2.0-rc1
> vanilla nozlc-v1r20
> Min mmap 53.9902 ( 0.00%) 49.3629 ( 8.57%)
> 1st-qrtle mmap 54.6776 ( 0.00%) 54.1201 ( 1.02%)
> 2nd-qrtle mmap 54.9242 ( 0.00%) 54.5961 ( 0.60%)
> 3rd-qrtle mmap 55.1817 ( 0.00%) 54.9338 ( 0.45%)
> Max-90% mmap 55.3952 ( 0.00%) 55.3929 ( 0.00%)
> Max-93% mmap 55.4766 ( 0.00%) 57.5712 ( -3.78%)
> Max-95% mmap 55.5522 ( 0.00%) 57.8376 ( -4.11%)
> Max-99% mmap 55.7938 ( 0.00%) 63.6180 (-14.02%)
> Max mmap 6344.0292 ( 0.00%) 67.2477 ( 98.94%)
> Mean mmap 57.3732 ( 0.00%) 54.5680 ( 4.89%)

Do you have data for other leads? Because the reclaim counters look
quite discouraging to be honest.

> 4.1.0 4.1.0
> vanilla nozlc-v1r4
> Swap Ins 838 502
> Swap Outs 1149395 2622895

Twice as much swapouts is a lot.

> DMA32 allocs 17839113 15863747
> Normal allocs 129045707 137847920
> Direct pages scanned 4070089 29046893

7x more scanns by direct reclaim also sounds bad.

> Kswapd pages scanned 17147837 17140694

while kswapd is doing the same amount of work so we are moving
considerable amount of reclaim activity into the direct reclaim

> Kswapd pages reclaimed 17146691 17139601
> Direct pages reclaimed 1888879 4886630
> Kswapd efficiency 99% 99%
> Kswapd velocity 17523.721 17518.928
> Direct efficiency 46% 16%

which is just a wasted effort because the efficiency is really poor.
Is this the effect of hammering a single zone which would be skipped
otherwise while the allocation would succed from another zone?

The latencies were not very much higher to match these numbers though.
Is it possible that other parts of the benchmark suffered? The benchmark
has measured only mmap part AFAIU.

> Direct velocity 4159.306 29687.854
> Percentage direct scans 19% 62%
> Page writes by reclaim 1149395.000 2622895.000
> Page writes file 0 0
> Page writes anon 1149395 2622895
>
> The direct page scan and reclaim rates are noticeable. It is possible
> this will not be a universal win on all workloads but cycling through
> zonelists waiting for zlc->last_full_zap to expire is not the right
> decision.

As much as I would like to see zlc go it seems that it won't be that
easy without regressing some loads. Or the numbers
--
Michal Hocko
SUSE Labs

2015-08-20 13:30:59

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, page_alloc: Delete the zonelist_cache

On 08/12/2015 12:45 PM, Mel Gorman wrote:
> The zonelist cache (zlc) was introduced to skip over zones that were
> recently known to be full. This avoided expensive operations such as the
> cpuset checks, watermark calculations and zone_reclaim. The situation
> today is different and the complexity of zlc is harder to justify.
>
> 1) The cpuset checks are no-ops unless a cpuset is active and in general are
> a lot cheaper.
>
> 2) zone_reclaim is now disabled by default and I suspect that was a large
> source of the cost that zlc wanted to avoid. When it is enabled, it's
> known to be a major source of stalling when nodes fill up and it's
> unwise to hit every other user with the overhead.
>
> 3) Watermark checks are expensive to calculate for high-order
> allocation requests. Later patches in this series will reduce the cost
> of the watermark checking.
>
> 4) The most important issue is that in the current implementation it
> is possible for a failed THP allocation to mark a zone full for order-0
> allocations and cause a fallback to remote nodes.
>
> The last issue could be addressed with additional complexity but as the
> benefit of zlc is questionable, it is better to remove it. If stalls
> due to zone_reclaim are ever reported then an alternative would be to
> introduce deferring logic based on a timeout inside zone_reclaim itself
> and leave the page allocator fast paths alone.
>
> The impact on page-allocator microbenchmarks is negligible as they don't
> hit the paths where the zlc comes into play. The impact was noticeable
> in a workload called "stutter". One part uses a lot of anonymous memory,
> a second measures mmap latency and a third copies a large file. In an
> ideal world the latency application would not notice the mmap latency.
> On a 4-node machine the results of this patch are
>
> 4-node machine stutter
> 4.2.0-rc1 4.2.0-rc1
> vanilla nozlc-v1r20
> Min mmap 53.9902 ( 0.00%) 49.3629 ( 8.57%)
> 1st-qrtle mmap 54.6776 ( 0.00%) 54.1201 ( 1.02%)
> 2nd-qrtle mmap 54.9242 ( 0.00%) 54.5961 ( 0.60%)
> 3rd-qrtle mmap 55.1817 ( 0.00%) 54.9338 ( 0.45%)
> Max-90% mmap 55.3952 ( 0.00%) 55.3929 ( 0.00%)
> Max-93% mmap 55.4766 ( 0.00%) 57.5712 ( -3.78%)
> Max-95% mmap 55.5522 ( 0.00%) 57.8376 ( -4.11%)
> Max-99% mmap 55.7938 ( 0.00%) 63.6180 (-14.02%)
> Max mmap 6344.0292 ( 0.00%) 67.2477 ( 98.94%)
> Mean mmap 57.3732 ( 0.00%) 54.5680 ( 4.89%)
>
> Note the maximum stall latency which was 6 seconds and becomes 67ms with
> this patch applied. However, also note that it is not guaranteed this
> benchmark always hits pathelogical cases and the milage varies. There is
> a secondary impact with more direct reclaim because zones are now being
> considered instead of being skipped by zlc.
>
> 4.1.0 4.1.0
> vanilla nozlc-v1r4
> Swap Ins 838 502
> Swap Outs 1149395 2622895
> DMA32 allocs 17839113 15863747
> Normal allocs 129045707 137847920
> Direct pages scanned 4070089 29046893
> Kswapd pages scanned 17147837 17140694
> Kswapd pages reclaimed 17146691 17139601
> Direct pages reclaimed 1888879 4886630
> Kswapd efficiency 99% 99%
> Kswapd velocity 17523.721 17518.928
> Direct efficiency 46% 16%
> Direct velocity 4159.306 29687.854
> Percentage direct scans 19% 62%
> Page writes by reclaim 1149395.000 2622895.000
> Page writes file 0 0
> Page writes anon 1149395 2622895

Interesting, kswapd has no decrease that would counter the increase in
direct reclaim. So there's more reclaim overall. Does it mean that
stutter doesn't like LRU and zlc was disrupting LRU?

> The direct page scan and reclaim rates are noticeable. It is possible
> this will not be a universal win on all workloads but cycling through
> zonelists waiting for zlc->last_full_zap to expire is not the right
> decision.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: David Rientjes <[email protected]>

It doesn't seem that removal of zlc would increase overhead due to
"expensive operations no longer being avoided". Making some corner-case
benchmark(s) worse as a side-effect of different LRU approximation
shouldn't be a show-stopper. Hence

Acked-by: Vlastimil Babka <[email protected]>

just git grep found some lines that should be also deleted:

include/linux/mmzone.h: * If zlcache_ptr is not NULL, then it is just
the address of zlcache,
include/linux/mmzone.h: * as explained above. If zlcache_ptr is NULL,
there is no zlcache.

And:

> @@ -3157,7 +2967,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
> struct alloc_context ac = {
> .high_zoneidx = gfp_zone(gfp_mask),
> - .nodemask = nodemask,
> + .nodemask = nodemask ? : &cpuset_current_mems_allowed,
> .migratetype = gfpflags_to_migratetype(gfp_mask),
> };
>
> @@ -3188,8 +2998,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> ac.zonelist = zonelist;
> /* The preferred zone is used for statistics later */
> preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
> - ac.nodemask ? : &cpuset_current_mems_allowed,
> - &ac.preferred_zone);
> + ac.nodemask, &ac.preferred_zone);
> if (!ac.preferred_zone)
> goto out;
> ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);

These hunks appear unrelated to zonelist cache? Also they move the
evaluation of cpuset_current_mems_allowed

2015-08-20 13:42:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, page_alloc: Delete the zonelist_cache

On Thu, Aug 20, 2015 at 03:18:43PM +0200, Michal Hocko wrote:
> On Wed 12-08-15 11:45:26, Mel Gorman wrote:
> [...]
> > 4-node machine stutter
> > 4-node machine stutter
> > 4.2.0-rc1 4.2.0-rc1
> > vanilla nozlc-v1r20
> > Min mmap 53.9902 ( 0.00%) 49.3629 ( 8.57%)
> > 1st-qrtle mmap 54.6776 ( 0.00%) 54.1201 ( 1.02%)
> > 2nd-qrtle mmap 54.9242 ( 0.00%) 54.5961 ( 0.60%)
> > 3rd-qrtle mmap 55.1817 ( 0.00%) 54.9338 ( 0.45%)
> > Max-90% mmap 55.3952 ( 0.00%) 55.3929 ( 0.00%)
> > Max-93% mmap 55.4766 ( 0.00%) 57.5712 ( -3.78%)
> > Max-95% mmap 55.5522 ( 0.00%) 57.8376 ( -4.11%)
> > Max-99% mmap 55.7938 ( 0.00%) 63.6180 (-14.02%)
> > Max mmap 6344.0292 ( 0.00%) 67.2477 ( 98.94%)
> > Mean mmap 57.3732 ( 0.00%) 54.5680 ( 4.89%)
>
> Do you have data for other leads? Because the reclaim counters look
> quite discouraging to be honest.
>

None of the other workloads showed changes that were worth reporting.

> > 4.1.0 4.1.0
> > vanilla nozlc-v1r4
> > Swap Ins 838 502
> > Swap Outs 1149395 2622895
>
> Twice as much swapouts is a lot.
>
> > DMA32 allocs 17839113 15863747
> > Normal allocs 129045707 137847920
> > Direct pages scanned 4070089 29046893
>
> 7x more scanns by direct reclaim also sounds bad.
>

With this benchmark, the results for stutter will be highly variable as
it's hammering the system. The intent of the test was to measure stalls at
a time when desktop interactivity went to hell during IO and could stall
for several minutes. Due to it nature, there is intense reclaim *and*
compaction activity going on and there is no point drawing conclusions
from the reclaim stats that are inherently good or bad.

There will be differences in direct reclaim figures because instead of
looping in the page allocator waiting for zlc to clear, it'll enter direct
reclaim. In effect, the zlc causes processes to busy loop while kswapd
does the work. If it turns out that this is the correct behaviour then
we should do that explicitly, not rely on the broken zlc behaviour for
the same reason we no longer rely on sprinkling congestion_wait() all
over the place.

> > Kswapd pages scanned 17147837 17140694
>
> while kswapd is doing the same amount of work so we are moving
> considerable amount of reclaim activity into the direct reclaim
>
> > Kswapd pages reclaimed 17146691 17139601
> > Direct pages reclaimed 1888879 4886630
> > Kswapd efficiency 99% 99%
> > Kswapd velocity 17523.721 17518.928
> > Direct efficiency 46% 16%
>
> which is just a wasted effort because the efficiency is really poor.
> Is this the effect of hammering a single zone which would be skipped
> otherwise while the allocation would succed from another zone?
>

Very doubtful. It's more likely because the zlc was causing a process to
busy loop waiting for kswapd to make forward progress.

> The latencies were not very much higher to match these numbers though.
> Is it possible that other parts of the benchmark suffered? The benchmark
> has measured only mmap part AFAIU.
>

mmap latency yes but during it, the system is getting hammered and the
latency is also affected by whether THPs were used or not.

> > Direct velocity 4159.306 29687.854
> > Percentage direct scans 19% 62%
> > Page writes by reclaim 1149395.000 2622895.000
> > Page writes file 0 0
> > Page writes anon 1149395 2622895
> >
> > The direct page scan and reclaim rates are noticeable. It is possible
> > this will not be a universal win on all workloads but cycling through
> > zonelists waiting for zlc->last_full_zap to expire is not the right
> > decision.
>
> As much as I would like to see zlc go it seems that it won't be that
> easy without regressing some loads. Or the numbers

If there are regressions on a real workload then it would be worth
considering why busy looping happened to behave better and then solve it
correctly.

--
Mel Gorman
SUSE Labs

2015-08-20 13:45:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing

On Thu, Aug 20, 2015 at 02:45:27PM +0200, Michal Hocko wrote:
> On Wed 12-08-15 11:45:28, Mel Gorman wrote:
> > File-backed pages that will be immediately are balanced between zones but
> ^written to...
>
> > it's unnecessarily expensive.
>
> to do WHAT? I guess you meant checking gfp_mask resp. alloc_mask? I
> doubt it would make a noticeable difference as this is a slow path
> already but I agree it doesn't make sense to check it again.
>

File-backed pages that will be immediately written are balanced between
zones. This heuristic tries to avoid having a single zone filled with
recently dirtied pages but the checks are unnecessarily expensive. Move
consider_zone_balanced into the alloc_context instead of checking bitmaps
multiple times. The patch also gives the parameter a more meaningful name.

?

--
Mel Gorman
SUSE Labs

2015-08-20 14:17:29

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, page_alloc: Delete the zonelist_cache

On Thu, Aug 20, 2015 at 03:30:54PM +0200, Vlastimil Babka wrote:
> >Note the maximum stall latency which was 6 seconds and becomes 67ms with
> >this patch applied. However, also note that it is not guaranteed this
> >benchmark always hits pathelogical cases and the milage varies. There is
> >a secondary impact with more direct reclaim because zones are now being
> >considered instead of being skipped by zlc.
> >
> > 4.1.0 4.1.0
> > vanilla nozlc-v1r4
> >Swap Ins 838 502
> >Swap Outs 1149395 2622895
> >DMA32 allocs 17839113 15863747
> >Normal allocs 129045707 137847920
> >Direct pages scanned 4070089 29046893
> >Kswapd pages scanned 17147837 17140694
> >Kswapd pages reclaimed 17146691 17139601
> >Direct pages reclaimed 1888879 4886630
> >Kswapd efficiency 99% 99%
> >Kswapd velocity 17523.721 17518.928
> >Direct efficiency 46% 16%
> >Direct velocity 4159.306 29687.854
> >Percentage direct scans 19% 62%
> >Page writes by reclaim 1149395.000 2622895.000
> >Page writes file 0 0
> >Page writes anon 1149395 2622895
>
> Interesting, kswapd has no decrease that would counter the increase in
> direct reclaim. So there's more reclaim overall. Does it mean that stutter
> doesn't like LRU and zlc was disrupting LRU?
>

The LRU is being heavily disrupted by both reclaim and compaction
activity. The test is not a reliable means of evaluating reclaim decisions
because of the compaction activity. The main purpose of stutter was as a
proxy measure of desktop interactivity during IO.

As the test does THP allocations, it can trigger the case where zlc can
disable a zone for no reason and instead busy loop which is just wrong.

> >The direct page scan and reclaim rates are noticeable. It is possible
> >this will not be a universal win on all workloads but cycling through
> >zonelists waiting for zlc->last_full_zap to expire is not the right
> >decision.
> >
> >Signed-off-by: Mel Gorman <[email protected]>
> >Acked-by: David Rientjes <[email protected]>
>
> It doesn't seem that removal of zlc would increase overhead due to
> "expensive operations no longer being avoided". Making some corner-case
> benchmark(s) worse as a side-effect of different LRU approximation shouldn't
> be a show-stopper. Hence
>
> Acked-by: Vlastimil Babka <[email protected]>
>

Thanks.

> just git grep found some lines that should be also deleted:
>
> include/linux/mmzone.h: * If zlcache_ptr is not NULL, then it is just the
> address of zlcache,
> include/linux/mmzone.h: * as explained above. If zlcache_ptr is NULL, there
> is no zlcache.
>

Thanks

> And:
>
> >@@ -3157,7 +2967,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> > gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
> > struct alloc_context ac = {
> > .high_zoneidx = gfp_zone(gfp_mask),
> >- .nodemask = nodemask,
> >+ .nodemask = nodemask ? : &cpuset_current_mems_allowed,
> > .migratetype = gfpflags_to_migratetype(gfp_mask),
> > };
> >
> >@@ -3188,8 +2998,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> > ac.zonelist = zonelist;
> > /* The preferred zone is used for statistics later */
> > preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
> >- ac.nodemask ? : &cpuset_current_mems_allowed,
> >- &ac.preferred_zone);
> >+ ac.nodemask, &ac.preferred_zone);
> > if (!ac.preferred_zone)
> > goto out;
> > ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
>
> These hunks appear unrelated to zonelist cache? Also they move the
> evaluation of cpuset_current_mems_allowed

They are rebase-related brain damage :(. I'll fix it and retest.

--
Mel Gorman
SUSE Labs

2015-08-20 14:25:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing

On Thu 20-08-15 14:45:26, Mel Gorman wrote:
> On Thu, Aug 20, 2015 at 02:45:27PM +0200, Michal Hocko wrote:
> > On Wed 12-08-15 11:45:28, Mel Gorman wrote:
> > > File-backed pages that will be immediately are balanced between zones but
> > ^written to...
> >
> > > it's unnecessarily expensive.
> >
> > to do WHAT? I guess you meant checking gfp_mask resp. alloc_mask? I
> > doubt it would make a noticeable difference as this is a slow path
> > already but I agree it doesn't make sense to check it again.
> >
>
> File-backed pages that will be immediately written are balanced between
> zones. This heuristic tries to avoid having a single zone filled with
> recently dirtied pages but the checks are unnecessarily expensive. Move
> consider_zone_balanced into the alloc_context instead of checking bitmaps
> multiple times. The patch also gives the parameter a more meaningful name.

Sounds much better. Thanks!
--
Michal Hocko
SUSE Labs

2015-08-20 14:46:03

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, page_alloc: Delete the zonelist_cache

On 08/20/2015 04:17 PM, Mel Gorman wrote:
> On Thu, Aug 20, 2015 at 03:30:54PM +0200, Vlastimil Babka wrote:
>> These hunks appear unrelated to zonelist cache? Also they move the
>> evaluation of cpuset_current_mems_allowed

Ah forgot to delete the "Also" part. I wanted to write that it moves the
evaluation away from inside the read_mems_allowed_begin() -
read_mems_allowed_retry() pair. But then I realized it's just taking a
*reference* and not going through cpuset_current_mems_allowed yet, so
it's probably OK. Just out of place in this patch.

> They are rebase-related brain damage :(. I'll fix it and retest.
>

2015-08-21 09:29:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, page_alloc: Delete the zonelist_cache

On Thu 20-08-15 14:42:40, Mel Gorman wrote:
> On Thu, Aug 20, 2015 at 03:18:43PM +0200, Michal Hocko wrote:
> > On Wed 12-08-15 11:45:26, Mel Gorman wrote:
> > [...]
> > > 4-node machine stutter
> > > 4-node machine stutter
> > > 4.2.0-rc1 4.2.0-rc1
> > > vanilla nozlc-v1r20
> > > Min mmap 53.9902 ( 0.00%) 49.3629 ( 8.57%)
> > > 1st-qrtle mmap 54.6776 ( 0.00%) 54.1201 ( 1.02%)
> > > 2nd-qrtle mmap 54.9242 ( 0.00%) 54.5961 ( 0.60%)
> > > 3rd-qrtle mmap 55.1817 ( 0.00%) 54.9338 ( 0.45%)
> > > Max-90% mmap 55.3952 ( 0.00%) 55.3929 ( 0.00%)
> > > Max-93% mmap 55.4766 ( 0.00%) 57.5712 ( -3.78%)
> > > Max-95% mmap 55.5522 ( 0.00%) 57.8376 ( -4.11%)
> > > Max-99% mmap 55.7938 ( 0.00%) 63.6180 (-14.02%)
> > > Max mmap 6344.0292 ( 0.00%) 67.2477 ( 98.94%)
> > > Mean mmap 57.3732 ( 0.00%) 54.5680 ( 4.89%)
> >
> > Do you have data for other leads? Because the reclaim counters look
> > quite discouraging to be honest.
> >
>
> None of the other workloads showed changes that were worth reporting.

OK, that is a good sign. I would agree that an extreme and artificial
load shouldn't be considered as a blocker.

> > > 4.1.0 4.1.0
> > > vanilla nozlc-v1r4
> > > Swap Ins 838 502
> > > Swap Outs 1149395 2622895
> >
> > Twice as much swapouts is a lot.
> >
> > > DMA32 allocs 17839113 15863747
> > > Normal allocs 129045707 137847920
> > > Direct pages scanned 4070089 29046893
> >
> > 7x more scanns by direct reclaim also sounds bad.
> >
>
> With this benchmark, the results for stutter will be highly variable as
> it's hammering the system. The intent of the test was to measure stalls at
> a time when desktop interactivity went to hell during IO and could stall
> for several minutes. Due to it nature, there is intense reclaim *and*
> compaction activity going on and there is no point drawing conclusions
> from the reclaim stats that are inherently good or bad.
>
> There will be differences in direct reclaim figures because instead of
> looping in the page allocator waiting for zlc to clear, it'll enter direct
> reclaim.

OK, I haven't considered this. kswapd might be stuck for quite some time
but all of them being stuck shouldn't be that likely. But still, this is
not a desirable behavior.

> In effect, the zlc causes processes to busy loop while kswapd
> does the work. If it turns out that this is the correct behaviour then
> we should do that explicitly, not rely on the broken zlc behaviour for
> the same reason we no longer rely on sprinkling congestion_wait() all
> over the place.

Fair point. I do agree that this should be done outside of
get_page_from_freelist. I am still surprised by the considerable
increase of swapouts but that should be handled separately if we see
that in the real world loads.

That being said
Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs

2015-08-21 13:42:26

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: page_alloc: Distinguish between being unable to sleep, unwilling to unwilling and avoiding waking kswapd

On 08/12/2015 12:45 PM, Mel Gorman wrote:
> __GFP_WAIT has been used to identify atomic context in callers that hold
> spinlocks or are in interrupts. They are expected to be high priority and
> have access one of two watermarks lower than "min". __GFP_HIGH users get
> access to the first lower watermark and can be called the "high priority
> reserve". Atomic users and interrupts access yet another lower watermark
> that can be called the "atomic reserve".
>
> Over time, callers had a requirement to not block when fallback options
> were available. Some have abused __GFP_WAIT leading to a situation where
> an optimisitic allocation with a fallback option can access atomic reserves.
>
> This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> cannot sleep and have no alternative. High priority users continue to use
> __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> as a caller that is willing to enter direct reclaim and wake kswapd for
> background reclaim.
>
> This patch then converts a number of sites
>
> o __GFP_ATOMIC is used by callers that are high priority and have memory
> pools for those requests. GFP_ATOMIC uses this flag. Callers with
> interrupts disabled still automatically use the atomic reserves.

Hm I can't see where the latter happens? In gfp_to_alloc_flags(),
ALLOC_HARDER is set for __GFP_ATOMIC, or rt-tasks *not* in interrupt?
What am I missing?

> o Callers that have a limited mempool to guarantee forward progress use
> __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
> kswapd will still be woken but atomic reserves are not used as there
> is a one-entry mempool to guarantee progress.
>
> o Callers that are checking if they are non-blocking should use the
> helper gfpflags_allows_blocking() where possible. This is because

A bit subjective but gfpflags_allow_blocking() sounds better to me.
Or shorter gfp_allows_blocking()?

> checking for __GFP_WAIT as was done historically now can trigger false
> positives. Some exceptions like dm-crypt.c exist where the code intent
> is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
> flag manipulations.
>
> The key hazard to watch out for is callers that removed __GFP_WAIT and
> was depending on access to atomic reserves for inconspicuous reasons.
> In some cases it may be appropriate for them to use __GFP_HIGH.

Hm we might also have a (non-fatal) hazard of callers that directly
combined __GFP_* flags that didn't include __GFP_WAIT, but did wake up
kswapd, and now might be missing __GFP_KSWAPD_RECLAIM. Did you try
checking for those? I imagine it's not a simple task...

> Signed-off-by: Mel Gorman <[email protected]>

>
> diff --git a/Documentation/vm/balance b/Documentation/vm/balance
> index c46e68cf9344..6f1f6fae30f5 100644
> --- a/Documentation/vm/balance
> +++ b/Documentation/vm/balance
> @@ -1,12 +1,14 @@
> Started Jan 2000 by Kanoj Sarcar <[email protected]>
>
> -Memory balancing is needed for non __GFP_WAIT as well as for non
> -__GFP_IO allocations.
> +Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> +well as for non __GFP_IO allocations.
>
> -There are two reasons to be requesting non __GFP_WAIT allocations:
> -the caller can not sleep (typically intr context), or does not want
> -to incur cost overheads of page stealing and possible swap io for
> -whatever reasons.
> +The first reason why a caller may avoid reclaim is that the caller can not
> +sleep due to holding a spinlock or is in interrupt context. The second may
> +be that the caller is willing to fail the allocation without incurring the
> +overhead of page stealing. This may happen for opportunistic high-order

I think "page stealing" has nowadays a different meaning in the
anti-fragmentation context? Should it just say "reclaim"?

> +allocation requests that have order-0 fallback options. In such cases,
> +the caller may also wish to avoid waking kswapd.
>
> __GFP_IO allocation requests are made to prevent file system deadlocks.
>
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index cba12f34ff77..100d3fbaebae 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -650,7 +650,7 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
>
> if (is_coherent || nommu())
> addr = __alloc_simple_buffer(dev, size, gfp, &page);
> - else if (!(gfp & __GFP_WAIT))
> + else if (gfp & __GFP_ATOMIC)
> addr = __alloc_from_pool(size, &page);
> else if (!dev_get_cma_area(dev))
> addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller, want_vaddr);
> @@ -1369,7 +1369,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
> *handle = DMA_ERROR_CODE;
> size = PAGE_ALIGN(size);
>
> - if (!(gfp & __GFP_WAIT))
> + if (gfp & __GFP_ATOMIC)
> return __iommu_alloc_atomic(dev, size, handle);
>
> /*
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index d16a1cead23f..713d963fb96b 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
> if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> dev->coherent_dma_mask <= DMA_BIT_MASK(32))
> flags |= GFP_DMA;
> - if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) {
> + if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_DIRECT_RECLAIM)) {
> struct page *page;
> void *addr;
>
> @@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
>
> size = PAGE_ALIGN(size);
>
> - if (!coherent && !(flags & __GFP_WAIT)) {
> + if (!coherent && (flags & __GFP_ATOMIC)) {
> struct page *page = NULL;
> void *addr = __alloc_from_pool(size, &page, flags);
>

Hmm these change the lack of __GFP_WAIT to expect __GFP_ATOMIC, so it's
potentially one of those "key hazards" mentioned in the changelog,
right? But here it's not just about using atomic reserves, but using a
completely different allocation function.
E.g. in case of arch/arm/mm/dma-mapping.c:__dma_alloc() I see it can go
to __alloc_remap_buffer -> __dma_alloc_remap -> dma_common_contiguous_remap
which does kmalloc(..., GFP_KERNEL) and has comment "Cannot be used in
non-sleeping contexts".

So I think callers that cannot sleep and did clear __GFP_WAIT before,
are now dangerous unless they set __GFP_ATOMIC?

> diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
> index 2a3973a7c441..dc611c8cad10 100644
> --- a/drivers/firewire/core-cdev.c
> +++ b/drivers/firewire/core-cdev.c
> @@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
> static int add_client_resource(struct client *client,
> struct client_resource *resource, gfp_t gfp_mask)
> {
> - bool preload = !!(gfp_mask & __GFP_WAIT);
> + bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);

Use the helper here to avoid !! as a bonus?

> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -619,7 +619,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
>
> static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
> {
> - bool preload = !!(gfp_mask & __GFP_WAIT);
> + bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
> unsigned long flags;
> int ret, id;
>

Same here.

> diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
> index d51687780b61..06badad3ab75 100644
> --- a/drivers/usb/host/u132-hcd.c
> +++ b/drivers/usb/host/u132-hcd.c
> @@ -2247,7 +2247,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
> {
> struct u132 *u132 = hcd_to_u132(hcd);
> if (irqs_disabled()) {
> - if (__GFP_WAIT & mem_flags) {
> + if (__GFP_DIRECT_RECLAIM & mem_flags) {
> printk(KERN_ERR "invalid context for function that migh"
> "t sleep\n");
> return -EINVAL;

And here - no other flag manipulations and it would match the printk.
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
> clear = 1;
> again:
> - if (!prealloc && (mask & __GFP_WAIT)) {
> + if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> /*
> * Don't care for allocation failure here because we might end
> * up not needing the pre-allocated extent state at all, which
> @@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>
> bits |= EXTENT_FIRST_DELALLOC;
> again:
> - if (!prealloc && (mask & __GFP_WAIT)) {
> + if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> prealloc = alloc_extent_state(mask);
> BUG_ON(!prealloc);
> }
> @@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> btrfs_debug_check_extent_io_range(tree, start, end);
>
> again:
> - if (!prealloc && (mask & __GFP_WAIT)) {
> + if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> /*
> * Best effort, don't worry if extent state allocation fails
> * here for the first iteration. We might have a cached state
> @@ -4265,7 +4265,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
> u64 start = page_offset(page);
> u64 end = start + PAGE_CACHE_SIZE - 1;
>
> - if ((mask & __GFP_WAIT) &&
> + if ((mask & __GFP_DIRECT_RECLAIM) &&
> page->mapping->host->i_size > 16 * 1024 * 1024) {
> u64 len;
> while (start <= end) {

Why not here as well.

> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
> cpumask_var_t cpus;
> int cpu, ret;
>
> - might_sleep_if(gfp_flags & __GFP_WAIT);
> + might_sleep_if(gfp_flags & __GFP_DIRECT_RECLAIM);
>
> if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
> preempt_disable();
> diff --git a/lib/idr.c b/lib/idr.c
> index 5335c43adf46..e5118fc82961 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
> * allocation guarantee. Disallow usage from those contexts.
> */
> WARN_ON_ONCE(in_interrupt());
> - might_sleep_if(gfp_mask & __GFP_WAIT);
> + might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
> preempt_disable();
>
> @@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
> struct idr_layer *pa[MAX_IDR_LEVEL + 1];
> int id;
>
> - might_sleep_if(gfp_mask & __GFP_WAIT);
> + might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
> /* sanity checks */
> if (WARN_ON_ONCE(start < 0))
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index f9ebe1c82060..cc5fdc3fb734 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
> * preloading in the interrupt anyway as all the allocations have to
> * be atomic. So just do normal allocation when in interrupt.
> */
> - if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> + if (!(gfp_mask & __GFP_DIRECT_RECLAIM) && !in_interrupt()) {
> struct radix_tree_preload *rtp;
>
> /*

These too?

> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index dac5bf59309d..2056d16807de 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
> {
> struct bdi_writeback *wb;
>
> - might_sleep_if(gfp & __GFP_WAIT);
> + might_sleep_if(gfp & __GFP_DIRECT_RECLAIM);
>
> if (!memcg_css->parent)
> return &bdi->wb;

ditto

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2143,7 +2143,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> return false;
> if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
> return false;
> - if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> + if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))

Should __GFP_ATOMIC really be here?

> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 197c3f59ecbf..c5fcdd6f85b7 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
> /* Set an association id for a given association */
> int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
> {
> - bool preload = !!(gfp & __GFP_WAIT);
> + bool preload = !!(gfp & __GFP_DIRECT_RECLAIM);
> int ret;
>
> /* If the id is already assigned, keep it. */

helper?

2015-08-21 14:20:56

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM

On 08/12/2015 12:45 PM, Mel Gorman wrote:
> __GFP_WAIT was used to signal that the caller was in atomic context and
> could not sleep. Now it is possible to distinguish between true atomic
> context and callers that are not willing to sleep. The latter should clear
> __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
> behaves differently, there is a risk that people will clear the wrong
> flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
> what it does -- setting it allows all reclaim activity, clearing them
> prevents it.
>
> Signed-off-by: Mel Gorman <[email protected]>

...

> diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
> index c097909c589c..1d2046e68808 100644
> --- a/drivers/block/drbd/drbd_receiver.c
> +++ b/drivers/block/drbd/drbd_receiver.c
> @@ -357,7 +357,7 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
> }
>
> if (has_payload && data_size) {
> - page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_WAIT));
> + page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_RECLAIM));

I think here it should test only for direct reclaim (via the helper) and
thus moved to patch 06?

> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2226,7 +2226,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
> mapping = file_inode(obj->base.filp)->i_mapping;
> gfp = mapping_gfp_mask(mapping);
> gfp |= __GFP_NORETRY | __GFP_NOWARN;
> - gfp &= ~(__GFP_IO | __GFP_WAIT);
> + gfp &= ~(__GFP_IO | __GFP_RECLAIM);

Why clear the kswapd reclaim here?

> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> index ed37d26eb20d..393270436a4b 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> @@ -113,7 +113,7 @@ do { \
> do { \
> LASSERT(!in_interrupt() || \
> ((size) <= LIBCFS_VMALLOC_SIZE && \
> - ((mask) & __GFP_WAIT) == 0)); \
> + ((mask) & __GFP_RECLAIM) == 0)); \
> } while (0)

This should test only __GFP_DIRECT_RECLAIM?

> #define LIBCFS_ALLOC_POST(ptr, size) \
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 35660da77921..92e284d0362e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -718,7 +718,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> if (start > end)
> goto out;
> spin_unlock(&tree->lock);
> - if (mask & __GFP_WAIT)
> + if (mask & __GFP_RECLAIM)
> cond_resched();
> goto again;
> }
> @@ -1028,7 +1028,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> if (start > end)
> goto out;
> spin_unlock(&tree->lock);
> - if (mask & __GFP_WAIT)
> + if (mask & __GFP_RECLAIM)
> cond_resched();
> goto again;
> }
> @@ -1253,7 +1253,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> if (start > end)
> goto out;
> spin_unlock(&tree->lock);
> - if (mask & __GFP_WAIT)
> + if (mask & __GFP_RECLAIM)
> cond_resched();
> first_iteration = false;
> goto again;

This too?

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index dbd246a14e2f..e066f3afae73 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -104,7 +104,7 @@ struct vm_area_struct;
> * can be cleared when the reclaiming of pages would cause unnecessary
> * disruption.
> */
> -#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> +#define __GFP_RECLAIM (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> #define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
> #define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
>
> @@ -123,12 +123,12 @@ struct vm_area_struct;
> */
> #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
> -#define GFP_NOIO (__GFP_WAIT)
> -#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
> -#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
> -#define GFP_TEMPORARY (__GFP_WAIT | __GFP_IO | __GFP_FS | \
> +#define GFP_NOIO (__GFP_RECLAIM)
> +#define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
> +#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> +#define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
> __GFP_RECLAIMABLE)
> -#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> +#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
> #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
> #define GFP_IOFS (__GFP_IO | __GFP_FS)

Hmm GFP_IOFS should maybe include __GFP_KSWAPD_RECLAIM? Although I
wonder if it makes sense to use it like "... | GFP_IOFS" and not just as
a mask "... & ~GFP_IOFS". Not including __GFP_KSWAPD_RECLAIM changes the
former use, while including it changes the latter one.
Maybe we should just remove it while at it? There's only a handful of
users. mm/ uses it as a mask, and the rest is in staging/lustre and it's
doing allocations like "__GFP_ZERO | GFP_IOFS" which looks like a
mistake to me - what good is IO or FS without DIRECT_RECLAIM?

It's probably best we removed it or changed it to __GFP_IOFS. The form
without underscores suggests usage as parameter to alloc functions and
that's clearly wrong here.

> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index d8e2e3918ce2..4bee2392dbb2 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2061,7 +2061,7 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
> consume_skb(info.skb2);
>
> if (info.delivered) {
> - if (info.congested && (allocation & __GFP_WAIT))
> + if (info.congested && (allocation & __GFP_RECLAIM))
> yield();

Just direct reclaim?

> return 0;
> }
> diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
> index 6631f4f1e39b..b5cd65401a28 100644
> --- a/net/rxrpc/ar-connection.c
> +++ b/net/rxrpc/ar-connection.c
> @@ -500,7 +500,7 @@ int rxrpc_connect_call(struct rxrpc_sock *rx,
> if (bundle->num_conns >= 20) {
> _debug("too many conns");
>
> - if (!(gfp & __GFP_WAIT)) {
> + if (!(gfp & __GFP_RECLAIM)) {
> _leave(" = -EAGAIN");
> return -EAGAIN;
> }

ditto?

2015-08-21 20:39:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: page_alloc: Distinguish between being unable to sleep, unwilling to unwilling and avoiding waking kswapd

On Fri, Aug 21, 2015 at 03:42:21PM +0200, Vlastimil Babka wrote:
> On 08/12/2015 12:45 PM, Mel Gorman wrote:
> >__GFP_WAIT has been used to identify atomic context in callers that hold
> >spinlocks or are in interrupts. They are expected to be high priority and
> >have access one of two watermarks lower than "min". __GFP_HIGH users get
> >access to the first lower watermark and can be called the "high priority
> >reserve". Atomic users and interrupts access yet another lower watermark
> >that can be called the "atomic reserve".
> >
> >Over time, callers had a requirement to not block when fallback options
> >were available. Some have abused __GFP_WAIT leading to a situation where
> >an optimisitic allocation with a fallback option can access atomic reserves.
> >
> >This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> >cannot sleep and have no alternative. High priority users continue to use
> >__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> >willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> >that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> >as a caller that is willing to enter direct reclaim and wake kswapd for
> >background reclaim.
> >
> >This patch then converts a number of sites
> >
> >o __GFP_ATOMIC is used by callers that are high priority and have memory
> > pools for those requests. GFP_ATOMIC uses this flag. Callers with
> > interrupts disabled still automatically use the atomic reserves.
>
> Hm I can't see where the latter happens? In gfp_to_alloc_flags(),
> ALLOC_HARDER is set for __GFP_ATOMIC, or rt-tasks *not* in interrupt? What
> am I missing?
>

It was a mistake from an earlier version of the patch that was itself
buggy. I forgot to fix the changelog properly.

> >o Callers that have a limited mempool to guarantee forward progress use
> > __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
> > kswapd will still be woken but atomic reserves are not used as there
> > is a one-entry mempool to guarantee progress.
> >
> >o Callers that are checking if they are non-blocking should use the
> > helper gfpflags_allows_blocking() where possible. This is because
>
> A bit subjective but gfpflags_allow_blocking() sounds better to me.
> Or shorter gfp_allows_blocking()?
>

I'll use gfpflags_allow_blocking.

> > checking for __GFP_WAIT as was done historically now can trigger false
> > positives. Some exceptions like dm-crypt.c exist where the code intent
> > is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
> > flag manipulations.
> >
> >The key hazard to watch out for is callers that removed __GFP_WAIT and
> >was depending on access to atomic reserves for inconspicuous reasons.
> >In some cases it may be appropriate for them to use __GFP_HIGH.
>
> Hm we might also have a (non-fatal) hazard of callers that directly combined
> __GFP_* flags that didn't include __GFP_WAIT, but did wake up kswapd, and
> now might be missing __GFP_KSWAPD_RECLAIM. Did you try checking for those? I
> imagine it's not a simple task...
>

I hadn't searched but there are a small number of callers that
potentially care. I fixed them.

> >Signed-off-by: Mel Gorman <[email protected]>
>
> >
> >diff --git a/Documentation/vm/balance b/Documentation/vm/balance
> >index c46e68cf9344..6f1f6fae30f5 100644
> >--- a/Documentation/vm/balance
> >+++ b/Documentation/vm/balance
> >@@ -1,12 +1,14 @@
> > Started Jan 2000 by Kanoj Sarcar <[email protected]>
> >
> >-Memory balancing is needed for non __GFP_WAIT as well as for non
> >-__GFP_IO allocations.
> >+Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> >+well as for non __GFP_IO allocations.
> >
> >-There are two reasons to be requesting non __GFP_WAIT allocations:
> >-the caller can not sleep (typically intr context), or does not want
> >-to incur cost overheads of page stealing and possible swap io for
> >-whatever reasons.
> >+The first reason why a caller may avoid reclaim is that the caller can not
> >+sleep due to holding a spinlock or is in interrupt context. The second may
> >+be that the caller is willing to fail the allocation without incurring the
> >+overhead of page stealing. This may happen for opportunistic high-order
>
> I think "page stealing" has nowadays a different meaning in the
> anti-fragmentation context? Should it just say "reclaim"?
>

Good point, corrected.

> >+allocation requests that have order-0 fallback options. In such cases,
> >+the caller may also wish to avoid waking kswapd.
> >
> > __GFP_IO allocation requests are made to prevent file system deadlocks.
> >
> >diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> >index cba12f34ff77..100d3fbaebae 100644
> >--- a/arch/arm/mm/dma-mapping.c
> >+++ b/arch/arm/mm/dma-mapping.c
> >@@ -650,7 +650,7 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
> >
> > if (is_coherent || nommu())
> > addr = __alloc_simple_buffer(dev, size, gfp, &page);
> >- else if (!(gfp & __GFP_WAIT))
> >+ else if (gfp & __GFP_ATOMIC)
> > addr = __alloc_from_pool(size, &page);
> > else if (!dev_get_cma_area(dev))
> > addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller, want_vaddr);
> >@@ -1369,7 +1369,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
> > *handle = DMA_ERROR_CODE;
> > size = PAGE_ALIGN(size);
> >
> >- if (!(gfp & __GFP_WAIT))
> >+ if (gfp & __GFP_ATOMIC)
> > return __iommu_alloc_atomic(dev, size, handle);
> >
> > /*
> >diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> >index d16a1cead23f..713d963fb96b 100644
> >--- a/arch/arm64/mm/dma-mapping.c
> >+++ b/arch/arm64/mm/dma-mapping.c
> >@@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
> > if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> > dev->coherent_dma_mask <= DMA_BIT_MASK(32))
> > flags |= GFP_DMA;
> >- if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) {
> >+ if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_DIRECT_RECLAIM)) {
> > struct page *page;
> > void *addr;
> >
> >@@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
> >
> > size = PAGE_ALIGN(size);
> >
> >- if (!coherent && !(flags & __GFP_WAIT)) {
> >+ if (!coherent && (flags & __GFP_ATOMIC)) {
> > struct page *page = NULL;
> > void *addr = __alloc_from_pool(size, &page, flags);
> >
>
> Hmm these change the lack of __GFP_WAIT to expect __GFP_ATOMIC, so it's
> potentially one of those "key hazards" mentioned in the changelog, right?
> But here it's not just about using atomic reserves, but using a completely
> different allocation function.
> E.g. in case of arch/arm/mm/dma-mapping.c:__dma_alloc() I see it can go to
> __alloc_remap_buffer -> __dma_alloc_remap -> dma_common_contiguous_remap
> which does kmalloc(..., GFP_KERNEL) and has comment "Cannot be used in
> non-sleeping contexts".
>

I completely missed that. It needs to be a check for
__GFP_DIRECT_RECLAIM and similar in the other arm file.

> So I think callers that cannot sleep and did clear __GFP_WAIT before, are
> now dangerous unless they set __GFP_ATOMIC?
>
> >diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
> >index 2a3973a7c441..dc611c8cad10 100644
> >--- a/drivers/firewire/core-cdev.c
> >+++ b/drivers/firewire/core-cdev.c
> >@@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
> > static int add_client_resource(struct client *client,
> > struct client_resource *resource, gfp_t gfp_mask)
> > {
> >- bool preload = !!(gfp_mask & __GFP_WAIT);
> >+ bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
>
> Use the helper here to avoid !! as a bonus?
>

Done.

> >--- a/drivers/infiniband/core/sa_query.c
> >+++ b/drivers/infiniband/core/sa_query.c
> >@@ -619,7 +619,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
> >
> > static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
> > {
> >- bool preload = !!(gfp_mask & __GFP_WAIT);
> >+ bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
> > unsigned long flags;
> > int ret, id;
> >
>
> Same here.
>

Done.

> >diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
> >index d51687780b61..06badad3ab75 100644
> >--- a/drivers/usb/host/u132-hcd.c
> >+++ b/drivers/usb/host/u132-hcd.c
> >@@ -2247,7 +2247,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
> > {
> > struct u132 *u132 = hcd_to_u132(hcd);
> > if (irqs_disabled()) {
> >- if (__GFP_WAIT & mem_flags) {
> >+ if (__GFP_DIRECT_RECLAIM & mem_flags) {
> > printk(KERN_ERR "invalid context for function that migh"
> > "t sleep\n");
> > return -EINVAL;
>
> And here - no other flag manipulations and it would match the printk.

Fixed

> >--- a/fs/btrfs/extent_io.c
> >+++ b/fs/btrfs/extent_io.c
> >@@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> > if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
> > clear = 1;
> > again:
> >- if (!prealloc && (mask & __GFP_WAIT)) {
> >+ if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> > /*
> > * Don't care for allocation failure here because we might end
> > * up not needing the pre-allocated extent state at all, which
> >@@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> >
> > bits |= EXTENT_FIRST_DELALLOC;
> > again:
> >- if (!prealloc && (mask & __GFP_WAIT)) {
> >+ if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> > prealloc = alloc_extent_state(mask);
> > BUG_ON(!prealloc);
> > }
> >@@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> > btrfs_debug_check_extent_io_range(tree, start, end);
> >
> > again:
> >- if (!prealloc && (mask & __GFP_WAIT)) {
> >+ if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> > /*
> > * Best effort, don't worry if extent state allocation fails
> > * here for the first iteration. We might have a cached state
> >@@ -4265,7 +4265,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
> > u64 start = page_offset(page);
> > u64 end = start + PAGE_CACHE_SIZE - 1;
> >
> >- if ((mask & __GFP_WAIT) &&
> >+ if ((mask & __GFP_DIRECT_RECLAIM) &&
> > page->mapping->host->i_size > 16 * 1024 * 1024) {
> > u64 len;
> > while (start <= end) {
>
> Why not here as well.
>

Done

> >--- a/kernel/smp.c
> >+++ b/kernel/smp.c
> >@@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
> > cpumask_var_t cpus;
> > int cpu, ret;
> >
> >- might_sleep_if(gfp_flags & __GFP_WAIT);
> >+ might_sleep_if(gfp_flags & __GFP_DIRECT_RECLAIM);
> >
> > if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
> > preempt_disable();
> >diff --git a/lib/idr.c b/lib/idr.c
> >index 5335c43adf46..e5118fc82961 100644
> >--- a/lib/idr.c
> >+++ b/lib/idr.c
> >@@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
> > * allocation guarantee. Disallow usage from those contexts.
> > */
> > WARN_ON_ONCE(in_interrupt());
> >- might_sleep_if(gfp_mask & __GFP_WAIT);
> >+ might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> >
> > preempt_disable();
> >
> >@@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
> > struct idr_layer *pa[MAX_IDR_LEVEL + 1];
> > int id;
> >
> >- might_sleep_if(gfp_mask & __GFP_WAIT);
> >+ might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> >
> > /* sanity checks */
> > if (WARN_ON_ONCE(start < 0))
> >diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> >index f9ebe1c82060..cc5fdc3fb734 100644
> >--- a/lib/radix-tree.c
> >+++ b/lib/radix-tree.c
> >@@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
> > * preloading in the interrupt anyway as all the allocations have to
> > * be atomic. So just do normal allocation when in interrupt.
> > */
> >- if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> >+ if (!(gfp_mask & __GFP_DIRECT_RECLAIM) && !in_interrupt()) {
> > struct radix_tree_preload *rtp;
> >
> > /*
>
> These too?
>

Yep

> >diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> >index dac5bf59309d..2056d16807de 100644
> >--- a/mm/backing-dev.c
> >+++ b/mm/backing-dev.c
> >@@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
> > {
> > struct bdi_writeback *wb;
> >
> >- might_sleep_if(gfp & __GFP_WAIT);
> >+ might_sleep_if(gfp & __GFP_DIRECT_RECLAIM);
> >
> > if (!memcg_css->parent)
> > return &bdi->wb;
>
> ditto
>

Indeed.

> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2143,7 +2143,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> > return false;
> > if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
> > return false;
> >- if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> >+ if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>
> Should __GFP_ATOMIC really be here?
>

I felt it was safer because it felt in line with the intent of
alloc.ignore_gfp_wait.

> >diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> >index 197c3f59ecbf..c5fcdd6f85b7 100644
> >--- a/net/sctp/associola.c
> >+++ b/net/sctp/associola.c
> >@@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
> > /* Set an association id for a given association */
> > int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
> > {
> >- bool preload = !!(gfp & __GFP_WAIT);
> >+ bool preload = !!(gfp & __GFP_DIRECT_RECLAIM);
> > int ret;
> >
> > /* If the id is already assigned, keep it. */
>
> helper?
>

Yes. Thanks very much

--
Mel Gorman
SUSE Labs

2015-08-21 20:56:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM

On Fri, Aug 21, 2015 at 04:20:52PM +0200, Vlastimil Babka wrote:
> On 08/12/2015 12:45 PM, Mel Gorman wrote:
> >__GFP_WAIT was used to signal that the caller was in atomic context and
> >could not sleep. Now it is possible to distinguish between true atomic
> >context and callers that are not willing to sleep. The latter should clear
> >__GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
> >behaves differently, there is a risk that people will clear the wrong
> >flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
> >what it does -- setting it allows all reclaim activity, clearing them
> >prevents it.
> >
> >Signed-off-by: Mel Gorman <[email protected]>
>
> ...
>
> >diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
> >index c097909c589c..1d2046e68808 100644
> >--- a/drivers/block/drbd/drbd_receiver.c
> >+++ b/drivers/block/drbd/drbd_receiver.c
> >@@ -357,7 +357,7 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
> > }
> >
> > if (has_payload && data_size) {
> >- page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_WAIT));
> >+ page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_RECLAIM));
>

Yeah.

> I think here it should test only for direct reclaim (via the helper) and
> thus moved to patch 06?
>
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -2226,7 +2226,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
> > mapping = file_inode(obj->base.filp)->i_mapping;
> > gfp = mapping_gfp_mask(mapping);
> > gfp |= __GFP_NORETRY | __GFP_NOWARN;
> >- gfp &= ~(__GFP_IO | __GFP_WAIT);
> >+ gfp &= ~(__GFP_IO | __GFP_RECLAIM);
>
> Why clear the kswapd reclaim here?
>

Because in patch 6, it was using __GFP_NO_KSWAPD so it's in line with
the expected behaviour of the code.

> >diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> >index ed37d26eb20d..393270436a4b 100644
> >--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> >+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> >@@ -113,7 +113,7 @@ do { \
> > do { \
> > LASSERT(!in_interrupt() || \
> > ((size) <= LIBCFS_VMALLOC_SIZE && \
> >- ((mask) & __GFP_WAIT) == 0)); \
> >+ ((mask) & __GFP_RECLAIM) == 0)); \
> > } while (0)
>
> This should test only __GFP_DIRECT_RECLAIM?
>

Yes and it should be in patch 6.

> > #define LIBCFS_ALLOC_POST(ptr, size) \
> >diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> >index 35660da77921..92e284d0362e 100644
> >--- a/fs/btrfs/extent_io.c
> >+++ b/fs/btrfs/extent_io.c
> >@@ -718,7 +718,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> > if (start > end)
> > goto out;
> > spin_unlock(&tree->lock);
> >- if (mask & __GFP_WAIT)
> >+ if (mask & __GFP_RECLAIM)
> > cond_resched();
> > goto again;
> > }
> >@@ -1028,7 +1028,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> > if (start > end)
> > goto out;
> > spin_unlock(&tree->lock);
> >- if (mask & __GFP_WAIT)
> >+ if (mask & __GFP_RECLAIM)
> > cond_resched();
> > goto again;
> > }
> >@@ -1253,7 +1253,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> > if (start > end)
> > goto out;
> > spin_unlock(&tree->lock);
> >- if (mask & __GFP_WAIT)
> >+ if (mask & __GFP_RECLAIM)
> > cond_resched();
> > first_iteration = false;
> > goto again;
>
> This too?
>

Yes.

> >diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >index dbd246a14e2f..e066f3afae73 100644
> >--- a/include/linux/gfp.h
> >+++ b/include/linux/gfp.h
> >@@ -104,7 +104,7 @@ struct vm_area_struct;
> > * can be cleared when the reclaiming of pages would cause unnecessary
> > * disruption.
> > */
> >-#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> >+#define __GFP_RECLAIM (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> > #define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
> > #define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
> >
> >@@ -123,12 +123,12 @@ struct vm_area_struct;
> > */
> > #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> > #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
> >-#define GFP_NOIO (__GFP_WAIT)
> >-#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
> >-#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
> >-#define GFP_TEMPORARY (__GFP_WAIT | __GFP_IO | __GFP_FS | \
> >+#define GFP_NOIO (__GFP_RECLAIM)
> >+#define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
> >+#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> >+#define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
> > __GFP_RECLAIMABLE)
> >-#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> >+#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> > #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
> > #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
> > #define GFP_IOFS (__GFP_IO | __GFP_FS)
>
> Hmm GFP_IOFS should maybe include __GFP_KSWAPD_RECLAIM? Although I wonder if
> it makes sense to use it like "... | GFP_IOFS" and not just as a mask "... &
> ~GFP_IOFS". Not including __GFP_KSWAPD_RECLAIM changes the former use, while
> including it changes the latter one.
> Maybe we should just remove it while at it? There's only a handful of users.
> mm/ uses it as a mask, and the rest is in staging/lustre and it's doing
> allocations like "__GFP_ZERO | GFP_IOFS" which looks like a mistake to me -
> what good is IO or FS without DIRECT_RECLAIM?
>
> It's probably best we removed it or changed it to __GFP_IOFS. The form
> without underscores suggests usage as parameter to alloc functions and
> that's clearly wrong here.
>

I updated GFP_IOFS to include the flag but kept its existance. A few
sites needed to be converted to (__GFP_IO | __GFP_FS) to still be
correct. It's now part of patch 6

> >diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> >index d8e2e3918ce2..4bee2392dbb2 100644
> >--- a/net/netlink/af_netlink.c
> >+++ b/net/netlink/af_netlink.c
> >@@ -2061,7 +2061,7 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
> > consume_skb(info.skb2);
> >
> > if (info.delivered) {
> >- if (info.congested && (allocation & __GFP_WAIT))
> >+ if (info.congested && (allocation & __GFP_RECLAIM))
> > yield();
>
> Just direct reclaim?
>

Yeah

> > return 0;
> > }
> >diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
> >index 6631f4f1e39b..b5cd65401a28 100644
> >--- a/net/rxrpc/ar-connection.c
> >+++ b/net/rxrpc/ar-connection.c
> >@@ -500,7 +500,7 @@ int rxrpc_connect_call(struct rxrpc_sock *rx,
> > if (bundle->num_conns >= 20) {
> > _debug("too many conns");
> >
> >- if (!(gfp & __GFP_WAIT)) {
> >+ if (!(gfp & __GFP_RECLAIM)) {
> > _leave(" = -EAGAIN");
> > return -EAGAIN;
> > }
>
> ditto?
>

Yeah.

--
Mel Gorman
SUSE Labs