2015-06-08 13:56:45

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH 00/25] Move LRU page reclaim from zones to nodes

This is an RFC series against 4.0 that moves LRUs from the zones to the
node. In concept, this is straight forward but there are a lot of details
so I'm posting it early to see what people think. The motivations are;

1. Currently, reclaim on node 0 behaves differently to node 1 with subtly different
aging rules. Workloads may exhibit different behaviour depending on what node
it was scheduled on as a result.

2. The residency of a page partially depends on what zone the page was
allocated from. This is partially combatted by the fair zone allocation
policy but that is a partial solution that introduces overhead in the
page allocator paths.

3. kswapd and the page allocator play special games with the order they scan zones
to avoid interfering with each other but it's unpredictable.

4. The different scan activity and ordering for zone reclaim is very difficult
to predict.

5. slab shrinkers are node-based which makes relating page reclaim to
slab reclaim harder than it should be.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes.

The series is very long and bisection will be hazardous due to being
misleading as infrastructure is reshuffled. The rational bisection points are

[PATCH 01/25] mm, vmstat: Add infrastructure for per-node vmstats
[PATCH 19/25] mm, vmscan: Account in vmstat for pages skipped during reclaim
[PATCH 21/25] mm, page_alloc: Defer zlc_setup until it is known it is required
[PATCH 23/25] mm, page_alloc: Delete the zonelist_cache
[PATCH 25/25] mm: page_alloc: Take fewer passes when allocating to the low watermark

It was tested on a UMA (8 cores single socket) and a NUMA machine (48 cores,
4 sockets). The page allocator tests showed marginal differences in aim9,
page fault microbenchmark, page allocator micro-benchmark and ebizzy. This
was expected as the affected paths are small in comparison to the overall
workloads.

I also tested using fstest on zero-length files to stress slab reclaim. It
showed no major differences in performance or stats.

A THP-based test case that stresses compaction was inconclusive. It showed
differences in the THP allocation success rate and both gains and losses in
the time it takes to allocate THP depending on the number of threads running.

Tests did show there were differences in the pages allocated from each zone.
This is due to the fact the fair zone allocation policy is removed as with
node-based LRU reclaim, it *should* not be necessary. It would be preferable
if the original database workload that motivated the introduction of that
policy was retested with this series though.

The raw figures as such are not that interesting -- things perform more
or less the same which is what you'd hope.

arch/s390/appldata/appldata_mem.c | 2 +-
arch/tile/mm/pgtable.c | 18 +-
drivers/base/node.c | 73 +--
drivers/staging/android/lowmemorykiller.c | 12 +-
fs/fs-writeback.c | 8 +-
fs/fuse/file.c | 8 +-
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 2 +-
fs/proc/meminfo.c | 14 +-
include/linux/backing-dev.h | 2 +-
include/linux/memcontrol.h | 15 +-
include/linux/mm_inline.h | 4 +-
include/linux/mmzone.h | 224 ++++------
include/linux/swap.h | 11 +-
include/linux/topology.h | 2 +-
include/linux/vm_event_item.h | 11 +-
include/linux/vmstat.h | 94 +++-
include/linux/writeback.h | 2 +-
include/trace/events/vmscan.h | 10 +-
include/trace/events/writeback.h | 6 +-
kernel/power/snapshot.c | 10 +-
kernel/sysctl.c | 4 +-
mm/backing-dev.c | 14 +-
mm/compaction.c | 25 +-
mm/filemap.c | 16 +-
mm/huge_memory.c | 14 +-
mm/internal.h | 11 +-
mm/memcontrol.c | 37 +-
mm/memory-failure.c | 4 +-
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 2 +-
mm/migrate.c | 31 +-
mm/mlock.c | 12 +-
mm/mmap.c | 4 +-
mm/nommu.c | 4 +-
mm/page-writeback.c | 109 ++---
mm/page_alloc.c | 489 ++++++--------------
mm/rmap.c | 16 +-
mm/shmem.c | 12 +-
mm/swap.c | 66 +--
mm/swap_state.c | 4 +-
mm/truncate.c | 2 +-
mm/vmscan.c | 718 ++++++++++++++----------------
mm/vmstat.c | 308 ++++++++++---
mm/workingset.c | 49 +-
45 files changed, 1225 insertions(+), 1258 deletions(-)

--
2.3.5


2015-06-08 13:57:10

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/25] mm, vmstat: Add infrastructure for per-node vmstats

VM statistic counters for reclaim decisions are zone-based. If the kernel
is to reclaim on a per-node basis then we need to track per-node statistics
but there is no infrastructure for that. This patch adds it in preparation
but is currently unused.

Signed-off-by: Mel Gorman <[email protected]>
---
drivers/base/node.c | 72 ++++++++-------
include/linux/mmzone.h | 12 +++
include/linux/vmstat.h | 94 +++++++++++++++++--
mm/page_alloc.c | 9 +-
mm/vmstat.c | 240 ++++++++++++++++++++++++++++++++++++++++++++-----
5 files changed, 364 insertions(+), 63 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 36fabe43cd44..0b6392789b66 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -74,16 +74,16 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(i.totalram),
nid, K(i.freeram),
nid, K(i.totalram - i.freeram),
- nid, K(node_page_state(nid, NR_ACTIVE_ANON) +
- node_page_state(nid, NR_ACTIVE_FILE)),
- nid, K(node_page_state(nid, NR_INACTIVE_ANON) +
- node_page_state(nid, NR_INACTIVE_FILE)),
- nid, K(node_page_state(nid, NR_ACTIVE_ANON)),
- nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
- nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
- nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
- nid, K(node_page_state(nid, NR_UNEVICTABLE)),
- nid, K(node_page_state(nid, NR_MLOCK)));
+ nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
+ sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
+ nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
+ sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
+ nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
+ nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
+ nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
+ nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
+ nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
+ nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));

#ifdef CONFIG_HIGHMEM
n += sprintf(buf + n,
@@ -115,28 +115,28 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d AnonHugePages: %8lu kB\n"
#endif
,
- nid, K(node_page_state(nid, NR_FILE_DIRTY)),
- nid, K(node_page_state(nid, NR_WRITEBACK)),
- nid, K(node_page_state(nid, NR_FILE_PAGES)),
- nid, K(node_page_state(nid, NR_FILE_MAPPED)),
- nid, K(node_page_state(nid, NR_ANON_PAGES)),
+ nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
+ nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
+ nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
+ nid, K(sum_zone_node_page_state(nid, NR_FILE_MAPPED)),
+ nid, K(sum_zone_node_page_state(nid, NR_ANON_PAGES)),
nid, K(i.sharedram),
- nid, node_page_state(nid, NR_KERNEL_STACK) *
+ nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
THREAD_SIZE / 1024,
- nid, K(node_page_state(nid, NR_PAGETABLE)),
- nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
- nid, K(node_page_state(nid, NR_BOUNCE)),
- nid, K(node_page_state(nid, NR_WRITEBACK_TEMP)),
- nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
- node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
- nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
+ nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
+ nid, K(sum_zone_node_page_state(nid, NR_UNSTABLE_NFS)),
+ nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
+ nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK_TEMP)),
+ nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE) +
+ sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+ nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE)),
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
+ nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
, nid,
- K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+ K(sum_zone_node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
HPAGE_PMD_NR));
#else
- nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+ nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
#endif
n += hugetlb_report_node_meminfo(nid, buf + n);
return n;
@@ -155,12 +155,12 @@ static ssize_t node_read_numastat(struct device *dev,
"interleave_hit %lu\n"
"local_node %lu\n"
"other_node %lu\n",
- node_page_state(dev->id, NUMA_HIT),
- node_page_state(dev->id, NUMA_MISS),
- node_page_state(dev->id, NUMA_FOREIGN),
- node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
- node_page_state(dev->id, NUMA_LOCAL),
- node_page_state(dev->id, NUMA_OTHER));
+ sum_zone_node_page_state(dev->id, NUMA_HIT),
+ sum_zone_node_page_state(dev->id, NUMA_MISS),
+ sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
+ sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
+ sum_zone_node_page_state(dev->id, NUMA_LOCAL),
+ sum_zone_node_page_state(dev->id, NUMA_OTHER));
}
static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);

@@ -168,12 +168,18 @@ static ssize_t node_read_vmstat(struct device *dev,
struct device_attribute *attr, char *buf)
{
int nid = dev->id;
+ struct pglist_data *pgdat = NODE_DATA(nid);
int i;
int n = 0;

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
- node_page_state(nid, i));
+ sum_zone_node_page_state(nid, i));
+
+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+ n += sprintf(buf+n, "%s %lu\n",
+ vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+ node_page_state(pgdat, i));

return n;
}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2782df47101e..c52a3e3f178c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -159,6 +159,10 @@ enum zone_stat_item {
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };

+enum node_stat_item {
+ NR_VM_NODE_STAT_ITEMS
+};
+
/*
* We do arithmetic on the LRU lists in various places in the code,
* so it is important to keep the active lists LRU_ACTIVE higher in
@@ -269,6 +273,11 @@ struct per_cpu_pageset {
#endif
};

+struct per_cpu_nodestat {
+ s8 stat_threshold;
+ s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+};
+
#endif /* !__GENERATING_BOUNDS.H */

enum zone_type {
@@ -762,6 +771,9 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
#endif
+
+ struct per_cpu_nodestat __percpu *per_cpu_nodestats;
+ atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS];
} pg_data_t;

#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7f7100..99564636ddfc 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -106,20 +106,38 @@ static inline void vm_events_fold_cpu(int cpu)
zone_idx(zone), delta)

/*
- * Zone based page accounting with per cpu differentials.
+ * Zone and node-based page accounting with per cpu differentials.
*/
-extern atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];

static inline void zone_page_state_add(long x, struct zone *zone,
enum zone_stat_item item)
{
atomic_long_add(x, &zone->vm_stat[item]);
- atomic_long_add(x, &vm_stat[item]);
+ atomic_long_add(x, &vm_zone_stat[item]);
+}
+
+static inline void node_page_state_add(long x, struct pglist_data *pgdat,
+ enum node_stat_item item)
+{
+ atomic_long_add(x, &pgdat->vm_stat[item]);
+ atomic_long_add(x, &vm_node_stat[item]);
}

static inline unsigned long global_page_state(enum zone_stat_item item)
{
- long x = atomic_long_read(&vm_stat[item]);
+ long x = atomic_long_read(&vm_zone_stat[item]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
+static inline unsigned long global_node_page_state(enum node_stat_item item)
+{
+ long x = atomic_long_read(&vm_node_stat[item]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -138,6 +156,17 @@ static inline unsigned long zone_page_state(struct zone *zone,
return x;
}

+static inline unsigned long node_page_state(struct pglist_data *pgdat,
+ enum node_stat_item item)
+{
+ long x = atomic_long_read(&pgdat->vm_stat[item]);
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
/*
* More accurate version that also considers the currently pending
* deltas. For that we need to loop over all cpus to find the current
@@ -166,7 +195,7 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
* is called frequently in a NUMA machine, so try to be as
* frugal as possible.
*/
-static inline unsigned long node_page_state(int node,
+static inline unsigned long sum_zone_node_page_state(int node,
enum zone_stat_item item)
{
struct zone *zones = NODE_DATA(node)->node_zones;
@@ -189,27 +218,39 @@ extern void zone_statistics(struct zone *, struct zone *, gfp_t gfp);

#else

-#define node_page_state(node, item) global_page_state(item)
#define zone_statistics(_zl, _z, gfp) do { } while (0)

#endif /* CONFIG_NUMA */

#define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d)
#define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d))
+#define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d)
+#define sub_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, -(__d))

#ifdef CONFIG_SMP
void __mod_zone_page_state(struct zone *, enum zone_stat_item item, int);
void __inc_zone_page_state(struct page *, enum zone_stat_item);
void __dec_zone_page_state(struct page *, enum zone_stat_item);

+void __mod_node_page_state(struct pglist_data *, enum node_stat_item item, int);
+void __inc_node_page_state(struct page *, enum node_stat_item);
+void __dec_node_page_state(struct page *, enum node_stat_item);
+
void mod_zone_page_state(struct zone *, enum zone_stat_item, int);
void inc_zone_page_state(struct page *, enum zone_stat_item);
void dec_zone_page_state(struct page *, enum zone_stat_item);

+void mod_node_page_state(struct pglist_data *, enum node_stat_item, int);
+void inc_node_page_state(struct page *, enum node_stat_item);
+void dec_node_page_state(struct page *, enum node_stat_item);
+
extern void inc_zone_state(struct zone *, enum zone_stat_item);
+extern void inc_node_state(struct pglist_data *, enum node_stat_item);
extern void __inc_zone_state(struct zone *, enum zone_stat_item);
+extern void __inc_node_state(struct pglist_data *, enum node_stat_item);
extern void dec_zone_state(struct zone *, enum zone_stat_item);
extern void __dec_zone_state(struct zone *, enum zone_stat_item);
+extern void __dec_node_state(struct pglist_data *, enum node_stat_item);

void cpu_vm_stats_fold(int cpu);
void refresh_zone_stat_thresholds(void);
@@ -232,16 +273,34 @@ static inline void __mod_zone_page_state(struct zone *zone,
zone_page_state_add(delta, zone, item);
}

+static inline void __mod_node_page_state(struct pglist_data *pgdat,
+ enum node_stat_item item, int delta)
+{
+ node_page_state_add(delta, pgdat, item);
+}
+
static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
atomic_long_inc(&zone->vm_stat[item]);
- atomic_long_inc(&vm_stat[item]);
+ atomic_long_inc(&vm_zone_stat[item]);
+}
+
+static inline void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+ atomic_long_inc(&pgdat->vm_stat[item]);
+ atomic_long_inc(&vm_node_stat[item]);
}

static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
{
atomic_long_dec(&zone->vm_stat[item]);
- atomic_long_dec(&vm_stat[item]);
+ atomic_long_dec(&vm_zone_stat[item]);
+}
+
+static inline void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+ atomic_long_dec(&pgdat->vm_stat[item]);
+ atomic_long_dec(&vm_node_stat[item]);
}

static inline void __inc_zone_page_state(struct page *page,
@@ -250,12 +309,26 @@ static inline void __inc_zone_page_state(struct page *page,
__inc_zone_state(page_zone(page), item);
}

+static inline void __inc_node_page_state(struct page *page,
+ enum node_stat_item item)
+{
+ __inc_node_state(page_zone(page)->zone_pgdat, item);
+}
+
+
static inline void __dec_zone_page_state(struct page *page,
enum zone_stat_item item)
{
__dec_zone_state(page_zone(page), item);
}

+static inline void __dec_node_page_state(struct page *page,
+ enum node_stat_item item)
+{
+ __dec_node_state(page_zone(page)->zone_pgdat, item);
+}
+
+
/*
* We only use atomic operations to update counters. So there is no need to
* disable interrupts.
@@ -264,7 +337,12 @@ static inline void __dec_zone_page_state(struct page *page,
#define dec_zone_page_state __dec_zone_page_state
#define mod_zone_page_state __mod_zone_page_state

+#define inc_node_page_state __inc_node_page_state
+#define dec_node_page_state __dec_node_page_state
+#define mod_node_page_state __mod_node_page_state
+
#define inc_zone_state __inc_zone_state
+#define inc_node_state __inc_node_state
#define dec_zone_state __dec_zone_state

#define set_pgdat_percpu_threshold(pgdat, callback) { }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e29429e7b0..b58c27b13061 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3134,8 +3134,8 @@ void si_meminfo_node(struct sysinfo *val, int nid)
for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
managed_pages += pgdat->node_zones[zone_type].managed_pages;
val->totalram = managed_pages;
- val->sharedram = node_page_state(nid, NR_SHMEM);
- val->freeram = node_page_state(nid, NR_FREE_PAGES);
+ val->sharedram = sum_zone_node_page_state(nid, NR_SHMEM);
+ val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
#ifdef CONFIG_HIGHMEM
val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].managed_pages;
val->freehigh = zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM],
@@ -4335,6 +4335,11 @@ static void __meminit setup_zone_pageset(struct zone *zone)
zone->pageset = alloc_percpu(struct per_cpu_pageset);
for_each_possible_cpu(cpu)
zone_pageset_init(zone, cpu);
+
+ if (!zone->zone_pgdat->per_cpu_nodestats) {
+ zone->zone_pgdat->per_cpu_nodestats =
+ alloc_percpu(struct per_cpu_nodestat);
+ }
}

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..effafdb80975 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -86,8 +86,10 @@ void vm_events_fold_cpu(int cpu)
*
* vm_stat contains the global counters
*/
-atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
-EXPORT_SYMBOL(vm_stat);
+atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
+atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
+EXPORT_SYMBOL(vm_zone_stat);
+EXPORT_SYMBOL(vm_node_stat);

#ifdef CONFIG_SMP

@@ -176,9 +178,13 @@ void refresh_zone_stat_thresholds(void)

threshold = calculate_normal_threshold(zone);

- for_each_online_cpu(cpu)
+ for_each_online_cpu(cpu) {
+ struct pglist_data *pgdat = zone->zone_pgdat;
per_cpu_ptr(zone->pageset, cpu)->stat_threshold
= threshold;
+ per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
+ = threshold;
+ }

/*
* Only set percpu_drift_mark if there is a danger that
@@ -238,6 +244,26 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
}
EXPORT_SYMBOL(__mod_zone_page_state);

+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+ int delta)
+{
+ struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+ s8 __percpu *p = pcp->vm_node_stat_diff + item;
+ long x;
+ long t;
+
+ x = delta + __this_cpu_read(*p);
+
+ t = __this_cpu_read(pcp->stat_threshold);
+
+ if (unlikely(x > t || x < -t)) {
+ node_page_state_add(x, pgdat, item);
+ x = 0;
+ }
+ __this_cpu_write(*p, x);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
/*
* Optimized increment and decrement functions.
*
@@ -277,12 +303,34 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
}
}

+void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+ struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+ s8 __percpu *p = pcp->vm_node_stat_diff + item;
+ s8 v, t;
+
+ v = __this_cpu_inc_return(*p);
+ t = __this_cpu_read(pcp->stat_threshold);
+ if (unlikely(v > t)) {
+ s8 overstep = t >> 1;
+
+ node_page_state_add(v + overstep, pgdat, item);
+ __this_cpu_write(*p, -overstep);
+ }
+}
+
void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
{
__inc_zone_state(page_zone(page), item);
}
EXPORT_SYMBOL(__inc_zone_page_state);

+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+ __inc_node_state(page_zone(page)->zone_pgdat, item);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
{
struct per_cpu_pageset __percpu *pcp = zone->pageset;
@@ -299,12 +347,34 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
}
}

+void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+ struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+ s8 __percpu *p = pcp->vm_node_stat_diff + item;
+ s8 v, t;
+
+ v = __this_cpu_dec_return(*p);
+ t = __this_cpu_read(pcp->stat_threshold);
+ if (unlikely(v < - t)) {
+ s8 overstep = t >> 1;
+
+ node_page_state_add(v - overstep, pgdat, item);
+ __this_cpu_write(*p, overstep);
+ }
+}
+
void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
{
__dec_zone_state(page_zone(page), item);
}
EXPORT_SYMBOL(__dec_zone_page_state);

+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+ __dec_node_state(page_zone(page)->zone_pgdat, item);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+
#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
/*
* If we have cmpxchg_local support then we do not need to incur the overhead
@@ -318,7 +388,7 @@ EXPORT_SYMBOL(__dec_zone_page_state);
* 1 Overstepping half of threshold
* -1 Overstepping minus half of threshold
*/
-static inline void mod_state(struct zone *zone,
+static inline void mod_zone_state(struct zone *zone,
enum zone_stat_item item, int delta, int overstep_mode)
{
struct per_cpu_pageset __percpu *pcp = zone->pageset;
@@ -359,26 +429,88 @@ static inline void mod_state(struct zone *zone,
void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
int delta)
{
- mod_state(zone, item, delta, 0);
+ mod_zone_state(zone, item, delta, 0);
}
EXPORT_SYMBOL(mod_zone_page_state);

void inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
- mod_state(zone, item, 1, 1);
+ mod_zone_state(zone, item, 1, 1);
}

void inc_zone_page_state(struct page *page, enum zone_stat_item item)
{
- mod_state(page_zone(page), item, 1, 1);
+ mod_zone_state(page_zone(page), item, 1, 1);
}
EXPORT_SYMBOL(inc_zone_page_state);

void dec_zone_page_state(struct page *page, enum zone_stat_item item)
{
- mod_state(page_zone(page), item, -1, -1);
+ mod_zone_state(page_zone(page), item, -1, -1);
}
EXPORT_SYMBOL(dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+ enum node_stat_item item, int delta, int overstep_mode)
+{
+ struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+ s8 __percpu *p = pcp->vm_node_stat_diff + item;
+ long o, n, t, z;
+
+ do {
+ z = 0; /* overflow to zone counters */
+
+ /*
+ * The fetching of the stat_threshold is racy. We may apply
+ * a counter threshold to the wrong the cpu if we get
+ * rescheduled while executing here. However, the next
+ * counter update will apply the threshold again and
+ * therefore bring the counter under the threshold again.
+ *
+ * Most of the time the thresholds are the same anyways
+ * for all cpus in a zone.
+ */
+ t = this_cpu_read(pcp->stat_threshold);
+
+ o = this_cpu_read(*p);
+ n = delta + o;
+
+ if (n > t || n < -t) {
+ int os = overstep_mode * (t >> 1) ;
+
+ /* Overflow must be added to zone counters */
+ z = n + os;
+ n = -os;
+ }
+ } while (this_cpu_cmpxchg(*p, o, n) != o);
+
+ if (z)
+ node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+ int delta)
+{
+ mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+ mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+ mod_node_state(page_zone(page)->zone_pgdat, item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+ mod_node_state(page_zone(page)->zone_pgdat, item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
#else
/*
* Use interrupt disable to serialize counter updates
@@ -394,15 +526,6 @@ void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
}
EXPORT_SYMBOL(mod_zone_page_state);

-void inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
- unsigned long flags;
-
- local_irq_save(flags);
- __inc_zone_state(zone, item);
- local_irq_restore(flags);
-}
-
void inc_zone_page_state(struct page *page, enum zone_stat_item item)
{
unsigned long flags;
@@ -424,8 +547,50 @@ void dec_zone_page_state(struct page *page, enum zone_stat_item item)
local_irq_restore(flags);
}
EXPORT_SYMBOL(dec_zone_page_state);
-#endif

+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __inc_node_state(pgdat, item);
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_node_state);
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+ int delta)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __mod_node_page_state(node, item, delta);
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+ unsigned long flags;
+ struct pglist_data *pgdat;
+
+ pgdat = page_zone(page)->zone_pgdat;
+ local_irq_save(flags);
+ __inc_zone_state(pgdat, item);
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __dec_node_page_state(page, item);
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+#endif

/*
* Fold a differential into the global counters.
@@ -438,7 +603,7 @@ static int fold_diff(int *diff)

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
if (diff[i]) {
- atomic_long_add(diff[i], &vm_stat[i]);
+ atomic_long_add(diff[i], &vm_zone_stat[i]);
changes++;
}
return changes;
@@ -462,6 +627,7 @@ static int fold_diff(int *diff)
*/
static int refresh_cpu_vm_stats(void)
{
+ struct pglist_data *pgdat;
struct zone *zone;
int i;
int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
@@ -514,6 +680,21 @@ static int refresh_cpu_vm_stats(void)
}
#endif
}
+
+ for_each_online_pgdat(pgdat) {
+ struct per_cpu_nodestat __percpu *p = pgdat->per_cpu_nodestats;
+
+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+ int v;
+
+ v = this_cpu_xchg(p->vm_node_stat_diff[i], 0);
+ if (v) {
+ atomic_long_add(v, &pgdat->vm_stat[i]);
+ changes += v;
+ }
+ }
+ }
+
changes += fold_diff(global_diff);
return changes;
}
@@ -525,6 +706,7 @@ static int refresh_cpu_vm_stats(void)
*/
void cpu_vm_stats_fold(int cpu)
{
+ struct pglist_data *pgdat;
struct zone *zone;
int i;
int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
@@ -545,6 +727,19 @@ void cpu_vm_stats_fold(int cpu)
}
}

+ for_each_online_pgdat(pgdat) {
+ struct per_cpu_nodestat __percpu *p = pgdat->per_cpu_nodestats;
+
+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+ if (p->vm_node_stat_diff[i]) {
+ int v;
+
+ v = p->vm_node_stat_diff[i];
+ p->vm_node_stat_diff[i] = 0;
+ atomic_long_add(v, &pgdat->vm_stat[i]);
+ }
+ }
+
fold_diff(global_diff);
}

@@ -561,7 +756,7 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
int v = pset->vm_stat_diff[i];
pset->vm_stat_diff[i] = 0;
atomic_long_add(v, &zone->vm_stat[i]);
- atomic_long_add(v, &vm_stat[i]);
+ atomic_long_add(v, &vm_zone_stat[i]);
}
}
#endif
@@ -1287,6 +1482,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
if (*pos >= ARRAY_SIZE(vmstat_text))
return NULL;
stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
+ NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);

#ifdef CONFIG_VM_EVENT_COUNTERS
@@ -1301,6 +1497,10 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
v[i] = global_page_state(i);
v += NR_VM_ZONE_STAT_ITEMS;

+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+ v[i] = global_node_page_state(i);
+ v += NR_VM_NODE_STAT_ITEMS;
+
global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
v + NR_DIRTY_THRESHOLD);
v += NR_VM_WRITEBACK_STAT_ITEMS;
--
2.3.5

2015-06-08 13:56:56

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/25] mm, vmscan: Move lru_lock to the node

Node-based reclaim means that the we need node-based LRUs and node-based
locking. This is a preparation patch that just moves the lru_lock to the
node as it makes later patches easier to review. It's a mechanical change
but note this patch makes contention actively worse because now the LRU
lock is hotter and direct reclaim and kswapd can contend on the same lock
even when reclaiming from different zones.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 10 ++++++++--
mm/compaction.c | 6 +++---
mm/filemap.c | 4 ++--
mm/huge_memory.c | 4 ++--
mm/memcontrol.c | 4 ++--
mm/mlock.c | 10 +++++-----
mm/page_alloc.c | 4 ++--
mm/rmap.c | 2 +-
mm/swap.c | 30 +++++++++++++++---------------
mm/vmscan.c | 42 +++++++++++++++++++++---------------------
10 files changed, 61 insertions(+), 55 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c52a3e3f178c..4c824d6996eb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -97,7 +97,7 @@ struct free_area {
struct pglist_data;

/*
- * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
* So add a wild amount of padding here to ensure that they fall into separate
* cachelines. There are very few zone structures in the machine, so space
* consumption is not a concern here.
@@ -497,7 +497,6 @@ struct zone {
/* Write-intensive fields used by page reclaim */

/* Fields commonly accessed by the page reclaim scanner */
- spinlock_t lru_lock;
struct lruvec lruvec;

/* Evictions & activations on the inactive file list */
@@ -771,6 +770,9 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
#endif
+ /* Write-intensive fields used from the page allocator */
+ ZONE_PADDING(_pad1_)
+ spinlock_t lru_lock;

struct per_cpu_nodestat __percpu *per_cpu_nodestats;
atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS];
@@ -787,6 +789,10 @@ typedef struct pglist_data {

#define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn)
#define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
+static inline spinlock_t *zone_lru_lock(struct zone *zone)
+{
+ return &zone->zone_pgdat->lru_lock;
+}

static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
{
diff --git a/mm/compaction.c b/mm/compaction.c
index 8c0d9459b54a..e0b547953e36 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -702,7 +702,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
* if contended.
*/
if (!(low_pfn % SWAP_CLUSTER_MAX)
- && compact_unlock_should_abort(&zone->lru_lock, flags,
+ && compact_unlock_should_abort(zone_lru_lock(zone), flags,
&locked, cc))
break;

@@ -780,7 +780,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,

/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
- locked = compact_trylock_irqsave(&zone->lru_lock,
+ locked = compact_trylock_irqsave(zone_lru_lock(zone),
&flags, cc);
if (!locked)
break;
@@ -825,7 +825,7 @@ isolate_success:
low_pfn = end_pfn;

if (locked)
- spin_unlock_irqrestore(&zone->lru_lock, flags);
+ spin_unlock_irqrestore(zone_lru_lock(zone), flags);

/*
* Update the pageblock-skip information and cached scanner pfn,
diff --git a/mm/filemap.c b/mm/filemap.c
index ad7242043bdb..12a47ccd8565 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -95,8 +95,8 @@
* ->swap_lock (try_to_unmap_one)
* ->private_lock (try_to_unmap_one)
* ->tree_lock (try_to_unmap_one)
- * ->zone.lru_lock (follow_page->mark_page_accessed)
- * ->zone.lru_lock (check_pte_range->isolate_lru_page)
+ * ->zone_lru_lock (follow_page->mark_page_accessed)
+ * ->zone_lru_lock (check_pte_range->isolate_lru_page)
* ->private_lock (page_remove_rmap->set_page_dirty)
* ->tree_lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6817b0350c71..cdc87f90c4eb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1634,7 +1634,7 @@ static void __split_huge_page_refcount(struct page *page,
int tail_count = 0;

/* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
lruvec = mem_cgroup_page_lruvec(page, zone);

compound_lock(page);
@@ -1723,7 +1723,7 @@ static void __split_huge_page_refcount(struct page *page,

ClearPageCompound(page);
compound_unlock(page);
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));

for (i = 1; i < HPAGE_PMD_NR; i++) {
struct page *page_tail = page + i;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b34ef4a32a3b..1e7932a5f921 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2392,7 +2392,7 @@ static void lock_page_lru(struct page *page, int *isolated)
{
struct zone *zone = page_zone(page);

- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
if (PageLRU(page)) {
struct lruvec *lruvec;

@@ -2416,7 +2416,7 @@ static void unlock_page_lru(struct page *page, int isolated)
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, page_lru(page));
}
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));
}

static void commit_charge(struct page *page, struct mem_cgroup *memcg,
diff --git a/mm/mlock.c b/mm/mlock.c
index 8a54cd214925..4b5b4a0a2191 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -183,7 +183,7 @@ unsigned int munlock_vma_page(struct page *page)
* might otherwise copy PageMlocked to part of the tail pages before
* we clear it in the head page. It also stabilizes hpage_nr_pages().
*/
- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));

nr_pages = hpage_nr_pages(page);
if (!TestClearPageMlocked(page))
@@ -192,14 +192,14 @@ unsigned int munlock_vma_page(struct page *page)
__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);

if (__munlock_isolate_lru_page(page, true)) {
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));
__munlock_isolated_page(page);
goto out;
}
__munlock_isolation_failed(page);

unlock_out:
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));

out:
return nr_pages - 1;
@@ -340,7 +340,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
pagevec_init(&pvec_putback, 0);

/* Phase 1: page isolation */
- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];

@@ -366,7 +366,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
}
delta_munlocked = -nr + pagevec_count(&pvec_putback);
__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));

/* Now we can release pins of pages that we are not munlocking */
pagevec_release(&pvec_putback);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b58c27b13061..3dde181c3b8b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4933,10 +4933,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
#endif
zone->name = zone_names[j];
+ zone->zone_pgdat = pgdat;
spin_lock_init(&zone->lock);
- spin_lock_init(&zone->lru_lock);
+ spin_lock_init(zone_lru_lock(zone));
zone_seqlock_init(zone);
- zone->zone_pgdat = pgdat;
zone_pcp_init(zone);

/* For bootup, initialized properly in watermark setup */
diff --git a/mm/rmap.c b/mm/rmap.c
index c161a14b6a8f..75f1e06f3339 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -26,7 +26,7 @@
* mapping->i_mmap_rwsem
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
- * zone->lru_lock (in mark_page_accessed, isolate_lru_page)
+ * zone_lru_lock (in mark_page_accessed, isolate_lru_page)
* swap_lock (in swap_duplicate, swap_info_get)
* mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/swap.c b/mm/swap.c
index cd3a5e64cea9..e31761ec280c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -55,12 +55,12 @@ static void __page_cache_release(struct page *page)
struct lruvec *lruvec;
unsigned long flags;

- spin_lock_irqsave(&zone->lru_lock, flags);
+ spin_lock_irqsave(zone_lru_lock(zone), flags);
lruvec = mem_cgroup_page_lruvec(page, zone);
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
- spin_unlock_irqrestore(&zone->lru_lock, flags);
+ spin_unlock_irqrestore(zone_lru_lock(zone), flags);
}
mem_cgroup_uncharge(page);
}
@@ -422,16 +422,16 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,

if (pagezone != zone) {
if (zone)
- spin_unlock_irqrestore(&zone->lru_lock, flags);
+ spin_unlock_irqrestore(zone_lru_lock(zone), flags);
zone = pagezone;
- spin_lock_irqsave(&zone->lru_lock, flags);
+ spin_lock_irqsave(zone_lru_lock(zone), flags);
}

lruvec = mem_cgroup_page_lruvec(page, zone);
(*move_fn)(page, lruvec, arg);
}
if (zone)
- spin_unlock_irqrestore(&zone->lru_lock, flags);
+ spin_unlock_irqrestore(zone_lru_lock(zone), flags);
release_pages(pvec->pages, pvec->nr, pvec->cold);
pagevec_reinit(pvec);
}
@@ -551,9 +551,9 @@ void activate_page(struct page *page)
{
struct zone *zone = page_zone(page);

- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));
}
#endif

@@ -679,13 +679,13 @@ void add_page_to_unevictable_list(struct page *page)
struct zone *zone = page_zone(page);
struct lruvec *lruvec;

- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
lruvec = mem_cgroup_page_lruvec(page, zone);
ClearPageActive(page);
SetPageUnevictable(page);
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));
}

/**
@@ -910,7 +910,7 @@ void release_pages(struct page **pages, int nr, bool cold)

if (unlikely(PageCompound(page))) {
if (zone) {
- spin_unlock_irqrestore(&zone->lru_lock, flags);
+ spin_unlock_irqrestore(zone_lru_lock(zone), flags);
zone = NULL;
}
put_compound_page(page);
@@ -923,7 +923,7 @@ void release_pages(struct page **pages, int nr, bool cold)
* same zone. The lock is held only if zone != NULL.
*/
if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
- spin_unlock_irqrestore(&zone->lru_lock, flags);
+ spin_unlock_irqrestore(zone_lru_lock(zone), flags);
zone = NULL;
}

@@ -935,11 +935,11 @@ void release_pages(struct page **pages, int nr, bool cold)

if (pagezone != zone) {
if (zone)
- spin_unlock_irqrestore(&zone->lru_lock,
+ spin_unlock_irqrestore(zone_lru_lock(zone),
flags);
lock_batch = 0;
zone = pagezone;
- spin_lock_irqsave(&zone->lru_lock, flags);
+ spin_lock_irqsave(zone_lru_lock(zone), flags);
}

lruvec = mem_cgroup_page_lruvec(page, zone);
@@ -954,7 +954,7 @@ void release_pages(struct page **pages, int nr, bool cold)
list_add(&page->lru, &pages_to_free);
}
if (zone)
- spin_unlock_irqrestore(&zone->lru_lock, flags);
+ spin_unlock_irqrestore(zone_lru_lock(zone), flags);

mem_cgroup_uncharge_list(&pages_to_free);
free_hot_cold_page_list(&pages_to_free, cold);
@@ -990,7 +990,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
VM_BUG_ON(NR_CPUS != 1 &&
- !spin_is_locked(&lruvec_zone(lruvec)->lru_lock));
+ !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));

if (!list)
SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..8ebdc4e5e720 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1387,7 +1387,7 @@ int isolate_lru_page(struct page *page)
struct zone *zone = page_zone(page);
struct lruvec *lruvec;

- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
lruvec = mem_cgroup_page_lruvec(page, zone);
if (PageLRU(page)) {
int lru = page_lru(page);
@@ -1396,7 +1396,7 @@ int isolate_lru_page(struct page *page)
del_page_from_lru_list(page, lruvec, lru);
ret = 0;
}
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));
}
return ret;
}
@@ -1455,9 +1455,9 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
VM_BUG_ON_PAGE(PageLRU(page), page);
list_del(&page->lru);
if (unlikely(!page_evictable(page))) {
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));
putback_lru_page(page);
- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
continue;
}

@@ -1478,10 +1478,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
del_page_from_lru_list(page, lruvec, lru);

if (unlikely(PageCompound(page))) {
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));
mem_cgroup_uncharge(page);
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
} else
list_add(&page->lru, &pages_to_free);
}
@@ -1543,7 +1543,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
if (!sc->may_writepage)
isolate_mode |= ISOLATE_CLEAN;

- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, isolate_mode, lru);
@@ -1558,7 +1558,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
else
__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
}
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));

if (nr_taken == 0)
return 0;
@@ -1568,7 +1568,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
&nr_writeback, &nr_immediate,
false);

- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));

reclaim_stat->recent_scanned[file] += nr_taken;

@@ -1585,7 +1585,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);

- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));

mem_cgroup_uncharge_list(&page_list);
free_hot_cold_page_list(&page_list, true);
@@ -1701,10 +1701,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
del_page_from_lru_list(page, lruvec, lru);

if (unlikely(PageCompound(page))) {
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));
mem_cgroup_uncharge(page);
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
} else
list_add(&page->lru, pages_to_free);
}
@@ -1739,7 +1739,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
if (!sc->may_writepage)
isolate_mode |= ISOLATE_CLEAN;

- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, isolate_mode, lru);
@@ -1751,7 +1751,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_zone_vm_events(PGREFILL, zone, nr_scanned);
__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));

while (!list_empty(&l_hold)) {
cond_resched();
@@ -1796,7 +1796,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move pages back to the lru list.
*/
- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
/*
* Count referenced pages from currently used mappings as rotated,
* even though only some of them are actually re-activated. This
@@ -1808,7 +1808,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));

mem_cgroup_uncharge_list(&l_hold);
free_hot_cold_page_list(&l_hold, true);
@@ -2039,7 +2039,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
get_lru_size(lruvec, LRU_INACTIVE_FILE);

- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
reclaim_stat->recent_scanned[0] /= 2;
reclaim_stat->recent_rotated[0] /= 2;
@@ -2060,7 +2060,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,

fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));

fraction[0] = ap;
fraction[1] = fp;
@@ -3799,9 +3799,9 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
pagezone = page_zone(page);
if (pagezone != zone) {
if (zone)
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));
zone = pagezone;
- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irq(zone_lru_lock(zone));
}
lruvec = mem_cgroup_page_lruvec(page, zone);

@@ -3822,7 +3822,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
if (zone) {
__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(zone_lru_lock(zone));
}
}
#endif /* CONFIG_SHMEM */
--
2.3.5

2015-06-08 14:02:45

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/25] mm, vmscan: Move LRU lists to node

This moves the LRU lists from the zone to the node and all related data
such as counters, tracing, congestion tracking and writeback tracking.
This is mostly a mechanical patch but note that it introduces a number
of anomalies. For example, the scans are per-zone but using per-node
counters. We also mark a node as congested when a zone is congested. This
causes weird problems that are fixed later but is easier to review.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/tile/mm/pgtable.c | 8 +-
drivers/base/node.c | 19 +--
drivers/staging/android/lowmemorykiller.c | 8 +-
include/linux/backing-dev.h | 2 +-
include/linux/memcontrol.h | 8 +-
include/linux/mm_inline.h | 4 +-
include/linux/mmzone.h | 70 +++++-----
include/linux/vm_event_item.h | 10 +-
include/trace/events/vmscan.h | 10 +-
kernel/power/snapshot.c | 10 +-
mm/backing-dev.c | 14 +-
mm/compaction.c | 19 +--
mm/huge_memory.c | 6 +-
mm/internal.h | 2 +-
mm/memcontrol.c | 18 +--
mm/memory-failure.c | 4 +-
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 2 +-
mm/migrate.c | 21 +--
mm/mlock.c | 2 +-
mm/page-writeback.c | 8 +-
mm/page_alloc.c | 103 +++++++++------
mm/swap.c | 56 ++++----
mm/vmscan.c | 208 ++++++++++++++++--------------
mm/vmstat.c | 45 +++----
mm/workingset.c | 8 +-
26 files changed, 354 insertions(+), 313 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 7bf2491a9c1f..3ed0a666d44a 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
struct zone *zone;

pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
- (global_page_state(NR_ACTIVE_ANON) +
- global_page_state(NR_ACTIVE_FILE)),
- (global_page_state(NR_INACTIVE_ANON) +
- global_page_state(NR_INACTIVE_FILE)),
+ (global_node_page_state(NR_ACTIVE_ANON) +
+ global_node_page_state(NR_ACTIVE_FILE)),
+ (global_node_page_state(NR_INACTIVE_ANON) +
+ global_node_page_state(NR_INACTIVE_FILE)),
global_page_state(NR_FILE_DIRTY),
global_page_state(NR_WRITEBACK),
global_page_state(NR_UNSTABLE_NFS),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0b6392789b66..b06ae7bfea63 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
{
int n;
int nid = dev->id;
+ struct pglist_data *pgdat = NODE_DATA(nid);
struct sysinfo i;

si_meminfo_node(&i, nid);
@@ -74,15 +75,15 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(i.totalram),
nid, K(i.freeram),
nid, K(i.totalram - i.freeram),
- nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
- sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
- nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
- sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
- nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
- nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
- nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
- nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
- nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
+ nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
+ node_page_state(pgdat, NR_ACTIVE_FILE)),
+ nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
+ node_page_state(pgdat, NR_INACTIVE_FILE)),
+ nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
+ nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
+ nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
+ nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
+ nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));

#ifdef CONFIG_HIGHMEM
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index feafa172b155..6463d9278229 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -69,10 +69,10 @@ static unsigned long lowmem_deathpending_timeout;
static unsigned long lowmem_count(struct shrinker *s,
struct shrink_control *sc)
{
- return global_page_state(NR_ACTIVE_ANON) +
- global_page_state(NR_ACTIVE_FILE) +
- global_page_state(NR_INACTIVE_ANON) +
- global_page_state(NR_INACTIVE_FILE);
+ return global_node_page_state(NR_ACTIVE_ANON) +
+ global_node_page_state(NR_ACTIVE_FILE) +
+ global_node_page_state(NR_INACTIVE_ANON) +
+ global_node_page_state(NR_INACTIVE_FILE);
}

static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index aff923ae8c4b..6ca09adfd55e 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -277,7 +277,7 @@ enum {
void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
void set_bdi_congested(struct backing_dev_info *bdi, int sync);
long congestion_wait(int sync, long timeout);
-long wait_iff_congested(struct zone *zone, int sync, long timeout);
+long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
int pdflush_proc_obsolete(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..df225059daf3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -85,7 +85,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
bool lrucare);

struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);

bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
struct mem_cgroup *root);
@@ -243,13 +243,13 @@ static inline void mem_cgroup_migrate(struct page *oldpage,
static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
struct mem_cgroup *memcg)
{
- return &zone->lruvec;
+ return zone_lruvec(zone);
}

static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
- struct zone *zone)
+ struct pglist_data *pgdat)
{
- return &zone->lruvec;
+ return &pgdat->lruvec;
}

static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index cf55945c83fb..275b10b2ace4 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -28,7 +28,7 @@ static __always_inline void add_page_to_lru_list(struct page *page,
int nr_pages = hpage_nr_pages(page);
mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
list_add(&page->lru, &lruvec->lists[lru]);
- __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
+ __mod_node_page_state(page_zone(page)->zone_pgdat, NR_LRU_BASE + lru, nr_pages);
}

static __always_inline void del_page_from_lru_list(struct page *page,
@@ -37,7 +37,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
int nr_pages = hpage_nr_pages(page);
mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
list_del(&page->lru);
- __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
+ __mod_node_page_state(page_zone(page)->zone_pgdat, NR_LRU_BASE + lru, -nr_pages);
}

/**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4c824d6996eb..fab74af19f26 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,12 +115,6 @@ enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
NR_ALLOC_BATCH,
- NR_LRU_BASE,
- NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
- NR_ACTIVE_ANON, /* " " " " " */
- NR_INACTIVE_FILE, /* " " " " " */
- NR_ACTIVE_FILE, /* " " " " " */
- NR_UNEVICTABLE, /* " " " " " */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_ANON_PAGES, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
@@ -138,12 +132,9 @@ enum zone_stat_item {
NR_VMSCAN_WRITE,
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
- NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
- NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
- NR_PAGES_SCANNED, /* pages scanned since last reclaim */
#ifdef CONFIG_NUMA
NUMA_HIT, /* allocated in intended node */
NUMA_MISS, /* allocated in non intended node */
@@ -160,6 +151,15 @@ enum zone_stat_item {
NR_VM_ZONE_STAT_ITEMS };

enum node_stat_item {
+ NR_LRU_BASE,
+ NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
+ NR_ACTIVE_ANON, /* " " " " " */
+ NR_INACTIVE_FILE, /* " " " " " */
+ NR_ACTIVE_FILE, /* " " " " " */
+ NR_UNEVICTABLE, /* " " " " " */
+ NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
+ NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
+ NR_PAGES_SCANNED, /* pages scanned since last reclaim */
NR_VM_NODE_STAT_ITEMS
};

@@ -221,7 +221,7 @@ struct lruvec {
struct list_head lists[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
#ifdef CONFIG_MEMCG
- struct zone *zone;
+ struct pglist_data *pgdat;
#endif
};

@@ -352,13 +352,6 @@ struct zone {
#ifdef CONFIG_NUMA
int node;
#endif
-
- /*
- * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
- * this zone's LRU. Maintained by the pageout code.
- */
- unsigned int inactive_ratio;
-
struct pglist_data *zone_pgdat;
struct per_cpu_pageset __percpu *pageset;

@@ -496,12 +489,6 @@ struct zone {

/* Write-intensive fields used by page reclaim */

- /* Fields commonly accessed by the page reclaim scanner */
- struct lruvec lruvec;
-
- /* Evictions & activations on the inactive file list */
- atomic_long_t inactive_age;
-
/*
* When free pages are below this point, additional steps are taken
* when reading the number of free pages to avoid per-cpu counter
@@ -540,17 +527,20 @@ struct zone {
enum zone_flags {
ZONE_RECLAIM_LOCKED, /* prevents concurrent reclaim */
ZONE_OOM_LOCKED, /* zone is in OOM killer zonelist */
- ZONE_CONGESTED, /* zone has many dirty pages backed by
+ ZONE_FAIR_DEPLETED, /* fair zone policy batch depleted */
+};
+
+enum pgdat_flags {
+ PGDAT_CONGESTED, /* zone has many dirty pages backed by
* a congested BDI
*/
- ZONE_DIRTY, /* reclaim scanning has recently found
+ PGDAT_DIRTY, /* reclaim scanning has recently found
* many dirty file pages at the tail
* of the LRU.
*/
- ZONE_WRITEBACK, /* reclaim scanning has recently found
+ PGDAT_WRITEBACK, /* reclaim scanning has recently found
* many pages under writeback
*/
- ZONE_FAIR_DEPLETED, /* fair zone policy batch depleted */
};

static inline unsigned long zone_end_pfn(const struct zone *zone)
@@ -774,6 +764,21 @@ typedef struct pglist_data {
ZONE_PADDING(_pad1_)
spinlock_t lru_lock;

+ /* Fields commonly accessed by the page reclaim scanner */
+ struct lruvec lruvec;
+
+ /* Evictions & activations on the inactive file list */
+ atomic_long_t inactive_age;
+
+ /*
+ * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
+ * this zone's LRU. Maintained by the pageout code.
+ */
+ unsigned int inactive_ratio;
+
+ unsigned long flags;
+
+ ZONE_PADDING(_pad2_)
struct per_cpu_nodestat __percpu *per_cpu_nodestats;
atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS];
} pg_data_t;
@@ -794,6 +799,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
return &zone->zone_pgdat->lru_lock;
}

+static inline struct lruvec *zone_lruvec(struct zone *zone)
+{
+ return &zone->zone_pgdat->lruvec;
+}
+
static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
{
return pgdat->node_start_pfn + pgdat->node_spanned_pages;
@@ -823,12 +833,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,

extern void lruvec_init(struct lruvec *lruvec);

-static inline struct zone *lruvec_zone(struct lruvec *lruvec)
+static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
{
#ifdef CONFIG_MEMCG
- return lruvec->zone;
+ return lruvec->pgdat;
#else
- return container_of(lruvec, struct zone, lruvec);
+ return container_of(lruvec, struct pglist_data, lruvec);
#endif
}

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9246d32dc973..4ce4d59d361e 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,11 +25,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
PGFREE, PGACTIVATE, PGDEACTIVATE,
PGFAULT, PGMAJFAULT,
- FOR_ALL_ZONES(PGREFILL),
- FOR_ALL_ZONES(PGSTEAL_KSWAPD),
- FOR_ALL_ZONES(PGSTEAL_DIRECT),
- FOR_ALL_ZONES(PGSCAN_KSWAPD),
- FOR_ALL_ZONES(PGSCAN_DIRECT),
+ PGREFILL,
+ PGSTEAL_KSWAPD,
+ PGSTEAL_DIRECT,
+ PGSCAN_KSWAPD,
+ PGSCAN_DIRECT,
PGSCAN_DIRECT_THROTTLE,
#ifdef CONFIG_NUMA
PGSCAN_ZONE_RECLAIM_FAILED,
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 69590b6ffc09..bbeaa12ae0c3 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -353,15 +353,14 @@ TRACE_EVENT(mm_vmscan_writepage,

TRACE_EVENT(mm_vmscan_lru_shrink_inactive,

- TP_PROTO(int nid, int zid,
+ TP_PROTO(int nid,
unsigned long nr_scanned, unsigned long nr_reclaimed,
int priority, int reclaim_flags),

- TP_ARGS(nid, zid, nr_scanned, nr_reclaimed, priority, reclaim_flags),
+ TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, reclaim_flags),

TP_STRUCT__entry(
__field(int, nid)
- __field(int, zid)
__field(unsigned long, nr_scanned)
__field(unsigned long, nr_reclaimed)
__field(int, priority)
@@ -370,15 +369,14 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,

TP_fast_assign(
__entry->nid = nid;
- __entry->zid = zid;
__entry->nr_scanned = nr_scanned;
__entry->nr_reclaimed = nr_reclaimed;
__entry->priority = priority;
__entry->reclaim_flags = reclaim_flags;
),

- TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
- __entry->nid, __entry->zid,
+ TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+ __entry->nid,
__entry->nr_scanned, __entry->nr_reclaimed,
__entry->priority,
show_reclaim_flags(__entry->reclaim_flags))
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 5235dd4e1e2f..1012eaf6e4c1 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1525,11 +1525,11 @@ static unsigned long minimum_image_size(unsigned long saveable)
unsigned long size;

size = global_page_state(NR_SLAB_RECLAIMABLE)
- + global_page_state(NR_ACTIVE_ANON)
- + global_page_state(NR_INACTIVE_ANON)
- + global_page_state(NR_ACTIVE_FILE)
- + global_page_state(NR_INACTIVE_FILE)
- - global_page_state(NR_FILE_MAPPED);
+ + global_node_page_state(NR_ACTIVE_ANON)
+ + global_node_page_state(NR_INACTIVE_ANON)
+ + global_node_page_state(NR_ACTIVE_FILE)
+ + global_node_page_state(NR_INACTIVE_FILE)
+ - global_node_page_state(NR_FILE_MAPPED);

return saveable <= size ? 0 : saveable - size;
}
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6dc4580df2af..513e15d428e1 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -542,24 +542,24 @@ long congestion_wait(int sync, long timeout)
EXPORT_SYMBOL(congestion_wait);

/**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
- * @zone: A zone to check if it is heavily congested
+ * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
+ * @pgdat: A pgdat to check if it is heavily congested
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
* In the event of a congested backing_dev (any backing_dev) and the given
- * @zone has experienced recent congestion, this waits for up to @timeout
+ * @pgdat has experienced recent congestion, this waits for up to @timeout
* jiffies for either a BDI to exit congestion of the given @sync queue
* or a write to complete.
*
- * In the absence of zone congestion, cond_resched() is called to yield
+ * In the absence of pgdat congestion, cond_resched() is called to yield
* the processor if necessary but otherwise does not sleep.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
* returned. return_value == timeout implies the function did not sleep.
*/
-long wait_iff_congested(struct zone *zone, int sync, long timeout)
+long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
{
long ret;
unsigned long start = jiffies;
@@ -568,11 +568,11 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)

/*
* If there is no congestion, or heavy congestion is not being
- * encountered in the current zone, yield if necessary instead
+ * encountered in the current pgdat, yield if necessary instead
* of sleeping on the congestion queue
*/
if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
- !test_bit(ZONE_CONGESTED, &zone->flags)) {
+ !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
cond_resched();

/* In case we scheduled, work out time remaining */
diff --git a/mm/compaction.c b/mm/compaction.c
index e0b547953e36..d73c509ea801 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -625,21 +625,22 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
list_for_each_entry(page, &cc->migratepages, lru)
count[!!page_is_file_cache(page)]++;

- mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
- mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
+ mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, count[0]);
+ mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, count[1]);
}

/* Similar to reclaim, but different enough that they don't share logic */
static bool too_many_isolated(struct zone *zone)
{
+ pg_data_t *pgdat = zone->zone_pgdat;
unsigned long active, inactive, isolated;

- inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
- zone_page_state(zone, NR_INACTIVE_ANON);
- active = zone_page_state(zone, NR_ACTIVE_FILE) +
- zone_page_state(zone, NR_ACTIVE_ANON);
- isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
- zone_page_state(zone, NR_ISOLATED_ANON);
+ inactive = node_page_state(pgdat, NR_INACTIVE_FILE) +
+ node_page_state(pgdat, NR_INACTIVE_ANON);
+ active = node_page_state(pgdat, NR_ACTIVE_FILE) +
+ node_page_state(pgdat, NR_ACTIVE_ANON);
+ isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
+ node_page_state(pgdat, NR_ISOLATED_ANON);

return isolated > (inactive + active) / 2;
}
@@ -794,7 +795,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
}
}

- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);

/* Try isolate the page */
if (__isolate_lru_page(page, isolate_mode) != 0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdc87f90c4eb..b56c14a41d96 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1635,7 +1635,7 @@ static void __split_huge_page_refcount(struct page *page,

/* prevent PageLRU to go away from under us, and freeze lru stats */
spin_lock_irq(zone_lru_lock(zone));
- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);

compound_lock(page);
/* complete memcg works before add pages to LRU */
@@ -2100,7 +2100,7 @@ void __khugepaged_exit(struct mm_struct *mm)
static void release_pte_page(struct page *page)
{
/* 0 stands for page_is_file_cache(page) == false */
- dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+ dec_node_page_state(page, NR_ISOLATED_ANON + 0);
unlock_page(page);
putback_lru_page(page);
}
@@ -2181,7 +2181,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
goto out;
}
/* 0 stands for page_is_file_cache(page) == false */
- inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+ inc_node_page_state(page, NR_ISOLATED_ANON + 0);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);

diff --git a/mm/internal.h b/mm/internal.h
index a96da5b0029d..2e4cee6a8739 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -98,7 +98,7 @@ extern unsigned long highest_memmap_pfn;
*/
extern int isolate_lru_page(struct page *page);
extern void putback_lru_page(struct page *page);
-extern bool zone_reclaimable(struct zone *zone);
+extern bool pgdat_reclaimable(struct pglist_data *pgdat);

/*
* in mm/rmap.c:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1e7932a5f921..10eed58506a0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1185,7 +1185,7 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
struct lruvec *lruvec;

if (mem_cgroup_disabled()) {
- lruvec = &zone->lruvec;
+ lruvec = zone_lruvec(zone);
goto out;
}

@@ -1197,8 +1197,8 @@ out:
* we have to be prepared to initialize lruvec->zone here;
* and if offlined then reonlined, we need to reinitialize it.
*/
- if (unlikely(lruvec->zone != zone))
- lruvec->zone = zone;
+ if (unlikely(lruvec->pgdat != zone->zone_pgdat))
+ lruvec->pgdat = zone->zone_pgdat;
return lruvec;
}

@@ -1211,14 +1211,14 @@ out:
* and putback protocol: the LRU lock must be held, and the page must
* either be PageLRU() or the caller must have isolated/allocated it.
*/
-struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
+struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
{
struct mem_cgroup_per_zone *mz;
struct mem_cgroup *memcg;
struct lruvec *lruvec;

if (mem_cgroup_disabled()) {
- lruvec = &zone->lruvec;
+ lruvec = &pgdat->lruvec;
goto out;
}

@@ -1238,8 +1238,8 @@ out:
* we have to be prepared to initialize lruvec->zone here;
* and if offlined then reonlined, we need to reinitialize it.
*/
- if (unlikely(lruvec->zone != zone))
- lruvec->zone = zone;
+ if (unlikely(lruvec->pgdat != pgdat))
+ lruvec->pgdat = pgdat;
return lruvec;
}

@@ -2396,7 +2396,7 @@ static void lock_page_lru(struct page *page, int *isolated)
if (PageLRU(page)) {
struct lruvec *lruvec;

- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_lru(page));
*isolated = 1;
@@ -2411,7 +2411,7 @@ static void unlock_page_lru(struct page *page, int isolated)
if (isolated) {
struct lruvec *lruvec;

- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
VM_BUG_ON_PAGE(PageLRU(page), page);
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, page_lru(page));
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d487f8dc6d39..e5415186f48f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1620,7 +1620,7 @@ static int __soft_offline_page(struct page *page, int flags)
put_page(page);
if (!ret) {
LIST_HEAD(pagelist);
- inc_zone_page_state(page, NR_ISOLATED_ANON +
+ inc_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
@@ -1628,7 +1628,7 @@ static int __soft_offline_page(struct page *page, int flags)
if (ret) {
if (!list_empty(&pagelist)) {
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
+ dec_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
putback_lru_page(page);
}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 65842d688b7c..b59da0f78415 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1426,7 +1426,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
put_page(page);
list_add_tail(&page->lru, &source);
move_pages--;
- inc_zone_page_state(page, NR_ISOLATED_ANON +
+ inc_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));

} else {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4721046a134a..ea211f16e3b7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -933,7 +933,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
if (!isolate_lru_page(page)) {
list_add_tail(&page->lru, pagelist);
- inc_zone_page_state(page, NR_ISOLATED_ANON +
+ inc_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
}
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 85e042686031..a33e4b4ed60d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -90,7 +90,7 @@ void putback_movable_pages(struct list_head *l)
continue;
}
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
+ dec_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
if (unlikely(isolated_balloon_page(page)))
balloon_page_putback(page);
@@ -935,7 +935,7 @@ out:
* restored.
*/
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
+ dec_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
putback_lru_page(page);
}
@@ -1244,7 +1244,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
err = isolate_lru_page(page);
if (!err) {
list_add_tail(&page->lru, &pagelist);
- inc_zone_page_state(page, NR_ISOLATED_ANON +
+ inc_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
}
put_and_set:
@@ -1514,15 +1514,16 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
unsigned long nr_migrate_pages)
{
int z;
+
+ if (!pgdat_reclaimable(pgdat))
+ return false;
+
for (z = pgdat->nr_zones - 1; z >= 0; z--) {
struct zone *zone = pgdat->node_zones + z;

if (!populated_zone(zone))
continue;

- if (!zone_reclaimable(zone))
- continue;
-
/* Avoid waking kswapd by allocating pages_to_migrate pages. */
if (!zone_watermark_ok(zone, 0,
high_wmark_pages(zone) +
@@ -1636,7 +1637,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
}

page_lru = page_is_file_cache(page);
- mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
+ mod_node_page_state(page_zone(page)->zone_pgdat, NR_ISOLATED_ANON + page_lru,
hpage_nr_pages(page));

/*
@@ -1694,7 +1695,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
if (nr_remaining) {
if (!list_empty(&migratepages)) {
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
+ dec_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
putback_lru_page(page);
}
@@ -1784,7 +1785,7 @@ fail_putback:
/* Retake the callers reference and putback on LRU */
get_page(page);
putback_lru_page(page);
- mod_zone_page_state(page_zone(page),
+ mod_node_page_state(page_zone(page)->zone_pgdat,
NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);

goto out_unlock;
@@ -1837,7 +1838,7 @@ fail_putback:
count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);

- mod_zone_page_state(page_zone(page),
+ mod_node_page_state(page_zone(page)->zone_pgdat,
NR_ISOLATED_ANON + page_lru,
-HPAGE_PMD_NR);
return isolated;
diff --git a/mm/mlock.c b/mm/mlock.c
index 4b5b4a0a2191..144bd5086260 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -100,7 +100,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
if (PageLRU(page)) {
struct lruvec *lruvec;

- lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
+ lruvec = mem_cgroup_page_lruvec(page, page_zone(page)->zone_pgdat);
if (getpage)
get_page(page);
ClearPageLRU(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 644bcb665773..9707c450c7c5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -187,8 +187,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
nr_pages = zone_page_state(zone, NR_FREE_PAGES);
nr_pages -= min(nr_pages, zone->dirty_balance_reserve);

- nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
- nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
+ nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
+ nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);

return nr_pages;
}
@@ -241,8 +241,8 @@ static unsigned long global_dirtyable_memory(void)
x = global_page_state(NR_FREE_PAGES);
x -= min(x, dirty_balance_reserve);

- x += global_page_state(NR_INACTIVE_FILE);
- x += global_page_state(NR_ACTIVE_FILE);
+ x += global_node_page_state(NR_INACTIVE_FILE);
+ x += global_node_page_state(NR_ACTIVE_FILE);

if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3dde181c3b8b..49a29e8ae493 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -700,9 +700,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
unsigned long nr_scanned;

spin_lock(&zone->lock);
- nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
+ nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
if (nr_scanned)
- __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+ __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);

while (to_free) {
struct page *page;
@@ -751,9 +751,9 @@ static void free_one_page(struct zone *zone,
{
unsigned long nr_scanned;
spin_lock(&zone->lock);
- nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
+ nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
if (nr_scanned)
- __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+ __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);

if (unlikely(has_isolate_pageblock(zone) ||
is_migrate_isolate(migratetype))) {
@@ -2527,8 +2527,8 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
ALLOC_NO_WATERMARKS, ac);

if (!page && gfp_mask & __GFP_NOFAIL)
- wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC,
- HZ/50);
+ wait_iff_congested(ac->preferred_zone->zone_pgdat,
+ BLK_RW_ASYNC, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));

return page;
@@ -2772,7 +2772,7 @@ retry:
goto nopage;
}
/* Wait for some write requests to complete then retry */
- wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
+ wait_iff_congested(ac->preferred_zone->zone_pgdat, BLK_RW_ASYNC, HZ/50);
goto retry;
} else {
/*
@@ -3208,6 +3208,7 @@ void show_free_areas(unsigned int filter)
{
int cpu;
struct zone *zone;
+ pg_data_t *pgdat;

for_each_populated_zone(zone) {
if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -3233,13 +3234,13 @@ void show_free_areas(unsigned int filter)
" free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
" free_cma:%lu\n",
- global_page_state(NR_ACTIVE_ANON),
- global_page_state(NR_INACTIVE_ANON),
- global_page_state(NR_ISOLATED_ANON),
- global_page_state(NR_ACTIVE_FILE),
- global_page_state(NR_INACTIVE_FILE),
- global_page_state(NR_ISOLATED_FILE),
- global_page_state(NR_UNEVICTABLE),
+ global_node_page_state(NR_ACTIVE_ANON),
+ global_node_page_state(NR_INACTIVE_ANON),
+ global_node_page_state(NR_ISOLATED_ANON),
+ global_node_page_state(NR_ACTIVE_FILE),
+ global_node_page_state(NR_INACTIVE_FILE),
+ global_node_page_state(NR_ISOLATED_FILE),
+ global_node_page_state(NR_UNEVICTABLE),
global_page_state(NR_FILE_DIRTY),
global_page_state(NR_WRITEBACK),
global_page_state(NR_UNSTABLE_NFS),
@@ -3252,6 +3253,28 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_BOUNCE),
global_page_state(NR_FREE_CMA_PAGES));

+ for_each_online_pgdat(pgdat) {
+ printk("Node %d"
+ " active_anon:%lukB"
+ " inactive_anon:%lukB"
+ " active_file:%lukB"
+ " inactive_file:%lukB"
+ " unevictable:%lukB"
+ " isolated(anon):%lukB"
+ " isolated(file):%lukB"
+ " all_unreclaimable? %s"
+ "\n",
+ pgdat->node_id,
+ K(node_page_state(pgdat, NR_ACTIVE_ANON)),
+ K(node_page_state(pgdat, NR_INACTIVE_ANON)),
+ K(node_page_state(pgdat, NR_ACTIVE_FILE)),
+ K(node_page_state(pgdat, NR_INACTIVE_FILE)),
+ K(node_page_state(pgdat, NR_UNEVICTABLE)),
+ K(node_page_state(pgdat, NR_ISOLATED_ANON)),
+ K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+ !pgdat_reclaimable(pgdat) ? "yes" : "no");
+ }
+
for_each_populated_zone(zone) {
int i;

@@ -3263,13 +3286,6 @@ void show_free_areas(unsigned int filter)
" min:%lukB"
" low:%lukB"
" high:%lukB"
- " active_anon:%lukB"
- " inactive_anon:%lukB"
- " active_file:%lukB"
- " inactive_file:%lukB"
- " unevictable:%lukB"
- " isolated(anon):%lukB"
- " isolated(file):%lukB"
" present:%lukB"
" managed:%lukB"
" mlocked:%lukB"
@@ -3285,21 +3301,13 @@ void show_free_areas(unsigned int filter)
" bounce:%lukB"
" free_cma:%lukB"
" writeback_tmp:%lukB"
- " pages_scanned:%lu"
- " all_unreclaimable? %s"
+ " node_pages_scanned:%lu"
"\n",
zone->name,
K(zone_page_state(zone, NR_FREE_PAGES)),
K(min_wmark_pages(zone)),
K(low_wmark_pages(zone)),
K(high_wmark_pages(zone)),
- K(zone_page_state(zone, NR_ACTIVE_ANON)),
- K(zone_page_state(zone, NR_INACTIVE_ANON)),
- K(zone_page_state(zone, NR_ACTIVE_FILE)),
- K(zone_page_state(zone, NR_INACTIVE_FILE)),
- K(zone_page_state(zone, NR_UNEVICTABLE)),
- K(zone_page_state(zone, NR_ISOLATED_ANON)),
- K(zone_page_state(zone, NR_ISOLATED_FILE)),
K(zone->present_pages),
K(zone->managed_pages),
K(zone_page_state(zone, NR_MLOCK)),
@@ -3316,9 +3324,7 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_BOUNCE)),
K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
- K(zone_page_state(zone, NR_PAGES_SCANNED)),
- (!zone_reclaimable(zone) ? "yes" : "no")
- );
+ K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
printk("lowmem_reserve[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
printk(" %ld", zone->lowmem_reserve[i]);
@@ -4942,7 +4948,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
/* For bootup, initialized properly in watermark setup */
mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);

- lruvec_init(&zone->lruvec);
+ lruvec_init(zone_lruvec(zone));
if (!size)
continue;

@@ -5788,26 +5794,37 @@ void setup_per_zone_wmarks(void)
* 1TB 101 10GB
* 10TB 320 32GB
*/
-static void __meminit calculate_zone_inactive_ratio(struct zone *zone)
+static void __meminit calculate_node_inactive_ratio(struct pglist_data *pgdat)
{
unsigned int gb, ratio;
+ int z;
+ unsigned long managed_pages = 0;
+
+ for (z = 0; z < MAX_NR_ZONES; z++) {
+ struct zone *zone = &pgdat->node_zones[z];

- /* Zone size in gigabytes */
- gb = zone->managed_pages >> (30 - PAGE_SHIFT);
+ if (populated_zone(zone))
+ continue;
+
+ managed_pages += zone->managed_pages;
+ }
+
+ /* Node size in gigabytes */
+ gb = managed_pages >> (30 - PAGE_SHIFT);
if (gb)
ratio = int_sqrt(10 * gb);
else
ratio = 1;

- zone->inactive_ratio = ratio;
+ pgdat->inactive_ratio = ratio;
}

-static void __meminit setup_per_zone_inactive_ratio(void)
+static void __meminit setup_per_node_inactive_ratio(void)
{
- struct zone *zone;
+ struct pglist_data *pgdat;

- for_each_zone(zone)
- calculate_zone_inactive_ratio(zone);
+ for_each_online_pgdat(pgdat)
+ calculate_node_inactive_ratio(pgdat);
}

/*
@@ -5855,7 +5872,7 @@ int __meminit init_per_zone_wmark_min(void)
setup_per_zone_wmarks();
refresh_zone_stat_thresholds();
setup_per_zone_lowmem_reserve();
- setup_per_zone_inactive_ratio();
+ setup_per_node_inactive_ratio();
return 0;
}
module_init(init_per_zone_wmark_min)
diff --git a/mm/swap.c b/mm/swap.c
index e31761ec280c..cbee80f8d88d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -56,7 +56,7 @@ static void __page_cache_release(struct page *page)
unsigned long flags;

spin_lock_irqsave(zone_lru_lock(zone), flags);
- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -427,7 +427,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
spin_lock_irqsave(zone_lru_lock(zone), flags);
}

- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
(*move_fn)(page, lruvec, arg);
}
if (zone)
@@ -549,11 +549,11 @@ static bool need_activate_page_drain(int cpu)

void activate_page(struct page *page)
{
- struct zone *zone = page_zone(page);
+ struct pglist_data *pgdat = page_zone(page)->zone_pgdat;

- spin_lock_irq(zone_lru_lock(zone));
- __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
- spin_unlock_irq(zone_lru_lock(zone));
+ spin_lock_irq(&pgdat->lru_lock);
+ __activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
+ spin_unlock_irq(&pgdat->lru_lock);
}
#endif

@@ -676,16 +676,16 @@ void lru_cache_add(struct page *page)
*/
void add_page_to_unevictable_list(struct page *page)
{
- struct zone *zone = page_zone(page);
+ struct pglist_data *pgdat = page_zone(page)->zone_pgdat;
struct lruvec *lruvec;

- spin_lock_irq(zone_lru_lock(zone));
- lruvec = mem_cgroup_page_lruvec(page, zone);
+ spin_lock_irq(&pgdat->lru_lock);
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
ClearPageActive(page);
SetPageUnevictable(page);
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
- spin_unlock_irq(zone_lru_lock(zone));
+ spin_unlock_irq(&pgdat->lru_lock);
}

/**
@@ -900,7 +900,7 @@ void release_pages(struct page **pages, int nr, bool cold)
{
int i;
LIST_HEAD(pages_to_free);
- struct zone *zone = NULL;
+ struct pglist_data *pgdat = NULL;
struct lruvec *lruvec;
unsigned long uninitialized_var(flags);
unsigned int uninitialized_var(lock_batch);
@@ -909,9 +909,9 @@ void release_pages(struct page **pages, int nr, bool cold)
struct page *page = pages[i];

if (unlikely(PageCompound(page))) {
- if (zone) {
- spin_unlock_irqrestore(zone_lru_lock(zone), flags);
- zone = NULL;
+ if (pgdat) {
+ spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ pgdat = NULL;
}
put_compound_page(page);
continue;
@@ -920,29 +920,29 @@ void release_pages(struct page **pages, int nr, bool cold)
/*
* Make sure the IRQ-safe lock-holding time does not get
* excessive with a continuous string of pages from the
- * same zone. The lock is held only if zone != NULL.
+ * same pgdat. The lock is held only if pgdat != NULL.
*/
- if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
- spin_unlock_irqrestore(zone_lru_lock(zone), flags);
- zone = NULL;
+ if (pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
+ spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ pgdat = NULL;
}

if (!put_page_testzero(page))
continue;

if (PageLRU(page)) {
- struct zone *pagezone = page_zone(page);
+ struct pglist_data *page_pgdat = page_zone(page)->zone_pgdat;

- if (pagezone != zone) {
- if (zone)
- spin_unlock_irqrestore(zone_lru_lock(zone),
+ if (page_pgdat != pgdat) {
+ if (pgdat)
+ spin_unlock_irqrestore(&pgdat->lru_lock,
flags);
lock_batch = 0;
- zone = pagezone;
- spin_lock_irqsave(zone_lru_lock(zone), flags);
+ pgdat = page_pgdat;
+ spin_lock_irqsave(&pgdat->lru_lock, flags);
}

- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -953,8 +953,8 @@ void release_pages(struct page **pages, int nr, bool cold)

list_add(&page->lru, &pages_to_free);
}
- if (zone)
- spin_unlock_irqrestore(zone_lru_lock(zone), flags);
+ if (pgdat)
+ spin_unlock_irqrestore(&pgdat->lru_lock, flags);

mem_cgroup_uncharge_list(&pages_to_free);
free_hot_cold_page_list(&pages_to_free, cold);
@@ -990,7 +990,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
VM_BUG_ON(NR_CPUS != 1 &&
- !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
+ !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));

if (!list)
SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ebdc4e5e720..a11d7d6d2070 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -161,32 +161,34 @@ static bool global_reclaim(struct scan_control *sc)
}
#endif

-static unsigned long zone_reclaimable_pages(struct zone *zone)
+static unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
{
int nr;

- nr = zone_page_state(zone, NR_ACTIVE_FILE) +
- zone_page_state(zone, NR_INACTIVE_FILE);
+ nr = node_page_state(pgdat, NR_ACTIVE_FILE) +
+ node_page_state(pgdat, NR_INACTIVE_FILE);

if (get_nr_swap_pages() > 0)
- nr += zone_page_state(zone, NR_ACTIVE_ANON) +
- zone_page_state(zone, NR_INACTIVE_ANON);
+ nr += node_page_state(pgdat, NR_ACTIVE_ANON) +
+ node_page_state(pgdat, NR_INACTIVE_ANON);

return nr;
}

-bool zone_reclaimable(struct zone *zone)
+bool pgdat_reclaimable(struct pglist_data *pgdat)
{
- return zone_page_state(zone, NR_PAGES_SCANNED) <
- zone_reclaimable_pages(zone) * 6;
+ return node_page_state(pgdat, NR_PAGES_SCANNED) <
+ pgdat_reclaimable_pages(pgdat) * 6;
}

static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
{
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
if (!mem_cgroup_disabled())
return mem_cgroup_get_lru_size(lruvec, lru);

- return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
+ return node_page_state(pgdat, NR_LRU_BASE + lru);
}

/*
@@ -841,7 +843,7 @@ static void page_check_dirty_writeback(struct page *page,
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
- struct zone *zone,
+ struct pglist_data *pgdat,
struct scan_control *sc,
enum ttu_flags ttu_flags,
unsigned long *ret_nr_dirty,
@@ -879,7 +881,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep;

VM_BUG_ON_PAGE(PageActive(page), page);
- VM_BUG_ON_PAGE(page_zone(page) != zone, page);

sc->nr_scanned++;

@@ -962,7 +963,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
/* Case 1 above */
if (current_is_kswapd() &&
PageReclaim(page) &&
- test_bit(ZONE_WRITEBACK, &zone->flags)) {
+ test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
nr_immediate++;
goto keep_locked;

@@ -1044,7 +1045,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
if (page_is_file_cache(page) &&
(!current_is_kswapd() ||
- !test_bit(ZONE_DIRTY, &zone->flags))) {
+ !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
/*
* Immediately reclaim when written back.
* Similar in principal to deactivate_page()
@@ -1208,11 +1209,11 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
}
}

- ret = shrink_page_list(&clean_pages, zone, &sc,
+ ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
TTU_UNMAP|TTU_IGNORE_ACCESS,
&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
list_splice(&clean_pages, page_list);
- mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
+ mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
return ret;
}

@@ -1388,7 +1389,7 @@ int isolate_lru_page(struct page *page)
struct lruvec *lruvec;

spin_lock_irq(zone_lru_lock(zone));
- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
if (PageLRU(page)) {
int lru = page_lru(page);
get_page(page);
@@ -1408,7 +1409,7 @@ int isolate_lru_page(struct page *page)
* the LRU list will go small and be scanned faster than necessary, leading to
* unnecessary swapping, thrashing and OOM.
*/
-static int too_many_isolated(struct zone *zone, int file,
+static int too_many_isolated(struct pglist_data *pgdat, int file,
struct scan_control *sc)
{
unsigned long inactive, isolated;
@@ -1420,11 +1421,11 @@ static int too_many_isolated(struct zone *zone, int file,
return 0;

if (file) {
- inactive = zone_page_state(zone, NR_INACTIVE_FILE);
- isolated = zone_page_state(zone, NR_ISOLATED_FILE);
+ inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
+ isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
} else {
- inactive = zone_page_state(zone, NR_INACTIVE_ANON);
- isolated = zone_page_state(zone, NR_ISOLATED_ANON);
+ inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
+ isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
}

/*
@@ -1442,7 +1443,7 @@ static noinline_for_stack void
putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
- struct zone *zone = lruvec_zone(lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
LIST_HEAD(pages_to_free);

/*
@@ -1455,13 +1456,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
VM_BUG_ON_PAGE(PageLRU(page), page);
list_del(&page->lru);
if (unlikely(!page_evictable(page))) {
- spin_unlock_irq(zone_lru_lock(zone));
+ spin_unlock_irq(&pgdat->lru_lock);
putback_lru_page(page);
- spin_lock_irq(zone_lru_lock(zone));
+ spin_lock_irq(&pgdat->lru_lock);
continue;
}

- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);

SetPageLRU(page);
lru = page_lru(page);
@@ -1478,10 +1479,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
del_page_from_lru_list(page, lruvec, lru);

if (unlikely(PageCompound(page))) {
- spin_unlock_irq(zone_lru_lock(zone));
+ spin_unlock_irq(&pgdat->lru_lock);
mem_cgroup_uncharge(page);
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(zone_lru_lock(zone));
+ spin_lock_irq(&pgdat->lru_lock);
} else
list_add(&page->lru, &pages_to_free);
}
@@ -1525,10 +1526,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
unsigned long nr_immediate = 0;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
- struct zone *zone = lruvec_zone(lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

- while (unlikely(too_many_isolated(zone, file, sc))) {
+ while (unlikely(too_many_isolated(pgdat, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);

/* We are about to die and free our memory. Return now. */
@@ -1543,49 +1544,47 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
if (!sc->may_writepage)
isolate_mode |= ISOLATE_CLEAN;

- spin_lock_irq(zone_lru_lock(zone));
+ spin_lock_irq(&pgdat->lru_lock);

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, isolate_mode, lru);

- __mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+ __mod_node_page_state(pgdat, NR_LRU_BASE + lru, -nr_taken);
+ __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);

if (global_reclaim(sc)) {
- __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
+ __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
if (current_is_kswapd())
- __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
+ __count_vm_events(PGSCAN_KSWAPD, nr_scanned);
else
- __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
+ __count_vm_events(PGSCAN_DIRECT, nr_scanned);
}
- spin_unlock_irq(zone_lru_lock(zone));
+ spin_unlock_irq(&pgdat->lru_lock);

if (nr_taken == 0)
return 0;

- nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+ nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
&nr_dirty, &nr_unqueued_dirty, &nr_congested,
&nr_writeback, &nr_immediate,
false);

- spin_lock_irq(zone_lru_lock(zone));
+ spin_lock_irq(&pgdat->lru_lock);

reclaim_stat->recent_scanned[file] += nr_taken;

if (global_reclaim(sc)) {
if (current_is_kswapd())
- __count_zone_vm_events(PGSTEAL_KSWAPD, zone,
- nr_reclaimed);
+ __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
else
- __count_zone_vm_events(PGSTEAL_DIRECT, zone,
- nr_reclaimed);
+ __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
}

putback_inactive_pages(lruvec, &page_list);

- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
+ __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);

- spin_unlock_irq(zone_lru_lock(zone));
+ spin_unlock_irq(&pgdat->lru_lock);

mem_cgroup_uncharge_list(&page_list);
free_hot_cold_page_list(&page_list, true);
@@ -1605,7 +1604,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* are encountered in the nr_immediate check below.
*/
if (nr_writeback && nr_writeback == nr_taken)
- set_bit(ZONE_WRITEBACK, &zone->flags);
+ set_bit(PGDAT_WRITEBACK, &pgdat->flags);

/*
* memcg will stall in page writeback so only consider forcibly
@@ -1617,16 +1616,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* backed by a congested BDI and wait_iff_congested will stall.
*/
if (nr_dirty && nr_dirty == nr_congested)
- set_bit(ZONE_CONGESTED, &zone->flags);
+ set_bit(PGDAT_CONGESTED, &pgdat->flags);

/*
* If dirty pages are scanned that are not queued for IO, it
* implies that flushers are not keeping up. In this case, flag
- * the zone ZONE_DIRTY and kswapd will start writing pages from
+ * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
* reclaim context.
*/
if (nr_unqueued_dirty == nr_taken)
- set_bit(ZONE_DIRTY, &zone->flags);
+ set_bit(PGDAT_DIRTY, &pgdat->flags);

/*
* If kswapd scans pages marked marked for immediate
@@ -1645,10 +1644,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
*/
if (!sc->hibernation_mode && !current_is_kswapd() &&
current_may_throttle())
- wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+ wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);

- trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
- zone_idx(zone),
+ trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed,
sc->priority,
trace_shrink_flags(file));
@@ -1678,14 +1676,14 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
struct list_head *pages_to_free,
enum lru_list lru)
{
- struct zone *zone = lruvec_zone(lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
unsigned long pgmoved = 0;
struct page *page;
int nr_pages;

while (!list_empty(list)) {
page = lru_to_page(list);
- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);

VM_BUG_ON_PAGE(PageLRU(page), page);
SetPageLRU(page);
@@ -1701,15 +1699,15 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
del_page_from_lru_list(page, lruvec, lru);

if (unlikely(PageCompound(page))) {
- spin_unlock_irq(zone_lru_lock(zone));
+ spin_unlock_irq(&pgdat->lru_lock);
mem_cgroup_uncharge(page);
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(zone_lru_lock(zone));
+ spin_lock_irq(&pgdat->lru_lock);
} else
list_add(&page->lru, pages_to_free);
}
}
- __mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
+ __mod_node_page_state(pgdat, NR_LRU_BASE + lru, pgmoved);
if (!is_active_lru(lru))
__count_vm_events(PGDEACTIVATE, pgmoved);
}
@@ -1730,7 +1728,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
unsigned long nr_rotated = 0;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
- struct zone *zone = lruvec_zone(lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);

lru_add_drain();

@@ -1739,19 +1737,19 @@ static void shrink_active_list(unsigned long nr_to_scan,
if (!sc->may_writepage)
isolate_mode |= ISOLATE_CLEAN;

- spin_lock_irq(zone_lru_lock(zone));
+ spin_lock_irq(&pgdat->lru_lock);

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, isolate_mode, lru);
if (global_reclaim(sc))
- __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
+ __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);

reclaim_stat->recent_scanned[file] += nr_taken;

- __count_zone_vm_events(PGREFILL, zone, nr_scanned);
- __mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
- spin_unlock_irq(zone_lru_lock(zone));
+ __count_vm_events(PGREFILL, nr_scanned);
+ __mod_node_page_state(pgdat, NR_LRU_BASE + lru, -nr_taken);
+ __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
+ spin_unlock_irq(&pgdat->lru_lock);

while (!list_empty(&l_hold)) {
cond_resched();
@@ -1796,7 +1794,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move pages back to the lru list.
*/
- spin_lock_irq(zone_lru_lock(zone));
+ spin_lock_irq(&pgdat->lru_lock);
/*
* Count referenced pages from currently used mappings as rotated,
* even though only some of them are actually re-activated. This
@@ -1807,22 +1805,22 @@ static void shrink_active_list(unsigned long nr_to_scan,

move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(zone_lru_lock(zone));
+ __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+ spin_unlock_irq(&pgdat->lru_lock);

mem_cgroup_uncharge_list(&l_hold);
free_hot_cold_page_list(&l_hold, true);
}

#ifdef CONFIG_SWAP
-static int inactive_anon_is_low_global(struct zone *zone)
+static int inactive_anon_is_low_global(struct pglist_data *pgdat)
{
unsigned long active, inactive;

- active = zone_page_state(zone, NR_ACTIVE_ANON);
- inactive = zone_page_state(zone, NR_INACTIVE_ANON);
+ active = node_page_state(pgdat, NR_ACTIVE_ANON);
+ inactive = node_page_state(pgdat, NR_INACTIVE_ANON);

- if (inactive * zone->inactive_ratio < active)
+ if (inactive * pgdat->inactive_ratio < active)
return 1;

return 0;
@@ -1847,7 +1845,7 @@ static int inactive_anon_is_low(struct lruvec *lruvec)
if (!mem_cgroup_disabled())
return mem_cgroup_inactive_anon_is_low(lruvec);

- return inactive_anon_is_low_global(lruvec_zone(lruvec));
+ return inactive_anon_is_low_global(lruvec_pgdat(lruvec));
}
#else
static inline int inactive_anon_is_low(struct lruvec *lruvec)
@@ -1924,7 +1922,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 fraction[2];
u64 denominator = 0; /* gcc */
- struct zone *zone = lruvec_zone(lruvec);
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
unsigned long anon_prio, file_prio;
enum scan_balance scan_balance;
unsigned long anon, file;
@@ -1945,7 +1943,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
* well.
*/
if (current_is_kswapd()) {
- if (!zone_reclaimable(zone))
+ if (!pgdat_reclaimable(pgdat))
force_scan = true;
if (!mem_cgroup_lruvec_online(lruvec))
force_scan = true;
@@ -1991,14 +1989,24 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
* anon pages. Try to detect this based on file LRU size.
*/
if (global_reclaim(sc)) {
- unsigned long zonefile;
- unsigned long zonefree;
+ unsigned long pgdatfile;
+ unsigned long pgdatfree;
+ int z;
+ unsigned long total_high_wmark = 0;
+
+ pgdatfree = node_page_state(pgdat, NR_FREE_PAGES);
+ pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
+ node_page_state(pgdat, NR_INACTIVE_FILE);

- zonefree = zone_page_state(zone, NR_FREE_PAGES);
- zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
- zone_page_state(zone, NR_INACTIVE_FILE);
+ for (z = 0; z < MAX_NR_ZONES; z++) {
+ struct zone *zone = &pgdat->node_zones[z];
+ if (!populated_zone(zone))
+ continue;
+
+ total_high_wmark += high_wmark_pages(zone);
+ }

- if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
+ if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
scan_balance = SCAN_ANON;
goto out;
}
@@ -2039,7 +2047,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
get_lru_size(lruvec, LRU_INACTIVE_FILE);

- spin_lock_irq(zone_lru_lock(zone));
+ spin_lock_irq(&pgdat->lru_lock);
if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
reclaim_stat->recent_scanned[0] /= 2;
reclaim_stat->recent_rotated[0] /= 2;
@@ -2060,7 +2068,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,

fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
- spin_unlock_irq(zone_lru_lock(zone));
+ spin_unlock_irq(&pgdat->lru_lock);

fraction[0] = ap;
fraction[1] = fp;
@@ -2294,9 +2302,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
- inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
+ inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
if (get_nr_swap_pages() > 0)
- inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+ inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
if (sc->nr_reclaimed < pages_for_compaction &&
inactive_lru_pages > pages_for_compaction)
return true;
@@ -2495,7 +2503,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
continue;

if (sc->priority != DEF_PRIORITY &&
- !zone_reclaimable(zone))
+ !pgdat_reclaimable(zone->zone_pgdat))
continue; /* Let kswapd poll it */

/*
@@ -2536,7 +2544,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
reclaimable = true;

if (global_reclaim(sc) &&
- !reclaimable && zone_reclaimable(zone))
+ !reclaimable && pgdat_reclaimable(zone->zone_pgdat))
reclaimable = true;
}

@@ -2951,7 +2959,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
* DEF_PRIORITY. Effectively, it considers them balanced so
* they must be considered balanced here as well!
*/
- if (!zone_reclaimable(zone)) {
+ if (!pgdat_reclaimable(zone->zone_pgdat)) {
balanced_pages += zone->managed_pages;
continue;
}
@@ -3016,6 +3024,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
int testorder = sc->order;
unsigned long balance_gap;
bool lowmem_pressure;
+ struct pglist_data *pgdat = zone->zone_pgdat;

/* Reclaim above the high watermark. */
sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
@@ -3054,7 +3063,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
/* Account for the number of pages attempted to reclaim */
*nr_attempted += sc->nr_to_reclaim;

- clear_bit(ZONE_WRITEBACK, &zone->flags);
+ /* TODO: ANOMALY */
+ clear_bit(PGDAT_WRITEBACK, &pgdat->flags);

/*
* If a zone reaches its high watermark, consider it to be no longer
@@ -3062,10 +3072,10 @@ static bool kswapd_shrink_zone(struct zone *zone,
* BDIs but as pressure is relieved, speculatively avoid congestion
* waits.
*/
- if (zone_reclaimable(zone) &&
+ if (pgdat_reclaimable(zone->zone_pgdat) &&
zone_balanced(zone, testorder, 0, classzone_idx)) {
- clear_bit(ZONE_CONGESTED, &zone->flags);
- clear_bit(ZONE_DIRTY, &zone->flags);
+ clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+ clear_bit(PGDAT_DIRTY, &pgdat->flags);
}

return sc->nr_scanned >= sc->nr_to_reclaim;
@@ -3127,7 +3137,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
continue;

if (sc.priority != DEF_PRIORITY &&
- !zone_reclaimable(zone))
+ !pgdat_reclaimable(zone->zone_pgdat))
continue;

/*
@@ -3154,9 +3164,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
/*
* If balanced, clear the dirty and congested
* flags
+ *
+ * TODO: ANOMALY
*/
- clear_bit(ZONE_CONGESTED, &zone->flags);
- clear_bit(ZONE_DIRTY, &zone->flags);
+ clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
+ clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
}
}

@@ -3204,7 +3216,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
continue;

if (sc.priority != DEF_PRIORITY &&
- !zone_reclaimable(zone))
+ !pgdat_reclaimable(zone->zone_pgdat))
continue;

sc.nr_scanned = 0;
@@ -3620,8 +3632,8 @@ int sysctl_min_slab_ratio = 5;
static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
{
unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
- unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
- zone_page_state(zone, NR_ACTIVE_FILE);
+ unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
+ node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);

/*
* It's possible for there to be more file mapped pages than
@@ -3724,7 +3736,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
return ZONE_RECLAIM_FULL;

- if (!zone_reclaimable(zone))
+ if (!pgdat_reclaimable(zone->zone_pgdat))
return ZONE_RECLAIM_FULL;

/*
@@ -3803,7 +3815,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
zone = pagezone;
spin_lock_irq(zone_lru_lock(zone));
}
- lruvec = mem_cgroup_page_lruvec(page, zone);
+ lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);

if (!PageLRU(page) || !PageUnevictable(page))
continue;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index effafdb80975..36897da22792 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -894,11 +894,6 @@ const char * const vmstat_text[] = {
/* enum zone_stat_item countes */
"nr_free_pages",
"nr_alloc_batch",
- "nr_inactive_anon",
- "nr_active_anon",
- "nr_inactive_file",
- "nr_active_file",
- "nr_unevictable",
"nr_mlock",
"nr_anon_pages",
"nr_mapped",
@@ -914,12 +909,9 @@ const char * const vmstat_text[] = {
"nr_vmscan_write",
"nr_vmscan_immediate_reclaim",
"nr_writeback_temp",
- "nr_isolated_anon",
- "nr_isolated_file",
"nr_shmem",
"nr_dirtied",
"nr_written",
- "nr_pages_scanned",

#ifdef CONFIG_NUMA
"numa_hit",
@@ -935,6 +927,16 @@ const char * const vmstat_text[] = {
"nr_anon_transparent_hugepages",
"nr_free_cma",

+ /* Node-based counters */
+ "nr_inactive_anon",
+ "nr_active_anon",
+ "nr_inactive_file",
+ "nr_active_file",
+ "nr_unevictable",
+ "nr_isolated_anon",
+ "nr_isolated_file",
+ "nr_pages_scanned",
+
/* enum writeback_stat_item counters */
"nr_dirty_threshold",
"nr_dirty_background_threshold",
@@ -955,11 +957,11 @@ const char * const vmstat_text[] = {
"pgfault",
"pgmajfault",

- TEXTS_FOR_ZONES("pgrefill")
- TEXTS_FOR_ZONES("pgsteal_kswapd")
- TEXTS_FOR_ZONES("pgsteal_direct")
- TEXTS_FOR_ZONES("pgscan_kswapd")
- TEXTS_FOR_ZONES("pgscan_direct")
+ "pgrefill",
+ "pgsteal_kswapd",
+ "pgsteal_direct",
+ "pgscan_kswapd",
+ "pgscan_direct",
"pgscan_direct_throttle",

#ifdef CONFIG_NUMA
@@ -1385,7 +1387,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n min %lu"
"\n low %lu"
"\n high %lu"
- "\n scanned %lu"
+ "\n node_scanned %lu"
"\n spanned %lu"
"\n present %lu"
"\n managed %lu",
@@ -1393,13 +1395,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
min_wmark_pages(zone),
low_wmark_pages(zone),
high_wmark_pages(zone),
- zone_page_state(zone, NR_PAGES_SCANNED),
+ node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
zone->spanned_pages,
zone->present_pages,
zone->managed_pages);

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
- seq_printf(m, "\n %-12s %lu", vmstat_text[i],
+ seq_printf(m, "\n %-12s %lu", vmstat_text[i],
zone_page_state(zone, i));

seq_printf(m,
@@ -1429,12 +1431,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
#endif
}
seq_printf(m,
- "\n all_unreclaimable: %u"
- "\n start_pfn: %lu"
- "\n inactive_ratio: %u",
- !zone_reclaimable(zone),
+ "\n node_unreclaimable: %u"
+ "\n start_pfn: %lu"
+ "\n node_inactive_ratio: %u",
+ !pgdat_reclaimable(zone->zone_pgdat),
zone->zone_start_pfn,
- zone->inactive_ratio);
+ zone->zone_pgdat->inactive_ratio);
seq_putc(m, '\n');
}

@@ -1525,7 +1527,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
{
unsigned long *l = arg;
unsigned long off = l - (unsigned long *)m->private;
-
seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
return 0;
}
diff --git a/mm/workingset.c b/mm/workingset.c
index aa017133744b..ca080cc11797 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -180,7 +180,7 @@ static void unpack_shadow(void *shadow,

*zone = NODE_DATA(nid)->node_zones + zid;

- refault = atomic_long_read(&(*zone)->inactive_age);
+ refault = atomic_long_read(&(*zone)->zone_pgdat->inactive_age);
mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
RADIX_TREE_EXCEPTIONAL_SHIFT);
/*
@@ -215,7 +215,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
struct zone *zone = page_zone(page);
unsigned long eviction;

- eviction = atomic_long_inc_return(&zone->inactive_age);
+ eviction = atomic_long_inc_return(&zone->zone_pgdat->inactive_age);
return pack_shadow(eviction, zone);
}

@@ -236,7 +236,7 @@ bool workingset_refault(void *shadow)
unpack_shadow(shadow, &zone, &refault_distance);
inc_zone_state(zone, WORKINGSET_REFAULT);

- if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+ if (refault_distance <= node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE)) {
inc_zone_state(zone, WORKINGSET_ACTIVATE);
return true;
}
@@ -249,7 +249,7 @@ bool workingset_refault(void *shadow)
*/
void workingset_activation(struct page *page)
{
- atomic_long_inc(&page_zone(page)->inactive_age);
+ atomic_long_inc(&page_zone(page)->zone_pgdat->inactive_age);
}

/*
--
2.3.5

2015-06-08 14:02:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/25] mm, vmscan: Begin reclaiming pages on a per-node basis

This patch makes reclaim decisions on a per-node basis. A reclaimer knows
what zone is required by the allocation request and skips unrelated pages. In
many cases this will be ok because it's a GFP_HIGHMEM request of some
description. On 64-bit, ZONE_DMA32 requests will cause some problems but
32-bit devices on 64-bit platforms are getting more rare. Historically it
would have been a major problem on 32-bit with big Highmem:Lowmem ratios
but that is also becoming very rare. If it really becomes a problem,
it'll manifest as very low reclaim efficiencies.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 75 ++++++++++++++++++++++++++++++++++++-------------------------
1 file changed, 45 insertions(+), 30 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a11d7d6d2070..acdded211bd8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -83,6 +83,9 @@ struct scan_control {
/* Scan (total_size >> priority) pages at once */
int priority;

+ /* The highest zone to isolate pages for reclaim from */
+ enum zone_type reclaim_idx;
+
unsigned int may_writepage:1;

/* Can mapped pages be reclaimed? */
@@ -1319,6 +1322,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
struct list_head *src = &lruvec->lists[lru];
unsigned long nr_taken = 0;
unsigned long scan;
+ LIST_HEAD(pages_skipped);

for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
struct page *page;
@@ -1329,6 +1333,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,

VM_BUG_ON_PAGE(!PageLRU(page), page);

+ if (page_zone_id(page) > sc->reclaim_idx)
+ list_move(&page->lru, &pages_skipped);
+
switch (__isolate_lru_page(page, mode)) {
case 0:
nr_pages = hpage_nr_pages(page);
@@ -1347,6 +1354,15 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
}
}

+ /*
+ * Splice any skipped pages to the start of the LRU list. Note that
+ * this disrupts the LRU order when reclaiming for lower zones but
+ * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
+ * scanning would soon rescan the same pages to skip and put the
+ * system at risk of premature OOM.
+ */
+ if (!list_empty(&pages_skipped))
+ list_splice(&pages_skipped, src);
*nr_scanned = scan;
trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
nr_taken, mode, is_file_lru(lru));
@@ -1508,7 +1524,7 @@ static int current_may_throttle(void)
}

/*
- * shrink_inactive_list() is a helper for shrink_zone(). It returns the number
+ * shrink_inactive_list() is a helper for shrink_node(). It returns the number
* of reclaimed pages
*/
static noinline_for_stack unsigned long
@@ -2319,12 +2335,14 @@ static inline bool should_continue_reclaim(struct zone *zone,
}
}

-static bool shrink_zone(struct zone *zone, struct scan_control *sc,
- bool is_classzone)
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
+ enum zone_type reclaim_idx,
+ enum zone_type classzone_idx)
{
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
+ struct zone *zone = &pgdat->node_zones[classzone_idx];

do {
struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2355,10 +2373,11 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
swappiness = mem_cgroup_swappiness(memcg);
scanned = sc->nr_scanned;

+ sc->reclaim_idx = reclaim_idx;
shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
zone_lru_pages += lru_pages;

- if (memcg && is_classzone)
+ if (!global_reclaim(sc) && reclaim_idx == classzone_idx)
shrink_slab(sc->gfp_mask, zone_to_nid(zone),
memcg, sc->nr_scanned - scanned,
lru_pages);
@@ -2384,7 +2403,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
* Shrink the slab caches in the same proportion that
* the eligible LRU pages were scanned.
*/
- if (global_reclaim(sc) && is_classzone)
+ if (global_reclaim(sc) && reclaim_idx == classzone_idx)
shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
sc->nr_scanned - nr_scanned,
zone_lru_pages);
@@ -2462,14 +2481,14 @@ static inline bool compaction_ready(struct zone *zone, int order)
*
* Returns true if a zone was reclaimable.
*/
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
+ enum zone_type reclaim_idx, enum zone_type classzone_idx)
{
struct zoneref *z;
struct zone *zone;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
gfp_t orig_mask;
- enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
bool reclaimable = false;

/*
@@ -2482,16 +2501,12 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
sc->gfp_mask |= __GFP_HIGHMEM;

for_each_zone_zonelist_nodemask(zone, z, zonelist,
- requested_highidx, sc->nodemask) {
- enum zone_type classzone_idx;
-
- if (!populated_zone(zone))
- continue;
-
- classzone_idx = requested_highidx;
- while (!populated_zone(zone->zone_pgdat->node_zones +
- classzone_idx))
+ classzone_idx, sc->nodemask) {
+ if (!populated_zone(zone)) {
+ reclaim_idx--;
classzone_idx--;
+ continue;
+ }

/*
* Take care memory controller reclaiming has small influence
@@ -2517,7 +2532,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
*/
if (IS_ENABLED(CONFIG_COMPACTION) &&
sc->order > PAGE_ALLOC_COSTLY_ORDER &&
- zonelist_zone_idx(z) <= requested_highidx &&
+ zonelist_zone_idx(z) <= classzone_idx &&
compaction_ready(zone, sc->order)) {
sc->compaction_ready = true;
continue;
@@ -2537,10 +2552,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
sc->nr_scanned += nr_soft_scanned;
if (nr_soft_reclaimed)
reclaimable = true;
- /* need some check for avoid more shrink_zone() */
+ /* need some check for avoid more shrink_node() */
}

- if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
+ if (shrink_node(zone->zone_pgdat, sc, reclaim_idx, classzone_idx))
reclaimable = true;

if (global_reclaim(sc) &&
@@ -2580,6 +2595,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
unsigned long total_scanned = 0;
unsigned long writeback_threshold;
bool zones_reclaimable;
+ enum zone_type classzone_idx = gfp_zone(sc->gfp_mask);
retry:
delayacct_freepages_start();

@@ -2590,7 +2606,7 @@ retry:
vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
sc->priority);
sc->nr_scanned = 0;
- zones_reclaimable = shrink_zones(zonelist, sc);
+ zones_reclaimable = shrink_zones(zonelist, sc, classzone_idx, classzone_idx);

total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -3058,7 +3074,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
balance_gap, classzone_idx))
return true;

- shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
+ shrink_node(zone->zone_pgdat, sc, zone_idx(zone), classzone_idx);

/* Account for the number of pages attempted to reclaim */
*nr_attempted += sc->nr_to_reclaim;
@@ -3201,15 +3217,14 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
sc.may_writepage = 1;

/*
- * Now scan the zone in the dma->highmem direction, stopping
- * at the last zone which needs scanning.
- *
- * We do this because the page allocator works in the opposite
- * direction. This prevents the page allocator from allocating
- * pages behind kswapd's direction of progress, which would
- * cause too much scanning of the lower zones.
+ * Continue scanning in the highmem->dma direction stopping at
+ * the last zone which needs scanning. This may reclaim lowmem
+ * pages that are not necessary for zone balancing but it
+ * preserves LRU ordering. It is assumed that the bulk of
+ * allocation requests can use arbitrary zones with the
+ * possible exception of big highmem:lowmem configurations.
*/
- for (i = 0; i <= end_zone; i++) {
+ for (i = end_zone; i >= end_zone; i--) {
struct zone *zone = pgdat->node_zones + i;

if (!populated_zone(zone))
@@ -3707,7 +3722,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* priorities until we have enough memory freed.
*/
do {
- shrink_zone(zone, &sc, true);
+ shrink_node(zone->zone_pgdat, &sc, zone_idx(zone), zone_idx(zone));
} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
}

--
2.3.5

2015-06-08 14:02:35

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/25] mm, vmscan: Have kswapd only scan based on the highest requested zone

kswapd checks all eligible zones to see if they need balancing even if it was
woken for a lower zone. This made sense when we reclaimed on a per-zone basis
because we wanted to shrink zones fairly so avoid age-inversion problems.
Ideally this is completely unnecessary when reclaiming on a per-node basis.
In theory, there may still be anomalies when all requests are for lower
zones and very old pages are preserved in higher zones but this should be
the exceptional case.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index acdded211bd8..f0eed2e6883c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3142,11 +3142,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,

sc.nr_reclaimed = 0;

- /*
- * Scan in the highmem->dma direction for the highest
- * zone which needs scanning
- */
- for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+ /* Scan from the highest requested zone to dma */
+ for (i = *classzone_idx; i >= 0; i--) {
struct zone *zone = pgdat->node_zones + i;

if (!populated_zone(zone))
--
2.3.5

2015-06-08 14:02:30

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/25] mm, vmscan: Avoid a second search through zones checking if compaction is required

This patch removes an unnecessary loop.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 31 +++++++++++++------------------
1 file changed, 13 insertions(+), 18 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f0eed2e6883c..975c315f1bf5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3182,30 +3182,25 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
*/
clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
+
+ /*
+ * If any zone is currently balanced then kswapd will
+ * not call compaction as it is expected that the
+ * necessary pages are already available.
+ */
+ if (pgdat_needs_compaction &&
+ zone_watermark_ok(zone, order,
+ low_wmark_pages(zone),
+ *classzone_idx, 0)) {
+ pgdat_needs_compaction = false;
+ }
+
}
}

if (i < 0)
goto out;

- for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- if (!populated_zone(zone))
- continue;
-
- /*
- * If any zone is currently balanced then kswapd will
- * not call compaction as it is expected that the
- * necessary pages are already available.
- */
- if (pgdat_needs_compaction &&
- zone_watermark_ok(zone, order,
- low_wmark_pages(zone),
- *classzone_idx, 0))
- pgdat_needs_compaction = false;
- }
-
/*
* If we're getting trouble reclaiming, start doing writepage
* even in laptop mode.
--
2.3.5

2015-06-08 14:02:17

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/25] mm, vmscan: Make kswapd think of reclaim in terms of nodes

Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" was
the start of thinking of reclaim in terms of nodes but kswapd is still
very zone-centric. This patch gets rid of many of the node-based versus
zone-based decisions.

o A node is considered balanced when any eligible lower zone is balanced.
This eliminates one class of age-inversion problem because we avoid
reclaiming a newer page just because it's in the wrong zone
o pgdat_balanced disappears because we now only care about one zone being
balanced.
o Some anomalies related to writeback and congestion tracking being based on
zones disappear.
o kswapd no longer has to take care to reclaim zones in the reverse order
that the page allocator uses.
o Most importantly of all, reclaim from node 0 which often has multiple
zones will have similar aging and reclaiming characteristics as every
other node.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 264 ++++++++++++++++++++----------------------------------------
1 file changed, 87 insertions(+), 177 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 975c315f1bf5..4d7ddaf4f2f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2900,7 +2900,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
}
#endif

-static void age_active_anon(struct zone *zone, struct scan_control *sc)
+static void age_active_anon(struct pglist_data *pgdat,
+ struct zone *zone, struct scan_control *sc)
{
struct mem_cgroup *memcg;

@@ -2934,65 +2935,6 @@ static bool zone_balanced(struct zone *zone, int order,
}

/*
- * pgdat_balanced() is used when checking if a node is balanced.
- *
- * For order-0, all zones must be balanced!
- *
- * For high-order allocations only zones that meet watermarks and are in a
- * zone allowed by the callers classzone_idx are added to balanced_pages. The
- * total of balanced pages must be at least 25% of the zones allowed by
- * classzone_idx for the node to be considered balanced. Forcing all zones to
- * be balanced for high orders can cause excessive reclaim when there are
- * imbalanced zones.
- * The choice of 25% is due to
- * o a 16M DMA zone that is balanced will not balance a zone on any
- * reasonable sized machine
- * o On all other machines, the top zone must be at least a reasonable
- * percentage of the middle zones. For example, on 32-bit x86, highmem
- * would need to be at least 256M for it to be balance a whole node.
- * Similarly, on x86-64 the Normal zone would need to be at least 1G
- * to balance a node on its own. These seemed like reasonable ratios.
- */
-static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
-{
- unsigned long managed_pages = 0;
- unsigned long balanced_pages = 0;
- int i;
-
- /* Check the watermark levels */
- for (i = 0; i <= classzone_idx; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- if (!populated_zone(zone))
- continue;
-
- managed_pages += zone->managed_pages;
-
- /*
- * A special case here:
- *
- * balance_pgdat() skips over all_unreclaimable after
- * DEF_PRIORITY. Effectively, it considers them balanced so
- * they must be considered balanced here as well!
- */
- if (!pgdat_reclaimable(zone->zone_pgdat)) {
- balanced_pages += zone->managed_pages;
- continue;
- }
-
- if (zone_balanced(zone, order, 0, i))
- balanced_pages += zone->managed_pages;
- else if (!order)
- return false;
- }
-
- if (order)
- return balanced_pages >= (managed_pages >> 2);
- else
- return true;
-}
-
-/*
* Prepare kswapd for sleeping. This verifies that there are no processes
* waiting in throttle_direct_reclaim() and that watermarks have been met.
*
@@ -3001,6 +2943,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
int classzone_idx)
{
+ int i;
+
/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
return false;
@@ -3021,78 +2965,69 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
if (waitqueue_active(&pgdat->pfmemalloc_wait))
wake_up_all(&pgdat->pfmemalloc_wait);

- return pgdat_balanced(pgdat, order, classzone_idx);
+ for (i = 0; i <= classzone_idx; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ if (zone_balanced(zone, order, 0, classzone_idx))
+ return true;
+ }
+
+ return false;
}

/*
- * kswapd shrinks the zone by the number of pages required to reach
- * the high watermark.
+ * kswapd shrinks a node of pages that are at or below the highest usable
+ * zone that is currently unbalanced.
*
* Returns true if kswapd scanned at least the requested number of pages to
* reclaim or if the lack of progress was due to pages under writeback.
* This is used to determine if the scanning priority needs to be raised.
*/
-static bool kswapd_shrink_zone(struct zone *zone,
- int classzone_idx,
- struct scan_control *sc,
+static bool kswapd_shrink_node(pg_data_t *pgdat,
+ int end_zone, struct scan_control *sc,
unsigned long *nr_attempted)
{
- int testorder = sc->order;
- unsigned long balance_gap;
- bool lowmem_pressure;
- struct pglist_data *pgdat = zone->zone_pgdat;
+ struct zone *zone;
+ unsigned long nr_to_reclaim = 0;
+ int z;

- /* Reclaim above the high watermark. */
- sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+ /* Aim to reclaim above all the zone high watermarks */
+ for (z = 0; z <= end_zone; z++) {
+ zone = pgdat->node_zones + end_zone;
+ nr_to_reclaim += high_wmark_pages(zone);

- /*
- * Kswapd reclaims only single pages with compaction enabled. Trying
- * too hard to reclaim until contiguous free pages have become
- * available can hurt performance by evicting too much useful data
- * from memory. Do not reclaim more than needed for compaction.
- */
- if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
- compaction_suitable(zone, sc->order, 0, classzone_idx)
+ /*
+ * Kswapd reclaims only single pages with compaction enabled.
+ * Trying too hard to reclaim until contiguous free pages have
+ * become available can hurt performance by evicting too much
+ * useful data from memory. Do not reclaim more than needed
+ * for compaction.
+ */
+ if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
+ compaction_suitable(zone, sc->order, 0, end_zone)
!= COMPACT_SKIPPED)
- testorder = 0;
-
- /*
- * We put equal pressure on every zone, unless one zone has way too
- * many pages free already. The "too many pages" is defined as the
- * high wmark plus a "gap" where the gap is either the low
- * watermark or 1% of the zone, whichever is smaller.
- */
- balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
- zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
+ sc->order = 0;
+ }
+ sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, nr_to_reclaim);

/*
- * If there is no low memory pressure or the zone is balanced then no
- * reclaim is necessary
+ * Historically care was taken to put equal pressure on all zones but
+ * now pressure is applied based on node LRU order.
*/
- lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
- if (!lowmem_pressure && zone_balanced(zone, testorder,
- balance_gap, classzone_idx))
- return true;
-
- shrink_node(zone->zone_pgdat, sc, zone_idx(zone), classzone_idx);
+ shrink_node(zone->zone_pgdat, sc, end_zone, end_zone);

/* Account for the number of pages attempted to reclaim */
*nr_attempted += sc->nr_to_reclaim;

- /* TODO: ANOMALY */
- clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
-
/*
- * If a zone reaches its high watermark, consider it to be no longer
- * congested. It's possible there are dirty pages backed by congested
- * BDIs but as pressure is relieved, speculatively avoid congestion
- * waits.
+ * Fragmentation may mean that the system cannot be rebalanced for
+ * high-order allocations. If twice the allocation size has been
+ * reclaimed then recheck watermarks only at order-0 to prevent
+ * excessive reclaim. Assume that a process requested a high-order
+ * can direct reclaim/compact.
*/
- if (pgdat_reclaimable(zone->zone_pgdat) &&
- zone_balanced(zone, testorder, 0, classzone_idx)) {
- clear_bit(PGDAT_CONGESTED, &pgdat->flags);
- clear_bit(PGDAT_DIRTY, &pgdat->flags);
- }
+ if (sc->order && sc->nr_reclaimed >= 2UL << sc->order)
+ sc->order = 0;

return sc->nr_scanned >= sc->nr_to_reclaim;
}
@@ -3122,9 +3057,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int *classzone_idx)
{
int i;
- int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
+ int end_zone = MAX_NR_ZONES - 1;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
+ struct zone *zone;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.order = order,
@@ -3142,23 +3078,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,

sc.nr_reclaimed = 0;

+ /* Allow writeout later if the pgdat appears unreclaimable */
+ if (!pgdat_reclaimable(pgdat))
+ sc.priority = min(sc.priority, DEF_PRIORITY - 3);
+
/* Scan from the highest requested zone to dma */
for (i = *classzone_idx; i >= 0; i--) {
- struct zone *zone = pgdat->node_zones + i;
-
+ zone = pgdat->node_zones + i;
if (!populated_zone(zone))
continue;

- if (sc.priority != DEF_PRIORITY &&
- !pgdat_reclaimable(zone->zone_pgdat))
- continue;
-
- /*
- * Do some background aging of the anon list, to give
- * pages a chance to be referenced before reclaiming.
- */
- age_active_anon(zone, &sc);
-
/*
* If the number of buffer_heads in the machine
* exceeds the maximum allowed level and this node
@@ -3175,10 +3104,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
break;
} else {
/*
- * If balanced, clear the dirty and congested
- * flags
- *
- * TODO: ANOMALY
+ * If any eligible zone is balanced then the
+ * node is not considered congested or dirty.
*/
clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
@@ -3202,51 +3129,32 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
goto out;

/*
+ * Do some background aging of the anon list, to give
+ * pages a chance to be referenced before reclaiming.
+ */
+ age_active_anon(pgdat, &pgdat->node_zones[end_zone], &sc);
+
+ /*
* If we're getting trouble reclaiming, start doing writepage
* even in laptop mode.
*/
if (sc.priority < DEF_PRIORITY - 2)
sc.may_writepage = 1;

+ /* Call soft limit reclaim before calling shrink_node. */
+ sc.nr_scanned = 0;
+ nr_soft_scanned = 0;
+ nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, order,
+ sc.gfp_mask, &nr_soft_scanned);
+ sc.nr_reclaimed += nr_soft_reclaimed;
+
/*
- * Continue scanning in the highmem->dma direction stopping at
- * the last zone which needs scanning. This may reclaim lowmem
- * pages that are not necessary for zone balancing but it
- * preserves LRU ordering. It is assumed that the bulk of
- * allocation requests can use arbitrary zones with the
- * possible exception of big highmem:lowmem configurations.
+ * There should be no need to raise the scanning priority if
+ * enough pages are already being scanned that that high
+ * watermark would be met at 100% efficiency.
*/
- for (i = end_zone; i >= end_zone; i--) {
- struct zone *zone = pgdat->node_zones + i;
-
- if (!populated_zone(zone))
- continue;
-
- if (sc.priority != DEF_PRIORITY &&
- !pgdat_reclaimable(zone->zone_pgdat))
- continue;
-
- sc.nr_scanned = 0;
-
- nr_soft_scanned = 0;
- /*
- * Call soft limit reclaim before calling shrink_zone.
- */
- nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
- order, sc.gfp_mask,
- &nr_soft_scanned);
- sc.nr_reclaimed += nr_soft_reclaimed;
-
- /*
- * There should be no need to raise the scanning
- * priority if enough pages are already being scanned
- * that that high watermark would be met at 100%
- * efficiency.
- */
- if (kswapd_shrink_zone(zone, end_zone,
- &sc, &nr_attempted))
- raise_priority = false;
- }
+ if (kswapd_shrink_node(pgdat, end_zone, &sc, &nr_attempted))
+ raise_priority = false;

/*
* If the low watermark is met there is no need for processes
@@ -3257,17 +3165,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
pfmemalloc_watermark_ok(pgdat))
wake_up_all(&pgdat->pfmemalloc_wait);

- /*
- * Fragmentation may mean that the system cannot be rebalanced
- * for high-order allocations in all zones. If twice the
- * allocation size has been reclaimed and the zones are still
- * not balanced then recheck the watermarks at order-0 to
- * prevent kswapd reclaiming excessively. Assume that a
- * process requested a high-order can direct reclaim/compact.
- */
- if (order && sc.nr_reclaimed >= 2UL << order)
- order = sc.order = 0;
-
/* Check if kswapd should be suspending */
if (try_to_freeze() || kthread_should_stop())
break;
@@ -3280,13 +3177,26 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
compact_pgdat(pgdat, order);

/*
+ * Stop reclaiming if any eligible zone is balanced and clear
+ * node writeback or congested.
+ */
+ for (i = 0; i <= *classzone_idx; i++) {
+ zone = pgdat->node_zones + i;
+
+ if (zone_balanced(zone, sc.order, 0, *classzone_idx)) {
+ clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+ clear_bit(PGDAT_DIRTY, &pgdat->flags);
+ break;
+ }
+ }
+
+ /*
* Raise priority if scanning rate is too low or there was no
* progress in reclaiming pages
*/
if (raise_priority || !sc.nr_reclaimed)
sc.priority--;
- } while (sc.priority >= 1 &&
- !pgdat_balanced(pgdat, order, *classzone_idx));
+ } while (sc.priority >= 1);

out:
/*
--
2.3.5

2015-06-08 13:57:35

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/25] mm, vmscan: By default have direct reclaim only shrink once per node

Direct reclaim iterates over all zones in the zonelist and shrinking them
but this is in conflict with node-based reclaim. In the default case,
only shrink once per node.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4d7ddaf4f2f4..50aa650ac206 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2468,14 +2468,6 @@ static inline bool compaction_ready(struct zone *zone, int order)
* try to reclaim pages from zones which will satisfy the caller's allocation
* request.
*
- * We reclaim from a zone even if that zone is over high_wmark_pages(zone).
- * Because:
- * a) The caller may be trying to free *extra* pages to satisfy a higher-order
- * allocation or
- * b) The target zone may be at high_wmark_pages(zone) but the lower zones
- * must go *over* high_wmark_pages(zone) to satisfy the `incremental min'
- * zone defense algorithm.
- *
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
*
@@ -2490,6 +2482,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
unsigned long nr_soft_scanned;
gfp_t orig_mask;
bool reclaimable = false;
+ pg_data_t *last_pgdat = NULL;

/*
* If the number of buffer_heads in the machine exceeds the maximum
@@ -2502,11 +2495,18 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,

for_each_zone_zonelist_nodemask(zone, z, zonelist,
classzone_idx, sc->nodemask) {
- if (!populated_zone(zone)) {
- reclaim_idx--;
- classzone_idx--;
+ BUG_ON(!populated_zone(zone));
+
+ /*
+ * Shrink each node in the zonelist once. If the zonelist is
+ * ordered by zone (not the default) then a node may be
+ * shrunk multiple times but in that case the user prefers
+ * lower zones being preserved
+ */
+ if (zone->zone_pgdat == last_pgdat)
continue;
- }
+ last_pgdat = zone->zone_pgdat;
+ reclaim_idx = zone_idx(zone);

/*
* Take care memory controller reclaiming has small influence
--
2.3.5

2015-06-08 13:57:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/25] mm, vmscan: Clear congestion, dirty and need for compaction on a per-node basis

Congested and dirty tracking of a node and whether reclaim should stall
is still based on zone activity. This patch considers whether the kernel
should stall based on node-based reclaim activity.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 54 ++++++++++++++++++++++++------------------------------
1 file changed, 24 insertions(+), 30 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 50aa650ac206..e069decbcfa1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2921,16 +2921,30 @@ static void age_active_anon(struct pglist_data *pgdat,
}

static bool zone_balanced(struct zone *zone, int order,
- unsigned long balance_gap, int classzone_idx)
+ unsigned long balance_gap, int classzone_idx,
+ bool *pgdat_needs_compaction)
{
if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
balance_gap, classzone_idx, 0))
return false;

+ /*
+ * If any eligible zone is balanced then the node is not considered
+ * to be congested or dirty
+ */
+ clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
+ clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
+
if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
order, 0, classzone_idx) == COMPACT_SKIPPED)
return false;

+ /*
+ * If a zone is balanced and compaction can start then there is no
+ * need for kswapd to call compact_pgdat
+ */
+ *pgdat_needs_compaction = false;
+
return true;
}

@@ -2944,6 +2958,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
int classzone_idx)
{
int i;
+ bool dummy;

/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
@@ -2968,7 +2983,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
for (i = 0; i <= classzone_idx; i++) {
struct zone *zone = pgdat->node_zones + i;

- if (zone_balanced(zone, order, 0, classzone_idx))
+ if (zone_balanced(zone, order, 0, classzone_idx, &dummy))
return true;
}

@@ -3099,29 +3114,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
break;
}

- if (!zone_balanced(zone, order, 0, 0)) {
+ if (!zone_balanced(zone, order, 0, 0,
+ &pgdat_needs_compaction)) {
end_zone = i;
break;
- } else {
- /*
- * If any eligible zone is balanced then the
- * node is not considered congested or dirty.
- */
- clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
- clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
-
- /*
- * If any zone is currently balanced then kswapd will
- * not call compaction as it is expected that the
- * necessary pages are already available.
- */
- if (pgdat_needs_compaction &&
- zone_watermark_ok(zone, order,
- low_wmark_pages(zone),
- *classzone_idx, 0)) {
- pgdat_needs_compaction = false;
- }
-
}
}

@@ -3182,12 +3178,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
*/
for (i = 0; i <= *classzone_idx; i++) {
zone = pgdat->node_zones + i;
-
- if (zone_balanced(zone, sc.order, 0, *classzone_idx)) {
- clear_bit(PGDAT_CONGESTED, &pgdat->flags);
- clear_bit(PGDAT_DIRTY, &pgdat->flags);
- break;
- }
+ if (zone_balanced(zone, sc.order, 0, *classzone_idx,
+ &pgdat_needs_compaction))
+ goto out;
}

/*
@@ -3379,6 +3372,7 @@ static int kswapd(void *p)
void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
{
pg_data_t *pgdat;
+ bool dummy;

if (!populated_zone(zone))
return;
@@ -3392,7 +3386,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
}
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
- if (zone_balanced(zone, order, 0, 0))
+ if (zone_balanced(zone, order, 0, 0, &dummy))
return;

trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
--
2.3.5

2015-06-08 13:57:22

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/25] mm, vmscan: Make shrink_node decisions more node-centric

Earlier patches focused on having direct reclaim and kswapd use data that
is node-centric for reclaiming but shrink_node() itself still uses too much
zone information. This patch removes unnecessary zone-based information
with the most important decision being whether to continue reclaim or
not. Some memcg APIs are adjusted as a result even though memcg itself
still uses some zone information.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/memcontrol.h | 9 ++++----
include/linux/mmzone.h | 4 ++--
include/linux/swap.h | 2 +-
mm/memcontrol.c | 17 ++++++++-------
mm/page_alloc.c | 2 +-
mm/vmscan.c | 53 ++++++++++++++++++++++++++--------------------
6 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index df225059daf3..b1ba7f5b3851 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -84,7 +84,8 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
bool lrucare);

-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_lruvec(struct pglist_data *, struct zone *zone,
+ struct mem_cgroup *);
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);

bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
@@ -240,10 +241,10 @@ static inline void mem_cgroup_migrate(struct page *oldpage,
{
}

-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
- struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+ struct zone *zone, struct mem_cgroup *memcg)
{
- return zone_lruvec(zone);
+ return node_lruvec(pgdat);
}

static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fab74af19f26..1830c2180555 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -799,9 +799,9 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
return &zone->zone_pgdat->lru_lock;
}

-static inline struct lruvec *zone_lruvec(struct zone *zone)
+static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
{
- return &zone->zone_pgdat->lruvec;
+ return &pgdat->lruvec;
}

static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7067eca501e2..bb9597213e39 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -323,7 +323,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
bool may_swap);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
struct zone *zone,
unsigned long *nr_scanned);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 10eed58506a0..7c39930c8d86 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1170,22 +1170,23 @@ out:
EXPORT_SYMBOL(__mem_cgroup_count_vm_event);

/**
- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
+ * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
+ * @node: node of the wanted lruvec
* @zone: zone of the wanted lruvec
* @memcg: memcg of the wanted lruvec
*
- * Returns the lru list vector holding pages for the given @zone and
- * @mem. This can be the global zone lruvec, if the memory controller
+ * Returns the lru list vector holding pages for a given @node or a given
+ * @memcg and @zone. This can be the node lruvec, if the memory controller
* is disabled.
*/
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
- struct mem_cgroup *memcg)
+struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+ struct zone *zone, struct mem_cgroup *memcg)
{
struct mem_cgroup_per_zone *mz;
struct lruvec *lruvec;

if (mem_cgroup_disabled()) {
- lruvec = zone_lruvec(zone);
+ lruvec = node_lruvec(pgdat);
goto out;
}

@@ -1721,8 +1722,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
}
continue;
}
- total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
- zone, &nr_scanned);
+ total += mem_cgroup_shrink_node(victim, gfp_mask, false,
+ zone, &nr_scanned);
*total_scanned += nr_scanned;
if (!soft_limit_excess(root_memcg))
break;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 49a29e8ae493..34201c141916 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4879,6 +4879,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
pgdat_page_ext_init(pgdat);
+ lruvec_init(node_lruvec(pgdat));

for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
@@ -4948,7 +4949,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
/* For bootup, initialized properly in watermark setup */
mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);

- lruvec_init(zone_lruvec(zone));
if (!size)
continue;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e069decbcfa1..3a6a2fac48e5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2278,13 +2278,14 @@ static bool in_reclaim_compaction(struct scan_control *sc)
* calls try_to_compact_zone() that it will have enough free pages to succeed.
* It will give up earlier than that if there is difficulty reclaiming pages.
*/
-static inline bool should_continue_reclaim(struct zone *zone,
+static inline bool should_continue_reclaim(struct pglist_data *pgdat,
unsigned long nr_reclaimed,
unsigned long nr_scanned,
struct scan_control *sc)
{
unsigned long pages_for_compaction;
unsigned long inactive_lru_pages;
+ int z;

/* If not in reclaim/compaction mode, stop */
if (!in_reclaim_compaction(sc))
@@ -2318,21 +2319,27 @@ static inline bool should_continue_reclaim(struct zone *zone,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
- inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
+ inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
if (get_nr_swap_pages() > 0)
- inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
+ inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
if (sc->nr_reclaimed < pages_for_compaction &&
inactive_lru_pages > pages_for_compaction)
return true;

/* If compaction would go ahead or the allocation would succeed, stop */
- switch (compaction_suitable(zone, sc->order, 0, 0)) {
- case COMPACT_PARTIAL:
- case COMPACT_CONTINUE:
- return false;
- default:
- return true;
+ for (z = 0; z <= sc->reclaim_idx; z++) {
+ struct zone *zone = &pgdat->node_zones[z];
+
+ switch (compaction_suitable(zone, sc->order, 0, 0)) {
+ case COMPACT_PARTIAL:
+ case COMPACT_CONTINUE:
+ return false;
+ default:
+ /* check next zone */
+ ;
+ }
}
+ return true;
}

static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
@@ -2342,15 +2349,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
- struct zone *zone = &pgdat->node_zones[classzone_idx];

do {
struct mem_cgroup *root = sc->target_mem_cgroup;
struct mem_cgroup_reclaim_cookie reclaim = {
- .zone = zone,
+ .zone = &pgdat->node_zones[classzone_idx],
.priority = sc->priority,
};
- unsigned long zone_lru_pages = 0;
+ unsigned long node_lru_pages = 0;
struct mem_cgroup *memcg;

nr_reclaimed = sc->nr_reclaimed;
@@ -2369,23 +2375,24 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
mem_cgroup_events(memcg, MEMCG_LOW, 1);
}

- lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+ lruvec = mem_cgroup_lruvec(pgdat,
+ &pgdat->node_zones[reclaim_idx], memcg);
swappiness = mem_cgroup_swappiness(memcg);
scanned = sc->nr_scanned;

sc->reclaim_idx = reclaim_idx;
shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
- zone_lru_pages += lru_pages;
+ node_lru_pages += lru_pages;

if (!global_reclaim(sc) && reclaim_idx == classzone_idx)
- shrink_slab(sc->gfp_mask, zone_to_nid(zone),
+ shrink_slab(sc->gfp_mask, pgdat->node_id,
memcg, sc->nr_scanned - scanned,
lru_pages);

/*
* Direct reclaim and kswapd have to scan all memory
* cgroups to fulfill the overall scan target for the
- * zone.
+ * node.
*
* Limit reclaim, on the other hand, only cares about
* nr_to_reclaim pages to be reclaimed and it will
@@ -2404,9 +2411,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
* the eligible LRU pages were scanned.
*/
if (global_reclaim(sc) && reclaim_idx == classzone_idx)
- shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
+ shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
sc->nr_scanned - nr_scanned,
- zone_lru_pages);
+ node_lru_pages);

if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2420,7 +2427,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
if (sc->nr_reclaimed - nr_reclaimed)
reclaimable = true;

- } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
+ } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));

return reclaimable;
@@ -2822,7 +2829,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,

#ifdef CONFIG_MEMCG

-unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
+unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
gfp_t gfp_mask, bool noswap,
struct zone *zone,
unsigned long *nr_scanned)
@@ -2834,7 +2841,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
.may_unmap = 1,
.may_swap = !noswap,
};
- struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+ struct lruvec *lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
int swappiness = mem_cgroup_swappiness(memcg);
unsigned long lru_pages;

@@ -2848,7 +2855,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
/*
* NOTE: Although we can get the priority field, using it
* here is not a good idea, since it limits the pages we can scan.
- * if we don't reclaim here, the shrink_zone from balance_pgdat
+ * if we don't reclaim here, the shrink_node from balance_pgdat
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
@@ -2910,7 +2917,7 @@ static void age_active_anon(struct pglist_data *pgdat,

memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
- struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+ struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);

if (inactive_anon_is_low(lruvec))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
--
2.3.5

2015-06-08 14:00:21

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 11/25] mm, workingset: Make working set detection node-aware

Working set and refault detection is still zone-based, fix it.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 6 +++---
mm/vmstat.c | 6 +++---
mm/workingset.c | 47 +++++++++++++++++++++--------------------------
3 files changed, 27 insertions(+), 32 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1830c2180555..4c761809d151 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -143,9 +143,6 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
#endif
- WORKINGSET_REFAULT,
- WORKINGSET_ACTIVATE,
- WORKINGSET_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
@@ -160,6 +157,9 @@ enum node_stat_item {
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
NR_PAGES_SCANNED, /* pages scanned since last reclaim */
+ WORKINGSET_REFAULT,
+ WORKINGSET_ACTIVATE,
+ WORKINGSET_NODERECLAIM,
NR_VM_NODE_STAT_ITEMS
};

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 36897da22792..054ee50974c9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,9 +921,6 @@ const char * const vmstat_text[] = {
"numa_local",
"numa_other",
#endif
- "workingset_refault",
- "workingset_activate",
- "workingset_nodereclaim",
"nr_anon_transparent_hugepages",
"nr_free_cma",

@@ -936,6 +933,9 @@ const char * const vmstat_text[] = {
"nr_isolated_anon",
"nr_isolated_file",
"nr_pages_scanned",
+ "workingset_refault",
+ "workingset_activate",
+ "workingset_nodereclaim",

/* enum writeback_stat_item counters */
"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index ca080cc11797..1cc71f1ca7fc 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -16,7 +16,7 @@
/*
* Double CLOCK lists
*
- * Per zone, two clock lists are maintained for file pages: the
+ * Per node, two clock lists are maintained for file pages: the
* inactive and the active list. Freshly faulted pages start out at
* the head of the inactive list and page reclaim scans pages from the
* tail. Pages that are accessed multiple times on the inactive list
@@ -141,48 +141,43 @@
*
* Implementation
*
- * For each zone's file LRU lists, a counter for inactive evictions
- * and activations is maintained (zone->inactive_age).
+ * For each node's file LRU lists, a counter for inactive evictions
+ * and activations is maintained (node->inactive_age).
*
* On eviction, a snapshot of this counter (along with some bits to
- * identify the zone) is stored in the now empty page cache radix tree
+ * identify the node) is stored in the now empty page cache radix tree
* slot of the evicted page. This is called a shadow entry.
*
* On cache misses for which there are shadow entries, an eligible
* refault distance will immediately activate the refaulting page.
*/

-static void *pack_shadow(unsigned long eviction, struct zone *zone)
+static void *pack_shadow(unsigned long eviction, struct pglist_data *pgdat)
{
- eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
- eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
+ eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);

return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
}

static void unpack_shadow(void *shadow,
- struct zone **zone,
+ struct pglist_data **pgdat,
unsigned long *distance)
{
unsigned long entry = (unsigned long)shadow;
unsigned long eviction;
unsigned long refault;
unsigned long mask;
- int zid, nid;
+ int nid;

entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
- zid = entry & ((1UL << ZONES_SHIFT) - 1);
- entry >>= ZONES_SHIFT;
nid = entry & ((1UL << NODES_SHIFT) - 1);
entry >>= NODES_SHIFT;
eviction = entry;

- *zone = NODE_DATA(nid)->node_zones + zid;
-
- refault = atomic_long_read(&(*zone)->zone_pgdat->inactive_age);
- mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
- RADIX_TREE_EXCEPTIONAL_SHIFT);
+ *pgdat = NODE_DATA(nid);
+ refault = atomic_long_read(&(*pgdat)->inactive_age);
+ mask = ~0UL >> (NODES_SHIFT + RADIX_TREE_EXCEPTIONAL_SHIFT);
/*
* The unsigned subtraction here gives an accurate distance
* across inactive_age overflows in most cases.
@@ -212,11 +207,11 @@ static void unpack_shadow(void *shadow,
*/
void *workingset_eviction(struct address_space *mapping, struct page *page)
{
- struct zone *zone = page_zone(page);
+ struct pglist_data *pgdat = page_zone(page)->zone_pgdat;
unsigned long eviction;

- eviction = atomic_long_inc_return(&zone->zone_pgdat->inactive_age);
- return pack_shadow(eviction, zone);
+ eviction = atomic_long_inc_return(&pgdat->inactive_age);
+ return pack_shadow(eviction, pgdat);
}

/**
@@ -224,20 +219,20 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
* @shadow: shadow entry of the evicted page
*
* Calculates and evaluates the refault distance of the previously
- * evicted page in the context of the zone it was allocated in.
+ * evicted page in the context of the node it was allocated in.
*
* Returns %true if the page should be activated, %false otherwise.
*/
bool workingset_refault(void *shadow)
{
unsigned long refault_distance;
- struct zone *zone;
+ struct pglist_data *pgdat;

- unpack_shadow(shadow, &zone, &refault_distance);
- inc_zone_state(zone, WORKINGSET_REFAULT);
+ unpack_shadow(shadow, &pgdat, &refault_distance);
+ inc_node_state(pgdat, WORKINGSET_REFAULT);

- if (refault_distance <= node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE)) {
- inc_zone_state(zone, WORKINGSET_ACTIVATE);
+ if (refault_distance <= node_page_state(pgdat, NR_ACTIVE_FILE)) {
+ inc_node_state(pgdat, WORKINGSET_ACTIVATE);
return true;
}
return false;
@@ -356,7 +351,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
}
}
BUG_ON(node->count);
- inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
+ inc_node_state(page_zone(virt_to_page(node))->zone_pgdat, WORKINGSET_NODERECLAIM);
if (!__radix_tree_delete_node(&mapping->page_tree, node))
BUG();

--
2.3.5

2015-06-08 14:00:15

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 12/25] mm, page_alloc: Consider dirtyable memory in terms of nodes

Historically dirty pages were spread among zones but now that LRUs are
per-node it is more appropriate to consider dirty pages in a node.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 12 +++----
include/linux/writeback.h | 2 +-
mm/page-writeback.c | 89 ++++++++++++++++++++++++++---------------------
mm/page_alloc.c | 8 +++--
4 files changed, 62 insertions(+), 49 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4c761809d151..22cdd8da6fb0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -355,12 +355,6 @@ struct zone {
struct pglist_data *zone_pgdat;
struct per_cpu_pageset __percpu *pageset;

- /*
- * This is a per-zone reserve of pages that should not be
- * considered dirtyable memory.
- */
- unsigned long dirty_balance_reserve;
-
#ifndef CONFIG_SPARSEMEM
/*
* Flags for a pageblock_nr_pages block. See pageblock-flags.h.
@@ -760,6 +754,12 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
#endif
+ /*
+ * This is a per-node reserve of pages that should not be
+ * considered dirtyable memory.
+ */
+ unsigned long dirty_balance_reserve;
+
/* Write-intensive fields used from the page allocator */
ZONE_PADDING(_pad1_)
spinlock_t lru_lock;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b2dd371ec0ca..76602a5d721e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -119,7 +119,7 @@ void laptop_mode_timer_fn(unsigned long data);
static inline void laptop_sync_completion(void) { }
#endif
void throttle_vm_writeout(gfp_t gfp_mask);
-bool zone_dirty_ok(struct zone *zone);
+bool node_dirty_ok(struct pglist_data *pgdat);

extern unsigned long global_dirty_limit;

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9707c450c7c5..88e346f36f79 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -174,21 +174,30 @@ static unsigned long writeout_period_time = 0;
*/

/**
- * zone_dirtyable_memory - number of dirtyable pages in a zone
- * @zone: the zone
+ * node_dirtyable_memory - number of dirtyable pages in a node
+ * @pgdat: the node
*
- * Returns the zone's number of pages potentially available for dirty
- * page cache. This is the base value for the per-zone dirty limits.
+ * Returns the node's number of pages potentially available for dirty
+ * page cache. This is the base value for the per-node dirty limits.
*/
-static unsigned long zone_dirtyable_memory(struct zone *zone)
+static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
{
- unsigned long nr_pages;
+ unsigned long nr_pages = 0;
+ int z;

- nr_pages = zone_page_state(zone, NR_FREE_PAGES);
- nr_pages -= min(nr_pages, zone->dirty_balance_reserve);
+ for (z = 0; z < MAX_NR_ZONES; z++) {
+ struct zone *zone = pgdat->node_zones + z;

- nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
- nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
+ if (!populated_zone(zone))
+ continue;
+
+ nr_pages += zone_page_state(zone, NR_FREE_PAGES);
+ }
+
+ nr_pages -= min(nr_pages, pgdat->dirty_balance_reserve);
+
+ nr_pages += node_page_state(pgdat, NR_INACTIVE_FILE);
+ nr_pages += node_page_state(pgdat, NR_ACTIVE_FILE);

return nr_pages;
}
@@ -199,22 +208,11 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
int node;
unsigned long x = 0;

- for_each_node_state(node, N_HIGH_MEMORY) {
- struct zone *z = &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
- x += zone_dirtyable_memory(z);
- }
/*
- * Unreclaimable memory (kernel memory or anonymous memory
- * without swap) can bring down the dirtyable pages below
- * the zone's dirty balance reserve and the above calculation
- * will underflow. However we still want to add in nodes
- * which are below threshold (negative values) to get a more
- * accurate calculation but make sure that the total never
- * underflows.
+ * LRU lists are per-node so there is accurate way of accurately
+ * calculating dirtyable memory of just the high zone
*/
- if ((long)x < 0)
- x = 0;
+ x = totalhigh_pages;

/*
* Make sure that the number of highmem pages is never larger
@@ -289,23 +287,23 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
}

/**
- * zone_dirty_limit - maximum number of dirty pages allowed in a zone
- * @zone: the zone
+ * node_dirty_limit - maximum number of dirty pages allowed in a node
+ * @pgdat: the node
*
- * Returns the maximum number of dirty pages allowed in a zone, based
- * on the zone's dirtyable memory.
+ * Returns the maximum number of dirty pages allowed in a node, based
+ * on the node's dirtyable memory.
*/
-static unsigned long zone_dirty_limit(struct zone *zone)
+static unsigned long node_dirty_limit(struct pglist_data *pgdat)
{
- unsigned long zone_memory = zone_dirtyable_memory(zone);
+ unsigned long node_memory = node_dirtyable_memory(pgdat);
struct task_struct *tsk = current;
unsigned long dirty;

if (vm_dirty_bytes)
dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
- zone_memory / global_dirtyable_memory();
+ node_memory / global_dirtyable_memory();
else
- dirty = vm_dirty_ratio * zone_memory / 100;
+ dirty = vm_dirty_ratio * node_memory / 100;

if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
dirty += dirty / 4;
@@ -314,19 +312,30 @@ static unsigned long zone_dirty_limit(struct zone *zone)
}

/**
- * zone_dirty_ok - tells whether a zone is within its dirty limits
- * @zone: the zone to check
+ * node_dirty_ok - tells whether a node is within its dirty limits
+ * @pgdat: the node to check
*
- * Returns %true when the dirty pages in @zone are within the zone's
+ * Returns %true when the dirty pages in @pgdat are within the node's
* dirty limit, %false if the limit is exceeded.
*/
-bool zone_dirty_ok(struct zone *zone)
+bool node_dirty_ok(struct pglist_data *pgdat)
{
- unsigned long limit = zone_dirty_limit(zone);
+ int z;
+ unsigned long limit = node_dirty_limit(pgdat);
+ unsigned long nr_pages = 0;
+
+ for (z = 0; z < MAX_NR_ZONES; z++) {
+ struct zone *zone = pgdat->node_zones + z;
+
+ if (!populated_zone(zone))
+ continue;
+
+ nr_pages += zone_page_state(zone, NR_FILE_DIRTY);
+ nr_pages += zone_page_state(zone, NR_UNSTABLE_NFS);
+ nr_pages += zone_page_state(zone, NR_WRITEBACK);
+ }

- return zone_page_state(zone, NR_FILE_DIRTY) +
- zone_page_state(zone, NR_UNSTABLE_NFS) +
- zone_page_state(zone, NR_WRITEBACK) <= limit;
+ return nr_pages <= limit;
}

int dirty_background_ratio_handler(struct ctl_table *table, int write,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34201c141916..fa54792f1719 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2042,6 +2042,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
(gfp_mask & __GFP_WRITE);
int nr_fair_skipped = 0;
bool zonelist_rescan;
+ struct pglist_data *last_pgdat = NULL;

zonelist_scan:
zonelist_rescan = false;
@@ -2101,8 +2102,11 @@ zonelist_scan:
* will require awareness of zones in the
* dirty-throttling and the flusher threads.
*/
- if (consider_zone_dirty && !zone_dirty_ok(zone))
+ if (consider_zone_dirty && last_pgdat != zone->zone_pgdat &&
+ !node_dirty_ok(zone->zone_pgdat)) {
continue;
+ }
+ last_pgdat = zone->zone_pgdat;

mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
if (!zone_watermark_ok(zone, order, mark,
@@ -5656,7 +5660,7 @@ static void calculate_totalreserve_pages(void)
* situation where reclaim has to clean pages
* in order to balance the zones.
*/
- zone->dirty_balance_reserve = max;
+ pgdat->dirty_balance_reserve += max;
}
}
dirty_balance_reserve = reserve_pages;
--
2.3.5

2015-06-08 14:00:09

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 13/25] mm: Move NR_FILE_MAPPED accounting to the node

Reclaim makes decisions based on the number of file pages that are mapped but
it's mixing node and zone information. Account NR_FILE_MAPPED pages on the node.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/tile/mm/pgtable.c | 2 +-
drivers/base/node.c | 4 ++--
fs/proc/meminfo.c | 4 ++--
include/linux/mmzone.h | 6 +++---
mm/page_alloc.c | 6 +++---
mm/rmap.c | 12 ++++++------
mm/vmscan.c | 2 +-
mm/vmstat.c | 4 ++--
8 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 3ed0a666d44a..2e784e84bd6f 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -55,7 +55,7 @@ void show_mem(unsigned int filter)
global_page_state(NR_FREE_PAGES),
(global_page_state(NR_SLAB_RECLAIMABLE) +
global_page_state(NR_SLAB_UNRECLAIMABLE)),
- global_page_state(NR_FILE_MAPPED),
+ global_node_page_state(NR_FILE_MAPPED),
global_page_state(NR_PAGETABLE),
global_page_state(NR_BOUNCE),
global_page_state(NR_FILE_PAGES),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index b06ae7bfea63..4a83f3c9891a 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,8 +119,8 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
- nid, K(sum_zone_node_page_state(nid, NR_FILE_MAPPED)),
- nid, K(sum_zone_node_page_state(nid, NR_ANON_PAGES)),
+ nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
+ nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
nid, K(i.sharedram),
nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index d3ebf2e61853..8f105c774b2e 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -173,8 +173,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
K(i.freeswap),
K(global_page_state(NR_FILE_DIRTY)),
K(global_page_state(NR_WRITEBACK)),
- K(global_page_state(NR_ANON_PAGES)),
- K(global_page_state(NR_FILE_MAPPED)),
+ K(global_node_page_state(NR_ANON_PAGES)),
+ K(global_node_page_state(NR_FILE_MAPPED)),
K(i.sharedram),
K(global_page_state(NR_SLAB_RECLAIMABLE) +
global_page_state(NR_SLAB_UNRECLAIMABLE)),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 22cdd8da6fb0..a523e1a30e54 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,9 +116,6 @@ enum zone_stat_item {
NR_FREE_PAGES,
NR_ALLOC_BATCH,
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
- NR_ANON_PAGES, /* Mapped anonymous pages */
- NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
- only modified from process context */
NR_FILE_PAGES,
NR_FILE_DIRTY,
NR_WRITEBACK,
@@ -160,6 +157,9 @@ enum node_stat_item {
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_NODERECLAIM,
+ NR_ANON_PAGES, /* Mapped anonymous pages */
+ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
+ only modified from process context */
NR_VM_NODE_STAT_ITEMS
};

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fa54792f1719..f5a376056ece 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3251,7 +3251,7 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_FREE_PAGES),
global_page_state(NR_SLAB_RECLAIMABLE),
global_page_state(NR_SLAB_UNRECLAIMABLE),
- global_page_state(NR_FILE_MAPPED),
+ global_node_page_state(NR_FILE_MAPPED),
global_page_state(NR_SHMEM),
global_page_state(NR_PAGETABLE),
global_page_state(NR_BOUNCE),
@@ -3266,6 +3266,7 @@ void show_free_areas(unsigned int filter)
" unevictable:%lukB"
" isolated(anon):%lukB"
" isolated(file):%lukB"
+ " mapped:%lukB"
" all_unreclaimable? %s"
"\n",
pgdat->node_id,
@@ -3276,6 +3277,7 @@ void show_free_areas(unsigned int filter)
K(node_page_state(pgdat, NR_UNEVICTABLE)),
K(node_page_state(pgdat, NR_ISOLATED_ANON)),
K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+ K(node_page_state(pgdat, NR_FILE_MAPPED)),
!pgdat_reclaimable(pgdat) ? "yes" : "no");
}

@@ -3295,7 +3297,6 @@ void show_free_areas(unsigned int filter)
" mlocked:%lukB"
" dirty:%lukB"
" writeback:%lukB"
- " mapped:%lukB"
" shmem:%lukB"
" slab_reclaimable:%lukB"
" slab_unreclaimable:%lukB"
@@ -3317,7 +3318,6 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_MLOCK)),
K(zone_page_state(zone, NR_FILE_DIRTY)),
K(zone_page_state(zone, NR_WRITEBACK)),
- K(zone_page_state(zone, NR_FILE_MAPPED)),
K(zone_page_state(zone, NR_SHMEM)),
K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
diff --git a/mm/rmap.c b/mm/rmap.c
index 75f1e06f3339..f2ce8d11bed6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1046,8 +1046,8 @@ void do_page_add_anon_rmap(struct page *page,
if (PageTransHuge(page))
__inc_zone_page_state(page,
NR_ANON_TRANSPARENT_HUGEPAGES);
- __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
- hpage_nr_pages(page));
+ __mod_node_page_state(page_zone(page)->zone_pgdat,
+ NR_ANON_PAGES, hpage_nr_pages(page));
}
if (unlikely(PageKsm(page)))
return;
@@ -1078,7 +1078,7 @@ void page_add_new_anon_rmap(struct page *page,
atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
if (PageTransHuge(page))
__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
- __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+ __mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_PAGES,
hpage_nr_pages(page));
__page_set_anon_rmap(page, vma, address, 1);
}
@@ -1095,7 +1095,7 @@ void page_add_file_rmap(struct page *page)

memcg = mem_cgroup_begin_page_stat(page);
if (atomic_inc_and_test(&page->_mapcount)) {
- __inc_zone_page_state(page, NR_FILE_MAPPED);
+ __inc_node_page_state(page, NR_FILE_MAPPED);
mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);
}
mem_cgroup_end_page_stat(memcg);
@@ -1120,7 +1120,7 @@ static void page_remove_file_rmap(struct page *page)
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
- __dec_zone_page_state(page, NR_FILE_MAPPED);
+ __dec_node_page_state(page, NR_FILE_MAPPED);
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);

if (unlikely(PageMlocked(page)))
@@ -1158,7 +1158,7 @@ void page_remove_rmap(struct page *page)
if (PageTransHuge(page))
__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);

- __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+ __mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_PAGES,
-hpage_nr_pages(page));

if (unlikely(PageMlocked(page)))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3a6a2fac48e5..1391fd15a7ec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3549,7 +3549,7 @@ int sysctl_min_slab_ratio = 5;

static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
{
- unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
+ unsigned long file_mapped = node_page_state(zone->zone_pgdat, NR_FILE_MAPPED);
unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 054ee50974c9..4aa4fb09d078 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -895,8 +895,6 @@ const char * const vmstat_text[] = {
"nr_free_pages",
"nr_alloc_batch",
"nr_mlock",
- "nr_anon_pages",
- "nr_mapped",
"nr_file_pages",
"nr_dirty",
"nr_writeback",
@@ -936,6 +934,8 @@ const char * const vmstat_text[] = {
"workingset_refault",
"workingset_activate",
"workingset_nodereclaim",
+ "nr_anon_pages",
+ "nr_mapped",

/* enum writeback_stat_item counters */
"nr_dirty_threshold",
--
2.3.5

2015-06-08 14:00:05

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 14/25] mm: Rename NR_ANON_PAGES to NR_ANON_MAPPED

NR_FILE_PAGES is the number of file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_PAGES is the number of mapped anon pages.

This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and NR_ANON_PAGES for
mapped pages. This patch renames NR_ANON_PAGES so we have

NR_FILE_PAGES is the number of file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_MAPPED is the number of mapped anon pages.

Signed-off-by: Mel Gorman <[email protected]>
---
fs/proc/meminfo.c | 2 +-
include/linux/mmzone.h | 2 +-
mm/migrate.c | 2 +-
mm/rmap.c | 8 ++++----
4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8f105c774b2e..2072876cce7c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -173,7 +173,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
K(i.freeswap),
K(global_page_state(NR_FILE_DIRTY)),
K(global_page_state(NR_WRITEBACK)),
- K(global_node_page_state(NR_ANON_PAGES)),
+ K(global_node_page_state(NR_ANON_MAPPED)),
K(global_node_page_state(NR_FILE_MAPPED)),
K(i.sharedram),
K(global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a523e1a30e54..4406f855d58e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -157,7 +157,7 @@ enum node_stat_item {
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_NODERECLAIM,
- NR_ANON_PAGES, /* Mapped anonymous pages */
+ NR_ANON_MAPPED, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
only modified from process context */
NR_VM_NODE_STAT_ITEMS
diff --git a/mm/migrate.c b/mm/migrate.c
index a33e4b4ed60d..4a50bb7c06a6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -376,7 +376,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
* new page and drop references to the old page.
*
* Note that anonymous pages are accounted for
- * via NR_FILE_PAGES and NR_ANON_PAGES if they
+ * via NR_FILE_PAGES and NR_ANON_MAPPED if they
* are mapped to swap space.
*/
__dec_zone_page_state(page, NR_FILE_PAGES);
diff --git a/mm/rmap.c b/mm/rmap.c
index f2ce8d11bed6..e6bf7a205913 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1047,7 +1047,7 @@ void do_page_add_anon_rmap(struct page *page,
__inc_zone_page_state(page,
NR_ANON_TRANSPARENT_HUGEPAGES);
__mod_node_page_state(page_zone(page)->zone_pgdat,
- NR_ANON_PAGES, hpage_nr_pages(page));
+ NR_ANON_MAPPED, hpage_nr_pages(page));
}
if (unlikely(PageKsm(page)))
return;
@@ -1078,7 +1078,7 @@ void page_add_new_anon_rmap(struct page *page,
atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
if (PageTransHuge(page))
__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
- __mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_PAGES,
+ __mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_MAPPED,
hpage_nr_pages(page));
__page_set_anon_rmap(page, vma, address, 1);
}
@@ -1146,7 +1146,7 @@ void page_remove_rmap(struct page *page)
if (!atomic_add_negative(-1, &page->_mapcount))
return;

- /* Hugepages are not counted in NR_ANON_PAGES for now. */
+ /* Hugepages are not counted in NR_ANON_MAPPED for now. */
if (unlikely(PageHuge(page)))
return;

@@ -1158,7 +1158,7 @@ void page_remove_rmap(struct page *page)
if (PageTransHuge(page))
__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);

- __mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_PAGES,
+ __mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_MAPPED,
-hpage_nr_pages(page));

if (unlikely(PageMlocked(page)))
--
2.3.5

2015-06-08 13:59:53

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 15/25] mm: Move most file-based accounting to the node

There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/s390/appldata/appldata_mem.c | 2 +-
arch/tile/mm/pgtable.c | 8 +++----
drivers/base/node.c | 12 +++++-----
drivers/staging/android/lowmemorykiller.c | 4 ++--
fs/fs-writeback.c | 8 +++----
fs/fuse/file.c | 8 +++----
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 2 +-
fs/proc/meminfo.c | 10 ++++----
include/linux/mmzone.h | 12 +++++-----
include/trace/events/writeback.h | 6 ++---
mm/filemap.c | 12 +++++-----
mm/migrate.c | 8 +++----
mm/mmap.c | 4 ++--
mm/nommu.c | 4 ++--
mm/page-writeback.c | 38 ++++++++++++-------------------
mm/page_alloc.c | 34 +++++++++++++--------------
mm/shmem.c | 12 +++++-----
mm/swap_state.c | 4 ++--
mm/truncate.c | 2 +-
mm/vmscan.c | 16 ++++++-------
mm/vmstat.c | 12 +++++-----
22 files changed, 106 insertions(+), 114 deletions(-)

diff --git a/arch/s390/appldata/appldata_mem.c b/arch/s390/appldata/appldata_mem.c
index edcf2a706942..598df5708501 100644
--- a/arch/s390/appldata/appldata_mem.c
+++ b/arch/s390/appldata/appldata_mem.c
@@ -102,7 +102,7 @@ static void appldata_get_mem_data(void *data)
mem_data->totalhigh = P2K(val.totalhigh);
mem_data->freehigh = P2K(val.freehigh);
mem_data->bufferram = P2K(val.bufferram);
- mem_data->cached = P2K(global_page_state(NR_FILE_PAGES)
+ mem_data->cached = P2K(global_node_page_state(NR_FILE_PAGES)
- val.bufferram);

si_swapinfo(&val);
diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 2e784e84bd6f..dad42acd0f84 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -49,16 +49,16 @@ void show_mem(unsigned int filter)
global_node_page_state(NR_ACTIVE_FILE)),
(global_node_page_state(NR_INACTIVE_ANON) +
global_node_page_state(NR_INACTIVE_FILE)),
- global_page_state(NR_FILE_DIRTY),
- global_page_state(NR_WRITEBACK),
- global_page_state(NR_UNSTABLE_NFS),
+ global_node_page_state(NR_FILE_DIRTY),
+ global_node_page_state(NR_WRITEBACK),
+ global_node_page_state(NR_UNSTABLE_NFS),
global_page_state(NR_FREE_PAGES),
(global_page_state(NR_SLAB_RECLAIMABLE) +
global_page_state(NR_SLAB_UNRECLAIMABLE)),
global_node_page_state(NR_FILE_MAPPED),
global_page_state(NR_PAGETABLE),
global_page_state(NR_BOUNCE),
- global_page_state(NR_FILE_PAGES),
+ global_node_page_state(NR_FILE_PAGES),
get_nr_swap_pages());

for_each_zone(zone) {
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 4a83f3c9891a..552271e46578 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -116,18 +116,18 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d AnonHugePages: %8lu kB\n"
#endif
,
- nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
- nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
- nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
+ nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
+ nid, K(node_page_state(pgdat, NR_WRITEBACK)),
+ nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
- nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
+ nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
nid, K(i.sharedram),
nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
THREAD_SIZE / 1024,
nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
- nid, K(sum_zone_node_page_state(nid, NR_UNSTABLE_NFS)),
+ nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
- nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK_TEMP)),
+ nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE) +
sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE)),
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 6463d9278229..e3aca64b6aca 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -87,8 +87,8 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
short selected_oom_score_adj;
int array_size = ARRAY_SIZE(lowmem_adj);
int other_free = global_page_state(NR_FREE_PAGES) - totalreserve_pages;
- int other_file = global_page_state(NR_FILE_PAGES) -
- global_page_state(NR_SHMEM) -
+ int other_file = global_node_page_state(NR_FILE_PAGES) -
+ global_node_page_state(NR_SHMEM) -
total_swapcache_pages();

if (lowmem_adj_size < array_size)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 32a8bbd7a9ad..813d4ee67a03 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -836,8 +836,8 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)

global_dirty_limits(&background_thresh, &dirty_thresh);

- if (global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS) > background_thresh)
+ if (global_node_page_state(NR_FILE_DIRTY) +
+ global_node_page_state(NR_UNSTABLE_NFS) > background_thresh)
return true;

if (bdi_stat(bdi, BDI_RECLAIMABLE) >
@@ -991,8 +991,8 @@ get_next_work_item(struct backing_dev_info *bdi)
*/
static unsigned long get_nr_dirty_pages(void)
{
- return global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS) +
+ return global_node_page_state(NR_FILE_DIRTY) +
+ global_node_page_state(NR_UNSTABLE_NFS) +
get_nr_dirty_inodes();
}

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c01ec3bdcfd8..bcb58dfe2dd3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1470,7 +1470,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
list_del(&req->writepages_entry);
for (i = 0; i < req->num_pages; i++) {
dec_bdi_stat(bdi, BDI_WRITEBACK);
- dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+ dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
bdi_writeout_inc(bdi);
}
wake_up(&fi->page_waitq);
@@ -1659,7 +1659,7 @@ static int fuse_writepage_locked(struct page *page)
req->inode = inode;

inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK);
- inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+ inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);

spin_lock(&fc->lock);
list_add(&req->writepages_entry, &fi->writepages);
@@ -1774,7 +1774,7 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
spin_unlock(&fc->lock);

dec_bdi_stat(bdi, BDI_WRITEBACK);
- dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+ dec_node_page_state(page, NR_WRITEBACK_TEMP);
bdi_writeout_inc(bdi);
fuse_writepage_free(fc, new_req);
fuse_request_free(new_req);
@@ -1873,7 +1873,7 @@ static int fuse_writepages_fill(struct page *page,
req->page_descs[req->num_pages].length = PAGE_SIZE;

inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK);
- inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+ inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);

err = 0;
if (is_writeback && fuse_writepage_in_flight(req, page)) {
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 9e6475bc5ba2..1200f9dba3f8 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -606,7 +606,7 @@ void nfs_mark_page_unstable(struct page *page)
{
struct inode *inode = page_file_mapping(page)->host;

- inc_zone_page_state(page, NR_UNSTABLE_NFS);
+ inc_node_page_state(page, NR_UNSTABLE_NFS);
inc_bdi_stat(inode_to_bdi(inode), BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
}
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 849ed784d6ac..ee1d2a51e86e 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -852,7 +852,7 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg,
static void
nfs_clear_page_commit(struct page *page)
{
- dec_zone_page_state(page, NR_UNSTABLE_NFS);
+ dec_node_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(inode_to_bdi(page_file_mapping(page)->host), BDI_RECLAIMABLE);
}

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 2072876cce7c..dc9fde883db4 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -44,7 +44,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
si_swapinfo(&i);
committed = percpu_counter_read_positive(&vm_committed_as);

- cached = global_page_state(NR_FILE_PAGES) -
+ cached = global_node_page_state(NR_FILE_PAGES) -
total_swapcache_pages() - i.bufferram;
if (cached < 0)
cached = 0;
@@ -171,8 +171,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
K(i.totalswap),
K(i.freeswap),
- K(global_page_state(NR_FILE_DIRTY)),
- K(global_page_state(NR_WRITEBACK)),
+ K(global_node_page_state(NR_FILE_DIRTY)),
+ K(global_node_page_state(NR_WRITEBACK)),
K(global_node_page_state(NR_ANON_MAPPED)),
K(global_node_page_state(NR_FILE_MAPPED)),
K(i.sharedram),
@@ -185,9 +185,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#ifdef CONFIG_QUICKLIST
K(quicklist_total_size()),
#endif
- K(global_page_state(NR_UNSTABLE_NFS)),
+ K(global_node_page_state(NR_UNSTABLE_NFS)),
K(global_page_state(NR_BOUNCE)),
- K(global_page_state(NR_WRITEBACK_TEMP)),
+ K(global_node_page_state(NR_WRITEBACK_TEMP)),
K(vm_commit_limit()),
K(committed),
(unsigned long)VMALLOC_TOTAL >> 10,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4406f855d58e..34050b012409 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,20 +116,14 @@ enum zone_stat_item {
NR_FREE_PAGES,
NR_ALLOC_BATCH,
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
- NR_FILE_PAGES,
- NR_FILE_DIRTY,
- NR_WRITEBACK,
NR_SLAB_RECLAIMABLE,
NR_SLAB_UNRECLAIMABLE,
NR_PAGETABLE, /* used for pagetables */
NR_KERNEL_STACK,
/* Second 128 byte cacheline */
- NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_BOUNCE,
NR_VMSCAN_WRITE,
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
- NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
- NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
#ifdef CONFIG_NUMA
@@ -160,6 +154,12 @@ enum node_stat_item {
NR_ANON_MAPPED, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
only modified from process context */
+ NR_FILE_PAGES,
+ NR_FILE_DIRTY,
+ NR_WRITEBACK,
+ NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
+ NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
+ NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_VM_NODE_STAT_ITEMS
};

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 5a14ead59696..e1f38ea62129 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -337,9 +337,9 @@ TRACE_EVENT(global_dirty_state,
),

TP_fast_assign(
- __entry->nr_dirty = global_page_state(NR_FILE_DIRTY);
- __entry->nr_writeback = global_page_state(NR_WRITEBACK);
- __entry->nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+ __entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY);
+ __entry->nr_writeback = global_node_page_state(NR_WRITEBACK);
+ __entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS);
__entry->nr_dirtied = global_page_state(NR_DIRTIED);
__entry->nr_written = global_page_state(NR_WRITTEN);
__entry->background_thresh = background_thresh;
diff --git a/mm/filemap.c b/mm/filemap.c
index 12a47ccd8565..43cb39b5c24a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -197,9 +197,9 @@ void __delete_from_page_cache(struct page *page, void *shadow)
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */

- __dec_zone_page_state(page, NR_FILE_PAGES);
+ __dec_node_page_state(page, NR_FILE_PAGES);
if (PageSwapBacked(page))
- __dec_zone_page_state(page, NR_SHMEM);
+ __dec_node_page_state(page, NR_SHMEM);
BUG_ON(page_mapped(page));

/*
@@ -210,7 +210,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
- dec_zone_page_state(page, NR_FILE_DIRTY);
+ dec_node_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(inode_to_bdi(mapping->host), BDI_RECLAIMABLE);
}
}
@@ -485,9 +485,9 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
error = radix_tree_insert(&mapping->page_tree, offset, new);
BUG_ON(error);
mapping->nrpages++;
- __inc_zone_page_state(new, NR_FILE_PAGES);
+ __inc_node_page_state(new, NR_FILE_PAGES);
if (PageSwapBacked(new))
- __inc_zone_page_state(new, NR_SHMEM);
+ __inc_node_page_state(new, NR_SHMEM);
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_migrate(old, new, true);
radix_tree_preload_end();
@@ -577,7 +577,7 @@ static int __add_to_page_cache_locked(struct page *page,
radix_tree_preload_end();
if (unlikely(error))
goto err_insert;
- __inc_zone_page_state(page, NR_FILE_PAGES);
+ __inc_node_page_state(page, NR_FILE_PAGES);
spin_unlock_irq(&mapping->tree_lock);
if (!huge)
mem_cgroup_commit_charge(page, memcg, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index 4a50bb7c06a6..ad58e7e33b1f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -379,11 +379,11 @@ int migrate_page_move_mapping(struct address_space *mapping,
* via NR_FILE_PAGES and NR_ANON_MAPPED if they
* are mapped to swap space.
*/
- __dec_zone_page_state(page, NR_FILE_PAGES);
- __inc_zone_page_state(newpage, NR_FILE_PAGES);
+ __dec_node_page_state(page, NR_FILE_PAGES);
+ __inc_node_page_state(newpage, NR_FILE_PAGES);
if (!PageSwapCache(page) && PageSwapBacked(page)) {
- __dec_zone_page_state(page, NR_SHMEM);
- __inc_zone_page_state(newpage, NR_SHMEM);
+ __dec_node_page_state(page, NR_SHMEM);
+ __inc_node_page_state(newpage, NR_SHMEM);
}
spin_unlock_irq(&mapping->tree_lock);

diff --git a/mm/mmap.c b/mm/mmap.c
index 9ec50a368634..be87d208fd25 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -168,7 +168,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)

if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
free = global_page_state(NR_FREE_PAGES);
- free += global_page_state(NR_FILE_PAGES);
+ free += global_node_page_state(NR_FILE_PAGES);

/*
* shmem pages shouldn't be counted as free in this
@@ -176,7 +176,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
* that won't affect the overall amount of available
* memory in the system.
*/
- free -= global_page_state(NR_SHMEM);
+ free -= global_node_page_state(NR_SHMEM);

free += get_nr_swap_pages();

diff --git a/mm/nommu.c b/mm/nommu.c
index 3fba2dc97c44..b036f23080e0 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1930,7 +1930,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)

if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
free = global_page_state(NR_FREE_PAGES);
- free += global_page_state(NR_FILE_PAGES);
+ free += global_node_page_state(NR_FILE_PAGES);

/*
* shmem pages shouldn't be counted as free in this
@@ -1938,7 +1938,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
* that won't affect the overall amount of available
* memory in the system.
*/
- free -= global_page_state(NR_SHMEM);
+ free -= global_node_page_state(NR_SHMEM);

free += get_nr_swap_pages();

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 88e346f36f79..ad1ee405d970 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -320,20 +320,12 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat)
*/
bool node_dirty_ok(struct pglist_data *pgdat)
{
- int z;
unsigned long limit = node_dirty_limit(pgdat);
unsigned long nr_pages = 0;

- for (z = 0; z < MAX_NR_ZONES; z++) {
- struct zone *zone = pgdat->node_zones + z;
-
- if (!populated_zone(zone))
- continue;
-
- nr_pages += zone_page_state(zone, NR_FILE_DIRTY);
- nr_pages += zone_page_state(zone, NR_UNSTABLE_NFS);
- nr_pages += zone_page_state(zone, NR_WRITEBACK);
- }
+ nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
+ nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
+ nr_pages += node_page_state(pgdat, NR_WRITEBACK);

return nr_pages <= limit;
}
@@ -1381,9 +1373,9 @@ static void balance_dirty_pages(struct address_space *mapping,
* written to the server's write cache, but has not yet
* been flushed to permanent storage.
*/
- nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS);
- nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+ nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
+ global_node_page_state(NR_UNSTABLE_NFS);
+ nr_dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);

global_dirty_limits(&background_thresh, &dirty_thresh);

@@ -1645,8 +1637,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
*/
dirty_thresh += dirty_thresh / 10; /* wheeee... */

- if (global_page_state(NR_UNSTABLE_NFS) +
- global_page_state(NR_WRITEBACK) <= dirty_thresh)
+ if (global_node_page_state(NR_UNSTABLE_NFS) +
+ global_node_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
congestion_wait(BLK_RW_ASYNC, HZ/10);

@@ -1674,8 +1666,8 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
void laptop_mode_timer_fn(unsigned long data)
{
struct request_queue *q = (struct request_queue *)data;
- int nr_pages = global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS);
+ int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
+ global_node_page_state(NR_UNSTABLE_NFS);

/*
* We want to write everything out, not just down to the dirty
@@ -2108,8 +2100,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
if (mapping_cap_account_dirty(mapping)) {
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);

- __inc_zone_page_state(page, NR_FILE_DIRTY);
- __inc_zone_page_state(page, NR_DIRTIED);
+ __inc_node_page_state(page, NR_FILE_DIRTY);
+ __inc_node_page_state(page, NR_DIRTIED);
__inc_bdi_stat(bdi, BDI_RECLAIMABLE);
__inc_bdi_stat(bdi, BDI_DIRTIED);
task_io_account_write(PAGE_CACHE_SIZE);
@@ -2311,7 +2303,7 @@ int clear_page_dirty_for_io(struct page *page)
* exclusion.
*/
if (TestClearPageDirty(page)) {
- dec_zone_page_state(page, NR_FILE_DIRTY);
+ dec_node_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(inode_to_bdi(mapping->host),
BDI_RECLAIMABLE);
return 1;
@@ -2350,7 +2342,7 @@ int test_clear_page_writeback(struct page *page)
}
if (ret) {
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
- dec_zone_page_state(page, NR_WRITEBACK);
+ dec_node_page_state(page, NR_WRITEBACK);
inc_zone_page_state(page, NR_WRITTEN);
}
mem_cgroup_end_page_stat(memcg);
@@ -2391,7 +2383,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
}
if (!ret) {
mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
- inc_zone_page_state(page, NR_WRITEBACK);
+ inc_node_page_state(page, NR_WRITEBACK);
}
mem_cgroup_end_page_stat(memcg);
return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5a376056ece..2ca5da938972 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3118,7 +3118,7 @@ static inline void show_node(struct zone *zone)
void si_meminfo(struct sysinfo *val)
{
val->totalram = totalram_pages;
- val->sharedram = global_page_state(NR_SHMEM);
+ val->sharedram = global_node_page_state(NR_SHMEM);
val->freeram = global_page_state(NR_FREE_PAGES);
val->bufferram = nr_blockdev_pages();
val->totalhigh = totalhigh_pages;
@@ -3138,7 +3138,7 @@ void si_meminfo_node(struct sysinfo *val, int nid)
for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
managed_pages += pgdat->node_zones[zone_type].managed_pages;
val->totalram = managed_pages;
- val->sharedram = sum_zone_node_page_state(nid, NR_SHMEM);
+ val->sharedram = node_page_state(pgdat, NR_SHMEM);
val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
#ifdef CONFIG_HIGHMEM
val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].managed_pages;
@@ -3245,14 +3245,14 @@ void show_free_areas(unsigned int filter)
global_node_page_state(NR_INACTIVE_FILE),
global_node_page_state(NR_ISOLATED_FILE),
global_node_page_state(NR_UNEVICTABLE),
- global_page_state(NR_FILE_DIRTY),
- global_page_state(NR_WRITEBACK),
- global_page_state(NR_UNSTABLE_NFS),
+ global_node_page_state(NR_FILE_DIRTY),
+ global_node_page_state(NR_WRITEBACK),
+ global_node_page_state(NR_UNSTABLE_NFS),
global_page_state(NR_FREE_PAGES),
global_page_state(NR_SLAB_RECLAIMABLE),
global_page_state(NR_SLAB_UNRECLAIMABLE),
global_node_page_state(NR_FILE_MAPPED),
- global_page_state(NR_SHMEM),
+ global_node_page_state(NR_SHMEM),
global_page_state(NR_PAGETABLE),
global_page_state(NR_BOUNCE),
global_page_state(NR_FREE_CMA_PAGES));
@@ -3267,6 +3267,11 @@ void show_free_areas(unsigned int filter)
" isolated(anon):%lukB"
" isolated(file):%lukB"
" mapped:%lukB"
+ " dirty:%lukB"
+ " writeback:%lukB"
+ " shmem:%lukB"
+ " writeback_tmp:%lukB"
+ " unstable:%lukB"
" all_unreclaimable? %s"
"\n",
pgdat->node_id,
@@ -3278,6 +3283,11 @@ void show_free_areas(unsigned int filter)
K(node_page_state(pgdat, NR_ISOLATED_ANON)),
K(node_page_state(pgdat, NR_ISOLATED_FILE)),
K(node_page_state(pgdat, NR_FILE_MAPPED)),
+ K(node_page_state(pgdat, NR_FILE_DIRTY)),
+ K(node_page_state(pgdat, NR_WRITEBACK)),
+ K(node_page_state(pgdat, NR_SHMEM)),
+ K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
+ K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
!pgdat_reclaimable(pgdat) ? "yes" : "no");
}

@@ -3295,17 +3305,12 @@ void show_free_areas(unsigned int filter)
" present:%lukB"
" managed:%lukB"
" mlocked:%lukB"
- " dirty:%lukB"
- " writeback:%lukB"
- " shmem:%lukB"
" slab_reclaimable:%lukB"
" slab_unreclaimable:%lukB"
" kernel_stack:%lukB"
" pagetables:%lukB"
- " unstable:%lukB"
" bounce:%lukB"
" free_cma:%lukB"
- " writeback_tmp:%lukB"
" node_pages_scanned:%lu"
"\n",
zone->name,
@@ -3316,18 +3321,13 @@ void show_free_areas(unsigned int filter)
K(zone->present_pages),
K(zone->managed_pages),
K(zone_page_state(zone, NR_MLOCK)),
- K(zone_page_state(zone, NR_FILE_DIRTY)),
- K(zone_page_state(zone, NR_WRITEBACK)),
- K(zone_page_state(zone, NR_SHMEM)),
K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
zone_page_state(zone, NR_KERNEL_STACK) *
THREAD_SIZE / 1024,
K(zone_page_state(zone, NR_PAGETABLE)),
- K(zone_page_state(zone, NR_UNSTABLE_NFS)),
K(zone_page_state(zone, NR_BOUNCE)),
K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
- K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
printk("lowmem_reserve[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
@@ -3369,7 +3369,7 @@ void show_free_areas(unsigned int filter)

hugetlb_show_meminfo();

- printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
+ printk("%ld total pagecache pages\n", global_node_page_state(NR_FILE_PAGES));

show_swap_cache_info();
}
diff --git a/mm/shmem.c b/mm/shmem.c
index cf2d0ca010bc..8f73a97599a6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -310,8 +310,8 @@ static int shmem_add_to_page_cache(struct page *page,
page);
if (!error) {
mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- __inc_zone_page_state(page, NR_SHMEM);
+ __inc_node_page_state(page, NR_FILE_PAGES);
+ __inc_node_page_state(page, NR_SHMEM);
spin_unlock_irq(&mapping->tree_lock);
} else {
page->mapping = NULL;
@@ -333,8 +333,8 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
page->mapping = NULL;
mapping->nrpages--;
- __dec_zone_page_state(page, NR_FILE_PAGES);
- __dec_zone_page_state(page, NR_SHMEM);
+ __dec_node_page_state(page, NR_FILE_PAGES);
+ __dec_node_page_state(page, NR_SHMEM);
spin_unlock_irq(&mapping->tree_lock);
page_cache_release(page);
BUG_ON(error);
@@ -995,8 +995,8 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
newpage);
if (!error) {
- __inc_zone_page_state(newpage, NR_FILE_PAGES);
- __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+ __inc_node_page_state(newpage, NR_FILE_PAGES);
+ __dec_node_page_state(oldpage, NR_FILE_PAGES);
}
spin_unlock_irq(&swap_mapping->tree_lock);

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 405923f77334..caa8ebca3996 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -95,7 +95,7 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry)
entry.val, page);
if (likely(!error)) {
address_space->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
+ __inc_node_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(add_total);
}
spin_unlock_irq(&address_space->tree_lock);
@@ -147,7 +147,7 @@ void __delete_from_swap_cache(struct page *page)
set_page_private(page, 0);
ClearPageSwapCache(page);
address_space->nrpages--;
- __dec_zone_page_state(page, NR_FILE_PAGES);
+ __dec_node_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total);
}

diff --git a/mm/truncate.c b/mm/truncate.c
index ddec5a5966d7..77393b97d9ac 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -111,7 +111,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
- dec_zone_page_state(page, NR_FILE_DIRTY);
+ dec_node_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(inode_to_bdi(mapping->host),
BDI_RECLAIMABLE);
if (account_size)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1391fd15a7ec..2a3050d7dc95 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3547,11 +3547,11 @@ int sysctl_min_unmapped_ratio = 1;
*/
int sysctl_min_slab_ratio = 5;

-static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
+static inline unsigned long node_unmapped_file_pages(struct pglist_data *pgdat)
{
- unsigned long file_mapped = node_page_state(zone->zone_pgdat, NR_FILE_MAPPED);
- unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
- node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
+ unsigned long file_mapped = node_page_state(pgdat, NR_FILE_MAPPED);
+ unsigned long file_lru = node_page_state(pgdat, NR_INACTIVE_FILE) +
+ node_page_state(pgdat, NR_ACTIVE_FILE);

/*
* It's possible for there to be more file mapped pages than
@@ -3570,17 +3570,17 @@ static long zone_pagecache_reclaimable(struct zone *zone)
/*
* If RECLAIM_SWAP is set, then all file pages are considered
* potentially reclaimable. Otherwise, we have to worry about
- * pages like swapcache and zone_unmapped_file_pages() provides
+ * pages like swapcache and node_unmapped_file_pages() provides
* a better estimate
*/
if (zone_reclaim_mode & RECLAIM_SWAP)
- nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
+ nr_pagecache_reclaimable = node_page_state(zone->zone_pgdat, NR_FILE_PAGES);
else
- nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
+ nr_pagecache_reclaimable = node_unmapped_file_pages(zone->zone_pgdat);

/* If we can't clean pages, remove dirty pages from consideration */
if (!(zone_reclaim_mode & RECLAIM_WRITE))
- delta += zone_page_state(zone, NR_FILE_DIRTY);
+ delta += node_page_state(zone->zone_pgdat, NR_FILE_DIRTY);

/* Watch for any possible underflows due to delta */
if (unlikely(delta > nr_pagecache_reclaimable))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4aa4fb09d078..4a9f73c4140b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -895,19 +895,13 @@ const char * const vmstat_text[] = {
"nr_free_pages",
"nr_alloc_batch",
"nr_mlock",
- "nr_file_pages",
- "nr_dirty",
- "nr_writeback",
"nr_slab_reclaimable",
"nr_slab_unreclaimable",
"nr_page_table_pages",
"nr_kernel_stack",
- "nr_unstable",
"nr_bounce",
"nr_vmscan_write",
"nr_vmscan_immediate_reclaim",
- "nr_writeback_temp",
- "nr_shmem",
"nr_dirtied",
"nr_written",

@@ -936,6 +930,12 @@ const char * const vmstat_text[] = {
"workingset_nodereclaim",
"nr_anon_pages",
"nr_mapped",
+ "nr_file_pages",
+ "nr_dirty",
+ "nr_writeback",
+ "nr_writeback_temp",
+ "nr_shmem",
+ "nr_unstable",

/* enum writeback_stat_item counters */
"nr_dirty_threshold",
--
2.3.5

2015-06-08 13:59:41

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 16/25] mm, vmscan: Update classzone_idx if buffer_heads_over_limit

If buffer heads are over the limit then the direct reclaim gfp_mask
is promoted to __GFP_HIGHMEM so that lowmem is indirectly freed. With
node-based reclaim, it is also required that the classzone_idx be updated
or the pages will be skipped.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2a3050d7dc95..140aeefdebe1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2497,8 +2497,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
* highmem pages could be pinning lowmem pages storing buffer_heads
*/
orig_mask = sc->gfp_mask;
- if (buffer_heads_over_limit)
+ if (buffer_heads_over_limit) {
sc->gfp_mask |= __GFP_HIGHMEM;
+ classzone_idx = gfp_zone(sc->gfp_mask);
+ }

for_each_zone_zonelist_nodemask(zone, z, zonelist,
classzone_idx, sc->nodemask) {
--
2.3.5

2015-06-08 13:59:14

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 17/25] mm, vmscan: Check if cpusets are enabled during direct reclaim

Direct reclaim obeys cpusets but misses the cpusets_enabled() check.
The overhead is unlikely to be measurable in the direct reclaim
path which is expensive but there is no harm is doing it.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 140aeefdebe1..e1fbd89ab750 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2522,7 +2522,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
* to global LRU.
*/
if (global_reclaim(sc)) {
- if (!cpuset_zone_allowed(zone,
+ if (cpusets_enabled() && !cpuset_zone_allowed(zone,
GFP_KERNEL | __GFP_HARDWALL))
continue;

--
2.3.5

2015-06-08 13:59:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 18/25] mm, vmscan: Only wakeup kswapd once per node for the requested classzone

kswapd is woken when zones are below the low watermark but the wakeup
decision is not taking the classzone into account. Now that reclaim is
node-based, it is only required to wake kswapd once per node and only if
all zones are unbalanced for the requested classzone.

Note that one node might be checked multiple times but there is no cheap
way of tracking what nodes have already been visited for zoneslists that
be ordered by either zone or node.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e1fbd89ab750..69916bb9acba 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3381,6 +3381,7 @@ static int kswapd(void *p)
void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
{
pg_data_t *pgdat;
+ int z;
bool dummy;

if (!populated_zone(zone))
@@ -3395,8 +3396,14 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
}
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
- if (zone_balanced(zone, order, 0, 0, &dummy))
- return;
+
+ /* Only wake kswapd if all zones are unbalanced */
+ for (z = 0; z <= zone_idx(zone); z++) {
+ zone = pgdat->node_zones + z;
+
+ if (zone_balanced(zone, order, 0, classzone_idx, &dummy))
+ return;
+ }

trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
wake_up_interruptible(&pgdat->kswapd_wait);
--
2.3.5

2015-06-08 13:59:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 19/25] mm, vmscan: Account in vmstat for pages skipped during reclaim

Low reclaim efficiency occurs when many pages are scanned that cannot
be reclaimed. This occurs for example when pages are dirty or under
writeback. Node-based LRU reclaim introduces a new source as reclaim
for allocation requests requiring lower zones will skip pages belonging
to higher zones. This patch adds vmstat counters to count pages that
were skipped because the calling context could not use pages from that
zone. It will help distinguish one source of low reclaim efficiency.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vm_event_item.h | 1 +
mm/vmscan.c | 6 +++++-
mm/vmstat.c | 2 ++
3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 4ce4d59d361e..95cdd56c65bf 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
PGFREE, PGACTIVATE, PGDEACTIVATE,
PGFAULT, PGMAJFAULT,
+ FOR_ALL_ZONES(PGSCAN_SKIP),
PGREFILL,
PGSTEAL_KSWAPD,
PGSTEAL_DIRECT,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 69916bb9acba..3cb0cc70ddbd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1326,6 +1326,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,

for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
struct page *page;
+ struct zone *zone;
int nr_pages;

page = lru_to_page(src);
@@ -1333,8 +1334,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,

VM_BUG_ON_PAGE(!PageLRU(page), page);

- if (page_zone_id(page) > sc->reclaim_idx)
+ zone = page_zone(page);
+ if (page_zone_id(page) > sc->reclaim_idx) {
list_move(&page->lru, &pages_skipped);
+ __count_zone_vm_events(PGSCAN_SKIP, page_zone(page), 1);
+ }

switch (__isolate_lru_page(page, mode)) {
case 0:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4a9f73c4140b..d805df47d3ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -957,6 +957,8 @@ const char * const vmstat_text[] = {
"pgfault",
"pgmajfault",

+ TEXTS_FOR_ZONES("pgskip")
+
"pgrefill",
"pgsteal_kswapd",
"pgsteal_direct",
--
2.3.5

2015-06-08 13:58:55

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 20/25] mm, page_alloc: Remove fair zone allocation policy

The fair zone allocation policy interleaves allocation requests between
zones to avoid an age inversion problem whereby new pages are reclaimed
to balance a zone. Reclaim is now node-based so this should no longer be
an issue and the fair zone allocation policy is not free. This patch
removes it.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 2 --
mm/internal.h | 1 -
mm/page_alloc.c | 69 +-------------------------------------------------
mm/vmstat.c | 1 -
4 files changed, 1 insertion(+), 72 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 34050b012409..c551f70951fa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -114,7 +114,6 @@ struct zone_padding {
enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
- NR_ALLOC_BATCH,
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_SLAB_RECLAIMABLE,
NR_SLAB_UNRECLAIMABLE,
@@ -521,7 +520,6 @@ struct zone {
enum zone_flags {
ZONE_RECLAIM_LOCKED, /* prevents concurrent reclaim */
ZONE_OOM_LOCKED, /* zone is in OOM killer zonelist */
- ZONE_FAIR_DEPLETED, /* fair zone policy batch depleted */
};

enum pgdat_flags {
diff --git a/mm/internal.h b/mm/internal.h
index 2e4cee6a8739..a24c4a50c33f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -429,6 +429,5 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
-#define ALLOC_FAIR 0x100 /* fair zone allocation */

#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ca5da938972..6b3a78420a5e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1698,11 +1698,6 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
get_freepage_migratetype(page));
}

- __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
- if (atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]) <= 0 &&
- !test_bit(ZONE_FAIR_DEPLETED, &zone->flags))
- set_bit(ZONE_FAIR_DEPLETED, &zone->flags);
-
__count_zone_vm_events(PGALLOC, zone, 1 << order);
zone_statistics(preferred_zone, zone, gfp_flags);
local_irq_restore(flags);
@@ -1967,11 +1962,6 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
}

-static bool zone_local(struct zone *local_zone, struct zone *zone)
-{
- return local_zone->node == zone->node;
-}
-
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
@@ -1999,11 +1989,6 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
{
}

-static bool zone_local(struct zone *local_zone, struct zone *zone)
-{
- return true;
-}
-
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return true;
@@ -2011,18 +1996,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)

#endif /* CONFIG_NUMA */

-static void reset_alloc_batches(struct zone *preferred_zone)
-{
- struct zone *zone = preferred_zone->zone_pgdat->node_zones;
-
- do {
- mod_zone_page_state(zone, NR_ALLOC_BATCH,
- high_wmark_pages(zone) - low_wmark_pages(zone) -
- atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
- clear_bit(ZONE_FAIR_DEPLETED, &zone->flags);
- } while (zone++ != preferred_zone);
-}
-
/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
@@ -2040,7 +2013,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
int did_zlc_setup = 0; /* just call zlc_setup() one time */
bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
(gfp_mask & __GFP_WRITE);
- int nr_fair_skipped = 0;
bool zonelist_rescan;
struct pglist_data *last_pgdat = NULL;

@@ -2063,20 +2035,6 @@ zonelist_scan:
!cpuset_zone_allowed(zone, gfp_mask))
continue;
/*
- * Distribute pages in proportion to the individual
- * zone size to ensure fair page aging. The zone a
- * page was allocated in should have no effect on the
- * time the page has in memory before being reclaimed.
- */
- if (alloc_flags & ALLOC_FAIR) {
- if (!zone_local(ac->preferred_zone, zone))
- break;
- if (test_bit(ZONE_FAIR_DEPLETED, &zone->flags)) {
- nr_fair_skipped++;
- continue;
- }
- }
- /*
* When allocating a page cache page for writing, we
* want to get it from a zone that is within its dirty
* limit, such that no single zone holds more than its
@@ -2186,24 +2144,6 @@ this_zone_full:
zlc_mark_zone_full(zonelist, z);
}

- /*
- * The first pass makes sure allocations are spread fairly within the
- * local node. However, the local node might have free pages left
- * after the fairness batches are exhausted, and remote zones haven't
- * even been considered yet. Try once more without fairness, and
- * include remote zones now, before entering the slowpath and waking
- * kswapd: prefer spilling to a remote zone over swapping locally.
- */
- if (alloc_flags & ALLOC_FAIR) {
- alloc_flags &= ~ALLOC_FAIR;
- if (nr_fair_skipped) {
- zonelist_rescan = true;
- reset_alloc_batches(ac->preferred_zone);
- }
- if (nr_online_nodes > 1)
- zonelist_rescan = true;
- }
-
if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
/* Disable zlc cache for second zonelist scan */
zlc_active = 0;
@@ -2808,7 +2748,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zoneref *preferred_zoneref;
struct page *page = NULL;
unsigned int cpuset_mems_cookie;
- int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
+ int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = {
.high_zoneidx = gfp_zone(gfp_mask),
@@ -4950,9 +4890,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone_seqlock_init(zone);
zone_pcp_init(zone);

- /* For bootup, initialized properly in watermark setup */
- mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
-
if (!size)
continue;

@@ -5751,10 +5688,6 @@ static void __setup_per_zone_wmarks(void)
zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + (tmp >> 2);
zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);

- __mod_zone_page_state(zone, NR_ALLOC_BATCH,
- high_wmark_pages(zone) - low_wmark_pages(zone) -
- atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-
setup_zone_migrate_reserve(zone);
spin_unlock_irqrestore(&zone->lock, flags);
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d805df47d3ae..c3fdd88961ff 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -893,7 +893,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
const char * const vmstat_text[] = {
/* enum zone_stat_item countes */
"nr_free_pages",
- "nr_alloc_batch",
"nr_mlock",
"nr_slab_reclaimable",
"nr_slab_unreclaimable",
--
2.3.5

2015-06-08 13:58:35

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 21/25] mm, page_alloc: Defer zlc_setup until it is known it is required

The zonelist cache (zlc) records if zone_reclaim() is necessary but it is
setup before it is checked if zone_reclaim is even enabled. This patch
defers the setup until after zone_reclaim is checked.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6b3a78420a5e..637b293cd5d1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2076,6 +2076,10 @@ zonelist_scan:
if (alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;

+ if (zone_reclaim_mode == 0 ||
+ !zone_allows_reclaim(ac->preferred_zone, zone))
+ goto this_zone_full;
+
if (IS_ENABLED(CONFIG_NUMA) &&
!did_zlc_setup && nr_online_nodes > 1) {
/*
@@ -2088,10 +2092,6 @@ zonelist_scan:
did_zlc_setup = 1;
}

- if (zone_reclaim_mode == 0 ||
- !zone_allows_reclaim(ac->preferred_zone, zone))
- goto this_zone_full;
-
/*
* As we may have just activated ZLC, check if the first
* eligible zone has failed zone_reclaim recently.
--
2.3.5

2015-06-08 13:59:07

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 22/25] mm: Convert zone_reclaim to node_reclaim

As reclaim is now per-node based, convert zone_reclaim to be node_reclaim
and avoid reclaiming a node multiple times due to having multiple populated
zones. The documentation and interface to userspace is the same as from
a configuration and behaviour perspective, it will be similar unless the
node-local allocation requests were also limited to lower zones.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 18 +++++------
include/linux/swap.h | 9 +++---
include/linux/topology.h | 2 +-
kernel/sysctl.c | 4 +--
mm/huge_memory.c | 4 +--
mm/internal.h | 8 ++---
mm/page_alloc.c | 35 +++++++++++++++-------
mm/vmscan.c | 77 ++++++++++++++++++++++++------------------------
8 files changed, 85 insertions(+), 72 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c551f70951fa..84fcb7aafb2b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -362,14 +362,6 @@ struct zone {
unsigned long *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

-#ifdef CONFIG_NUMA
- /*
- * zone reclaim becomes active if more unmapped pages exist.
- */
- unsigned long min_unmapped_pages;
- unsigned long min_slab_pages;
-#endif /* CONFIG_NUMA */
-
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;

@@ -518,7 +510,6 @@ struct zone {
} ____cacheline_internodealigned_in_smp;

enum zone_flags {
- ZONE_RECLAIM_LOCKED, /* prevents concurrent reclaim */
ZONE_OOM_LOCKED, /* zone is in OOM killer zonelist */
};

@@ -533,6 +524,7 @@ enum pgdat_flags {
PGDAT_WRITEBACK, /* reclaim scanning has recently found
* many pages under writeback
*/
+ PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */
};

static inline unsigned long zone_end_pfn(const struct zone *zone)
@@ -758,6 +750,14 @@ typedef struct pglist_data {
*/
unsigned long dirty_balance_reserve;

+#ifdef CONFIG_NUMA
+ /*
+ * zone reclaim becomes active if more unmapped pages exist.
+ */
+ unsigned long min_unmapped_pages;
+ unsigned long min_slab_pages;
+#endif /* CONFIG_NUMA */
+
/* Write-intensive fields used from the page allocator */
ZONE_PADDING(_pad1_)
spinlock_t lru_lock;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bb9597213e39..59d70fd04ec8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -333,13 +333,14 @@ extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;

#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
+extern int node_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
#else
-#define zone_reclaim_mode 0
-static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
+#define node_reclaim_mode 0
+static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
+ unsigned int order)
{
return 0;
}
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 909b6e43b694..55a9b2bbb4de 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -58,7 +58,7 @@ int arch_update_cpu_topology(void);
/*
* If the distance between nodes in a system is larger than RECLAIM_DISTANCE
* (in whatever arch specific measurement units returned by node_distance())
- * and zone_reclaim_mode is enabled then the VM will only call zone_reclaim()
+ * and node_reclaim_mode is enabled then the VM will only call node_reclaim()
* on nodes within this distance.
*/
#define RECLAIM_DISTANCE 30
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ce410bb9f2e1..f80921283f06 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1394,8 +1394,8 @@ static struct ctl_table vm_table[] = {
#ifdef CONFIG_NUMA
{
.procname = "zone_reclaim_mode",
- .data = &zone_reclaim_mode,
- .maxlen = sizeof(zone_reclaim_mode),
+ .data = &node_reclaim_mode,
+ .maxlen = sizeof(node_reclaim_mode),
.mode = 0644,
.proc_handler = proc_dointvec,
.extra1 = &zero,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b56c14a41d96..a5c4e36f200c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2249,10 +2249,10 @@ static bool khugepaged_scan_abort(int nid)
int i;

/*
- * If zone_reclaim_mode is disabled, then no extra effort is made to
+ * If node_reclaim_mode is disabled, then no extra effort is made to
* allocate memory locally.
*/
- if (!zone_reclaim_mode)
+ if (!node_reclaim_mode)
return false;

/* If there is a count for this node already, it must be acceptable */
diff --git a/mm/internal.h b/mm/internal.h
index a24c4a50c33f..a0b0d20ead97 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -395,10 +395,10 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
}
#endif /* CONFIG_SPARSEMEM */

-#define ZONE_RECLAIM_NOSCAN -2
-#define ZONE_RECLAIM_FULL -1
-#define ZONE_RECLAIM_SOME 0
-#define ZONE_RECLAIM_SUCCESS 1
+#define NODE_RECLAIM_NOSCAN -2
+#define NODE_RECLAIM_FULL -1
+#define NODE_RECLAIM_SOME 0
+#define NODE_RECLAIM_SUCCESS 1

extern int hwpoison_filter(struct page *p);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 637b293cd5d1..47e6332d7566 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2064,7 +2064,6 @@ zonelist_scan:
!node_dirty_ok(zone->zone_pgdat)) {
continue;
}
- last_pgdat = zone->zone_pgdat;

mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
if (!zone_watermark_ok(zone, order, mark,
@@ -2076,7 +2075,7 @@ zonelist_scan:
if (alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;

- if (zone_reclaim_mode == 0 ||
+ if (node_reclaim_mode == 0 ||
!zone_allows_reclaim(ac->preferred_zone, zone))
goto this_zone_full;

@@ -2094,18 +2093,22 @@ zonelist_scan:

/*
* As we may have just activated ZLC, check if the first
- * eligible zone has failed zone_reclaim recently.
+ * eligible zone has failed node_reclaim recently.
*/
if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;

- ret = zone_reclaim(zone, gfp_mask, order);
+ /* Skip if we have already attemped node_reclaim */
+ if (last_pgdat == zone->zone_pgdat)
+ goto try_this_zone;
+
+ ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
switch (ret) {
- case ZONE_RECLAIM_NOSCAN:
+ case NODE_RECLAIM_NOSCAN:
/* did not scan */
continue;
- case ZONE_RECLAIM_FULL:
+ case NODE_RECLAIM_FULL:
/* scanned but unreclaimable */
continue;
default:
@@ -2124,7 +2127,7 @@ zonelist_scan:
* min watermarks.
*/
if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
- ret == ZONE_RECLAIM_SOME)
+ ret == NODE_RECLAIM_SOME)
goto this_zone_full;

continue;
@@ -2132,6 +2135,7 @@ zonelist_scan:
}

try_this_zone:
+ last_pgdat = zone->zone_pgdat;
page = buffered_rmqueue(ac->preferred_zone, zone, order,
gfp_mask, ac->migratetype);
if (page) {
@@ -2140,6 +2144,7 @@ try_this_zone:
return page;
}
this_zone_full:
+ last_pgdat = zone->zone_pgdat;
if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
zlc_mark_zone_full(zonelist, z);
}
@@ -4879,9 +4884,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
#ifdef CONFIG_NUMA
zone->node = nid;
- zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
+ pgdat->min_unmapped_pages += (freesize*sysctl_min_unmapped_ratio)
/ 100;
- zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
+ pgdat->min_slab_pages += (freesize * sysctl_min_slab_ratio) / 100;
#endif
zone->name = zone_names[j];
zone->zone_pgdat = pgdat;
@@ -5839,6 +5844,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
+ struct pglist_data *pgdat;
struct zone *zone;
int rc;

@@ -5846,8 +5852,11 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
if (rc)
return rc;

+ for_each_online_pgdat(pgdat)
+ pgdat->min_slab_pages = 0;
+
for_each_zone(zone)
- zone->min_unmapped_pages = (zone->managed_pages *
+ zone->zone_pgdat->min_unmapped_pages += (zone->managed_pages *
sysctl_min_unmapped_ratio) / 100;
return 0;
}
@@ -5855,6 +5864,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
+ struct pglist_data *pgdat;
struct zone *zone;
int rc;

@@ -5862,8 +5872,11 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
if (rc)
return rc;

+ for_each_online_pgdat(pgdat)
+ pgdat->min_slab_pages = 0;
+
for_each_zone(zone)
- zone->min_slab_pages = (zone->managed_pages *
+ zone->zone_pgdat->min_slab_pages += (zone->managed_pages *
sysctl_min_slab_ratio) / 100;
return 0;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3cb0cc70ddbd..cf9ae51c9a5c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3529,12 +3529,12 @@ module_init(kswapd_init)

#ifdef CONFIG_NUMA
/*
- * Zone reclaim mode
+ * Node reclaim mode
*
- * If non-zero call zone_reclaim when the number of free pages falls below
+ * If non-zero call node_reclaim when the number of free pages falls below
* the watermarks.
*/
-int zone_reclaim_mode __read_mostly;
+int node_reclaim_mode __read_mostly;

#define RECLAIM_OFF 0
#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
@@ -3542,14 +3542,14 @@ int zone_reclaim_mode __read_mostly;
#define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */

/*
- * Priority for ZONE_RECLAIM. This determines the fraction of pages
+ * Priority for NODE_RECLAIM. This determines the fraction of pages
* of a node considered for each zone_reclaim. 4 scans 1/16th of
* a zone.
*/
-#define ZONE_RECLAIM_PRIORITY 4
+#define NODE_RECLAIM_PRIORITY 4

/*
- * Percentage of pages in a zone that must be unmapped for zone_reclaim to
+ * Percentage of pages in a zone that must be unmapped for node_reclaim to
* occur.
*/
int sysctl_min_unmapped_ratio = 1;
@@ -3575,9 +3575,9 @@ static inline unsigned long node_unmapped_file_pages(struct pglist_data *pgdat)
}

/* Work out how many page cache pages we can reclaim in this reclaim_mode */
-static long zone_pagecache_reclaimable(struct zone *zone)
+static long node_pagecache_reclaimable(struct pglist_data *pgdat)
{
- long nr_pagecache_reclaimable;
+ long nr_pagecache_reclaimable = 0;
long delta = 0;

/*
@@ -3586,14 +3586,14 @@ static long zone_pagecache_reclaimable(struct zone *zone)
* pages like swapcache and node_unmapped_file_pages() provides
* a better estimate
*/
- if (zone_reclaim_mode & RECLAIM_SWAP)
- nr_pagecache_reclaimable = node_page_state(zone->zone_pgdat, NR_FILE_PAGES);
+ if (node_reclaim_mode & RECLAIM_SWAP)
+ nr_pagecache_reclaimable = node_page_state(pgdat, NR_FILE_PAGES);
else
- nr_pagecache_reclaimable = node_unmapped_file_pages(zone->zone_pgdat);
+ nr_pagecache_reclaimable = node_unmapped_file_pages(pgdat);

/* If we can't clean pages, remove dirty pages from consideration */
- if (!(zone_reclaim_mode & RECLAIM_WRITE))
- delta += node_page_state(zone->zone_pgdat, NR_FILE_DIRTY);
+ if (!(node_reclaim_mode & RECLAIM_WRITE))
+ delta += node_page_state(pgdat, NR_FILE_DIRTY);

/* Watch for any possible underflows due to delta */
if (unlikely(delta > nr_pagecache_reclaimable))
@@ -3603,21 +3603,22 @@ static long zone_pagecache_reclaimable(struct zone *zone)
}

/*
- * Try to free up some pages from this zone through reclaim.
+ * Try to free up some pages from this node through reclaim.
*/
-static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
{
/* Minimum pages needed in order to stay on node */
const unsigned long nr_pages = 1 << order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
+ int classzone_idx = gfp_zone(gfp_mask);
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
.order = order,
- .priority = ZONE_RECLAIM_PRIORITY,
- .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
- .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
+ .priority = NODE_RECLAIM_PRIORITY,
+ .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
+ .may_unmap = !!(node_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
};

@@ -3632,13 +3633,13 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

- if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
+ if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
/*
* Free memory by calling shrink zone with increasing
* priorities until we have enough memory freed.
*/
do {
- shrink_node(zone->zone_pgdat, &sc, zone_idx(zone), zone_idx(zone));
+ shrink_node(pgdat, &sc, classzone_idx, classzone_idx);
} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
}

@@ -3648,49 +3649,47 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
return sc.nr_reclaimed >= nr_pages;
}

-int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
{
- int node_id;
int ret;

/*
- * Zone reclaim reclaims unmapped file backed pages and
+ * Node reclaim reclaims unmapped file backed pages and
* slab pages if we are over the defined limits.
*
* A small portion of unmapped file backed pages is needed for
* file I/O otherwise pages read by file I/O will be immediately
- * thrown out if the zone is overallocated. So we do not reclaim
- * if less than a specified percentage of the zone is used by
+ * thrown out if the node is overallocated. So we do not reclaim
+ * if less than a specified percentage of the node is used by
* unmapped file backed pages.
*/
- if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
- zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
- return ZONE_RECLAIM_FULL;
+ if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
+ sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+ return NODE_RECLAIM_FULL;

- if (!pgdat_reclaimable(zone->zone_pgdat))
- return ZONE_RECLAIM_FULL;
+ if (!pgdat_reclaimable(pgdat))
+ return NODE_RECLAIM_FULL;

/*
* Do not scan if the allocation should not be delayed.
*/
if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
- return ZONE_RECLAIM_NOSCAN;
+ return NODE_RECLAIM_NOSCAN;

/*
- * Only run zone reclaim on the local zone or on zones that do not
+ * Only run node reclaim on the local node or on nodes that do not
* have associated processors. This will favor the local processor
* over remote processors and spread off node memory allocations
* as wide as possible.
*/
- node_id = zone_to_nid(zone);
- if (node_state(node_id, N_CPU) && node_id != numa_node_id())
- return ZONE_RECLAIM_NOSCAN;
+ if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
+ return NODE_RECLAIM_NOSCAN;

- if (test_and_set_bit(ZONE_RECLAIM_LOCKED, &zone->flags))
- return ZONE_RECLAIM_NOSCAN;
+ if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
+ return NODE_RECLAIM_NOSCAN;

- ret = __zone_reclaim(zone, gfp_mask, order);
- clear_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
+ ret = __node_reclaim(pgdat, gfp_mask, order);
+ clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);

if (!ret)
count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
--
2.3.5

2015-06-08 13:58:24

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 23/25] mm, page_alloc: Delete the zonelist_cache

The zonelist cache (zlc) was introduced to quickly skip over full zones
when zone_reclaim was active. This made some sense before there were zone
flags and when zone_reclaim was the default. Now reclaim is node-based and
zone_reclaim is disabled by default. If it was being implemented today,
it would use a pgdat flag with a timeout to reset it but as zone_reclaim
is extraordinarily expensive the entire concept is of dubious merit. This
patch deletes it entirely.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 78 +-----------------
mm/page_alloc.c | 210 +------------------------------------------------
mm/vmstat.c | 8 +-
3 files changed, 8 insertions(+), 288 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 84fcb7aafb2b..89a88ba096f3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -558,84 +558,14 @@ static inline bool zone_is_empty(struct zone *zone)
#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)

#ifdef CONFIG_NUMA
-
/*
* The NUMA zonelists are doubled because we need zonelists that restrict the
* allocations to a single node for __GFP_THISNODE.
*
- * [0] : Zonelist with fallback
- * [1] : No fallback (__GFP_THISNODE)
+ * [0] : Zonelist with fallback
+ * [1] : No fallback (__GFP_THISNODE)
*/
#define MAX_ZONELISTS 2
-
-
-/*
- * We cache key information from each zonelist for smaller cache
- * footprint when scanning for free pages in get_page_from_freelist().
- *
- * 1) The BITMAP fullzones tracks which zones in a zonelist have come
- * up short of free memory since the last time (last_fullzone_zap)
- * we zero'd fullzones.
- * 2) The array z_to_n[] maps each zone in the zonelist to its node
- * id, so that we can efficiently evaluate whether that node is
- * set in the current tasks mems_allowed.
- *
- * Both fullzones and z_to_n[] are one-to-one with the zonelist,
- * indexed by a zones offset in the zonelist zones[] array.
- *
- * The get_page_from_freelist() routine does two scans. During the
- * first scan, we skip zones whose corresponding bit in 'fullzones'
- * is set or whose corresponding node in current->mems_allowed (which
- * comes from cpusets) is not set. During the second scan, we bypass
- * this zonelist_cache, to ensure we look methodically at each zone.
- *
- * Once per second, we zero out (zap) fullzones, forcing us to
- * reconsider nodes that might have regained more free memory.
- * The field last_full_zap is the time we last zapped fullzones.
- *
- * This mechanism reduces the amount of time we waste repeatedly
- * reexaming zones for free memory when they just came up low on
- * memory momentarilly ago.
- *
- * The zonelist_cache struct members logically belong in struct
- * zonelist. However, the mempolicy zonelists constructed for
- * MPOL_BIND are intentionally variable length (and usually much
- * shorter). A general purpose mechanism for handling structs with
- * multiple variable length members is more mechanism than we want
- * here. We resort to some special case hackery instead.
- *
- * The MPOL_BIND zonelists don't need this zonelist_cache (in good
- * part because they are shorter), so we put the fixed length stuff
- * at the front of the zonelist struct, ending in a variable length
- * zones[], as is needed by MPOL_BIND.
- *
- * Then we put the optional zonelist cache on the end of the zonelist
- * struct. This optional stuff is found by a 'zlcache_ptr' pointer in
- * the fixed length portion at the front of the struct. This pointer
- * both enables us to find the zonelist cache, and in the case of
- * MPOL_BIND zonelists, (which will just set the zlcache_ptr to NULL)
- * to know that the zonelist cache is not there.
- *
- * The end result is that struct zonelists come in two flavors:
- * 1) The full, fixed length version, shown below, and
- * 2) The custom zonelists for MPOL_BIND.
- * The custom MPOL_BIND zonelists have a NULL zlcache_ptr and no zlcache.
- *
- * Even though there may be multiple CPU cores on a node modifying
- * fullzones or last_full_zap in the same zonelist_cache at the same
- * time, we don't lock it. This is just hint data - if it is wrong now
- * and then, the allocator will still function, perhaps a bit slower.
- */
-
-
-struct zonelist_cache {
- unsigned short z_to_n[MAX_ZONES_PER_ZONELIST]; /* zone->nid */
- DECLARE_BITMAP(fullzones, MAX_ZONES_PER_ZONELIST); /* zone full? */
- unsigned long last_full_zap; /* when last zap'd (jiffies) */
-};
-#else
-#define MAX_ZONELISTS 1
-struct zonelist_cache;
#endif

/*
@@ -665,11 +595,7 @@ struct zoneref {
* zonelist_node_idx() - Return the index of the node for an entry
*/
struct zonelist {
- struct zonelist_cache *zlcache_ptr; // NULL or &zlcache
struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
-#ifdef CONFIG_NUMA
- struct zonelist_cache zlcache; // optional ...
-#endif
};

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47e6332d7566..4108743eb801 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1846,154 +1846,16 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
}

#ifdef CONFIG_NUMA
-/*
- * zlc_setup - Setup for "zonelist cache". Uses cached zone data to
- * skip over zones that are not allowed by the cpuset, or that have
- * been recently (in last second) found to be nearly full. See further
- * comments in mmzone.h. Reduces cache footprint of zonelist scans
- * that have to skip over a lot of full or unallowed zones.
- *
- * If the zonelist cache is present in the passed zonelist, then
- * returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_states[N_MEMORY].)
- *
- * If the zonelist cache is not available for this zonelist, does
- * nothing and returns NULL.
- *
- * If the fullzones BITMAP in the zonelist cache is stale (more than
- * a second since last zap'd) then we zap it out (clear its bits.)
- *
- * We hold off even calling zlc_setup, until after we've checked the
- * first zone in the zonelist, on the theory that most allocations will
- * be satisfied from that first zone, so best to examine that zone as
- * quickly as we can.
- */
-static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
- nodemask_t *allowednodes; /* zonelist_cache approximation */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return NULL;
-
- if (time_after(jiffies, zlc->last_full_zap + HZ)) {
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
- zlc->last_full_zap = jiffies;
- }
-
- allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
- &cpuset_current_mems_allowed :
- &node_states[N_MEMORY];
- return allowednodes;
-}
-
-/*
- * Given 'z' scanning a zonelist, run a couple of quick checks to see
- * if it is worth looking at further for free memory:
- * 1) Check that the zone isn't thought to be full (doesn't have its
- * bit set in the zonelist_cache fullzones BITMAP).
- * 2) Check that the zones node (obtained from the zonelist_cache
- * z_to_n[] mapping) is allowed in the passed in allowednodes mask.
- * Return true (non-zero) if zone is worth looking at further, or
- * else return false (zero) if it is not.
- *
- * This check -ignores- the distinction between various watermarks,
- * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ... If a zone is
- * found to be full for any variation of these watermarks, it will
- * be considered full for up to one second by all requests, unless
- * we are so low on memory on all allowed nodes that we are forced
- * into the second scan of the zonelist.
- *
- * In the second scan we ignore this zonelist cache and exactly
- * apply the watermarks to all zones, even it is slower to do so.
- * We are low on memory in the second scan, and should leave no stone
- * unturned looking for a free page.
- */
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
- nodemask_t *allowednodes)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
- int i; /* index of *z in zonelist zones */
- int n; /* node that zone *z is on */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return 1;
-
- i = z - zonelist->_zonerefs;
- n = zlc->z_to_n[i];
-
- /* This zone is worth trying if it is allowed but not full */
- return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
-}
-
-/*
- * Given 'z' scanning a zonelist, set the corresponding bit in
- * zlc->fullzones, so that subsequent attempts to allocate a page
- * from that zone don't waste time re-examining it.
- */
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
- int i; /* index of *z in zonelist zones */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return;
-
- i = z - zonelist->_zonerefs;
-
- set_bit(i, zlc->fullzones);
-}
-
-/*
- * clear all zones full, called after direct reclaim makes progress so that
- * a zone that was recently full is not skipped over for up to a second
- */
-static void zlc_clear_zones_full(struct zonelist *zonelist)
-{
- struct zonelist_cache *zlc; /* cached zonelist speedup info */
-
- zlc = zonelist->zlcache_ptr;
- if (!zlc)
- return;
-
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-}
-
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
RECLAIM_DISTANCE;
}
-
#else /* CONFIG_NUMA */
-
-static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
-{
- return NULL;
-}
-
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
- nodemask_t *allowednodes)
-{
- return 1;
-}
-
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
-{
-}
-
-static void zlc_clear_zones_full(struct zonelist *zonelist)
-{
-}
-
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
return true;
}
-
#endif /* CONFIG_NUMA */

/*
@@ -2008,17 +1870,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
struct zoneref *z;
struct page *page = NULL;
struct zone *zone;
- nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
- int zlc_active = 0; /* set if using zonelist_cache */
- int did_zlc_setup = 0; /* just call zlc_setup() one time */
bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
(gfp_mask & __GFP_WRITE);
- bool zonelist_rescan;
struct pglist_data *last_pgdat = NULL;

-zonelist_scan:
- zonelist_rescan = false;
-
/*
* Scan zonelist, looking for a zone with enough free.
* See also __cpuset_node_allowed() comment in kernel/cpuset.c.
@@ -2027,9 +1882,6 @@ zonelist_scan:
ac->nodemask) {
unsigned long mark;

- if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
- !zlc_zone_worth_trying(zonelist, z, allowednodes))
- continue;
if (cpusets_enabled() &&
(alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed(zone, gfp_mask))
@@ -2079,26 +1931,6 @@ zonelist_scan:
!zone_allows_reclaim(ac->preferred_zone, zone))
goto this_zone_full;

- if (IS_ENABLED(CONFIG_NUMA) &&
- !did_zlc_setup && nr_online_nodes > 1) {
- /*
- * we do zlc_setup if there are multiple nodes
- * and before considering the first zone allowed
- * by the cpuset.
- */
- allowednodes = zlc_setup(zonelist, alloc_flags);
- zlc_active = 1;
- did_zlc_setup = 1;
- }
-
- /*
- * As we may have just activated ZLC, check if the first
- * eligible zone has failed node_reclaim recently.
- */
- if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
- !zlc_zone_worth_trying(zonelist, z, allowednodes))
- continue;
-
/* Skip if we have already attemped node_reclaim */
if (last_pgdat == zone->zone_pgdat)
goto try_this_zone;
@@ -2145,19 +1977,8 @@ try_this_zone:
}
this_zone_full:
last_pgdat = zone->zone_pgdat;
- if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
- zlc_mark_zone_full(zonelist, z);
}

- if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
- /* Disable zlc cache for second zonelist scan */
- zlc_active = 0;
- zonelist_rescan = true;
- }
-
- if (zonelist_rescan)
- goto zonelist_scan;
-
return NULL;
}

@@ -2440,10 +2261,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
if (unlikely(!(*did_some_progress)))
return NULL;

- /* After successful reclaim, reconsider all zones for allocation */
- if (IS_ENABLED(CONFIG_NUMA))
- zlc_clear_zones_full(ac->zonelist);
-
retry:
page = get_page_from_freelist(gfp_mask, order,
alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
@@ -2757,7 +2574,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = {
.high_zoneidx = gfp_zone(gfp_mask),
- .nodemask = nodemask,
+ .nodemask = nodemask ? : &cpuset_current_mems_allowed,
.migratetype = gfpflags_to_migratetype(gfp_mask),
};

@@ -2788,8 +2605,7 @@ retry_cpuset:
ac.zonelist = zonelist;
/* The preferred zone is used for statistics later */
preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
- ac.nodemask ? : &cpuset_current_mems_allowed,
- &ac.preferred_zone);
+ ac.nodemask, &ac.preferred_zone);
if (!ac.preferred_zone)
goto out;
ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
@@ -3674,20 +3490,6 @@ static void build_zonelists(pg_data_t *pgdat)
build_thisnode_zonelists(pgdat);
}

-/* Construct the zonelist performance cache - see further mmzone.h */
-static void build_zonelist_cache(pg_data_t *pgdat)
-{
- struct zonelist *zonelist;
- struct zonelist_cache *zlc;
- struct zoneref *z;
-
- zonelist = &pgdat->node_zonelists[0];
- zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
- for (z = zonelist->_zonerefs; z->zone; z++)
- zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z);
-}
-
#ifdef CONFIG_HAVE_MEMORYLESS_NODES
/*
* Return node id of node used for "local" allocations.
@@ -3748,12 +3550,6 @@ static void build_zonelists(pg_data_t *pgdat)
zonelist->_zonerefs[j].zone_idx = 0;
}

-/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
-static void build_zonelist_cache(pg_data_t *pgdat)
-{
- pgdat->node_zonelists[0].zlcache_ptr = NULL;
-}
-
#endif /* CONFIG_NUMA */

/*
@@ -3794,14 +3590,12 @@ static int __build_all_zonelists(void *data)

if (self && !node_online(self->node_id)) {
build_zonelists(self);
- build_zonelist_cache(self);
}

for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);

build_zonelists(pgdat);
- build_zonelist_cache(pgdat);
}

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c3fdd88961ff..30b17cf07197 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1500,14 +1500,14 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
v[i] = global_page_state(i);
v += NR_VM_ZONE_STAT_ITEMS;

- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
- v[i] = global_node_page_state(i);
- v += NR_VM_NODE_STAT_ITEMS;
-
global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
v + NR_DIRTY_THRESHOLD);
v += NR_VM_WRITEBACK_STAT_ITEMS;

+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+ v[i] = global_node_page_state(i);
+ v += NR_VM_NODE_STAT_ITEMS;
+
#ifdef CONFIG_VM_EVENT_COUNTERS
all_vm_events(v);
v[PGPGIN] /= 2; /* sectors -> kbytes */
--
2.3.5

2015-06-08 13:58:44

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 24/25] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone)

ac->classzone_idx is determined by the index of the preferred zone and cached
to avoid repeated calculations. wake_all_kswapds() should use it instead of
using zone_idx() within a loop.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4108743eb801..886102cc9b09 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2307,7 +2307,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)

for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
ac->high_zoneidx, ac->nodemask)
- wakeup_kswapd(zone, order, zone_idx(ac->preferred_zone));
+ wakeup_kswapd(zone, order, ac->classzone_idx);
}

static inline int
--
2.3.5

2015-06-08 13:58:13

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 25/25] mm: page_alloc: Take fewer passes when allocating to the low watermark

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 886102cc9b09..58f6330ec3e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1927,6 +1927,27 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
if (alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;

+ /*
+ * If checking the low watermark, see if we meet the
+ * min watermark and if so, try the zone and wake
+ * kswapd instead of falling back to a remote zone
+ * or having to take a second pass
+ */
+ if (alloc_flags & ALLOC_WMARK_LOW) {
+ int min_flags = alloc_flags;
+
+ min_flags &= ~ALLOC_WMARK_LOW;
+ min_flags |= ALLOC_WMARK_MIN;
+
+ if (zone_watermark_ok(zone, order,
+ zone->watermark[min_flags & ALLOC_WMARK_MASK],
+ ac->classzone_idx,
+ min_flags)) {
+ wakeup_kswapd(zone, order, ac->classzone_idx);
+ goto try_this_zone;
+ }
+ }
+
if (node_reclaim_mode == 0 ||
!zone_allows_reclaim(ac->preferred_zone, zone))
goto this_zone_full;
--
2.3.5

2015-06-11 07:13:10

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 03/25] mm, vmscan: Move LRU lists to node

> @@ -774,6 +764,21 @@ typedef struct pglist_data {
> ZONE_PADDING(_pad1_)
> spinlock_t lru_lock;
>
> + /* Fields commonly accessed by the page reclaim scanner */
> + struct lruvec lruvec;
> +
> + /* Evictions & activations on the inactive file list */
> + atomic_long_t inactive_age;
> +
> + /*
> + * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> + * this zone's LRU. Maintained by the pageout code.
> + */

The comment has to be updated.

> + unsigned int inactive_ratio;
> +
> + unsigned long flags;
> +
> + ZONE_PADDING(_pad2_)
> struct per_cpu_nodestat __percpu *per_cpu_nodestats;
> atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS];
> } pg_data_t;
> @@ -1185,7 +1185,7 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
> struct lruvec *lruvec;
>
> if (mem_cgroup_disabled()) {
> - lruvec = &zone->lruvec;
> + lruvec = zone_lruvec(zone);
> goto out;
> }
>
> @@ -1197,8 +1197,8 @@ out:
> * we have to be prepared to initialize lruvec->zone here;
> * and if offlined then reonlined, we need to reinitialize it.
> */
> - if (unlikely(lruvec->zone != zone))
> - lruvec->zone = zone;
> + if (unlikely(lruvec->pgdat != zone->zone_pgdat))
> + lruvec->pgdat = zone->zone_pgdat;

See below please.

> return lruvec;
> }
>
> @@ -1211,14 +1211,14 @@ out:
> * and putback protocol: the LRU lock must be held, and the page must
> * either be PageLRU() or the caller must have isolated/allocated it.
> */
> -struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> +struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
> {
> struct mem_cgroup_per_zone *mz;
> struct mem_cgroup *memcg;
> struct lruvec *lruvec;
>
> if (mem_cgroup_disabled()) {
> - lruvec = &zone->lruvec;
> + lruvec = &pgdat->lruvec;
> goto out;
> }
>
> @@ -1238,8 +1238,8 @@ out:
> * we have to be prepared to initialize lruvec->zone here;
> * and if offlined then reonlined, we need to reinitialize it.
> */
> - if (unlikely(lruvec->zone != zone))
> - lruvec->zone = zone;
> + if (unlikely(lruvec->pgdat != pgdat))
> + lruvec->pgdat = pgdat;

Given &pgdat->lruvec, we no longer need(or are able) to set lruvec->pgdat.

> return lruvec;
> }

2015-06-11 07:59:15

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 04/25] mm, vmscan: Begin reclaiming pages on a per-node basis

> @@ -1319,6 +1322,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> struct list_head *src = &lruvec->lists[lru];
> unsigned long nr_taken = 0;
> unsigned long scan;
> + LIST_HEAD(pages_skipped);
>
> for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> struct page *page;
> @@ -1329,6 +1333,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>
> VM_BUG_ON_PAGE(!PageLRU(page), page);
>
> + if (page_zone_id(page) > sc->reclaim_idx)
> + list_move(&page->lru, &pages_skipped);
> +
> switch (__isolate_lru_page(page, mode)) {
> case 0:
> nr_pages = hpage_nr_pages(page);
> @@ -1347,6 +1354,15 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> }
> }
>
> + /*
> + * Splice any skipped pages to the start of the LRU list. Note that
> + * this disrupts the LRU order when reclaiming for lower zones but
> + * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
> + * scanning would soon rescan the same pages to skip and put the
> + * system at risk of premature OOM.
> + */
> + if (!list_empty(&pages_skipped))
> + list_splice(&pages_skipped, src);
> *nr_scanned = scan;
> trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
> nr_taken, mode, is_file_lru(lru));

Can we avoid splicing pages by skipping pages with scan not incremented?

Hillf

2015-06-12 07:05:57

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 07/25] mm, vmscan: Make kswapd think of reclaim in terms of nodes

> - /* Reclaim above the high watermark. */
> - sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> + /* Aim to reclaim above all the zone high watermarks */
> + for (z = 0; z <= end_zone; z++) {
> + zone = pgdat->node_zones + end_zone;
s/end_zone/z/ ?
> + nr_to_reclaim += high_wmark_pages(zone);
>
[...]
> @@ -3280,13 +3177,26 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> compact_pgdat(pgdat, order);
>
> /*
> + * Stop reclaiming if any eligible zone is balanced and clear
> + * node writeback or congested.
> + */
> + for (i = 0; i <= *classzone_idx; i++) {
> + zone = pgdat->node_zones + i;
> +
> + if (zone_balanced(zone, sc.order, 0, *classzone_idx)) {
> + clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> + clear_bit(PGDAT_DIRTY, &pgdat->flags);
> + break;
s/break/goto out/ ?
> + }
> + }
> +
> + /*
> * Raise priority if scanning rate is too low or there was no
> * progress in reclaiming pages
> */
> if (raise_priority || !sc.nr_reclaimed)
> sc.priority--;
> - } while (sc.priority >= 1 &&
> - !pgdat_balanced(pgdat, order, *classzone_idx));
> + } while (sc.priority >= 1);
>
> out:
> /*
> --
> 2.3.5

2015-06-12 08:56:16

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 19/25] mm, vmscan: Account in vmstat for pages skipped during reclaim

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1326,6 +1326,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>
> for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> struct page *page;
> + struct zone *zone;
> int nr_pages;
>
> page = lru_to_page(src);
> @@ -1333,8 +1334,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>
> VM_BUG_ON_PAGE(!PageLRU(page), page);
>
> - if (page_zone_id(page) > sc->reclaim_idx)
> + zone = page_zone(page);
> + if (page_zone_id(page) > sc->reclaim_idx) {
> list_move(&page->lru, &pages_skipped);
> + __count_zone_vm_events(PGSCAN_SKIP, page_zone(page), 1);
> + }
The newly added zone is not used.
>
> switch (__isolate_lru_page(page, mode)) {
> case 0:

2015-06-15 08:19:57

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/25] mm, vmscan: Move LRU lists to node

On Thu, Jun 11, 2015 at 03:12:12PM +0800, Hillf Danton wrote:
> > @@ -774,6 +764,21 @@ typedef struct pglist_data {
> > ZONE_PADDING(_pad1_)
> > spinlock_t lru_lock;
> >
> > + /* Fields commonly accessed by the page reclaim scanner */
> > + struct lruvec lruvec;
> > +
> > + /* Evictions & activations on the inactive file list */
> > + atomic_long_t inactive_age;
> > +
> > + /*
> > + * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> > + * this zone's LRU. Maintained by the pageout code.
> > + */
>
> The comment has to be updated.
>

Yes it does. Fixed.

> > + unsigned int inactive_ratio;
> > +
> > + unsigned long flags;
> > +
> > + ZONE_PADDING(_pad2_)
> > struct per_cpu_nodestat __percpu *per_cpu_nodestats;
> > atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS];
> > } pg_data_t;
> > @@ -1185,7 +1185,7 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
> > struct lruvec *lruvec;
> >
> > if (mem_cgroup_disabled()) {
> > - lruvec = &zone->lruvec;
> > + lruvec = zone_lruvec(zone);
> > goto out;
> > }
> >
> > @@ -1197,8 +1197,8 @@ out:
> > * we have to be prepared to initialize lruvec->zone here;
> > * and if offlined then reonlined, we need to reinitialize it.
> > */
> > - if (unlikely(lruvec->zone != zone))
> > - lruvec->zone = zone;
> > + if (unlikely(lruvec->pgdat != zone->zone_pgdat))
> > + lruvec->pgdat = zone->zone_pgdat;
>
> See below please.
>
> > return lruvec;
> > }
> >
> > @@ -1211,14 +1211,14 @@ out:
> > * and putback protocol: the LRU lock must be held, and the page must
> > * either be PageLRU() or the caller must have isolated/allocated it.
> > */
> > -struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> > +struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
> > {
> > struct mem_cgroup_per_zone *mz;
> > struct mem_cgroup *memcg;
> > struct lruvec *lruvec;
> >
> > if (mem_cgroup_disabled()) {
> > - lruvec = &zone->lruvec;
> > + lruvec = &pgdat->lruvec;
> > goto out;
> > }
> >
> > @@ -1238,8 +1238,8 @@ out:
> > * we have to be prepared to initialize lruvec->zone here;
> > * and if offlined then reonlined, we need to reinitialize it.
> > */
> > - if (unlikely(lruvec->zone != zone))
> > - lruvec->zone = zone;
> > + if (unlikely(lruvec->pgdat != pgdat))
> > + lruvec->pgdat = pgdat;
>
> Given &pgdat->lruvec, we no longer need(or are able) to set lruvec->pgdat.
>

I do not understand your comment. This is setting a mapping between lruvec
and pgdat, not the other way around. It's a straight-forward conversion
of zone to pgdat.

--
Mel Gorman
SUSE Labs

2015-06-15 08:23:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/25] mm, vmscan: Begin reclaiming pages on a per-node basis

On Thu, Jun 11, 2015 at 03:58:14PM +0800, Hillf Danton wrote:
> > @@ -1319,6 +1322,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > struct list_head *src = &lruvec->lists[lru];
> > unsigned long nr_taken = 0;
> > unsigned long scan;
> > + LIST_HEAD(pages_skipped);
> >
> > for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> > struct page *page;
> > @@ -1329,6 +1333,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >
> > VM_BUG_ON_PAGE(!PageLRU(page), page);
> >
> > + if (page_zone_id(page) > sc->reclaim_idx)
> > + list_move(&page->lru, &pages_skipped);
> > +
> > switch (__isolate_lru_page(page, mode)) {
> > case 0:
> > nr_pages = hpage_nr_pages(page);
> > @@ -1347,6 +1354,15 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > }
> > }
> >
> > + /*
> > + * Splice any skipped pages to the start of the LRU list. Note that
> > + * this disrupts the LRU order when reclaiming for lower zones but
> > + * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
> > + * scanning would soon rescan the same pages to skip and put the
> > + * system at risk of premature OOM.
> > + */
> > + if (!list_empty(&pages_skipped))
> > + list_splice(&pages_skipped, src);
> > *nr_scanned = scan;
> > trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
> > nr_taken, mode, is_file_lru(lru));
>
> Can we avoid splicing pages by skipping pages with scan not incremented?
>

The reclaimers would still have to do the work of examining those pages
and ignoring them even if the counters are not updated. It'll look like
high CPU usage for no obvious reason.

--
Mel Gorman
SUSE Labs

2015-06-15 08:27:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/25] mm, vmscan: Make kswapd think of reclaim in terms of nodes

On Fri, Jun 12, 2015 at 03:05:00PM +0800, Hillf Danton wrote:
> > - /* Reclaim above the high watermark. */
> > - sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> > + /* Aim to reclaim above all the zone high watermarks */
> > + for (z = 0; z <= end_zone; z++) {
> > + zone = pgdat->node_zones + end_zone;
> s/end_zone/z/ ?

Ouch, thanks!

With this bug, kswapd would reclaim based on a multiple of the highest
zone. Whether that was under or over reclaim would depend on the size of
that zone relative to lower zones.

> > + nr_to_reclaim += high_wmark_pages(zone);
> >
> [...]
> > @@ -3280,13 +3177,26 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> > compact_pgdat(pgdat, order);
> >
> > /*
> > + * Stop reclaiming if any eligible zone is balanced and clear
> > + * node writeback or congested.
> > + */
> > + for (i = 0; i <= *classzone_idx; i++) {
> > + zone = pgdat->node_zones + i;
> > +
> > + if (zone_balanced(zone, sc.order, 0, *classzone_idx)) {
> > + clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> > + clear_bit(PGDAT_DIRTY, &pgdat->flags);
> > + break;
> s/break/goto out/ ?

Yes. It'd actually be ok because it'll detect the same condition and
exit in the next outer loop but goto out is better.


--
Mel Gorman
SUSE Labs

2015-06-19 17:02:12

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/25] Move LRU page reclaim from zones to nodes

Hi Mel,

these are cool patches, I very much like the direction this is headed.

On Mon, Jun 08, 2015 at 02:56:06PM +0100, Mel Gorman wrote:
> This is an RFC series against 4.0 that moves LRUs from the zones to the
> node. In concept, this is straight forward but there are a lot of details
> so I'm posting it early to see what people think. The motivations are;
>
> 1. Currently, reclaim on node 0 behaves differently to node 1 with subtly different
> aging rules. Workloads may exhibit different behaviour depending on what node
> it was scheduled on as a result.

How so? Don't we ultimately age pages in proportion to node size,
regardless of how many zones they are broken into?

> 2. The residency of a page partially depends on what zone the page was
> allocated from. This is partially combatted by the fair zone allocation
> policy but that is a partial solution that introduces overhead in the
> page allocator paths.

Yeah, it's ugly and I'm happy you're getting rid of this again. That
being said, in my tests it seemed like a complete solution to remove
any influence from allocation placement on aging behavior. Where do
you still see aging artifacts?

> 3. kswapd and the page allocator play special games with the order they scan zones
> to avoid interfering with each other but it's unpredictable.

It would be good to recall here these interference issues, how they
are currently coped with, and how your patches address them.

> 4. The different scan activity and ordering for zone reclaim is very difficult
> to predict.

I'm not sure what this means.

> 5. slab shrinkers are node-based which makes relating page reclaim to
> slab reclaim harder than it should be.

Agreed. And I'm sure dchinner also much prefers moving the VM towards
a node model over moving the shrinkers towards the zone model.

> The reason we have zone-based reclaim is that we used to have
> large highmem zones in common configurations and it was necessary
> to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
> less of a concern as machines with lots of memory will (or should) use
> 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
> rare. Machines that do use highmem should have relatively low highmem:lowmem
> ratios than we worried about in the past.
>
> Conceptually, moving to node LRUs should be easier to understand. The
> page allocator plays fewer tricks to game reclaim and reclaim behaves
> similarly on all nodes.

Do you think it's feasible to serve the occasional address-restricted
request from CMA, or a similar mechanism based on PFN ranges? In the
longterm, it would be great to eradicate struct zone entirely, and
have the page allocator and reclaim talk about the same thing again
without having to translate back and forth between zones and nodes.

It would also be much better for DMA allocations that don't align with
the zone model, such as 31-bit address requests, which currently have
to play the lottery with GFP_DMA32 and fall back to GFP_DMA.

> It was tested on a UMA (8 cores single socket) and a NUMA machine (48 cores,
> 4 sockets). The page allocator tests showed marginal differences in aim9,
> page fault microbenchmark, page allocator micro-benchmark and ebizzy. This
> was expected as the affected paths are small in comparison to the overall
> workloads.
>
> I also tested using fstest on zero-length files to stress slab reclaim. It
> showed no major differences in performance or stats.
>
> A THP-based test case that stresses compaction was inconclusive. It showed
> differences in the THP allocation success rate and both gains and losses in
> the time it takes to allocate THP depending on the number of threads running.

It would useful to include a "reasonable" highmem test here as well.

> Tests did show there were differences in the pages allocated from each zone.
> This is due to the fact the fair zone allocation policy is removed as with
> node-based LRU reclaim, it *should* not be necessary. It would be preferable
> if the original database workload that motivated the introduction of that
> policy was retested with this series though.

It's as simple as repeatedly reading a file that is ever-so-slightly
bigger than the available memory. The result should be a perfect
tail-chasing scenario, with the entire file being served from disk
every single time. If parts of it get activated, that is a problem,
because it means that some pages get aged differently than others.

When I worked on the fair zone allocator, I hacked mincore() to report
PG_active, to be extra sure about where the pages of interest are, but
monitoring pgactivate during the test, or comparing its deltas between
kernels, should be good enough.

2015-06-21 14:04:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 00/25] Move LRU page reclaim from zones to nodes

On Fri, Jun 19, 2015 at 01:01:39PM -0400, Johannes Weiner wrote:
> Hi Mel,
>
> these are cool patches, I very much like the direction this is headed.
>
> On Mon, Jun 08, 2015 at 02:56:06PM +0100, Mel Gorman wrote:
> > This is an RFC series against 4.0 that moves LRUs from the zones to the
> > node. In concept, this is straight forward but there are a lot of details
> > so I'm posting it early to see what people think. The motivations are;
> >
> > 1. Currently, reclaim on node 0 behaves differently to node 1 with subtly different
> > aging rules. Workloads may exhibit different behaviour depending on what node
> > it was scheduled on as a result.
>
> How so? Don't we ultimately age pages in proportion to node size,
> regardless of how many zones they are broken into?
>

For example, direct reclaim scans in zonelist order (highest eligible zone
first) and stops when SWAP_CLUSTER_MAX pages are reclaimed which could be
entirely from one zone. The fair zone policy limits age inversion problems
but it's still different to what happens when direct reclaim starts on a
node with one populated zone.

kswapd will not reclaim from the highest zone if it's already balanced. If
the distribution of anon/file pages differs between zones due to when they
were allocated then it's slightly different.

I expect in most cases that it makes little difference but moving to
node-base reclaim gets rid of some of these differences.

> > 2. The residency of a page partially depends on what zone the page was
> > allocated from. This is partially combatted by the fair zone allocation
> > policy but that is a partial solution that introduces overhead in the
> > page allocator paths.
>
> Yeah, it's ugly and I'm happy you're getting rid of this again. That
> being said, in my tests it seemed like a complete solution to remove
> any influence from allocation placement on aging behavior. Where do
> you still see aging artifacts?
>

I actually have not created an artifical test case that games the fair
zone allocation policy and I expect in almost all cases that the fair zone
allocation policy is adequate. It's just not necessary if we reclaim on
a per-node basis.

> > 3. kswapd and the page allocator play special games with the order they scan zones
> > to avoid interfering with each other but it's unpredictable.
>
> It would be good to recall here these interference issues, how they
> are currently coped with, and how your patches address them.
>

It is coped with by having kswapd reclaim in the opposite order that the
page allocator prefers. This prevents the allocator always reusing pages
reclaimed by kswapd recently. It's not handled at all with direct reclaim
as it reclaims in zonelist order. With the patches it should be simply
unnecessary to avoid this problem as pages are reclaimed in the order of
their age unless the caller requires a page from a low zone.

> > 4. The different scan activity and ordering for zone reclaim is very difficult
> > to predict.
>
> I'm not sure what this means.
>

The order that pages get reclaimed in partially depends on when allocataions
push a zone below a watermark. The time that kswapd examines a zone versus
the page allocator matters more than it should.

> > 5. slab shrinkers are node-based which makes relating page reclaim to
> > slab reclaim harder than it should be.
>
> Agreed. And I'm sure dchinner also much prefers moving the VM towards
> a node model over moving the shrinkers towards the zone model.
>
> > The reason we have zone-based reclaim is that we used to have
> > large highmem zones in common configurations and it was necessary
> > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
> > less of a concern as machines with lots of memory will (or should) use
> > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
> > rare. Machines that do use highmem should have relatively low highmem:lowmem
> > ratios than we worried about in the past.
> >
> > Conceptually, moving to node LRUs should be easier to understand. The
> > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > similarly on all nodes.
>
> Do you think it's feasible to serve the occasional address-restricted
> request from CMA, or a similar mechanism based on PFN ranges?

I worried that it would get very expensive as we would have to search the
free lists to find a page of the correct address. If one is not found then we
have to reclaim based on PFN ranges and then retry. Each address-restricted
allocation would have to retry the search. Potentially we could use migrate
types to prevent a percentage of the lower zones being used for unmovable
allocations but we'd still have to do the search.

> In the
> longterm, it would be great to eradicate struct zone entirely, and
> have the page allocator and reclaim talk about the same thing again
> without having to translate back and forth between zones and nodes.
>

I agree it would be great in the long term. I don't see how it could be
done now but that is partially because node-based reclaim alone is a big
set of changes.


> It would also be much better for DMA allocations that don't align with
> the zone model, such as 31-bit address requests, which currently have
> to play the lottery with GFP_DMA32 and fall back to GFP_DMA.
>

Also true. Maybe it would be a reserve-based mechanism and subsystems
register their requirements and then use a mempool-like mechanism to limit
searching. It certainly would be worth examining if/when node-based reclaim
gets ironed out and we are confident that there are no regressions.

> > It was tested on a UMA (8 cores single socket) and a NUMA machine (48 cores,
> > 4 sockets). The page allocator tests showed marginal differences in aim9,
> > page fault microbenchmark, page allocator micro-benchmark and ebizzy. This
> > was expected as the affected paths are small in comparison to the overall
> > workloads.
> >
> > I also tested using fstest on zero-length files to stress slab reclaim. It
> > showed no major differences in performance or stats.
> >
> > A THP-based test case that stresses compaction was inconclusive. It showed
> > differences in the THP allocation success rate and both gains and losses in
> > the time it takes to allocate THP depending on the number of threads running.
>
> It would useful to include a "reasonable" highmem test here as well.
>

I think something IO-intensive with a large highmem zone would do the job.

> > Tests did show there were differences in the pages allocated from each zone.
> > This is due to the fact the fair zone allocation policy is removed as with
> > node-based LRU reclaim, it *should* not be necessary. It would be preferable
> > if the original database workload that motivated the introduction of that
> > policy was retested with this series though.
>
> It's as simple as repeatedly reading a file that is ever-so-slightly
> bigger than the available memory. The result should be a perfect
> tail-chasing scenario, with the entire file being served from disk
> every single time. If parts of it get activated, that is a problem,
> because it means that some pages get aged differently than others.
>

Ok, that is trivial to put together.

> When I worked on the fair zone allocator, I hacked mincore() to report
> PG_active, to be extra sure about where the pages of interest are, but
> monitoring pgactivate during the test, or comparing its deltas between
> kernels, should be good enough.

Thanks.

--
Mel Gorman
SUSE Labs