LinuxLists.cc - [PATCH 0/11 v3] Use local_lock for pcp protection and reduce stat overhead

2021-04-15 00:27:25

Subject: [PATCH 0/11 v3] Use local_lock for pcp protection and reduce stat overhead

Changelog since v2
o Fix zonestats initialisation
o Merged memory hotplug fix separately
o Embed local_lock within per_cpu_pages

This series requires patches in Andrew's tree so for convenience, it's
also available at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v3r6

The PCP (per-cpu page allocator in page_alloc.c) shares locking
requirements with vmstat and the zone lock which is inconvenient and
causes some issues. For example, the PCP list and vmstat share the
same per-cpu space meaning that it's possible that vmstat updates dirty
cache lines holding per-cpu lists across CPUs unless padding is used.
Second, PREEMPT_RT does not want to disable IRQs for too long in the
page allocator.

This series splits the locking requirements and uses locks types more
suitable for PREEMPT_RT, reduces the time when special locking is required
for stats and reduces the time when IRQs need to be disabled on !PREEMPT_RT
kernels.

Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst

local_irq_disable();
spin_lock(&lock);

The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). While it's possible to
separate this out, it generally means there are points where we enable
IRQs and reenable them again immediately. To prevent a migration and the
per-cpu pointer going stale, migrate_disable is also needed. That is a
custom lock that is similar, but worse, than local_lock. Furthermore,
on PREEMPT_RT, it's undesirable to leave IRQs disabled for too long.
By converting to local_lock which disables migration on PREEMPT_RT, the
locking requirements can be separated and start moving the protections
for PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking. As
a bonus, local_lock also means that PROVE_LOCKING does something useful.

After that, it's obvious that zone_statistics incurs too much overhead
and leaves IRQs disabled for longer than necessary on !PREEMPT_RT
kernels. zone_statistics uses perfectly accurate counters requiring IRQs
be disabled for parallel RMW sequences when inaccurate ones like vm_events
would do. The series makes the NUMA statistics (NUMA_HIT and friends)
inaccurate counters that then require no special protection on !PREEMPT_RT.

The bulk page allocator can then do stat updates in bulk with IRQs enabled
which should improve the efficiency. Technically, this could have been
done without the local_lock and vmstat conversion work and the order
simply reflects the timing of when different series were implemented.

Finally, there are places where we conflate IRQs being disabled for the
PCP with the IRQ-safe zone spinlock. The remainder of the series reduces
the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
By the end of the series, page_alloc.c does not call local_irq_save so
the locking scope is a bit clearer. The one exception is that modifying
NR_FREE_PAGES still happens in places where it's known the IRQs are
disabled as it's harmless for PREEMPT_RT and would be expensive to split
the locking there.

No performance data is included because despite the overhead of the stats,
it's within the noise for most workloads on !PREEMPT_RT. However, Jesper
Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
3.60GHz CPU on the first version of this series. Focusing on the array
variant of the bulk page allocator reveals the following.

(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

Baseline Patched
1 56.383 54.225 (+3.83%)
2 40.047 35.492 (+11.38%)
3 37.339 32.643 (+12.58%)
4 35.578 30.992 (+12.89%)
8 33.592 29.606 (+11.87%)
16 32.362 28.532 (+11.85%)
32 31.476 27.728 (+11.91%)
64 30.633 27.252 (+11.04%)
128 30.596 27.090 (+11.46%)

While this is a positive outcome, the series is more likely to be
interesting to the RT people in terms of getting parts of the PREEMPT_RT
tree into mainline.

drivers/base/node.c | 18 +--
include/linux/mmzone.h | 58 ++++++--
include/linux/vmstat.h | 65 +++++----
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 302 +++++++++++++++++++++++++----------------
mm/vmstat.c | 250 ++++++++++++----------------------
6 files changed, 370 insertions(+), 325 deletions(-)

--
2.26.2

Mel Gorman (11):
mm/page_alloc: Split per cpu page lists and zone stats
mm/page_alloc: Convert per-cpu list protection to local_lock
mm/vmstat: Convert NUMA statistics to basic NUMA counters
mm/vmstat: Inline NUMA event counter updates
mm/page_alloc: Batch the accounting updates in the bulk allocator
mm/page_alloc: Reduce duration that IRQs are disabled for VM counters
mm/page_alloc: Remove duplicate checks if migratetype should be
isolated
mm/page_alloc: Explicitly acquire the zone lock in __free_pages_ok
mm/page_alloc: Avoid conflating IRQs disabled with zone->lock
mm/page_alloc: Update PGFREE outside the zone lock in __free_pages_ok
mm/page_alloc: Embed per_cpu_pages locking within the per-cpu
structure

drivers/base/node.c | 18 +--
include/linux/mmzone.h | 58 ++++++--
include/linux/vmstat.h | 65 +++++----
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 302 +++++++++++++++++++++++++----------------
mm/vmstat.c | 250 ++++++++++++----------------------
6 files changed, 370 insertions(+), 325 deletions(-)

--
2.26.2

2021-04-15 00:27:51

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 01/11] mm/page_alloc: Split per cpu page lists and zone stats

The per-cpu page allocator lists and the per-cpu vmstat deltas are stored
in the same struct per_cpu_pages even though vmstats have no direct impact
on the per-cpu page lists. This is inconsistent because the vmstats for a
node are stored on a dedicated structure. The bigger issue is that the
per_cpu_pages structure is not cache-aligned and stat updates either
cache conflict with adjacent per-cpu lists incurring a runtime cost or
padding is required incurring a memory cost.

This patch splits the per-cpu pagelists and the vmstat deltas into separate
structures. It's mostly a mechanical conversion but some variable renaming
is done to clearly distinguish the per-cpu pages structure (pcp) from
the vmstats (pzstats).

Superficially, this appears to increase the size of the per_cpu_pages
structure but the movement of expire fills a structure hole so there is
no impact overall.

[[email protected]: Check struct per_cpu_zonestat has a non-zero size]
[[email protected]: Init zone->per_cpu_zonestats properly]
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 18 ++++----
include/linux/vmstat.h | 8 ++--
mm/page_alloc.c | 85 ++++++++++++++++++++-----------------
mm/vmstat.c | 96 ++++++++++++++++++++++--------------------
4 files changed, 111 insertions(+), 96 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..a4393ac27336 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -341,20 +341,21 @@ struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
+#ifdef CONFIG_NUMA
+ int expire; /* When 0, remote pagesets are drained */
+#endif

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[MIGRATE_PCPTYPES];
};

-struct per_cpu_pageset {
- struct per_cpu_pages pcp;
-#ifdef CONFIG_NUMA
- s8 expire;
- u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
-#endif
+struct per_cpu_zonestat {
#ifdef CONFIG_SMP
- s8 stat_threshold;
s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
+ s8 stat_threshold;
+#endif
+#ifdef CONFIG_NUMA
+ u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
#endif
};

@@ -470,7 +471,8 @@ struct zone {
int node;
#endif
struct pglist_data *zone_pgdat;
- struct per_cpu_pageset __percpu *pageset;
+ struct per_cpu_pages __percpu *per_cpu_pageset;
+ struct per_cpu_zonestat __percpu *per_cpu_zonestats;
/*
* the high and batch values are copied to individual pagesets for
* faster access
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 506d625163a1..1736ea9d24a7 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -163,7 +163,7 @@ static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
int cpu;

for_each_online_cpu(cpu)
- x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
+ x += per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_stat_diff[item];

return x;
}
@@ -236,7 +236,7 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
#ifdef CONFIG_SMP
int cpu;
for_each_online_cpu(cpu)
- x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
+ x += per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_stat_diff[item];

if (x < 0)
x = 0;
@@ -291,7 +291,7 @@ struct ctl_table;
int vmstat_refresh(struct ctl_table *, int write, void *buffer, size_t *lenp,
loff_t *ppos);

-void drain_zonestat(struct zone *zone, struct per_cpu_pageset *);
+void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *);

int calculate_pressure_threshold(struct zone *zone);
int calculate_normal_threshold(struct zone *zone);
@@ -399,7 +399,7 @@ static inline void cpu_vm_stats_fold(int cpu) { }
static inline void quiet_vmstat(void) { }

static inline void drain_zonestat(struct zone *zone,
- struct per_cpu_pageset *pset) { }
+ struct per_cpu_zonestat *pzstats) { }
#endif /* CONFIG_SMP */

static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9bf0db982f14..2d6283cab22d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2981,15 +2981,14 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
static void drain_pages_zone(unsigned int cpu, struct zone *zone)
{
unsigned long flags;
- struct per_cpu_pageset *pset;
struct per_cpu_pages *pcp;

local_irq_save(flags);
- pset = per_cpu_ptr(zone->pageset, cpu);

- pcp = &pset->pcp;
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
if (pcp->count)
free_pcppages_bulk(zone, pcp->count, pcp);
+
local_irq_restore(flags);
}

@@ -3088,7 +3087,7 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
* disables preemption as part of its processing
*/
for_each_online_cpu(cpu) {
- struct per_cpu_pageset *pcp;
+ struct per_cpu_pages *pcp;
struct zone *z;
bool has_pcps = false;

@@ -3099,13 +3098,13 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
*/
has_pcps = true;
} else if (zone) {
- pcp = per_cpu_ptr(zone->pageset, cpu);
- if (pcp->pcp.count)
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ if (pcp->count)
has_pcps = true;
} else {
for_each_populated_zone(z) {
- pcp = per_cpu_ptr(z->pageset, cpu);
- if (pcp->pcp.count) {
+ pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
+ if (pcp->count) {
has_pcps = true;
break;
}
@@ -3235,7 +3234,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
migratetype = MIGRATE_MOVABLE;
}

- pcp = &this_cpu_ptr(zone->pageset)->pcp;
+ pcp = this_cpu_ptr(zone->per_cpu_pageset);
list_add(&page->lru, &pcp->lists[migratetype]);
pcp->count++;
if (pcp->count >= READ_ONCE(pcp->high))
@@ -3451,7 +3450,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
unsigned long flags;

local_irq_save(flags);
- pcp = &this_cpu_ptr(zone->pageset)->pcp;
+ pcp = this_cpu_ptr(zone->per_cpu_pageset);
list = &pcp->lists[migratetype];
page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
if (page) {
@@ -5054,7 +5053,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,

/* Attempt the batch allocation */
local_irq_save(flags);
- pcp = &this_cpu_ptr(zone->pageset)->pcp;
+ pcp = this_cpu_ptr(zone->per_cpu_pageset);
pcp_list = &pcp->lists[ac.migratetype];

while (nr_populated < nr_pages) {
@@ -5667,7 +5666,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
continue;

for_each_online_cpu(cpu)
- free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count;
+ free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
}

printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
@@ -5759,7 +5758,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)

free_pcp = 0;
for_each_online_cpu(cpu)
- free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count;
+ free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;

show_node(zone);
printk(KERN_CONT
@@ -5800,7 +5799,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
K(zone_page_state(zone, NR_MLOCK)),
K(zone_page_state(zone, NR_BOUNCE)),
K(free_pcp),
- K(this_cpu_read(zone->pageset->pcp.count)),
+ K(this_cpu_read(zone->per_cpu_pageset->count)),
K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
printk("lowmem_reserve[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
@@ -6127,11 +6126,12 @@ static void build_zonelists(pg_data_t *pgdat)
* not check if the processor is online before following the pageset pointer.
* Other parts of the kernel may not check if the zone is available.
*/
-static void pageset_init(struct per_cpu_pageset *p);
+static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats);
/* These effectively disable the pcplists in the boot pageset completely */
#define BOOT_PAGESET_HIGH 0
#define BOOT_PAGESET_BATCH 1
-static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
+static DEFINE_PER_CPU(struct per_cpu_pages, boot_pageset);
+static DEFINE_PER_CPU(struct per_cpu_zonestat, boot_zonestats);
static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);

static void __build_all_zonelists(void *data)
@@ -6198,7 +6198,7 @@ build_all_zonelists_init(void)
* (a chicken-egg dilemma).
*/
for_each_possible_cpu(cpu)
- pageset_init(&per_cpu(boot_pageset, cpu));
+ per_cpu_pages_init(&per_cpu(boot_pageset, cpu), &per_cpu(boot_zonestats, cpu));

mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
@@ -6576,14 +6576,13 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
WRITE_ONCE(pcp->high, high);
}

-static void pageset_init(struct per_cpu_pageset *p)
+static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats)
{
- struct per_cpu_pages *pcp;
int migratetype;

- memset(p, 0, sizeof(*p));
+ memset(pcp, 0, sizeof(*pcp));
+ memset(pzstats, 0, sizeof(*pzstats));

- pcp = &p->pcp;
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);

@@ -6600,12 +6599,12 @@ static void pageset_init(struct per_cpu_pageset *p)
static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
unsigned long batch)
{
- struct per_cpu_pageset *p;
+ struct per_cpu_pages *pcp;
int cpu;

for_each_possible_cpu(cpu) {
- p = per_cpu_ptr(zone->pageset, cpu);
- pageset_update(&p->pcp, high, batch);
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ pageset_update(pcp, high, batch);
}
}

@@ -6640,13 +6639,20 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)

void __meminit setup_zone_pageset(struct zone *zone)
{
- struct per_cpu_pageset *p;
int cpu;

- zone->pageset = alloc_percpu(struct per_cpu_pageset);
+ /* Size may be 0 on !SMP && !NUMA */
+ if (sizeof(struct per_cpu_zonestat) > 0)
+ zone->per_cpu_zonestats = alloc_percpu(struct per_cpu_zonestat);
+
+ zone->per_cpu_pageset = alloc_percpu(struct per_cpu_pages);
for_each_possible_cpu(cpu) {
- p = per_cpu_ptr(zone->pageset, cpu);
- pageset_init(p);
+ struct per_cpu_pages *pcp;
+ struct per_cpu_zonestat *pzstats;
+
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+ per_cpu_pages_init(pcp, pzstats);
}

zone_set_pageset_high_and_batch(zone);
@@ -6673,9 +6679,9 @@ void __init setup_per_cpu_pageset(void)
* the nodes these zones are associated with.
*/
for_each_possible_cpu(cpu) {
- struct per_cpu_pageset *pcp = &per_cpu(boot_pageset, cpu);
- memset(pcp->vm_numa_stat_diff, 0,
- sizeof(pcp->vm_numa_stat_diff));
+ struct per_cpu_zonestat *pzstats = &per_cpu(boot_zonestats, cpu);
+ memset(pzstats->vm_numa_stat_diff, 0,
+ sizeof(pzstats->vm_numa_stat_diff));
}
#endif

@@ -6691,7 +6697,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
* relies on the ability of the linker to provide the
* offset of a (static) per cpu variable into the per cpu area.
*/
- zone->pageset = &boot_pageset;
+ zone->per_cpu_pageset = &boot_pageset;
+ zone->per_cpu_zonestats = &boot_zonestats;
zone->pageset_high = BOOT_PAGESET_HIGH;
zone->pageset_batch = BOOT_PAGESET_BATCH;

@@ -8953,15 +8960,17 @@ void zone_pcp_enable(struct zone *zone)
void zone_pcp_reset(struct zone *zone)
{
int cpu;
- struct per_cpu_pageset *pset;
+ struct per_cpu_zonestat *pzstats;

- if (zone->pageset != &boot_pageset) {
+ if (zone->per_cpu_pageset != &boot_pageset) {
for_each_online_cpu(cpu) {
- pset = per_cpu_ptr(zone->pageset, cpu);
- drain_zonestat(zone, pset);
+ pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+ drain_zonestat(zone, pzstats);
}
- free_percpu(zone->pageset);
- zone->pageset = &boot_pageset;
+ free_percpu(zone->per_cpu_pageset);
+ free_percpu(zone->per_cpu_zonestats);
+ zone->per_cpu_pageset = &boot_pageset;
+ zone->per_cpu_zonestats = &boot_zonestats;
}
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 74b2c374b86c..8a8f1a26b231 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -44,7 +44,7 @@ static void zero_zone_numa_counters(struct zone *zone)
for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
atomic_long_set(&zone->vm_numa_stat[item], 0);
for_each_online_cpu(cpu)
- per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]
+ per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_stat_diff[item]
= 0;
}
}
@@ -266,7 +266,7 @@ void refresh_zone_stat_thresholds(void)
for_each_online_cpu(cpu) {
int pgdat_threshold;

- per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+ per_cpu_ptr(zone->per_cpu_zonestats, cpu)->stat_threshold
= threshold;

/* Base nodestat threshold on the largest populated zone. */
@@ -303,7 +303,7 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,

threshold = (*calculate_pressure)(zone);
for_each_online_cpu(cpu)
- per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+ per_cpu_ptr(zone->per_cpu_zonestats, cpu)->stat_threshold
= threshold;
}
}
@@ -316,7 +316,7 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
long delta)
{
- struct per_cpu_pageset __percpu *pcp = zone->pageset;
+ struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
s8 __percpu *p = pcp->vm_stat_diff + item;
long x;
long t;
@@ -389,7 +389,7 @@ EXPORT_SYMBOL(__mod_node_page_state);
*/
void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
- struct per_cpu_pageset __percpu *pcp = zone->pageset;
+ struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
s8 __percpu *p = pcp->vm_stat_diff + item;
s8 v, t;

@@ -435,7 +435,7 @@ EXPORT_SYMBOL(__inc_node_page_state);

void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
{
- struct per_cpu_pageset __percpu *pcp = zone->pageset;
+ struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
s8 __percpu *p = pcp->vm_stat_diff + item;
s8 v, t;

@@ -495,7 +495,7 @@ EXPORT_SYMBOL(__dec_node_page_state);
static inline void mod_zone_state(struct zone *zone,
enum zone_stat_item item, long delta, int overstep_mode)
{
- struct per_cpu_pageset __percpu *pcp = zone->pageset;
+ struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
s8 __percpu *p = pcp->vm_stat_diff + item;
long o, n, t, z;

@@ -781,19 +781,20 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
int changes = 0;

for_each_populated_zone(zone) {
- struct per_cpu_pageset __percpu *p = zone->pageset;
+ struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+ struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
int v;

- v = this_cpu_xchg(p->vm_stat_diff[i], 0);
+ v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0);
if (v) {

atomic_long_add(v, &zone->vm_stat[i]);
global_zone_diff[i] += v;
#ifdef CONFIG_NUMA
/* 3 seconds idle till flush */
- __this_cpu_write(p->expire, 3);
+ __this_cpu_write(pcp->expire, 3);
#endif
}
}
@@ -801,12 +802,12 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
int v;

- v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0);
+ v = this_cpu_xchg(pzstats->vm_numa_stat_diff[i], 0);
if (v) {

atomic_long_add(v, &zone->vm_numa_stat[i]);
global_numa_diff[i] += v;
- __this_cpu_write(p->expire, 3);
+ __this_cpu_write(pcp->expire, 3);
}
}

@@ -819,23 +820,23 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
* Check if there are pages remaining in this pageset
* if not then there is nothing to expire.
*/
- if (!__this_cpu_read(p->expire) ||
- !__this_cpu_read(p->pcp.count))
+ if (!__this_cpu_read(pcp->expire) ||
+ !__this_cpu_read(pcp->count))
continue;

/*
* We never drain zones local to this processor.
*/
if (zone_to_nid(zone) == numa_node_id()) {
- __this_cpu_write(p->expire, 0);
+ __this_cpu_write(pcp->expire, 0);
continue;
}

- if (__this_cpu_dec_return(p->expire))
+ if (__this_cpu_dec_return(pcp->expire))
continue;

- if (__this_cpu_read(p->pcp.count)) {
- drain_zone_pages(zone, this_cpu_ptr(&p->pcp));
+ if (__this_cpu_read(pcp->count)) {
+ drain_zone_pages(zone, this_cpu_ptr(pcp));
changes++;
}
}
@@ -882,27 +883,27 @@ void cpu_vm_stats_fold(int cpu)
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };

for_each_populated_zone(zone) {
- struct per_cpu_pageset *p;
+ struct per_cpu_zonestat *pzstats;

- p = per_cpu_ptr(zone->pageset, cpu);
+ pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
- if (p->vm_stat_diff[i]) {
+ if (pzstats->vm_stat_diff[i]) {
int v;

- v = p->vm_stat_diff[i];
- p->vm_stat_diff[i] = 0;
+ v = pzstats->vm_stat_diff[i];
+ pzstats->vm_stat_diff[i] = 0;
atomic_long_add(v, &zone->vm_stat[i]);
global_zone_diff[i] += v;
}

#ifdef CONFIG_NUMA
for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- if (p->vm_numa_stat_diff[i]) {
+ if (pzstats->vm_numa_stat_diff[i]) {
int v;

- v = p->vm_numa_stat_diff[i];
- p->vm_numa_stat_diff[i] = 0;
+ v = pzstats->vm_numa_stat_diff[i];
+ pzstats->vm_numa_stat_diff[i] = 0;
atomic_long_add(v, &zone->vm_numa_stat[i]);
global_numa_diff[i] += v;
}
@@ -936,24 +937,24 @@ void cpu_vm_stats_fold(int cpu)
* this is only called if !populated_zone(zone), which implies no other users of
* pset->vm_stat_diff[] exsist.
*/
-void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
+void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats)
{
int i;

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
- if (pset->vm_stat_diff[i]) {
- int v = pset->vm_stat_diff[i];
- pset->vm_stat_diff[i] = 0;
+ if (pzstats->vm_stat_diff[i]) {
+ int v = pzstats->vm_stat_diff[i];
+ pzstats->vm_stat_diff[i] = 0;
atomic_long_add(v, &zone->vm_stat[i]);
atomic_long_add(v, &vm_zone_stat[i]);
}

#ifdef CONFIG_NUMA
for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- if (pset->vm_numa_stat_diff[i]) {
- int v = pset->vm_numa_stat_diff[i];
+ if (pzstats->vm_numa_stat_diff[i]) {
+ int v = pzstats->vm_numa_stat_diff[i];

- pset->vm_numa_stat_diff[i] = 0;
+ pzstats->vm_numa_stat_diff[i] = 0;
atomic_long_add(v, &zone->vm_numa_stat[i]);
atomic_long_add(v, &vm_numa_stat[i]);
}
@@ -965,8 +966,8 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
void __inc_numa_state(struct zone *zone,
enum numa_stat_item item)
{
- struct per_cpu_pageset __percpu *pcp = zone->pageset;
- u16 __percpu *p = pcp->vm_numa_stat_diff + item;
+ struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+ u16 __percpu *p = pzstats->vm_numa_stat_diff + item;
u16 v;

v = __this_cpu_inc_return(*p);
@@ -1685,21 +1686,23 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,

seq_printf(m, "\n pagesets");
for_each_online_cpu(i) {
- struct per_cpu_pageset *pageset;
+ struct per_cpu_pages *pcp;
+ struct per_cpu_zonestat *pzstats;

- pageset = per_cpu_ptr(zone->pageset, i);
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, i);
+ pzstats = per_cpu_ptr(zone->per_cpu_zonestats, i);
seq_printf(m,
"\n cpu: %i"
"\n count: %i"
"\n high: %i"
"\n batch: %i",
i,
- pageset->pcp.count,
- pageset->pcp.high,
- pageset->pcp.batch);
+ pcp->count,
+ pcp->high,
+ pcp->batch);
#ifdef CONFIG_SMP
seq_printf(m, "\n vm stats threshold: %d",
- pageset->stat_threshold);
+ pzstats->stat_threshold);
#endif
}
seq_printf(m,
@@ -1910,17 +1913,18 @@ static bool need_update(int cpu)
struct zone *zone;

for_each_populated_zone(zone) {
- struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
+ struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
struct per_cpu_nodestat *n;
+
/*
* The fast way of checking if there are any vmstat diffs.
*/
- if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
- sizeof(p->vm_stat_diff[0])))
+ if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
+ sizeof(pzstats->vm_stat_diff[0])))
return true;
#ifdef CONFIG_NUMA
- if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS *
- sizeof(p->vm_numa_stat_diff[0])))
+ if (memchr_inv(pzstats->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS *
+ sizeof(pzstats->vm_numa_stat_diff[0])))
return true;
#endif
if (last_pgdat == zone->zone_pgdat)
--
2.26.2

2021-04-15 00:27:51

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock

There is a lack of clarity of what exactly local_irq_save/local_irq_restore
protects in page_alloc.c . It conflates the protection of per-cpu page
allocation structures with per-cpu vmstat deltas.

This patch protects the PCP structure using local_lock which for most
configurations is identical to IRQ enabling/disabling. The scope of the
lock is still wider than it should be but this is decreased later.

It is possible for the local_lock to be embedded safely within struct
per_cpu_pages but it adds complexity to free_unref_page_list so it is
implemented as a separate patch later in the series.

[[email protected]: Make pagesets static]
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 2 ++
mm/page_alloc.c | 50 +++++++++++++++++++++++++++++-------------
2 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a4393ac27336..106da8fbc72a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -20,6 +20,7 @@
#include <linux/atomic.h>
#include <linux/mm_types.h>
#include <linux/page-flags.h>
+#include <linux/local_lock.h>
#include <asm/page.h>

/* Free memory management - zoned buddy allocator. */
@@ -337,6 +338,7 @@ enum zone_watermarks {
#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)

+/* Fields and list protected by pagesets local_lock in page_alloc.c */
struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d6283cab22d..4e92d43c25f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -112,6 +112,13 @@ typedef int __bitwise fpi_t;
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_FRACTION (8)

+struct pagesets {
+ local_lock_t lock;
+};
+static DEFINE_PER_CPU(struct pagesets, pagesets) = {
+ .lock = INIT_LOCAL_LOCK(lock),
+};
+
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1421,6 +1428,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
} while (--count && --batch_free && !list_empty(list));
}

+ /*
+ * local_lock_irq held so equivalent to spin_lock_irqsave for
+ * both PREEMPT_RT and non-PREEMPT_RT configurations.
+ */
spin_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);

@@ -1541,6 +1552,11 @@ static void __free_pages_ok(struct page *page, unsigned int order,
return;

migratetype = get_pfnblock_migratetype(page, pfn);
+
+ /*
+ * TODO FIX: Disable IRQs before acquiring IRQ-safe zone->lock
+ * and protect vmstat updates.
+ */
local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order);
free_one_page(page_zone(page), page, pfn, order, migratetype,
@@ -2910,6 +2926,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
{
int i, allocated = 0;

+ /*
+ * local_lock_irq held so equivalent to spin_lock_irqsave for
+ * both PREEMPT_RT and non-PREEMPT_RT configurations.
+ */
spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype,
@@ -2962,12 +2982,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
unsigned long flags;
int to_drain, batch;

- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
if (to_drain > 0)
free_pcppages_bulk(zone, to_drain, pcp);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
}
#endif

@@ -2983,13 +3003,13 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
unsigned long flags;
struct per_cpu_pages *pcp;

- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);

pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
if (pcp->count)
free_pcppages_bulk(zone, pcp->count, pcp);

- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
}

/*
@@ -3252,9 +3272,9 @@ void free_unref_page(struct page *page)
if (!free_unref_page_prepare(page, pfn))
return;

- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
free_unref_page_commit(page, pfn);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
}

/*
@@ -3274,7 +3294,7 @@ void free_unref_page_list(struct list_head *list)
set_page_private(page, pfn);
}

- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
list_for_each_entry_safe(page, next, list, lru) {
unsigned long pfn = page_private(page);

@@ -3287,12 +3307,12 @@ void free_unref_page_list(struct list_head *list)
* a large list of pages to free.
*/
if (++batch_count == SWAP_CLUSTER_MAX) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
batch_count = 0;
- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
}
}
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
}

/*
@@ -3449,7 +3469,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
struct page *page;
unsigned long flags;

- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
list = &pcp->lists[migratetype];
page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
@@ -3457,7 +3477,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
zone_statistics(preferred_zone, zone);
}
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
return page;
}

@@ -5052,7 +5072,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto failed;

/* Attempt the batch allocation */
- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
pcp_list = &pcp->lists[ac.migratetype];

@@ -5090,12 +5110,12 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);

return nr_populated;

failed_irq:
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);

failed:
page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
--
2.26.2

2021-04-15 00:27:53

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 03/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters

NUMA statistics are maintained on the zone level for hits, misses, foreign
etc but nothing relies on them being perfectly accurate for functional
correctness. The counters are used by userspace to get a general overview
of a workloads NUMA behaviour but the page allocator incurs a high cost to
maintain perfect accuracy similar to what is required for a vmstat like
NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to
turn off the collection of NUMA statistics like NUMA_HIT.

This patch converts NUMA_HIT and friends to be NUMA events with similar
accuracy to VM events. There is a possibility that slight errors will be
introduced but the overall trend as seen by userspace will be similar.
Note that while these counters could be maintained at the node level that
it would have a user-visible impact.

Signed-off-by: Mel Gorman <[email protected]>
---
drivers/base/node.c | 18 +++--
include/linux/mmzone.h | 11 ++-
include/linux/vmstat.h | 42 +++++-----
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 12 +--
mm/vmstat.c | 175 ++++++++++++-----------------------------
6 files changed, 93 insertions(+), 167 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index f449dbb2c746..443a609db428 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -484,6 +484,7 @@ static DEVICE_ATTR(meminfo, 0444, node_read_meminfo, NULL);
static ssize_t node_read_numastat(struct device *dev,
struct device_attribute *attr, char *buf)
{
+ fold_vm_numa_events();
return sysfs_emit(buf,
"numa_hit %lu\n"
"numa_miss %lu\n"
@@ -491,12 +492,12 @@ static ssize_t node_read_numastat(struct device *dev,
"interleave_hit %lu\n"
"local_node %lu\n"
"other_node %lu\n",
- sum_zone_numa_state(dev->id, NUMA_HIT),
- sum_zone_numa_state(dev->id, NUMA_MISS),
- sum_zone_numa_state(dev->id, NUMA_FOREIGN),
- sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
- sum_zone_numa_state(dev->id, NUMA_LOCAL),
- sum_zone_numa_state(dev->id, NUMA_OTHER));
+ sum_zone_numa_event_state(dev->id, NUMA_HIT),
+ sum_zone_numa_event_state(dev->id, NUMA_MISS),
+ sum_zone_numa_event_state(dev->id, NUMA_FOREIGN),
+ sum_zone_numa_event_state(dev->id, NUMA_INTERLEAVE_HIT),
+ sum_zone_numa_event_state(dev->id, NUMA_LOCAL),
+ sum_zone_numa_event_state(dev->id, NUMA_OTHER));
}
static DEVICE_ATTR(numastat, 0444, node_read_numastat, NULL);

@@ -514,10 +515,11 @@ static ssize_t node_read_vmstat(struct device *dev,
sum_zone_node_page_state(nid, i));

#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
+ fold_vm_numa_events();
+ for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++)
len += sysfs_emit_at(buf, len, "%s %lu\n",
numa_stat_name(i),
- sum_zone_numa_state(nid, i));
+ sum_zone_numa_event_state(nid, i));

#endif
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 106da8fbc72a..693cd5f24f7d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -135,10 +135,10 @@ enum numa_stat_item {
NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
- NR_VM_NUMA_STAT_ITEMS
+ NR_VM_NUMA_EVENT_ITEMS
};
#else
-#define NR_VM_NUMA_STAT_ITEMS 0
+#define NR_VM_NUMA_EVENT_ITEMS 0
#endif

enum zone_stat_item {
@@ -357,7 +357,10 @@ struct per_cpu_zonestat {
s8 stat_threshold;
#endif
#ifdef CONFIG_NUMA
- u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
+ u16 vm_numa_stat_diff[NR_VM_NUMA_EVENT_ITEMS];
+#endif
+#ifdef CONFIG_NUMA
+ unsigned long vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
#endif
};

@@ -609,7 +612,7 @@ struct zone {
ZONE_PADDING(_pad3_)
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
- atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
+ atomic_long_t vm_numa_events[NR_VM_NUMA_EVENT_ITEMS];
} ____cacheline_internodealigned_in_smp;

enum pgdat_flags {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1736ea9d24a7..fc14415223c5 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -138,35 +138,27 @@ static inline void vm_events_fold_cpu(int cpu)
* Zone and node-based page accounting with per cpu differentials.
*/
extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
-extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];

#ifdef CONFIG_NUMA
-static inline void zone_numa_state_add(long x, struct zone *zone,
- enum numa_stat_item item)
-{
- atomic_long_add(x, &zone->vm_numa_stat[item]);
- atomic_long_add(x, &vm_numa_stat[item]);
-}
-
-static inline unsigned long global_numa_state(enum numa_stat_item item)
+static inline unsigned long zone_numa_event_state(struct zone *zone,
+ enum numa_stat_item item)
{
- long x = atomic_long_read(&vm_numa_stat[item]);
-
- return x;
+ return atomic_long_read(&zone->vm_numa_events[item]);
}

-static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
- enum numa_stat_item item)
+static inline unsigned long
+global_numa_event_state(enum numa_stat_item item)
{
- long x = atomic_long_read(&zone->vm_numa_stat[item]);
- int cpu;
+ struct zone *zone;
+ unsigned long x = 0;

- for_each_online_cpu(cpu)
- x += per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_stat_diff[item];
+ for_each_populated_zone(zone)
+ x += zone_numa_event_state(zone, item);

return x;
}
+
#endif /* CONFIG_NUMA */

static inline void zone_page_state_add(long x, struct zone *zone,
@@ -245,18 +237,22 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
}

#ifdef CONFIG_NUMA
-extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item);
+extern void __count_numa_event(struct zone *zone, enum numa_stat_item item);
extern unsigned long sum_zone_node_page_state(int node,
enum zone_stat_item item);
-extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
+extern unsigned long sum_zone_numa_event_state(int node, enum numa_stat_item item);
extern unsigned long node_page_state(struct pglist_data *pgdat,
enum node_stat_item item);
extern unsigned long node_page_state_pages(struct pglist_data *pgdat,
enum node_stat_item item);
+extern void fold_vm_numa_events(void);
#else
#define sum_zone_node_page_state(node, item) global_zone_page_state(item)
#define node_page_state(node, item) global_node_page_state(item)
#define node_page_state_pages(node, item) global_node_page_state_pages(item)
+static inline void fold_vm_numa_events(void)
+{
+}
#endif /* CONFIG_NUMA */

#ifdef CONFIG_SMP
@@ -428,7 +424,7 @@ static inline const char *numa_stat_name(enum numa_stat_item item)
static inline const char *node_stat_name(enum node_stat_item item)
{
return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
- NR_VM_NUMA_STAT_ITEMS +
+ NR_VM_NUMA_EVENT_ITEMS +
item];
}

@@ -440,7 +436,7 @@ static inline const char *lru_list_name(enum lru_list lru)
static inline const char *writeback_stat_name(enum writeback_stat_item item)
{
return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
- NR_VM_NUMA_STAT_ITEMS +
+ NR_VM_NUMA_EVENT_ITEMS +
NR_VM_NODE_STAT_ITEMS +
item];
}
@@ -449,7 +445,7 @@ static inline const char *writeback_stat_name(enum writeback_stat_item item)
static inline const char *vm_event_name(enum vm_event_item item)
{
return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
- NR_VM_NUMA_STAT_ITEMS +
+ NR_VM_NUMA_EVENT_ITEMS +
NR_VM_NODE_STAT_ITEMS +
NR_VM_WRITEBACK_STAT_ITEMS +
item];
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cd0295567a04..99c06a9ae7ee 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2146,7 +2146,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
return page;
if (page && page_to_nid(page) == nid) {
preempt_disable();
- __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
+ __count_numa_event(page_zone(page), NUMA_INTERLEAVE_HIT);
preempt_enable();
}
return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e92d43c25f6..9d0f047647e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3424,12 +3424,12 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
local_stat = NUMA_OTHER;

if (zone_to_nid(z) == zone_to_nid(preferred_zone))
- __inc_numa_state(z, NUMA_HIT);
+ __count_numa_event(z, NUMA_HIT);
else {
- __inc_numa_state(z, NUMA_MISS);
- __inc_numa_state(preferred_zone, NUMA_FOREIGN);
+ __count_numa_event(z, NUMA_MISS);
+ __count_numa_event(preferred_zone, NUMA_FOREIGN);
}
- __inc_numa_state(z, local_stat);
+ __count_numa_event(z, local_stat);
#endif
}

@@ -6700,8 +6700,8 @@ void __init setup_per_cpu_pageset(void)
*/
for_each_possible_cpu(cpu) {
struct per_cpu_zonestat *pzstats = &per_cpu(boot_zonestats, cpu);
- memset(pzstats->vm_numa_stat_diff, 0,
- sizeof(pzstats->vm_numa_stat_diff));
+ memset(pzstats->vm_numa_event, 0,
+ sizeof(pzstats->vm_numa_event));
}
#endif

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8a8f1a26b231..63bd84d122c0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -41,38 +41,24 @@ static void zero_zone_numa_counters(struct zone *zone)
{
int item, cpu;

- for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
- atomic_long_set(&zone->vm_numa_stat[item], 0);
- for_each_online_cpu(cpu)
- per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_stat_diff[item]
+ for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) {
+ atomic_long_set(&zone->vm_numa_events[item], 0);
+ for_each_online_cpu(cpu) {
+ per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_event[item]
= 0;
+ }
}
}

-/* zero numa counters of all the populated zones */
-static void zero_zones_numa_counters(void)
+static void invalidate_numa_statistics(void)
{
struct zone *zone;

+ /* zero numa counters of all the populated zones */
for_each_populated_zone(zone)
zero_zone_numa_counters(zone);
}

-/* zero global numa counters */
-static void zero_global_numa_counters(void)
-{
- int item;
-
- for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
- atomic_long_set(&vm_numa_stat[item], 0);
-}
-
-static void invalid_numa_statistics(void)
-{
- zero_zones_numa_counters();
- zero_global_numa_counters();
-}
-
static DEFINE_MUTEX(vm_numa_stat_lock);

int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
@@ -94,7 +80,7 @@ int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
pr_info("enable numa statistics\n");
} else {
static_branch_disable(&vm_numa_stat_key);
- invalid_numa_statistics();
+ invalidate_numa_statistics();
pr_info("disable numa statistics, and clear numa counters\n");
}

@@ -161,10 +147,8 @@ void vm_events_fold_cpu(int cpu)
* vm_stat contains the global counters
*/
atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
-atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
EXPORT_SYMBOL(vm_zone_stat);
-EXPORT_SYMBOL(vm_numa_stat);
EXPORT_SYMBOL(vm_node_stat);

#ifdef CONFIG_SMP
@@ -706,8 +690,7 @@ EXPORT_SYMBOL(dec_node_page_state);
* Fold a differential into the global counters.
* Returns the number of counters updated.
*/
-#ifdef CONFIG_NUMA
-static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
+static int fold_diff(int *zone_diff, int *node_diff)
{
int i;
int changes = 0;
@@ -718,12 +701,6 @@ static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
changes++;
}

- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- if (numa_diff[i]) {
- atomic_long_add(numa_diff[i], &vm_numa_stat[i]);
- changes++;
- }
-
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
if (node_diff[i]) {
atomic_long_add(node_diff[i], &vm_node_stat[i]);
@@ -731,26 +708,36 @@ static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
}
return changes;
}
-#else
-static int fold_diff(int *zone_diff, int *node_diff)
+
+#ifdef CONFIG_NUMA
+static void fold_vm_zone_numa_events(struct zone *zone)
{
- int i;
- int changes = 0;
+ int zone_numa_events[NR_VM_NUMA_EVENT_ITEMS] = { 0, };
+ int cpu;
+ enum numa_stat_item item;

- for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
- if (zone_diff[i]) {
- atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
- changes++;
+ for_each_online_cpu(cpu) {
+ struct per_cpu_zonestat *pzstats;
+
+ pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+ for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) {
+ zone_numa_events[item] += pzstats->vm_numa_event[item];
+ }
}

- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
- if (node_diff[i]) {
- atomic_long_add(node_diff[i], &vm_node_stat[i]);
- changes++;
+ for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) {
+ atomic_long_set(&zone->vm_numa_events[item], zone_numa_events[item]);
}
- return changes;
}
-#endif /* CONFIG_NUMA */
+
+void fold_vm_numa_events(void)
+{
+ struct zone *zone;
+
+ for_each_populated_zone(zone)
+ fold_vm_zone_numa_events(zone);
+}
+#endif

/*
* Update the zone counters for the current cpu.
@@ -774,9 +761,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
-#ifdef CONFIG_NUMA
- int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
-#endif
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
int changes = 0;

@@ -799,17 +783,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
}
}
#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
- int v;
-
- v = this_cpu_xchg(pzstats->vm_numa_stat_diff[i], 0);
- if (v) {
-
- atomic_long_add(v, &zone->vm_numa_stat[i]);
- global_numa_diff[i] += v;
- __this_cpu_write(pcp->expire, 3);
- }
- }

if (do_pagesets) {
cond_resched();
@@ -857,12 +830,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
}
}

-#ifdef CONFIG_NUMA
- changes += fold_diff(global_zone_diff, global_numa_diff,
- global_node_diff);
-#else
changes += fold_diff(global_zone_diff, global_node_diff);
-#endif
return changes;
}

@@ -877,9 +845,6 @@ void cpu_vm_stats_fold(int cpu)
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
-#ifdef CONFIG_NUMA
- int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
-#endif
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };

for_each_populated_zone(zone) {
@@ -887,7 +852,7 @@ void cpu_vm_stats_fold(int cpu)

pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);

- for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+ for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
if (pzstats->vm_stat_diff[i]) {
int v;

@@ -896,18 +861,7 @@ void cpu_vm_stats_fold(int cpu)
atomic_long_add(v, &zone->vm_stat[i]);
global_zone_diff[i] += v;
}
-
-#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- if (pzstats->vm_numa_stat_diff[i]) {
- int v;
-
- v = pzstats->vm_numa_stat_diff[i];
- pzstats->vm_numa_stat_diff[i] = 0;
- atomic_long_add(v, &zone->vm_numa_stat[i]);
- global_numa_diff[i] += v;
- }
-#endif
+ }
}

for_each_online_pgdat(pgdat) {
@@ -926,11 +880,7 @@ void cpu_vm_stats_fold(int cpu)
}
}

-#ifdef CONFIG_NUMA
- fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
-#else
fold_diff(global_zone_diff, global_node_diff);
-#endif
}

/*
@@ -948,34 +898,17 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats)
atomic_long_add(v, &zone->vm_stat[i]);
atomic_long_add(v, &vm_zone_stat[i]);
}
-
-#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- if (pzstats->vm_numa_stat_diff[i]) {
- int v = pzstats->vm_numa_stat_diff[i];
-
- pzstats->vm_numa_stat_diff[i] = 0;
- atomic_long_add(v, &zone->vm_numa_stat[i]);
- atomic_long_add(v, &vm_numa_stat[i]);
- }
-#endif
}
#endif

#ifdef CONFIG_NUMA
-void __inc_numa_state(struct zone *zone,
+/* See __count_vm_event comment on why raw_cpu_inc is used. */
+void __count_numa_event(struct zone *zone,
enum numa_stat_item item)
{
struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
- u16 __percpu *p = pzstats->vm_numa_stat_diff + item;
- u16 v;

- v = __this_cpu_inc_return(*p);
-
- if (unlikely(v > NUMA_STATS_THRESHOLD)) {
- zone_numa_state_add(v, zone, item);
- __this_cpu_write(*p, 0);
- }
+ raw_cpu_inc(pzstats->vm_numa_event[item]);
}

/*
@@ -1000,15 +933,15 @@ unsigned long sum_zone_node_page_state(int node,
* Determine the per node value of a numa stat item. To avoid deviation,
* the per cpu stat number in vm_numa_stat_diff[] is also included.
*/
-unsigned long sum_zone_numa_state(int node,
+unsigned long sum_zone_numa_event_state(int node,
enum numa_stat_item item)
{
struct zone *zones = NODE_DATA(node)->node_zones;
- int i;
unsigned long count = 0;
+ int i;

for (i = 0; i < MAX_NR_ZONES; i++)
- count += zone_numa_state_snapshot(zones + i, item);
+ count += zone_numa_event_state(zones + i, item);

return count;
}
@@ -1679,9 +1612,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
zone_page_state(zone, i));

#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
+ for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++)
seq_printf(m, "\n %-12s %lu", numa_stat_name(i),
- zone_numa_state_snapshot(zone, i));
+ zone_numa_event_state(zone, i));
#endif

seq_printf(m, "\n pagesets");
@@ -1735,7 +1668,7 @@ static const struct seq_operations zoneinfo_op = {
};

#define NR_VMSTAT_ITEMS (NR_VM_ZONE_STAT_ITEMS + \
- NR_VM_NUMA_STAT_ITEMS + \
+ NR_VM_NUMA_EVENT_ITEMS + \
NR_VM_NODE_STAT_ITEMS + \
NR_VM_WRITEBACK_STAT_ITEMS + \
(IS_ENABLED(CONFIG_VM_EVENT_COUNTERS) ? \
@@ -1750,6 +1683,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
return NULL;

BUILD_BUG_ON(ARRAY_SIZE(vmstat_text) < NR_VMSTAT_ITEMS);
+ fold_vm_numa_events();
v = kmalloc_array(NR_VMSTAT_ITEMS, sizeof(unsigned long), GFP_KERNEL);
m->private = v;
if (!v)
@@ -1759,9 +1693,9 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
v += NR_VM_ZONE_STAT_ITEMS;

#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- v[i] = global_numa_state(i);
- v += NR_VM_NUMA_STAT_ITEMS;
+ for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++)
+ v[i] = global_numa_event_state(i);
+ v += NR_VM_NUMA_EVENT_ITEMS;
#endif

for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
@@ -1864,16 +1798,6 @@ int vmstat_refresh(struct ctl_table *table, int write,
err = -EINVAL;
}
}
-#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
- val = atomic_long_read(&vm_numa_stat[i]);
- if (val < 0) {
- pr_warn("%s: %s %ld\n",
- __func__, numa_stat_name(i), val);
- err = -EINVAL;
- }
- }
-#endif
if (err)
return err;
if (write)
@@ -1922,8 +1846,9 @@ static bool need_update(int cpu)
if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
sizeof(pzstats->vm_stat_diff[0])))
return true;
+
#ifdef CONFIG_NUMA
- if (memchr_inv(pzstats->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS *
+ if (memchr_inv(pzstats->vm_numa_stat_diff, 0, NR_VM_NUMA_EVENT_ITEMS *
sizeof(pzstats->vm_numa_stat_diff[0])))
return true;
#endif
--
2.26.2

2021-04-15 00:28:44

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 10/11] mm/page_alloc: Update PGFREE outside the zone lock in __free_pages_ok

VM events do not need explicit protection by disabling IRQs so
update the counter with IRQs enabled in __free_pages_ok.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a0b210077178..8a94fe77bef7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1569,10 +1569,11 @@ static void __free_pages_ok(struct page *page, unsigned int order,
migratetype = get_pfnblock_migratetype(page, pfn);

spin_lock_irqsave(&zone->lock, flags);
- __count_vm_events(PGFREE, 1 << order);
migratetype = check_migratetype_isolated(zone, page, pfn, migratetype);
__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
spin_unlock_irqrestore(&zone->lock, flags);
+
+ __count_vm_events(PGFREE, 1 << order);
}

void __free_pages_core(struct page *page, unsigned int order)
--
2.26.2

2021-04-15 00:29:40

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 09/11] mm/page_alloc: Avoid conflating IRQs disabled with zone->lock

Historically when freeing pages, free_one_page() assumed that callers
had IRQs disabled and the zone->lock could be acquired with spin_lock().
This confuses the scope of what local_lock_irq is protecting and what
zone->lock is protecting in free_unref_page_list in particular.

This patch uses spin_lock_irqsave() for the zone->lock in
free_one_page() instead of relying on callers to have disabled
IRQs. free_unref_page_commit() is changed to only deal with PCP pages
protected by the local lock. free_unref_page_list() then first frees
isolated pages to the buddy lists with free_one_page() and frees the rest
of the pages to the PCP via free_unref_page_commit(). The end result
is that free_one_page() is no longer depending on side-effects of
local_lock to be correct.

Note that this may incur a performance penalty while memory hot-remove
is running but that is not a common operation.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 67 ++++++++++++++++++++++++++++++-------------------
1 file changed, 41 insertions(+), 26 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6791e9361076..a0b210077178 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1473,10 +1473,12 @@ static void free_one_page(struct zone *zone,
unsigned int order,
int migratetype, fpi_t fpi_flags)
{
- spin_lock(&zone->lock);
+ unsigned long flags;
+
+ spin_lock_irqsave(&zone->lock, flags);
migratetype = check_migratetype_isolated(zone, page, pfn, migratetype);
__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
- spin_unlock(&zone->lock);
+ spin_unlock_irqrestore(&zone->lock, flags);
}

static void __meminit __init_single_page(struct page *page, unsigned long pfn,
@@ -3238,31 +3240,13 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn)
return true;
}

-static void free_unref_page_commit(struct page *page, unsigned long pfn)
+static void free_unref_page_commit(struct page *page, unsigned long pfn,
+ int migratetype)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
- int migratetype;

- migratetype = get_pcppage_migratetype(page);
__count_vm_event(PGFREE);
-
- /*
- * We only track unmovable, reclaimable and movable on pcp lists.
- * Free ISOLATE pages back to the allocator because they are being
- * offlined but treat HIGHATOMIC as movable pages so we can get those
- * areas back if necessary. Otherwise, we may have to free
- * excessively into the page allocator
- */
- if (migratetype >= MIGRATE_PCPTYPES) {
- if (unlikely(is_migrate_isolate(migratetype))) {
- free_one_page(zone, page, pfn, 0, migratetype,
- FPI_NONE);
- return;
- }
- migratetype = MIGRATE_MOVABLE;
- }
-
pcp = this_cpu_ptr(zone->per_cpu_pageset);
list_add(&page->lru, &pcp->lists[migratetype]);
pcp->count++;
@@ -3277,12 +3261,29 @@ void free_unref_page(struct page *page)
{
unsigned long flags;
unsigned long pfn = page_to_pfn(page);
+ int migratetype;

if (!free_unref_page_prepare(page, pfn))
return;

+ /*
+ * We only track unmovable, reclaimable and movable on pcp lists.
+ * Place ISOLATE pages on the isolated list because they are being
+ * offlined but treat HIGHATOMIC as movable pages so we can get those
+ * areas back if necessary. Otherwise, we may have to free
+ * excessively into the page allocator
+ */
+ migratetype = get_pcppage_migratetype(page);
+ if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
+ if (unlikely(is_migrate_isolate(migratetype))) {
+ free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
+ return;
+ }
+ migratetype = MIGRATE_MOVABLE;
+ }
+
local_lock_irqsave(&pagesets.lock, flags);
- free_unref_page_commit(page, pfn);
+ free_unref_page_commit(page, pfn, migratetype);
local_unlock_irqrestore(&pagesets.lock, flags);
}

@@ -3294,6 +3295,7 @@ void free_unref_page_list(struct list_head *list)
struct page *page, *next;
unsigned long flags, pfn;
int batch_count = 0;
+ int migratetype;

/* Prepare pages for freeing */
list_for_each_entry_safe(page, next, list, lru) {
@@ -3301,15 +3303,28 @@ void free_unref_page_list(struct list_head *list)
if (!free_unref_page_prepare(page, pfn))
list_del(&page->lru);
set_page_private(page, pfn);
+
+ /*
+ * Free isolated pages directly to the allocator, see
+ * comment in free_unref_page.
+ */
+ migratetype = get_pcppage_migratetype(page);
+ if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
+ if (unlikely(is_migrate_isolate(migratetype))) {
+ free_one_page(page_zone(page), page, pfn, 0,
+ migratetype, FPI_NONE);
+ list_del(&page->lru);
+ }
+ }
}

local_lock_irqsave(&pagesets.lock, flags);
list_for_each_entry_safe(page, next, list, lru) {
- unsigned long pfn = page_private(page);
-
+ pfn = page_private(page);
set_page_private(page, 0);
+ migratetype = get_pcppage_migratetype(page);
trace_mm_page_free_batched(page);
- free_unref_page_commit(page, pfn);
+ free_unref_page_commit(page, pfn, migratetype);

/*
* Guard against excessive IRQ disabled times when we get
--
2.26.2

2021-04-15 12:26:38

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 09/11] mm/page_alloc: Avoid conflating IRQs disabled with zone->lock

On 4/14/21 3:39 PM, Mel Gorman wrote:
> Historically when freeing pages, free_one_page() assumed that callers
> had IRQs disabled and the zone->lock could be acquired with spin_lock().
> This confuses the scope of what local_lock_irq is protecting and what
> zone->lock is protecting in free_unref_page_list in particular.
>
> This patch uses spin_lock_irqsave() for the zone->lock in
> free_one_page() instead of relying on callers to have disabled
> IRQs. free_unref_page_commit() is changed to only deal with PCP pages
> protected by the local lock. free_unref_page_list() then first frees
> isolated pages to the buddy lists with free_one_page() and frees the rest
> of the pages to the PCP via free_unref_page_commit(). The end result
> is that free_one_page() is no longer depending on side-effects of
> local_lock to be correct.
>
> Note that this may incur a performance penalty while memory hot-remove
> is running but that is not a common operation.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

A nit below:

> @@ -3294,6 +3295,7 @@ void free_unref_page_list(struct list_head *list)
> struct page *page, *next;
> unsigned long flags, pfn;
> int batch_count = 0;
> + int migratetype;
>
> /* Prepare pages for freeing */
> list_for_each_entry_safe(page, next, list, lru) {
> @@ -3301,15 +3303,28 @@ void free_unref_page_list(struct list_head *list)
> if (!free_unref_page_prepare(page, pfn))
> list_del(&page->lru);
> set_page_private(page, pfn);

Should probably move this below so we don't set private for pages that then go
through free_one_page()? Doesn't seem to be a bug, just unneccessary.

> +
> + /*
> + * Free isolated pages directly to the allocator, see
> + * comment in free_unref_page.
> + */
> + migratetype = get_pcppage_migratetype(page);
> + if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
> + if (unlikely(is_migrate_isolate(migratetype))) {
> + free_one_page(page_zone(page), page, pfn, 0,
> + migratetype, FPI_NONE);
> + list_del(&page->lru);
> + }
> + }
> }
>
> local_lock_irqsave(&pagesets.lock, flags);
> list_for_each_entry_safe(page, next, list, lru) {
> - unsigned long pfn = page_private(page);
> -
> + pfn = page_private(page);
> set_page_private(page, 0);
> + migratetype = get_pcppage_migratetype(page);
> trace_mm_page_free_batched(page);
> - free_unref_page_commit(page, pfn);
> + free_unref_page_commit(page, pfn, migratetype);
>
> /*
> * Guard against excessive IRQ disabled times when we get
>

2021-04-15 13:06:01

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 10/11] mm/page_alloc: Update PGFREE outside the zone lock in __free_pages_ok

On 4/14/21 3:39 PM, Mel Gorman wrote:
> VM events do not need explicit protection by disabling IRQs so
> update the counter with IRQs enabled in __free_pages_ok.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/page_alloc.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a0b210077178..8a94fe77bef7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1569,10 +1569,11 @@ static void __free_pages_ok(struct page *page, unsigned int order,
> migratetype = get_pfnblock_migratetype(page, pfn);
>
> spin_lock_irqsave(&zone->lock, flags);
> - __count_vm_events(PGFREE, 1 << order);
> migratetype = check_migratetype_isolated(zone, page, pfn, migratetype);
> __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
> spin_unlock_irqrestore(&zone->lock, flags);
> +
> + __count_vm_events(PGFREE, 1 << order);
> }
>
> void __free_pages_core(struct page *page, unsigned int order)
>

2021-04-15 14:14:40

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 09/11] mm/page_alloc: Avoid conflating IRQs disabled with zone->lock

On Thu, Apr 15, 2021 at 02:25:36PM +0200, Vlastimil Babka wrote:
> > @@ -3294,6 +3295,7 @@ void free_unref_page_list(struct list_head *list)
> > struct page *page, *next;
> > unsigned long flags, pfn;
> > int batch_count = 0;
> > + int migratetype;
> >
> > /* Prepare pages for freeing */
> > list_for_each_entry_safe(page, next, list, lru) {
> > @@ -3301,15 +3303,28 @@ void free_unref_page_list(struct list_head *list)
> > if (!free_unref_page_prepare(page, pfn))
> > list_del(&page->lru);
> > set_page_private(page, pfn);
>
> Should probably move this below so we don't set private for pages that then go
> through free_one_page()? Doesn't seem to be a bug, just unneccessary.
>

Sure.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d87ca364680..a9c1282d9c7b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3293,7 +3293,6 @@ void free_unref_page_list(struct list_head *list)
pfn = page_to_pfn(page);
if (!free_unref_page_prepare(page, pfn))
list_del(&page->lru);
- set_page_private(page, pfn);

/*
* Free isolated pages directly to the allocator, see
@@ -3307,6 +3306,8 @@ void free_unref_page_list(struct list_head *list)
list_del(&page->lru);
}
}
+
+ set_page_private(page, pfn);
}

local_lock_irqsave(&pagesets.lock, flags);

--
Mel Gorman
SUSE Labs