2021-04-19 18:37:49

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 00/10 v4] Use local_lock for pcp protection and reduce stat overhead

Some Acks from RT people are still missing that I'd like to have before
trying to merge this via Andrew's tree and there is an open question is
whether the last path in this series is worthwhile. It embeds local_lock
within the per_cpu_pages structure to clarify the scope but it increases
complexity and storage costs that may not be worthwhile. I don't think
it is but it was asked whether the lock could be safely embedded or not.

Changelog since v3
o Preserve NUMA_* counters after CPU hotplug
o Drop "mm/page_alloc: Remove duplicate checks if migratetype should be isolated"
o Add micro-optimisation tracking PFN during free_unref_page_list
o Add Acks

Changelog since v2
o Fix zonestats initialisation
o Merged memory hotplug fix separately
o Embed local_lock within per_cpu_pages

This series requires patches in Andrew's tree so for convenience, it's
also available at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v4r4

The PCP (per-cpu page allocator in page_alloc.c) shares locking
requirements with vmstat and the zone lock which is inconvenient and
causes some issues. For example, the PCP list and vmstat share the
same per-cpu space meaning that it's possible that vmstat updates dirty
cache lines holding per-cpu lists across CPUs unless padding is used.
Second, PREEMPT_RT does not want to disable IRQs for too long in the
page allocator.

This series splits the locking requirements and uses locks types more
suitable for PREEMPT_RT, reduces the time when special locking is required
for stats and reduces the time when IRQs need to be disabled on !PREEMPT_RT
kernels.

Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst

local_irq_disable();
spin_lock(&lock);

The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). While it's possible to
separate this out, it generally means there are points where we enable
IRQs and reenable them again immediately. To prevent a migration and the
per-cpu pointer going stale, migrate_disable is also needed. That is a
custom lock that is similar, but worse, than local_lock. Furthermore,
on PREEMPT_RT, it's undesirable to leave IRQs disabled for too long.
By converting to local_lock which disables migration on PREEMPT_RT, the
locking requirements can be separated and start moving the protections
for PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking. As
a bonus, local_lock also means that PROVE_LOCKING does something useful.

After that, it's obvious that zone_statistics incurs too much overhead
and leaves IRQs disabled for longer than necessary on !PREEMPT_RT
kernels. zone_statistics uses perfectly accurate counters requiring IRQs
be disabled for parallel RMW sequences when inaccurate ones like vm_events
would do. The series makes the NUMA statistics (NUMA_HIT and friends)
inaccurate counters that then require no special protection on !PREEMPT_RT.

The bulk page allocator can then do stat updates in bulk with IRQs enabled
which should improve the efficiency. Technically, this could have been
done without the local_lock and vmstat conversion work and the order
simply reflects the timing of when different series were implemented.

Finally, there are places where we conflate IRQs being disabled for the
PCP with the IRQ-safe zone spinlock. The remainder of the series reduces
the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
By the end of the series, page_alloc.c does not call local_irq_save so
the locking scope is a bit clearer. The one exception is that modifying
NR_FREE_PAGES still happens in places where it's known the IRQs are
disabled as it's harmless for PREEMPT_RT and would be expensive to split
the locking there.

No performance data is included because despite the overhead of the stats,
it's within the noise for most workloads on !PREEMPT_RT. However, Jesper
Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
3.60GHz CPU on the first version of this series. Focusing on the array
variant of the bulk page allocator reveals the following.

(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

Baseline Patched
1 56.383 54.225 (+3.83%)
2 40.047 35.492 (+11.38%)
3 37.339 32.643 (+12.58%)
4 35.578 30.992 (+12.89%)
8 33.592 29.606 (+11.87%)
16 32.362 28.532 (+11.85%)
32 31.476 27.728 (+11.91%)
64 30.633 27.252 (+11.04%)
128 30.596 27.090 (+11.46%)

While this is a positive outcome, the series is more likely to be
interesting to the RT people in terms of getting parts of the PREEMPT_RT
tree into mainline.

drivers/base/node.c | 18 +--
include/linux/mmzone.h | 60 +++++++--
include/linux/vmstat.h | 65 ++++++----
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 274 +++++++++++++++++++++++++----------------
mm/vmstat.c | 255 +++++++++++++++-----------------------
6 files changed, 369 insertions(+), 305 deletions(-)

--
2.26.2


2021-04-19 18:39:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/10] mm/page_alloc: Reduce duration that IRQs are disabled for VM counters

IRQs are left disabled for the zone and node VM event counters. This is
unnecessary as the affected counters are allowed to race for preemmption
and IRQs.

This patch reduces the scope of IRQs being disabled
via local_[lock|unlock]_irq on !PREEMPT_RT kernels. One
__mod_zone_freepage_state is still called with IRQs disabled. While this
could be moved out, it's not free on all architectures as some require
IRQs to be disabled for mod_zone_page_state on !PREEMPT_RT kernels.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
mm/page_alloc.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cff0f1c98b28..295624fe293b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3474,11 +3474,11 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
pcp = this_cpu_ptr(zone->per_cpu_pageset);
list = &pcp->lists[migratetype];
page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
+ local_unlock_irqrestore(&pagesets.lock, flags);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
zone_statistics(preferred_zone, zone, 1);
}
- local_unlock_irqrestore(&pagesets.lock, flags);
return page;
}

@@ -3530,15 +3530,15 @@ struct page *rmqueue(struct zone *preferred_zone,
if (!page)
page = __rmqueue(zone, order, migratetype, alloc_flags);
} while (page && check_new_pages(page, order));
- spin_unlock(&zone->lock);
if (!page)
goto failed;
+
__mod_zone_freepage_state(zone, -(1 << order),
get_pcppage_migratetype(page));
+ spin_unlock_irqrestore(&zone->lock, flags);

__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
zone_statistics(preferred_zone, zone, 1);
- local_irq_restore(flags);

out:
/* Separate test+clear to avoid unnecessary atomics */
@@ -3551,7 +3551,7 @@ struct page *rmqueue(struct zone *preferred_zone,
return page;

failed:
- local_irq_restore(flags);
+ spin_unlock_irqrestore(&zone->lock, flags);
return NULL;
}

@@ -5103,11 +5103,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

+ local_unlock_irqrestore(&pagesets.lock, flags);
+
__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);

- local_unlock_irqrestore(&pagesets.lock, flags);
-
return nr_populated;

failed_irq:
--
2.26.2

2021-04-19 18:53:19

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/10] mm/page_alloc: Convert per-cpu list protection to local_lock

There is a lack of clarity of what exactly local_irq_save/local_irq_restore
protects in page_alloc.c . It conflates the protection of per-cpu page
allocation structures with per-cpu vmstat deltas.

This patch protects the PCP structure using local_lock which for most
configurations is identical to IRQ enabling/disabling. The scope of the
lock is still wider than it should be but this is decreased later.

It is possible for the local_lock to be embedded safely within struct
per_cpu_pages but it adds complexity to free_unref_page_list so it is
implemented as a separate patch later in the series.

[[email protected]: Make pagesets static]
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 2 ++
mm/page_alloc.c | 50 +++++++++++++++++++++++++++++-------------
2 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a4393ac27336..106da8fbc72a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -20,6 +20,7 @@
#include <linux/atomic.h>
#include <linux/mm_types.h>
#include <linux/page-flags.h>
+#include <linux/local_lock.h>
#include <asm/page.h>

/* Free memory management - zoned buddy allocator. */
@@ -337,6 +338,7 @@ enum zone_watermarks {
#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)

+/* Fields and list protected by pagesets local_lock in page_alloc.c */
struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d6283cab22d..4e92d43c25f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -112,6 +112,13 @@ typedef int __bitwise fpi_t;
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_FRACTION (8)

+struct pagesets {
+ local_lock_t lock;
+};
+static DEFINE_PER_CPU(struct pagesets, pagesets) = {
+ .lock = INIT_LOCAL_LOCK(lock),
+};
+
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1421,6 +1428,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
} while (--count && --batch_free && !list_empty(list));
}

+ /*
+ * local_lock_irq held so equivalent to spin_lock_irqsave for
+ * both PREEMPT_RT and non-PREEMPT_RT configurations.
+ */
spin_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);

@@ -1541,6 +1552,11 @@ static void __free_pages_ok(struct page *page, unsigned int order,
return;

migratetype = get_pfnblock_migratetype(page, pfn);
+
+ /*
+ * TODO FIX: Disable IRQs before acquiring IRQ-safe zone->lock
+ * and protect vmstat updates.
+ */
local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order);
free_one_page(page_zone(page), page, pfn, order, migratetype,
@@ -2910,6 +2926,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
{
int i, allocated = 0;

+ /*
+ * local_lock_irq held so equivalent to spin_lock_irqsave for
+ * both PREEMPT_RT and non-PREEMPT_RT configurations.
+ */
spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype,
@@ -2962,12 +2982,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
unsigned long flags;
int to_drain, batch;

- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
if (to_drain > 0)
free_pcppages_bulk(zone, to_drain, pcp);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
}
#endif

@@ -2983,13 +3003,13 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
unsigned long flags;
struct per_cpu_pages *pcp;

- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);

pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
if (pcp->count)
free_pcppages_bulk(zone, pcp->count, pcp);

- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
}

/*
@@ -3252,9 +3272,9 @@ void free_unref_page(struct page *page)
if (!free_unref_page_prepare(page, pfn))
return;

- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
free_unref_page_commit(page, pfn);
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
}

/*
@@ -3274,7 +3294,7 @@ void free_unref_page_list(struct list_head *list)
set_page_private(page, pfn);
}

- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
list_for_each_entry_safe(page, next, list, lru) {
unsigned long pfn = page_private(page);

@@ -3287,12 +3307,12 @@ void free_unref_page_list(struct list_head *list)
* a large list of pages to free.
*/
if (++batch_count == SWAP_CLUSTER_MAX) {
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
batch_count = 0;
- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
}
}
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
}

/*
@@ -3449,7 +3469,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
struct page *page;
unsigned long flags;

- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
list = &pcp->lists[migratetype];
page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
@@ -3457,7 +3477,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
zone_statistics(preferred_zone, zone);
}
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);
return page;
}

@@ -5052,7 +5072,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto failed;

/* Attempt the batch allocation */
- local_irq_save(flags);
+ local_lock_irqsave(&pagesets.lock, flags);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
pcp_list = &pcp->lists[ac.migratetype];

@@ -5090,12 +5110,12 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);

return nr_populated;

failed_irq:
- local_irq_restore(flags);
+ local_unlock_irqrestore(&pagesets.lock, flags);

failed:
page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
--
2.26.2

2021-04-19 18:53:28

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/10] mm/page_alloc: Batch the accounting updates in the bulk allocator

Now that the zone_statistics are simple counters that do not require
special protection, the bulk allocator accounting updates can be batch
updated without adding too much complexity with protected RMW updates or
using xchg.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/vmstat.h | 8 ++++++++
mm/page_alloc.c | 30 +++++++++++++-----------------
2 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index c1d2c316ce7d..9bf194d507e7 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -247,6 +247,14 @@ __count_numa_event(struct zone *zone, enum numa_stat_item item)
raw_cpu_inc(pzstats->vm_numa_event[item]);
}

+static inline void
+__count_numa_events(struct zone *zone, enum numa_stat_item item, long delta)
+{
+ struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+
+ raw_cpu_add(pzstats->vm_numa_event[item], delta);
+}
+
extern unsigned long sum_zone_node_page_state(int node,
enum zone_stat_item item);
extern unsigned long sum_zone_numa_event_state(int node, enum numa_stat_item item);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d0f047647e3..cff0f1c98b28 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3411,7 +3411,8 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
*
* Must be called with interrupts disabled.
*/
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
+ long nr_account)
{
#ifdef CONFIG_NUMA
enum numa_stat_item local_stat = NUMA_LOCAL;
@@ -3424,12 +3425,12 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
local_stat = NUMA_OTHER;

if (zone_to_nid(z) == zone_to_nid(preferred_zone))
- __count_numa_event(z, NUMA_HIT);
+ __count_numa_events(z, NUMA_HIT, nr_account);
else {
- __count_numa_event(z, NUMA_MISS);
- __count_numa_event(preferred_zone, NUMA_FOREIGN);
+ __count_numa_events(z, NUMA_MISS, nr_account);
+ __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
}
- __count_numa_event(z, local_stat);
+ __count_numa_events(z, local_stat, nr_account);
#endif
}

@@ -3475,7 +3476,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
- zone_statistics(preferred_zone, zone);
+ zone_statistics(preferred_zone, zone, 1);
}
local_unlock_irqrestore(&pagesets.lock, flags);
return page;
@@ -3536,7 +3537,7 @@ struct page *rmqueue(struct zone *preferred_zone,
get_pcppage_migratetype(page));

__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
- zone_statistics(preferred_zone, zone);
+ zone_statistics(preferred_zone, zone, 1);
local_irq_restore(flags);

out:
@@ -5019,7 +5020,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
struct alloc_context ac;
gfp_t alloc_gfp;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
- int nr_populated = 0;
+ int nr_populated = 0, nr_account = 0;

if (unlikely(nr_pages <= 0))
return 0;
@@ -5092,15 +5093,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto failed_irq;
break;
}
-
- /*
- * Ideally this would be batched but the best way to do
- * that cheaply is to first convert zone_statistics to
- * be inaccurate per-cpu counter like vm_events to avoid
- * a RMW cycle then do the accounting with IRQs enabled.
- */
- __count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
- zone_statistics(ac.preferred_zoneref->zone, zone);
+ nr_account++;

prep_new_page(page, 0, gfp, 0);
if (page_list)
@@ -5110,6 +5103,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

+ __count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
+ zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
+
local_unlock_irqrestore(&pagesets.lock, flags);

return nr_populated;
--
2.26.2

2021-04-19 18:53:28

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/10] mm/vmstat: Inline NUMA event counter updates

__count_numa_event is small enough to be treated similarly to
__count_vm_event so inline it.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/vmstat.h | 10 +++++++++-
mm/vmstat.c | 9 ---------
2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index b120c58cae82..c1d2c316ce7d 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -238,7 +238,15 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
}

#ifdef CONFIG_NUMA
-extern void __count_numa_event(struct zone *zone, enum numa_stat_item item);
+/* See __count_vm_event comment on why raw_cpu_inc is used. */
+static inline void
+__count_numa_event(struct zone *zone, enum numa_stat_item item)
+{
+ struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+
+ raw_cpu_inc(pzstats->vm_numa_event[item]);
+}
+
extern unsigned long sum_zone_node_page_state(int node,
enum zone_stat_item item);
extern unsigned long sum_zone_numa_event_state(int node, enum numa_stat_item item);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 97558cb607ac..4e6c474a3cd6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -935,15 +935,6 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats)
#endif

#ifdef CONFIG_NUMA
-/* See __count_vm_event comment on why raw_cpu_inc is used. */
-void __count_numa_event(struct zone *zone,
- enum numa_stat_item item)
-{
- struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-
- raw_cpu_inc(pzstats->vm_numa_event[item]);
-}
-
/*
* Determine the per node value of a stat item. This function
* is called frequently in a NUMA machine, so try to be as
--
2.26.2