LinuxLists.cc - [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead

2021-04-07 22:03:00

Subject: [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead

For MM people, the whole series is relevant but patch 3 needs particular
attention for memory hotremove as I had problems testing it because full
zone removal always failed for me. For RT people, the most interesting
patches are 2, 9 and 10 with 2 being the most important.

This series requires patches in Andrew's tree so for convenience, it's also available at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v2r10

The PCP (per-cpu page allocator in page_alloc.c) shares locking
requirements with vmstat and the zone lock which is inconvenient and
causes some issues. For example, the PCP list and vmstat share the same
per-cpu space meaning that it's possible that vmstat updates dirty cache
lines holding per-cpu lists across CPUs unless padding is used. Second,
PREEMPT_RT does not want IRQs disabled in the page allocator because it's
too long for IRQs to be disabled unnecesarily.

This series splits the locking requirements and uses locks types more
suitable for PREEMPT_RT, reduces the time when special locking is required
for stats and reduces the time when IRQs need to be disabled on !PREEMPT_RT
kernels.

Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst

local_irq_disable();
raw_spin_lock(&lock);

The page allocator does not use raw_spin_lock but using local_irq_safe
is undesirable on PREEMPT_RT as it leaves IRQs disabled for an excessive
length of time. By converting to local_lock which disables migration on
PREEMPT_RT, the locking requirements can be separated and start moving
the protections for PCP, stats and the zone lock to PREEMPT_RT-safe
equivalent locking. As a bonus, local_lock also means that PROVE_LOCKING
does something useful.

After that, it was very obvious that zone_statistics in particular has
way too much overhead and leaves IRQs disabled for longer than necessary
on !PREEMPT_RT kernels. zone_statistics uses perfectly accurate counters
requiring IRQs be disabled for parallel RMW sequences when inaccurate ones
like vm_events would do. The series makes the NUMA statistics (NUMA_HIT
and friends) inaccurate counters that then require no special protection
on !PREEMPT_RT.

The bulk page allocator can then do stat updates in bulk with IRQs enabled
which should improve the efficiency. Technically, this could have been
done without the local_lock and vmstat conversion work and the order
simply reflects the timing of when different series were implemented.

Finally, there are places where we conflate IRQs being disabled for the
PCP with the IRQ-safe zone spinlock. The remainder of the series reduces
the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
By the end of the series, page_alloc.c does not call local_irq_save so
the locking scope is a bit clearer. The one exception is that modifying
NR_FREE_PAGES still happens in places where it's known the IRQs are
disabled as it's harmless for PREEMPT_RT and would be expensive to split
the locking there.

No performance data is included because despite the overhead of the stats,
it's within the noise for most workloads on !PREEMPT_RT. However, Jesper
Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
3.60GHz CPU on the first version of this series. Focusing on the array
variant of the bulk page allocator reveals the following.

(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

Baseline Patched
1 56.383 54.225 (+3.83%)
2 40.047 35.492 (+11.38%)
3 37.339 32.643 (+12.58%)
4 35.578 30.992 (+12.89%)
8 33.592 29.606 (+11.87%)
16 32.362 28.532 (+11.85%)
32 31.476 27.728 (+11.91%)
64 30.633 27.252 (+11.04%)
128 30.596 27.090 (+11.46%)

While this is a positive outcome, the series is more likely to be
interesting to the RT people in terms of getting parts of the PREEMPT_RT
tree into mainline.

drivers/base/node.c | 18 +--
include/linux/mmzone.h | 29 ++--
include/linux/vmstat.h | 65 +++++----
mm/internal.h | 2 +-
mm/memory_hotplug.c | 10 +-
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 297 ++++++++++++++++++++++++-----------------
mm/vmstat.c | 250 ++++++++++++----------------------
8 files changed, 339 insertions(+), 334 deletions(-)

--
2.26.2

2021-04-07 22:03:42

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 06/11] mm/page_alloc: Batch the accounting updates in the bulk allocator

Now that the zone_statistics are simple counters that do not require
special protection, the bulk allocator accounting updates can be batch
updated without adding too much complexity with protected RMW updates or
using xchg.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vmstat.h | 8 ++++++++
mm/page_alloc.c | 30 +++++++++++++-----------------
2 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index dde4dec4e7dd..8473b8fa9756 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -246,6 +246,14 @@ __count_numa_event(struct zone *zone, enum numa_stat_item item)
raw_cpu_inc(pzstats->vm_numa_event[item]);
}

+static inline void
+__count_numa_events(struct zone *zone, enum numa_stat_item item, long delta)
+{
+ struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+
+ raw_cpu_add(pzstats->vm_numa_event[item], delta);
+}
+
extern void __count_numa_event(struct zone *zone, enum numa_stat_item item);
extern unsigned long sum_zone_node_page_state(int node,
enum zone_stat_item item);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 73e618d06315..defb0e436fac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3411,7 +3411,8 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
*
* Must be called with interrupts disabled.
*/
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
+ long nr_account)
{
#ifdef CONFIG_NUMA
enum numa_stat_item local_stat = NUMA_LOCAL;
@@ -3424,12 +3425,12 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
local_stat = NUMA_OTHER;

if (zone_to_nid(z) == zone_to_nid(preferred_zone))
- __count_numa_event(z, NUMA_HIT);
+ __count_numa_events(z, NUMA_HIT, nr_account);
else {
- __count_numa_event(z, NUMA_MISS);
- __count_numa_event(preferred_zone, NUMA_FOREIGN);
+ __count_numa_events(z, NUMA_MISS, nr_account);
+ __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
}
- __count_numa_event(z, local_stat);
+ __count_numa_events(z, local_stat, nr_account);
#endif
}

@@ -3475,7 +3476,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
- zone_statistics(preferred_zone, zone);
+ zone_statistics(preferred_zone, zone, 1);
}
local_unlock_irqrestore(&pagesets.lock, flags);
return page;
@@ -3536,7 +3537,7 @@ struct page *rmqueue(struct zone *preferred_zone,
get_pcppage_migratetype(page));

__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
- zone_statistics(preferred_zone, zone);
+ zone_statistics(preferred_zone, zone, 1);
local_irq_restore(flags);

out:
@@ -5019,7 +5020,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
struct alloc_context ac;
gfp_t alloc_gfp;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
- int nr_populated = 0;
+ int nr_populated = 0, nr_account = 0;

if (unlikely(nr_pages <= 0))
return 0;
@@ -5092,15 +5093,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto failed_irq;
break;
}
-
- /*
- * Ideally this would be batched but the best way to do
- * that cheaply is to first convert zone_statistics to
- * be inaccurate per-cpu counter like vm_events to avoid
- * a RMW cycle then do the accounting with IRQs enabled.
- */
- __count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
- zone_statistics(ac.preferred_zoneref->zone, zone);
+ nr_account++;

prep_new_page(page, 0, gfp, 0);
if (page_list)
@@ -5110,6 +5103,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

+ __count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
+ zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
+
local_unlock_irqrestore(&pagesets.lock, flags);

return nr_populated;
--
2.26.2

2021-04-07 22:03:50

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 03/11] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

zone_pcp_reset allegedly protects against a race with drain_pages
using local_irq_save but this is bogus. local_irq_save only operates
on the local CPU. If memory hotplug is running on CPU A and drain_pages
is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
offers no protection.

This patch reorders memory hotremove such that the PCP structures
relevant to the zone are no longer reachable by the time the structures
are freed. With this reordering, no protection is required to prevent
a use-after-free and the IRQs can be left enabled. zone_pcp_reset is
renamed to zone_pcp_destroy to make it clear that the per-cpu structures
are deleted when the function returns.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 2 +-
mm/memory_hotplug.c | 10 +++++++---
mm/page_alloc.c | 22 ++++++++++++++++------
3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 09adf152a10b..cc34ce4461b7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -203,7 +203,7 @@ extern void free_unref_page(struct page *page);
extern void free_unref_page_list(struct list_head *list);

extern void zone_pcp_update(struct zone *zone);
-extern void zone_pcp_reset(struct zone *zone);
+extern void zone_pcp_destroy(struct zone *zone);
extern void zone_pcp_disable(struct zone *zone);
extern void zone_pcp_enable(struct zone *zone);

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0cdbbfbc5757..3d059c9f9c2d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1687,12 +1687,16 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
spin_unlock_irqrestore(&zone->lock, flags);

- zone_pcp_enable(zone);
-
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
zone->present_pages -= nr_pages;

+ /*
+ * Restore PCP after managed pages has been updated. Unpopulated
+ * zones PCP structures will remain unusable.
+ */
+ zone_pcp_enable(zone);
+
pgdat_resize_lock(zone->zone_pgdat, &flags);
zone->zone_pgdat->node_present_pages -= nr_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
@@ -1700,8 +1704,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
init_per_zone_wmark_min();

if (!populated_zone(zone)) {
- zone_pcp_reset(zone);
build_all_zonelists(NULL);
+ zone_pcp_destroy(zone);
} else
zone_pcp_update(zone);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e9e60d1a85d4..a8630003612b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8972,18 +8972,29 @@ void zone_pcp_disable(struct zone *zone)

void zone_pcp_enable(struct zone *zone)
{
- __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+ /*
+ * If the zone is populated, restore the high and batch counts.
+ * If unpopulated, leave the high and batch count as 0 and 1
+ * respectively as done by zone_pcp_disable. The per-cpu
+ * structures will later be freed by zone_pcp_destroy.
+ */
+ if (populated_zone(zone))
+ __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+
mutex_unlock(&pcp_batch_high_lock);
}

-void zone_pcp_reset(struct zone *zone)
+/*
+ * Called when a zone has been hot-removed. At this point, the PCP has been
+ * drained, disabled and the zone is removed from the zonelists so the
+ * structures are no longer in use. PCP was disabled/drained by
+ * zone_pcp_disable. This function will drain any remaining vmstat deltas.
+ */
+void zone_pcp_destroy(struct zone *zone)
{
- unsigned long flags;
int cpu;
struct per_cpu_zonestat *pzstats;

- /* avoid races with drain_pages() */
- local_irq_save(flags);
if (zone->per_cpu_pageset != &boot_pageset) {
for_each_online_cpu(cpu) {
pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
@@ -8994,7 +9005,6 @@ void zone_pcp_reset(struct zone *zone)
zone->per_cpu_pageset = &boot_pageset;
zone->per_cpu_zonestats = &boot_zonestats;
}
- local_irq_restore(flags);
}

#ifdef CONFIG_MEMORY_HOTREMOVE
--
2.26.2

2021-04-08 16:52:16

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead

On Wed, Apr 07, 2021 at 09:24:12PM +0100, Mel Gorman wrote:
> Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
> as documented in Documentation/locking/locktypes.rst
>
> local_irq_disable();
> raw_spin_lock(&lock);

Almost, the above is actually OK on RT. The problematic one is:

local_irq_disable();
spin_lock(&lock);

That doesn't work on RT since spin_lock() turns into a PI-mutex which
then obviously explodes if it tries to block with IRQs disabled.

And it so happens, that's exactly the one at hand.

2021-04-08 17:49:36

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead

On Thu, Apr 08, 2021 at 12:56:01PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 07, 2021 at 09:24:12PM +0100, Mel Gorman wrote:
> > Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
> > as documented in Documentation/locking/locktypes.rst
> >
> > local_irq_disable();
> > raw_spin_lock(&lock);
>
> Almost, the above is actually OK on RT. The problematic one is:
>
> local_irq_disable();
> spin_lock(&lock);
>
> That doesn't work on RT since spin_lock() turns into a PI-mutex which
> then obviously explodes if it tries to block with IRQs disabled.
>
> And it so happens, that's exactly the one at hand.

Ok, I completely messed up the leader because it was local_irq_disable()
+ spin_lock() that I was worried about. Once the series is complete,
it is replated with

local_lock_irq(&lock_lock)
spin_lock(&lock);

According to Documentation/locking/locktypes.rst, that should be safe.
I'll rephrase the justification.

--
Mel Gorman
SUSE Labs