2022-05-12 13:46:52

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/6] Drain remote per-cpu directly v3

Changelog since v2
o More conversions from page->lru to page->[pcp_list|buddy_list]
o Additional test results in changelogs

Changelog since v1
o Fix unsafe RT locking scheme
o Use spin_trylock on UP PREEMPT_RT

This series has the same intent as Nicolas' series "mm/page_alloc: Remote
per-cpu lists drain support" -- avoid interference of a high priority
task due to a workqueue item draining per-cpu page lists. While many
workloads can tolerate a brief interruption, it may be cause a real-time
task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
the draining in non-deterministic.

Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
The local_lock on its own prevents migration and the IRQ disabling protects
from corruption due to an interrupt arriving while a page allocation is
in progress. The locking is inherently unsafe for remote access unless
the CPU is hot-removed.

This series adjusts the locking. A spinlock is added to struct
per_cpu_pages to protect the list contents while local_lock_irq continues
to prevent migration and IRQ reentry. This allows a remote CPU to safely
drain a remote per-cpu list.

This series is a partial series. Follow-on work should allow the
local_irq_save to be converted to a local_irq to avoid IRQs being
disabled/enabled in most cases. Consequently, there are some TODO comments
highlighting the places that would change if local_irq was used. However,
there are enough corner cases that it deserves a series on its own
separated by one kernel release and the priority right now is to avoid
interference of high priority tasks.

Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
and when it is storing per-cpu pages.

Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
this is not necessary but it avoids per_cpu_pages consuming another
cache line.

Patch 3 is a preparation patch to avoid code duplication.

Patch 4 is a simple micro-optimisation that improves code flow necessary for
a later patch to avoid code duplication.

Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
relying on local_lock to prevent migration, stabilise the pcp
lookup and prevent IRQ reentrancy.

Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

include/linux/mm_types.h | 5 +
include/linux/mmzone.h | 12 +-
mm/page_alloc.c | 348 +++++++++++++++++++++++++--------------
3 files changed, 233 insertions(+), 132 deletions(-)

--
2.34.1



2022-05-12 16:10:22

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists

From: Nicolas Saenz Julienne <[email protected]>

Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages(). So introduce a new mechanism to
remotely drain the per-cpu lists. It is made possible by remotely locking
'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new scheme
is that drain operations are now migration safe.

There was no observed performance degradation vs. the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second.
The new scheme performs a bit better (~5%), although the important point
here is there are no performance regressions vs. the previous mechanism.
Per-cpu lists draining happens only in slow paths.

Minchan Kim tested this independently and reported;

My workload is not NOHZ CPUs but run apps under heavy memory
pressure so they goes to direct reclaim and be stuck on
drain_all_pages until work on workqueue run.

unit: nanosecond
max(dur) avg(dur) count(dur)
166713013 487511.77786438033 1283

From traces, system encountered the drain_all_pages 1283 times and
worst case was 166ms and avg was 487us.

The other problem was alloc_contig_range in CMA. The PCP draining
takes several hundred millisecond sometimes though there is no
memory pressure or a few of pages to be migrated out but CPU were
fully booked.

Your patch perfectly removed those wasted time.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Tested-by: Minchan Kim <[email protected]>
Acked-by: Minchan Kim <[email protected]>
---
mm/page_alloc.c | 59 +++++--------------------------------------------
1 file changed, 5 insertions(+), 54 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ce4d3002b8a3..0f5a6a5b0302 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -164,13 +164,7 @@ DEFINE_PER_CPU(int, _numa_mem_); /* Kernel "local memory" node */
EXPORT_PER_CPU_SYMBOL(_numa_mem_);
#endif

-/* work_structs for global per-cpu drains */
-struct pcpu_drain {
- struct zone *zone;
- struct work_struct work;
-};
static DEFINE_MUTEX(pcpu_drain_mutex);
-static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);

#ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY
volatile unsigned long latent_entropy __latent_entropy;
@@ -3090,9 +3084,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* Called from the vmstat counter updater to drain pagesets of this
* currently executing processor on remote nodes after they have
* expired.
- *
- * Note that this function must be called with the thread pinned to
- * a single processor.
*/
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
{
@@ -3117,10 +3108,6 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)

/*
* Drain pcplists of the indicated processor and zone.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
*/
static void drain_pages_zone(unsigned int cpu, struct zone *zone)
{
@@ -3139,10 +3126,6 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)

/*
* Drain pcplists of all zones on the indicated processor.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
*/
static void drain_pages(unsigned int cpu)
{
@@ -3155,9 +3138,6 @@ static void drain_pages(unsigned int cpu)

/*
* Spill all of this CPU's per-cpu pages back into the buddy allocator.
- *
- * The CPU has to be pinned. When zone parameter is non-NULL, spill just
- * the single zone's pages.
*/
void drain_local_pages(struct zone *zone)
{
@@ -3169,24 +3149,6 @@ void drain_local_pages(struct zone *zone)
drain_pages(cpu);
}

-static void drain_local_pages_wq(struct work_struct *work)
-{
- struct pcpu_drain *drain;
-
- drain = container_of(work, struct pcpu_drain, work);
-
- /*
- * drain_all_pages doesn't use proper cpu hotplug protection so
- * we can race with cpu offline when the WQ can move this from
- * a cpu pinned worker to an unbound one. We can operate on a different
- * cpu which is alright but we also have to make sure to not move to
- * a different one.
- */
- migrate_disable();
- drain_local_pages(drain->zone);
- migrate_enable();
-}
-
/*
* The implementation of drain_all_pages(), exposing an extra parameter to
* drain on all cpus.
@@ -3207,13 +3169,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
*/
static cpumask_t cpus_with_pcps;

- /*
- * Make sure nobody triggers this path before mm_percpu_wq is fully
- * initialized.
- */
- if (WARN_ON_ONCE(!mm_percpu_wq))
- return;
-
/*
* Do not drain if one is already in progress unless it's specific to
* a zone. Such callers are primarily CMA and memory hotplug and need
@@ -3263,14 +3218,12 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
}

for_each_cpu(cpu, &cpus_with_pcps) {
- struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
-
- drain->zone = zone;
- INIT_WORK(&drain->work, drain_local_pages_wq);
- queue_work_on(cpu, mm_percpu_wq, &drain->work);
+ if (zone) {
+ drain_pages_zone(cpu, zone);
+ } else {
+ drain_pages(cpu);
+ }
}
- for_each_cpu(cpu, &cpus_with_pcps)
- flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);

mutex_unlock(&pcpu_drain_mutex);
}
@@ -3279,8 +3232,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
* Spill all the per-cpu pages from all CPUs back into the buddy allocator.
*
* When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a workqueue.
*/
void drain_all_pages(struct zone *zone)
{
--
2.34.1


2022-05-12 18:52:53

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list

The page allocator uses page->lru for storing pages on either buddy or
PCP lists. Create page->buddy_list and page->pcp_list as a union with
page->lru. This is simply to clarify what type of list a page is on
in the page allocator.

No functional change intended.

[minchan: Fix page lru fields in macros]
Signed-off-by: Mel Gorman <[email protected]>
Tested-by: Minchan Kim <[email protected]>
Acked-by: Minchan Kim <[email protected]>
---
include/linux/mm_types.h | 5 +++++
mm/page_alloc.c | 24 ++++++++++++------------
2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8834e38c06a4..a2782e8af307 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -87,6 +87,7 @@ struct page {
*/
union {
struct list_head lru;
+
/* Or, for the Unevictable "LRU list" slot */
struct {
/* Always even, to negate PageTail */
@@ -94,6 +95,10 @@ struct page {
/* Count page's or folio's mlocks */
unsigned int mlock_count;
};
+
+ /* Or, free page */
+ struct list_head buddy_list;
+ struct list_head pcp_list;
};
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2db95780e003..f58f85fdb05f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -780,7 +780,7 @@ static inline bool set_page_guard(struct zone *zone, struct page *page,
return false;

__SetPageGuard(page);
- INIT_LIST_HEAD(&page->lru);
+ INIT_LIST_HEAD(&page->buddy_list);
set_page_private(page, order);
/* Guard pages are not available for any usage */
__mod_zone_freepage_state(zone, -(1 << order), migratetype);
@@ -957,7 +957,7 @@ static inline void add_to_free_list(struct page *page, struct zone *zone,
{
struct free_area *area = &zone->free_area[order];

- list_add(&page->lru, &area->free_list[migratetype]);
+ list_add(&page->buddy_list, &area->free_list[migratetype]);
area->nr_free++;
}

@@ -967,7 +967,7 @@ static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
{
struct free_area *area = &zone->free_area[order];

- list_add_tail(&page->lru, &area->free_list[migratetype]);
+ list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
area->nr_free++;
}

@@ -981,7 +981,7 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
{
struct free_area *area = &zone->free_area[order];

- list_move_tail(&page->lru, &area->free_list[migratetype]);
+ list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
}

static inline void del_page_from_free_list(struct page *page, struct zone *zone,
@@ -991,7 +991,7 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
if (page_reported(page))
__ClearPageReported(page);

- list_del(&page->lru);
+ list_del(&page->buddy_list);
__ClearPageBuddy(page);
set_page_private(page, 0);
zone->free_area[order].nr_free--;
@@ -1489,11 +1489,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
do {
int mt;

- page = list_last_entry(list, struct page, lru);
+ page = list_last_entry(list, struct page, pcp_list);
mt = get_pcppage_migratetype(page);

/* must delete to avoid corrupting pcp list */
- list_del(&page->lru);
+ list_del(&page->pcp_list);
count -= nr_pages;
pcp->count -= nr_pages;

@@ -3053,7 +3053,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* for IO devices that can merge IO requests if the physical
* pages are ordered properly.
*/
- list_add_tail(&page->lru, list);
+ list_add_tail(&page->pcp_list, list);
allocated++;
if (is_migrate_cma(get_pcppage_migratetype(page)))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
@@ -3303,7 +3303,7 @@ void mark_free_pages(struct zone *zone)

for_each_migratetype_order(order, t) {
list_for_each_entry(page,
- &zone->free_area[order].free_list[t], lru) {
+ &zone->free_area[order].free_list[t], buddy_list) {
unsigned long i;

pfn = page_to_pfn(page);
@@ -3392,7 +3392,7 @@ static void free_unref_page_commit(struct page *page, int migratetype,
__count_vm_event(PGFREE);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
pindex = order_to_pindex(migratetype, order);
- list_add(&page->lru, &pcp->lists[pindex]);
+ list_add(&page->pcp_list, &pcp->lists[pindex]);
pcp->count += 1 << order;

/*
@@ -3655,8 +3655,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
return NULL;
}

- page = list_first_entry(list, struct page, lru);
- list_del(&page->lru);
+ page = list_first_entry(list, struct page, pcp_list);
+ list_del(&page->pcp_list);
pcp->count -= 1 << order;
} while (check_new_pcp(page, order));

--
2.34.1


2022-05-13 09:55:01

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock

Currently the PCP lists are protected by using local_lock_irqsave to
prevent migration and IRQ reentrancy but this is inconvenient. Remote
draining of the lists is impossible and a workqueue is required and
every task allocation/free must disable then enable interrupts which is
expensive.

As preparation for dealing with both of those problems, protect the
lists with a spinlock. The IRQ-unsafe version of the lock is used
because IRQs are already disabled by local_lock_irqsave. spin_trylock
is used in preparation for a time when local_lock could be used instead
of lock_lock_irqsave.

The per_cpu_pages still fits within the same number of cache lines after
this patch relative to before the series.

struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int batch; /* 12 4 */
short int free_factor; /* 16 2 */
short int expire; /* 18 2 */

/* XXX 4 bytes hole, try to pack */

struct list_head lists[13]; /* 24 208 */

/* size: 256, cachelines: 4, members: 7 */
/* sum members: 228, holes: 1, sum holes: 4 */
/* padding: 24 */
} __attribute__((__aligned__(64)));

There is overhead in the fast path due to acquiring the spinlock even
though the spinlock is per-cpu and uncontended in the common case. Page
Fault Test (PFT) running on a 1-socket reported the following results on
a 1 socket machine.

5.18.0-rc1 5.18.0-rc1
vanilla mm-pcpdrain-v2r1
Hmean faults/sec-1 886331.5718 ( 0.00%) 885462.7479 ( -0.10%)
Hmean faults/sec-3 2337706.1583 ( 0.00%) 2332130.4909 * -0.24%*
Hmean faults/sec-5 2851594.2897 ( 0.00%) 2844123.9307 ( -0.26%)
Hmean faults/sec-7 3543251.5507 ( 0.00%) 3516889.0442 * -0.74%*
Hmean faults/sec-8 3947098.0024 ( 0.00%) 3916162.8476 * -0.78%*
Stddev faults/sec-1 2302.9105 ( 0.00%) 2065.0845 ( 10.33%)
Stddev faults/sec-3 7275.2442 ( 0.00%) 6033.2620 ( 17.07%)
Stddev faults/sec-5 24726.0328 ( 0.00%) 12525.1026 ( 49.34%)
Stddev faults/sec-7 9974.2542 ( 0.00%) 9543.9627 ( 4.31%)
Stddev faults/sec-8 9468.0191 ( 0.00%) 7958.2607 ( 15.95%)
CoeffVar faults/sec-1 0.2598 ( 0.00%) 0.2332 ( 10.24%)
CoeffVar faults/sec-3 0.3112 ( 0.00%) 0.2587 ( 16.87%)
CoeffVar faults/sec-5 0.8670 ( 0.00%) 0.4404 ( 49.21%)
CoeffVar faults/sec-7 0.2815 ( 0.00%) 0.2714 ( 3.60%)
CoeffVar faults/sec-8 0.2399 ( 0.00%) 0.2032 ( 15.28%)

There is a small hit in the number of faults per second but given that
the results are more stable, it's borderline noise.

Signed-off-by: Mel Gorman <[email protected]>
Tested-by: Minchan Kim <[email protected]>
Acked-by: Minchan Kim <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 169 ++++++++++++++++++++++++++++++++++++-----
2 files changed, 149 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index abe530748de6..8b5757735428 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -385,6 +385,7 @@ enum zone_watermarks {

/* Fields and list protected by pagesets local_lock in page_alloc.c */
struct per_cpu_pages {
+ spinlock_t lock; /* Protects lists field */
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b543333dce8f..ce4d3002b8a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -132,6 +132,20 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
.lock = INIT_LOCAL_LOCK(lock),
};

+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+/*
+ * On SMP, spin_trylock is sufficient protection.
+ * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ */
+#define pcp_trylock_prepare(flags) do { } while (0)
+#define pcp_trylock_finish(flag) do { } while (0)
+#else
+
+/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
+#define pcp_trylock_prepare(flags) local_irq_save(flags)
+#define pcp_trylock_finish(flags) local_irq_restore(flags)
+#endif
+
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -3082,15 +3096,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
*/
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
{
- unsigned long flags;
int to_drain, batch;

- local_lock_irqsave(&pagesets.lock, flags);
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
- if (to_drain > 0)
+ if (to_drain > 0) {
+ unsigned long flags;
+
+ /*
+ * free_pcppages_bulk expects IRQs disabled for zone->lock
+ * so even though pcp->lock is not intended to be IRQ-safe,
+ * it's needed in this context.
+ */
+ spin_lock_irqsave(&pcp->lock, flags);
free_pcppages_bulk(zone, to_drain, pcp, 0);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ spin_unlock_irqrestore(&pcp->lock, flags);
+ }
}
#endif

@@ -3103,16 +3124,17 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
*/
static void drain_pages_zone(unsigned int cpu, struct zone *zone)
{
- unsigned long flags;
struct per_cpu_pages *pcp;

- local_lock_irqsave(&pagesets.lock, flags);
-
pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
- if (pcp->count)
- free_pcppages_bulk(zone, pcp->count, pcp, 0);
+ if (pcp->count) {
+ unsigned long flags;

- local_unlock_irqrestore(&pagesets.lock, flags);
+ /* See drain_zone_pages on why this is disabling IRQs */
+ spin_lock_irqsave(&pcp->lock, flags);
+ free_pcppages_bulk(zone, pcp->count, pcp, 0);
+ spin_unlock_irqrestore(&pcp->lock, flags);
+ }
}

/*
@@ -3380,18 +3402,30 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
return min(READ_ONCE(pcp->batch) << 2, high);
}

-static void free_unref_page_commit(struct page *page, int migratetype,
- unsigned int order)
+/* Returns true if the page was committed to the per-cpu list. */
+static bool free_unref_page_commit(struct page *page, int migratetype,
+ unsigned int order, bool locked)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
int high;
int pindex;
bool free_high;
+ unsigned long __maybe_unused UP_flags;

__count_vm_event(PGFREE);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
pindex = order_to_pindex(migratetype, order);
+
+ if (!locked) {
+ /* Protect against a parallel drain. */
+ pcp_trylock_prepare(UP_flags);
+ if (!spin_trylock(&pcp->lock)) {
+ pcp_trylock_finish(UP_flags);
+ return false;
+ }
+ }
+
list_add(&page->pcp_list, &pcp->lists[pindex]);
pcp->count += 1 << order;

@@ -3409,6 +3443,13 @@ static void free_unref_page_commit(struct page *page, int migratetype,

free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch, free_high), pcp, pindex);
}
+
+ if (!locked) {
+ spin_unlock(&pcp->lock);
+ pcp_trylock_finish(UP_flags);
+ }
+
+ return true;
}

/*
@@ -3419,6 +3460,7 @@ void free_unref_page(struct page *page, unsigned int order)
unsigned long flags;
unsigned long pfn = page_to_pfn(page);
int migratetype;
+ bool freed_pcp = false;

if (!free_unref_page_prepare(page, pfn, order))
return;
@@ -3440,8 +3482,11 @@ void free_unref_page(struct page *page, unsigned int order)
}

local_lock_irqsave(&pagesets.lock, flags);
- free_unref_page_commit(page, migratetype, order);
+ freed_pcp = free_unref_page_commit(page, migratetype, order, false);
local_unlock_irqrestore(&pagesets.lock, flags);
+
+ if (unlikely(!freed_pcp))
+ free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
}

/*
@@ -3450,10 +3495,19 @@ void free_unref_page(struct page *page, unsigned int order)
void free_unref_page_list(struct list_head *list)
{
struct page *page, *next;
+ struct per_cpu_pages *pcp;
+ struct zone *locked_zone;
unsigned long flags;
int batch_count = 0;
int migratetype;

+ /*
+ * An empty list is possible. Check early so that the later
+ * lru_to_page() does not potentially read garbage.
+ */
+ if (list_empty(list))
+ return;
+
/* Prepare pages for freeing */
list_for_each_entry_safe(page, next, list, lru) {
unsigned long pfn = page_to_pfn(page);
@@ -3474,8 +3528,33 @@ void free_unref_page_list(struct list_head *list)
}
}

+ /*
+ * Preparation could have drained the list due to failing to prepare
+ * or all pages are being isolated.
+ */
+ if (list_empty(list))
+ return;
+
+ VM_BUG_ON(in_hardirq());
+
local_lock_irqsave(&pagesets.lock, flags);
+
+ page = lru_to_page(list);
+ locked_zone = page_zone(page);
+ pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
+ spin_lock(&pcp->lock);
+
list_for_each_entry_safe(page, next, list, lru) {
+ struct zone *zone = page_zone(page);
+
+ /* Different zone, different pcp lock. */
+ if (zone != locked_zone) {
+ spin_unlock(&pcp->lock);
+ locked_zone = zone;
+ pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ spin_lock(&pcp->lock);
+ }
+
/*
* Non-isolated types over MIGRATE_PCPTYPES get added
* to the MIGRATE_MOVABLE pcp list.
@@ -3485,18 +3564,33 @@ void free_unref_page_list(struct list_head *list)
migratetype = MIGRATE_MOVABLE;

trace_mm_page_free_batched(page);
- free_unref_page_commit(page, migratetype, 0);
+
+ /*
+ * If there is a parallel drain in progress, free to the buddy
+ * allocator directly. This is expensive as the zone lock will
+ * be acquired multiple times but if a drain is in progress
+ * then an expensive operation is already taking place.
+ *
+ * TODO: Always false at the moment due to local_lock_irqsave
+ * and is preparation for converting to local_lock.
+ */
+ if (unlikely(!free_unref_page_commit(page, migratetype, 0, true)))
+ free_one_page(page_zone(page), page, page_to_pfn(page), 0, migratetype, FPI_NONE);

/*
* Guard against excessive IRQ disabled times when we get
* a large list of pages to free.
*/
if (++batch_count == SWAP_CLUSTER_MAX) {
+ spin_unlock(&pcp->lock);
local_unlock_irqrestore(&pagesets.lock, flags);
batch_count = 0;
local_lock_irqsave(&pagesets.lock, flags);
+ pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
+ spin_lock(&pcp->lock);
}
}
+ spin_unlock(&pcp->lock);
local_unlock_irqrestore(&pagesets.lock, flags);
}

@@ -3668,9 +3762,28 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
int migratetype,
unsigned int alloc_flags,
struct per_cpu_pages *pcp,
- struct list_head *list)
+ struct list_head *list,
+ bool locked)
{
struct page *page;
+ unsigned long __maybe_unused UP_flags;
+
+ /*
+ * spin_trylock is not necessary right now due to due to
+ * local_lock_irqsave and is a preparation step for
+ * a conversion to local_lock using the trylock to prevent
+ * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
+ * uses rmqueue_buddy.
+ *
+ * TODO: Convert local_lock_irqsave to local_lock.
+ */
+ if (unlikely(!locked)) {
+ pcp_trylock_prepare(UP_flags);
+ if (!spin_trylock(&pcp->lock)) {
+ pcp_trylock_finish(UP_flags);
+ return NULL;
+ }
+ }

do {
if (list_empty(list)) {
@@ -3691,8 +3804,10 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
migratetype, alloc_flags);

pcp->count += alloced << order;
- if (unlikely(list_empty(list)))
- return NULL;
+ if (unlikely(list_empty(list))) {
+ page = NULL;
+ goto out;
+ }
}

page = list_first_entry(list, struct page, pcp_list);
@@ -3700,6 +3815,12 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
pcp->count -= 1 << order;
} while (check_new_pcp(page, order));

+out:
+ if (!locked) {
+ spin_unlock(&pcp->lock);
+ pcp_trylock_finish(UP_flags);
+ }
+
return page;
}

@@ -3724,7 +3845,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
pcp = this_cpu_ptr(zone->per_cpu_pageset);
pcp->free_factor >>= 1;
list = &pcp->lists[order_to_pindex(migratetype, order)];
- page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+ page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list, false);
local_unlock_irqrestore(&pagesets.lock, flags);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
@@ -3759,7 +3880,8 @@ struct page *rmqueue(struct zone *preferred_zone,
migratetype != MIGRATE_MOVABLE) {
page = rmqueue_pcplist(preferred_zone, zone, order,
gfp_flags, migratetype, alloc_flags);
- goto out;
+ if (likely(page))
+ goto out;
}
}

@@ -5326,6 +5448,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
local_lock_irqsave(&pagesets.lock, flags);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
+ spin_lock(&pcp->lock);

while (nr_populated < nr_pages) {

@@ -5336,11 +5459,13 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
}

page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
- pcp, pcp_list);
+ pcp, pcp_list, true);
if (unlikely(!page)) {
/* Try and get at least one page */
- if (!nr_populated)
+ if (!nr_populated) {
+ spin_unlock(&pcp->lock);
goto failed_irq;
+ }
break;
}
nr_account++;
@@ -5353,6 +5478,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

+ spin_unlock(&pcp->lock);
local_unlock_irqrestore(&pagesets.lock, flags);

__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
@@ -6992,6 +7118,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
memset(pcp, 0, sizeof(*pcp));
memset(pzstats, 0, sizeof(*pzstats));

+ spin_lock_init(&pcp->lock);
for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
INIT_LIST_HEAD(&pcp->lists[pindex]);

--
2.34.1


2022-05-13 16:34:33

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue

The VM_BUG_ON check for a valid page can be avoided with a simple
change in the flow. The ZONE_BOOSTED_WATERMARK is unlikely in general
and even more unlikely if the page allocation failed so mark the
branch unlikely.

Signed-off-by: Mel Gorman <[email protected]>
Tested-by: Minchan Kim <[email protected]>
Acked-by: Minchan Kim <[email protected]>
---
mm/page_alloc.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c4c54503a5d..b543333dce8f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3765,17 +3765,18 @@ struct page *rmqueue(struct zone *preferred_zone,

page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
migratetype);
- if (unlikely(!page))
- return NULL;

out:
/* Separate test+clear to avoid unnecessary atomics */
- if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
+ if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
wakeup_kswapd(zone, 0, 0, zone_idx(zone));
}

- VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
+ if (unlikely(!page))
+ return NULL;
+
+ VM_BUG_ON_PAGE(bad_range(zone, page), page);
return page;
}

--
2.34.1


2022-05-14 00:14:28

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists

On Fri, 2022-05-13 at 16:04 +0100, Mel Gorman wrote:
> On Thu, May 12, 2022 at 12:37:43PM -0700, Andrew Morton wrote:
> > On Thu, 12 May 2022 09:50:43 +0100 Mel Gorman <[email protected]> wrote:
> >
> > > From: Nicolas Saenz Julienne <[email protected]>
> > >
> > > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > > drain work queued by __drain_all_pages(). So introduce a new mechanism to
> > > remotely drain the per-cpu lists. It is made possible by remotely locking
> > > 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new scheme
> > > is that drain operations are now migration safe.
> > >
> > > There was no observed performance degradation vs. the previous scheme.
> > > Both netperf and hackbench were run in parallel to triggering the
> > > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > > The new scheme performs a bit better (~5%), although the important point
> > > here is there are no performance regressions vs. the previous mechanism.
> > > Per-cpu lists draining happens only in slow paths.
> > >
> > > Minchan Kim tested this independently and reported;
> > >
> > > My workload is not NOHZ CPUs but run apps under heavy memory
> > > pressure so they goes to direct reclaim and be stuck on
> > > drain_all_pages until work on workqueue run.
> > >
> > > unit: nanosecond
> > > max(dur) avg(dur) count(dur)
> > > 166713013 487511.77786438033 1283
> > >
> > > From traces, system encountered the drain_all_pages 1283 times and
> > > worst case was 166ms and avg was 487us.
> > >
> > > The other problem was alloc_contig_range in CMA. The PCP draining
> > > takes several hundred millisecond sometimes though there is no
> > > memory pressure or a few of pages to be migrated out but CPU were
> > > fully booked.
> > >
> > > Your patch perfectly removed those wasted time.
> >
> > I'm not getting a sense here of the overall effect upon userspace
> > performance. As Thomas said last year in
> > https://lkml.kernel.org/r/87v92sgt3n.ffs@tglx
> >
> > : The changelogs and the cover letter have a distinct void vs. that which
> > : means this is just another example of 'scratch my itch' changes w/o
> > : proper justification.
> >
> > Is there more to all of this than itchiness and if so, well, you know
> > the rest ;)
> >
>
> I think Minchan's example is clear-cut. The draining operation can take
> an arbitrary amount of time waiting for the workqueue to run on each CPU
> and can cause severe delays under reclaim or CMA and the patch fixes
> it. Maybe most users won't even notice but I bet phone users do if a
> camera app takes too long to open.
>
> The first paragraphs was written by Nicolas and I did not want to modify
> it heavily and still put his Signed-off-by on it. Maybe it could have
> been clearer though because "too busy" is vague when the actual intent
> is to avoid interfering with RT tasks. Does this sound better to you?
>
> Some setups, notably NOHZ_FULL CPUs, may be running realtime or
> latency-sensitive applications that cannot tolerate interference
> due to per-cpu drain work queued by __drain_all_pages(). Introduce
> a new mechanism to remotely drain the per-cpu lists. It is made
> possible by remotely locking 'struct per_cpu_pages' new per-cpu
> spinlocks. This has two advantages, the time to drain is more
> predictable and other unrelated tasks are not interrupted.
>
> You raise a very valid point with Thomas' mail and it is a concern that
> the local_lock is no longer strictly local. We still need preemption to
> be disabled between the percpu lookup and the lock acquisition but that
> can be done with get_cpu_var() to make the scope clear.

This isn't going to work in RT :(

get_cpu_var() disables preemption hampering RT spinlock use. There is more to
it in Documentation/locking/locktypes.rst.

Regards,

--
Nicolás Sáenz


2022-05-14 00:19:57

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

Hi Mel,

On Thu, 2022-05-12 at 09:50 +0100, Mel Gorman wrote:
> Changelog since v2
> o More conversions from page->lru to page->[pcp_list|buddy_list]
> o Additional test results in changelogs
>
> Changelog since v1
> o Fix unsafe RT locking scheme
> o Use spin_trylock on UP PREEMPT_RT
>
> This series has the same intent as Nicolas' series "mm/page_alloc:
> Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-
> time
> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> the draining in non-deterministic.
>
> Currently an IRQ-safe local_lock protects the page allocator per-cpu
> lists.
> The local_lock on its own prevents migration and the IRQ disabling
> protects
> from corruption due to an interrupt arriving while a page allocation
> is
> in progress. The locking is inherently unsafe for remote access
> unless
> the CPU is hot-removed.
>
> This series adjusts the locking. A spinlock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq
> continues
> to prevent migration and IRQ reentry. This allows a remote CPU to
> safely
> drain a remote per-cpu list.
>
> This series is a partial series. Follow-on work should allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. Consequently, there are some TODO
> comments
> highlighting the places that would change if local_irq was used.
> However,
> there are enough corner cases that it deserves a series on its own
> separated by one kernel release and the priority right now is to
> avoid
> interference of high priority tasks.
>
> Patch 1 is a cosmetic patch to clarify when page->lru is storing
> buddy pages
>         and when it is storing per-cpu pages.
>
> Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly
> speaking
>         this is not necessary but it avoids per_cpu_pages consuming
> another
>         cache line.
>
> Patch 3 is a preparation patch to avoid code duplication.
>
> Patch 4 is a simple micro-optimisation that improves code flow
> necessary for
>         a later patch to avoid code duplication.
>
> Patch 5 uses a spin_lock to protect the per_cpu_pages contents while
> still
>         relying on local_lock to prevent migration, stabilise the pcp
>         lookup and prevent IRQ reentrancy.
>
> Patch 6 remote drains per-cpu pages directly instead of using a
> workqueue.
>
>  include/linux/mm_types.h |   5 +
>  include/linux/mmzone.h   |  12 +-
>  mm/page_alloc.c          | 348 +++++++++++++++++++++++++------------
> --

Thanks for the series!

I'm testing the series ATM, both with on vanilla and RT kernels. I'll have the
results by Monday.

Regards,

--
Nicolás Sáenz


2022-05-14 00:33:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Thu, May 12, 2022 at 12:43:25PM -0700, Andrew Morton wrote:
> On Thu, 12 May 2022 09:50:37 +0100 Mel Gorman <[email protected]> wrote:
>
> > Changelog since v2
> > o More conversions from page->lru to page->[pcp_list|buddy_list]
> > o Additional test results in changelogs
> >
> > Changelog since v1
> > o Fix unsafe RT locking scheme
> > o Use spin_trylock on UP PREEMPT_RT
> >
> > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > per-cpu lists drain support" -- avoid interference of a high priority
> > task due to a workqueue item draining per-cpu page lists. While many
> > workloads can tolerate a brief interruption, it may be cause a real-time
>
> s/may be/may/
>
> > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
>
> s/nnn/nn/
>

Correct.

> > the draining in non-deterministic.
>
> s/n/s/;)
>

Think that one is ok. At least spell check did not complain.

> > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> > The local_lock on its own prevents migration and the IRQ disabling protects
> > from corruption due to an interrupt arriving while a page allocation is
> > in progress. The locking is inherently unsafe for remote access unless
> > the CPU is hot-removed.
>
> I don't understand the final sentence here. Which CPU and why does
> hot-removing it make the locking safe?
>

The sentence can be dropped because it adds little and is potentially
confusing. The PCP being safe to access remotely is specific to the
context of the CPU being hot-removed and there are other special corner
cases like zone_pcp_disable that modifies a per-cpu structure remotely
but not in a way that causes corruption.

--
Mel Gorman
SUSE Labs

2022-05-14 00:49:07

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists

On Fri, May 13, 2022 at 05:19:18PM +0200, Nicolas Saenz Julienne wrote:
> On Fri, 2022-05-13 at 16:04 +0100, Mel Gorman wrote:
> > On Thu, May 12, 2022 at 12:37:43PM -0700, Andrew Morton wrote:
> > > On Thu, 12 May 2022 09:50:43 +0100 Mel Gorman <[email protected]> wrote:
> > >
> > > > From: Nicolas Saenz Julienne <[email protected]>
> > > >
> > > > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > > > drain work queued by __drain_all_pages(). So introduce a new mechanism to
> > > > remotely drain the per-cpu lists. It is made possible by remotely locking
> > > > 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new scheme
> > > > is that drain operations are now migration safe.
> > > >
> > > > There was no observed performance degradation vs. the previous scheme.
> > > > Both netperf and hackbench were run in parallel to triggering the
> > > > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > > > The new scheme performs a bit better (~5%), although the important point
> > > > here is there are no performance regressions vs. the previous mechanism.
> > > > Per-cpu lists draining happens only in slow paths.
> > > >
> > > > Minchan Kim tested this independently and reported;
> > > >
> > > > My workload is not NOHZ CPUs but run apps under heavy memory
> > > > pressure so they goes to direct reclaim and be stuck on
> > > > drain_all_pages until work on workqueue run.
> > > >
> > > > unit: nanosecond
> > > > max(dur) avg(dur) count(dur)
> > > > 166713013 487511.77786438033 1283
> > > >
> > > > From traces, system encountered the drain_all_pages 1283 times and
> > > > worst case was 166ms and avg was 487us.
> > > >
> > > > The other problem was alloc_contig_range in CMA. The PCP draining
> > > > takes several hundred millisecond sometimes though there is no
> > > > memory pressure or a few of pages to be migrated out but CPU were
> > > > fully booked.
> > > >
> > > > Your patch perfectly removed those wasted time.
> > >
> > > I'm not getting a sense here of the overall effect upon userspace
> > > performance. As Thomas said last year in
> > > https://lkml.kernel.org/r/87v92sgt3n.ffs@tglx
> > >
> > > : The changelogs and the cover letter have a distinct void vs. that which
> > > : means this is just another example of 'scratch my itch' changes w/o
> > > : proper justification.
> > >
> > > Is there more to all of this than itchiness and if so, well, you know
> > > the rest ;)
> > >
> >
> > I think Minchan's example is clear-cut. The draining operation can take
> > an arbitrary amount of time waiting for the workqueue to run on each CPU
> > and can cause severe delays under reclaim or CMA and the patch fixes
> > it. Maybe most users won't even notice but I bet phone users do if a
> > camera app takes too long to open.
> >
> > The first paragraphs was written by Nicolas and I did not want to modify
> > it heavily and still put his Signed-off-by on it. Maybe it could have
> > been clearer though because "too busy" is vague when the actual intent
> > is to avoid interfering with RT tasks. Does this sound better to you?
> >
> > Some setups, notably NOHZ_FULL CPUs, may be running realtime or
> > latency-sensitive applications that cannot tolerate interference
> > due to per-cpu drain work queued by __drain_all_pages(). Introduce
> > a new mechanism to remotely drain the per-cpu lists. It is made
> > possible by remotely locking 'struct per_cpu_pages' new per-cpu
> > spinlocks. This has two advantages, the time to drain is more
> > predictable and other unrelated tasks are not interrupted.
> >
> > You raise a very valid point with Thomas' mail and it is a concern that
> > the local_lock is no longer strictly local. We still need preemption to
> > be disabled between the percpu lookup and the lock acquisition but that
> > can be done with get_cpu_var() to make the scope clear.
>
> This isn't going to work in RT :(
>
> get_cpu_var() disables preemption hampering RT spinlock use. There is more to
> it in Documentation/locking/locktypes.rst.
>

Bah, you're right. A helper that called preempt_disable() on !RT
and migrate_disable() on RT would work although similar to local_lock
with a different name. I'll look on Monday to see how the code could be
restructured to always have the get_cpu_var() call immediately before the
lock acquisition. Once that is done, I'll look what sort of helper that
"disables preempt/migration, lookup pcp structure, acquire lock, enable
preempt/migration". It's effectively the magic trick that local_lock uses
to always lock the right pcpu lock but we want the spinlock semantics
for remote drain.

--
Mel Gorman
SUSE Labs

2022-05-14 01:00:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Thu, 12 May 2022 09:50:37 +0100 Mel Gorman <[email protected]> wrote:

> Changelog since v2
> o More conversions from page->lru to page->[pcp_list|buddy_list]
> o Additional test results in changelogs
>
> Changelog since v1
> o Fix unsafe RT locking scheme
> o Use spin_trylock on UP PREEMPT_RT
>
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-time

s/may be/may/

> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,

s/nnn/nn/

> the draining in non-deterministic.

s/n/s/;)

> Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> The local_lock on its own prevents migration and the IRQ disabling protects
> from corruption due to an interrupt arriving while a page allocation is
> in progress. The locking is inherently unsafe for remote access unless
> the CPU is hot-removed.

I don't understand the final sentence here. Which CPU and why does
hot-removing it make the locking safe?

> This series adjusts the locking. A spinlock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
>
> This series is a partial series. Follow-on work should allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. Consequently, there are some TODO comments
> highlighting the places that would change if local_irq was used. However,
> there are enough corner cases that it deserves a series on its own
> separated by one kernel release and the priority right now is to avoid
> interference of high priority tasks.


2022-05-14 01:24:50

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists

On Thu, May 12, 2022 at 12:37:43PM -0700, Andrew Morton wrote:
> On Thu, 12 May 2022 09:50:43 +0100 Mel Gorman <[email protected]> wrote:
>
> > From: Nicolas Saenz Julienne <[email protected]>
> >
> > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > drain work queued by __drain_all_pages(). So introduce a new mechanism to
> > remotely drain the per-cpu lists. It is made possible by remotely locking
> > 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new scheme
> > is that drain operations are now migration safe.
> >
> > There was no observed performance degradation vs. the previous scheme.
> > Both netperf and hackbench were run in parallel to triggering the
> > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > The new scheme performs a bit better (~5%), although the important point
> > here is there are no performance regressions vs. the previous mechanism.
> > Per-cpu lists draining happens only in slow paths.
> >
> > Minchan Kim tested this independently and reported;
> >
> > My workload is not NOHZ CPUs but run apps under heavy memory
> > pressure so they goes to direct reclaim and be stuck on
> > drain_all_pages until work on workqueue run.
> >
> > unit: nanosecond
> > max(dur) avg(dur) count(dur)
> > 166713013 487511.77786438033 1283
> >
> > From traces, system encountered the drain_all_pages 1283 times and
> > worst case was 166ms and avg was 487us.
> >
> > The other problem was alloc_contig_range in CMA. The PCP draining
> > takes several hundred millisecond sometimes though there is no
> > memory pressure or a few of pages to be migrated out but CPU were
> > fully booked.
> >
> > Your patch perfectly removed those wasted time.
>
> I'm not getting a sense here of the overall effect upon userspace
> performance. As Thomas said last year in
> https://lkml.kernel.org/r/87v92sgt3n.ffs@tglx
>
> : The changelogs and the cover letter have a distinct void vs. that which
> : means this is just another example of 'scratch my itch' changes w/o
> : proper justification.
>
> Is there more to all of this than itchiness and if so, well, you know
> the rest ;)
>

I think Minchan's example is clear-cut. The draining operation can take
an arbitrary amount of time waiting for the workqueue to run on each CPU
and can cause severe delays under reclaim or CMA and the patch fixes
it. Maybe most users won't even notice but I bet phone users do if a
camera app takes too long to open.

The first paragraphs was written by Nicolas and I did not want to modify
it heavily and still put his Signed-off-by on it. Maybe it could have
been clearer though because "too busy" is vague when the actual intent
is to avoid interfering with RT tasks. Does this sound better to you?

Some setups, notably NOHZ_FULL CPUs, may be running realtime or
latency-sensitive applications that cannot tolerate interference
due to per-cpu drain work queued by __drain_all_pages(). Introduce
a new mechanism to remotely drain the per-cpu lists. It is made
possible by remotely locking 'struct per_cpu_pages' new per-cpu
spinlocks. This has two advantages, the time to drain is more
predictable and other unrelated tasks are not interrupted.

You raise a very valid point with Thomas' mail and it is a concern that
the local_lock is no longer strictly local. We still need preemption to
be disabled between the percpu lookup and the lock acquisition but that
can be done with get_cpu_var() to make the scope clear.

Assuming this passes testing and review, would something like this be
preferable to you? It removes pcp_trylock_* because spin_trylock_irqsave
does not have the same problems on !SMP as spin_trylock but something like
it would come back if spin_lock_irqsave(pcp) was converted to spin_lock().

--8<--
mm/page_alloc: Replace local_lock with get_cpu_var

struct per_cpu_pages is no longer strictly local as PCP lists can be
drained remotely using a lock for protection. While the use of local_lock
works, it goes against the intent of local_lock which is for "pure
CPU local concurrency control mechanisms and not suited for inter-CPU
concurrency control" (Documentation/locking/locktypes.rst)

local_lock protects against migration between when the percpu pointer is
accessed and the pcp->lock acquired. The lock acquisition is a preemption
point so in the worst case, a task could migrate to another NUMA node
and accidentally allocate remote memory.

Replace local_lock with get_cpu_var to make it clear what disabling
preemption is protecting.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 144 +++++++++++++++++++-------------------------------------
1 file changed, 48 insertions(+), 96 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0f5a6a5b0302..5c06139d8c5a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -125,27 +125,6 @@ typedef int __bitwise fpi_t;
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)

-struct pagesets {
- local_lock_t lock;
-};
-static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
- .lock = INIT_LOCAL_LOCK(lock),
-};
-
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
-/*
- * On SMP, spin_trylock is sufficient protection.
- * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
- */
-#define pcp_trylock_prepare(flags) do { } while (0)
-#define pcp_trylock_finish(flag) do { } while (0)
-#else
-
-/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
-#define pcp_trylock_prepare(flags) local_irq_save(flags)
-#define pcp_trylock_finish(flags) local_irq_restore(flags)
-#endif
-
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1466,10 +1445,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* Ensure requested pindex is drained first. */
pindex = pindex - 1;

- /*
- * local_lock_irq held so equivalent to spin_lock_irqsave for
- * both PREEMPT_RT and non-PREEMPT_RT configurations.
- */
+ /* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
spin_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);

@@ -3037,10 +3013,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
{
int i, allocated = 0;

- /*
- * local_lock_irq held so equivalent to spin_lock_irqsave for
- * both PREEMPT_RT and non-PREEMPT_RT configurations.
- */
+ /* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype,
@@ -3354,28 +3327,20 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
}

/* Returns true if the page was committed to the per-cpu list. */
-static bool free_unref_page_commit(struct page *page, int migratetype,
+static bool free_unref_page_commit(struct per_cpu_pages *pcp, struct zone *zone,
+ struct page *page, int migratetype,
unsigned int order, bool locked)
{
- struct zone *zone = page_zone(page);
- struct per_cpu_pages *pcp;
int high;
int pindex;
bool free_high;
- unsigned long __maybe_unused UP_flags;
+ unsigned long flags;

__count_vm_event(PGFREE);
- pcp = this_cpu_ptr(zone->per_cpu_pageset);
pindex = order_to_pindex(migratetype, order);

- if (!locked) {
- /* Protect against a parallel drain. */
- pcp_trylock_prepare(UP_flags);
- if (!spin_trylock(&pcp->lock)) {
- pcp_trylock_finish(UP_flags);
- return false;
- }
- }
+ if (!locked && !spin_trylock_irqsave(&pcp->lock, flags))
+ return false;

list_add(&page->pcp_list, &pcp->lists[pindex]);
pcp->count += 1 << order;
@@ -3395,10 +3360,8 @@ static bool free_unref_page_commit(struct page *page, int migratetype,
free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch, free_high), pcp, pindex);
}

- if (!locked) {
- spin_unlock(&pcp->lock);
- pcp_trylock_finish(UP_flags);
- }
+ if (!locked)
+ spin_unlock_irqrestore(&pcp->lock, flags);

return true;
}
@@ -3408,7 +3371,8 @@ static bool free_unref_page_commit(struct page *page, int migratetype,
*/
void free_unref_page(struct page *page, unsigned int order)
{
- unsigned long flags;
+ struct per_cpu_pages *pcp;
+ struct zone *zone;
unsigned long pfn = page_to_pfn(page);
int migratetype;
bool freed_pcp = false;
@@ -3432,9 +3396,10 @@ void free_unref_page(struct page *page, unsigned int order)
migratetype = MIGRATE_MOVABLE;
}

- local_lock_irqsave(&pagesets.lock, flags);
- freed_pcp = free_unref_page_commit(page, migratetype, order, false);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ zone = page_zone(page);
+ pcp = &get_cpu_var(*zone->per_cpu_pageset);
+ freed_pcp = free_unref_page_commit(pcp, zone, page, migratetype, order, false);
+ put_cpu_var(*zone->per_cpu_pageset);

if (unlikely(!freed_pcp))
free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
@@ -3488,20 +3453,21 @@ void free_unref_page_list(struct list_head *list)

VM_BUG_ON(in_hardirq());

- local_lock_irqsave(&pagesets.lock, flags);
-
page = lru_to_page(list);
locked_zone = page_zone(page);
- pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
- spin_lock(&pcp->lock);
+ pcp = &get_cpu_var(*locked_zone->per_cpu_pageset);
+ spin_lock_irqsave(&pcp->lock, flags);

list_for_each_entry_safe(page, next, list, lru) {
struct zone *zone = page_zone(page);

/* Different zone, different pcp lock. */
if (zone != locked_zone) {
+ /* Leave IRQs enabled as a new lock is acquired. */
spin_unlock(&pcp->lock);
locked_zone = zone;
+
+ /* Preemption already disabled by get_cpu_var. */
pcp = this_cpu_ptr(zone->per_cpu_pageset);
spin_lock(&pcp->lock);
}
@@ -3522,27 +3488,26 @@ void free_unref_page_list(struct list_head *list)
* be acquired multiple times but if a drain is in progress
* then an expensive operation is already taking place.
*
- * TODO: Always false at the moment due to local_lock_irqsave
- * and is preparation for converting to local_lock.
+ * TODO: Always false at the moment due to spin_lock_irqsave
+ * and is preparation for converting to spin_lock.
*/
- if (unlikely(!free_unref_page_commit(page, migratetype, 0, true)))
- free_one_page(page_zone(page), page, page_to_pfn(page), 0, migratetype, FPI_NONE);
+ if (unlikely(!free_unref_page_commit(pcp, zone, page, migratetype, 0, true)))
+ free_one_page(zone, page, page_to_pfn(page), 0, migratetype, FPI_NONE);

/*
* Guard against excessive IRQ disabled times when we get
* a large list of pages to free.
*/
if (++batch_count == SWAP_CLUSTER_MAX) {
- spin_unlock(&pcp->lock);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ spin_unlock_irqrestore(&pcp->lock, flags);
+ put_cpu_var(*locked_zone->per_cpu_pageset);
batch_count = 0;
- local_lock_irqsave(&pagesets.lock, flags);
- pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
- spin_lock(&pcp->lock);
+ pcp = &get_cpu_var(*locked_zone->per_cpu_pageset);
+ spin_lock_irqsave(&pcp->lock, flags);
}
}
- spin_unlock(&pcp->lock);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ spin_unlock_irqrestore(&pcp->lock, flags);
+ put_cpu_var(*locked_zone->per_cpu_pageset);
}

/*
@@ -3717,24 +3682,18 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
bool locked)
{
struct page *page;
- unsigned long __maybe_unused UP_flags;
+ unsigned long flags;

/*
* spin_trylock is not necessary right now due to due to
- * local_lock_irqsave and is a preparation step for
- * a conversion to local_lock using the trylock to prevent
- * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
- * uses rmqueue_buddy.
+ * IRQ-safe pcp->lock and is a preparation step for a conversion to
+ * spin_lock using the trylock to prevent IRQ re-entrancy. If
+ * pcp->lock cannot be acquired, the caller uses rmqueue_buddy.
*
- * TODO: Convert local_lock_irqsave to local_lock.
+ * TODO: Convert pcp spin_lock_irqsave to spin_lock.
*/
- if (unlikely(!locked)) {
- pcp_trylock_prepare(UP_flags);
- if (!spin_trylock(&pcp->lock)) {
- pcp_trylock_finish(UP_flags);
- return NULL;
- }
- }
+ if (!locked && !spin_trylock_irqsave(&pcp->lock, flags))
+ return NULL;

do {
if (list_empty(list)) {
@@ -3767,10 +3726,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
} while (check_new_pcp(page, order));

out:
- if (!locked) {
- spin_unlock(&pcp->lock);
- pcp_trylock_finish(UP_flags);
- }
+ if (!locked)
+ spin_unlock_irqrestore(&pcp->lock, flags);

return page;
}
@@ -3784,20 +3741,17 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
struct per_cpu_pages *pcp;
struct list_head *list;
struct page *page;
- unsigned long flags;
-
- local_lock_irqsave(&pagesets.lock, flags);

/*
* On allocation, reduce the number of pages that are batch freed.
* See nr_pcp_free() where free_factor is increased for subsequent
* frees.
*/
- pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ pcp = &get_cpu_var(*zone->per_cpu_pageset);
pcp->free_factor >>= 1;
list = &pcp->lists[order_to_pindex(migratetype, order)];
page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list, false);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ put_cpu_var(*zone->per_cpu_pageset);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
zone_statistics(preferred_zone, zone, 1);
@@ -5396,10 +5350,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto failed;

/* Attempt the batch allocation */
- local_lock_irqsave(&pagesets.lock, flags);
- pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ pcp = &get_cpu_var(*zone->per_cpu_pageset);
pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
- spin_lock(&pcp->lock);
+ spin_lock_irqsave(&pcp->lock, flags);

while (nr_populated < nr_pages) {

@@ -5413,10 +5366,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
pcp, pcp_list, true);
if (unlikely(!page)) {
/* Try and get at least one page */
- if (!nr_populated) {
- spin_unlock(&pcp->lock);
+ if (!nr_populated)
goto failed_irq;
- }
break;
}
nr_account++;
@@ -5429,8 +5380,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

- spin_unlock(&pcp->lock);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ spin_unlock_irqrestore(&pcp->lock, flags);
+ put_cpu_var(*zone->per_cpu_pageset);

__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
@@ -5439,7 +5390,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
return nr_populated;

failed_irq:
- local_unlock_irqrestore(&pagesets.lock, flags);
+ spin_unlock_irqrestore(&pcp->lock, flags);
+ put_cpu_var(*zone->per_cpu_pageset);

failed:
page = __alloc_pages(gfp, 0, preferred_nid, nodemask);


--
Mel Gorman
SUSE Labs

2022-05-14 01:28:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists

On Thu, 12 May 2022 09:50:43 +0100 Mel Gorman <[email protected]> wrote:

> From: Nicolas Saenz Julienne <[email protected]>
>
> Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> drain work queued by __drain_all_pages(). So introduce a new mechanism to
> remotely drain the per-cpu lists. It is made possible by remotely locking
> 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new scheme
> is that drain operations are now migration safe.
>
> There was no observed performance degradation vs. the previous scheme.
> Both netperf and hackbench were run in parallel to triggering the
> __drain_all_pages(NULL, true) code path around ~100 times per second.
> The new scheme performs a bit better (~5%), although the important point
> here is there are no performance regressions vs. the previous mechanism.
> Per-cpu lists draining happens only in slow paths.
>
> Minchan Kim tested this independently and reported;
>
> My workload is not NOHZ CPUs but run apps under heavy memory
> pressure so they goes to direct reclaim and be stuck on
> drain_all_pages until work on workqueue run.
>
> unit: nanosecond
> max(dur) avg(dur) count(dur)
> 166713013 487511.77786438033 1283
>
> From traces, system encountered the drain_all_pages 1283 times and
> worst case was 166ms and avg was 487us.
>
> The other problem was alloc_contig_range in CMA. The PCP draining
> takes several hundred millisecond sometimes though there is no
> memory pressure or a few of pages to be migrated out but CPU were
> fully booked.
>
> Your patch perfectly removed those wasted time.

I'm not getting a sense here of the overall effect upon userspace
performance. As Thomas said last year in
https://lkml.kernel.org/r/87v92sgt3n.ffs@tglx

: The changelogs and the cover letter have a distinct void vs. that which
: means this is just another example of 'scratch my itch' changes w/o
: proper justification.

Is there more to all of this than itchiness and if so, well, you know
the rest ;)


2022-05-14 01:37:23

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue

On Thu, 2022-05-12 at 09:50 +0100, Mel Gorman wrote:
> The VM_BUG_ON check for a valid page can be avoided with a simple
> change in the flow. The ZONE_BOOSTED_WATERMARK is unlikely in general
> and even more unlikely if the page allocation failed so mark the
> branch unlikely.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Tested-by: Minchan Kim <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> ---

Reviewed-by: Nicolas Saenz Julienne <[email protected]>

Thanks,

--
Nicolás Sáenz


2022-05-14 02:07:32

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list

On Thu, 2022-05-12 at 09:50 +0100, Mel Gorman wrote:
> The page allocator uses page->lru for storing pages on either buddy
> or
> PCP lists. Create page->buddy_list and page->pcp_list as a union with
> page->lru. This is simply to clarify what type of list a page is on
> in the page allocator.
>
> No functional change intended.
>
> [minchan: Fix page lru fields in macros]
> Signed-off-by: Mel Gorman <[email protected]>
> Tested-by: Minchan Kim <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> ---

Reviewed-by: Nicolas Saenz Julienne <[email protected]>

Thanks,

--
Nicolás Sáenz


2022-05-14 02:35:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Fri, 13 May 2022 15:23:30 +0100 Mel Gorman <[email protected]> wrote:

> Correct.
>
> > > the draining in non-deterministic.
> >
> > s/n/s/;)
> >
>
> Think that one is ok. At least spell check did not complain.

s/in/si/

> > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> > > The local_lock on its own prevents migration and the IRQ disabling protects
> > > from corruption due to an interrupt arriving while a page allocation is
> > > in progress. The locking is inherently unsafe for remote access unless
> > > the CPU is hot-removed.
> >
> > I don't understand the final sentence here. Which CPU and why does
> > hot-removing it make the locking safe?
> >
>
> The sentence can be dropped because it adds little and is potentially
> confusing. The PCP being safe to access remotely is specific to the
> context of the CPU being hot-removed and there are other special corner
> cases like zone_pcp_disable that modifies a per-cpu structure remotely
> but not in a way that causes corruption.

OK. I pasted in your para from the other email. Current 0/n blurb:

Some setups, notably NOHZ_FULL CPUs, may be running realtime or
latency-sensitive applications that cannot tolerate interference due to
per-cpu drain work queued by __drain_all_pages(). Introduce a new
mechanism to remotely drain the per-cpu lists. It is made possible by
remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. This has
two advantages, the time to drain is more predictable and other unrelated
tasks are not interrupted.

This series has the same intent as Nicolas' series "mm/page_alloc: Remote
per-cpu lists drain support" -- avoid interference of a high priority task
due to a workqueue item draining per-cpu page lists. While many workloads
can tolerate a brief interruption, it may cause a real-time task running
on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is
non-deterministic.

Currently an IRQ-safe local_lock protects the page allocator per-cpu
lists. The local_lock on its own prevents migration and the IRQ disabling
protects from corruption due to an interrupt arriving while a page
allocation is in progress.

This series adjusts the locking. A spinlock is added to struct
per_cpu_pages to protect the list contents while local_lock_irq continues
to prevent migration and IRQ reentry. This allows a remote CPU to safely
drain a remote per-cpu list.

This series is a partial series. Follow-on work should allow the
local_irq_save to be converted to a local_irq to avoid IRQs being
disabled/enabled in most cases. Consequently, there are some TODO
comments highlighting the places that would change if local_irq was used.
However, there are enough corner cases that it deserves a series on its
own separated by one kernel release and the priority right now is to avoid
interference of high priority tasks.

Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
and when it is storing per-cpu pages.

Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
this is not necessary but it avoids per_cpu_pages consuming another
cache line.

Patch 3 is a preparation patch to avoid code duplication.

Patch 4 is a simple micro-optimisation that improves code flow necessary for
a later patch to avoid code duplication.

Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
relying on local_lock to prevent migration, stabilise the pcp
lookup and prevent IRQ reentrancy.

Patch 6 remote drains per-cpu pages directly instead of using a workqueue.


2022-05-14 02:45:43

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper

This is a preparation page to allow the buddy removal code to be reused
in a later patch.

No functional change.

Signed-off-by: Mel Gorman <[email protected]>
Tested-by: Minchan Kim <[email protected]>
Acked-by: Minchan Kim <[email protected]>
---
mm/page_alloc.c | 87 ++++++++++++++++++++++++++++---------------------
1 file changed, 50 insertions(+), 37 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5851ee88a89c..1c4c54503a5d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3622,6 +3622,46 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
#endif
}

+static __always_inline
+struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
+ unsigned int order, unsigned int alloc_flags,
+ int migratetype)
+{
+ struct page *page;
+ unsigned long flags;
+
+ do {
+ page = NULL;
+ spin_lock_irqsave(&zone->lock, flags);
+ /*
+ * order-0 request can reach here when the pcplist is skipped
+ * due to non-CMA allocation context. HIGHATOMIC area is
+ * reserved for high-order atomic allocation, so order-0
+ * request should skip it.
+ */
+ if (order > 0 && alloc_flags & ALLOC_HARDER) {
+ page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+ if (page)
+ trace_mm_page_alloc_zone_locked(page, order, migratetype);
+ }
+ if (!page) {
+ page = __rmqueue(zone, order, migratetype, alloc_flags);
+ if (!page) {
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return NULL;
+ }
+ }
+ __mod_zone_freepage_state(zone, -(1 << order),
+ get_pcppage_migratetype(page));
+ spin_unlock_irqrestore(&zone->lock, flags);
+ } while (check_new_pages(page, order));
+
+ __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
+ zone_statistics(preferred_zone, zone, 1);
+
+ return page;
+}
+
/* Remove page from the per-cpu list, caller must protect the list */
static inline
struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
@@ -3702,9 +3742,14 @@ struct page *rmqueue(struct zone *preferred_zone,
gfp_t gfp_flags, unsigned int alloc_flags,
int migratetype)
{
- unsigned long flags;
struct page *page;

+ /*
+ * We most definitely don't want callers attempting to
+ * allocate greater than order-1 page units with __GFP_NOFAIL.
+ */
+ WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
+
if (likely(pcp_allowed_order(order))) {
/*
* MIGRATE_MOVABLE pcplist could have the pages on CMA area and
@@ -3718,38 +3763,10 @@ struct page *rmqueue(struct zone *preferred_zone,
}
}

- /*
- * We most definitely don't want callers attempting to
- * allocate greater than order-1 page units with __GFP_NOFAIL.
- */
- WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
-
- do {
- page = NULL;
- spin_lock_irqsave(&zone->lock, flags);
- /*
- * order-0 request can reach here when the pcplist is skipped
- * due to non-CMA allocation context. HIGHATOMIC area is
- * reserved for high-order atomic allocation, so order-0
- * request should skip it.
- */
- if (order > 0 && alloc_flags & ALLOC_HARDER) {
- page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
- if (page)
- trace_mm_page_alloc_zone_locked(page, order, migratetype);
- }
- if (!page) {
- page = __rmqueue(zone, order, migratetype, alloc_flags);
- if (!page)
- goto failed;
- }
- __mod_zone_freepage_state(zone, -(1 << order),
- get_pcppage_migratetype(page));
- spin_unlock_irqrestore(&zone->lock, flags);
- } while (check_new_pages(page, order));
-
- __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
- zone_statistics(preferred_zone, zone, 1);
+ page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
+ migratetype);
+ if (unlikely(!page))
+ return NULL;

out:
/* Separate test+clear to avoid unnecessary atomics */
@@ -3760,10 +3777,6 @@ struct page *rmqueue(struct zone *preferred_zone,

VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
return page;
-
-failed:
- spin_unlock_irqrestore(&zone->lock, flags);
- return NULL;
}

#ifdef CONFIG_FAIL_PAGE_ALLOC
--
2.34.1


2022-05-14 04:08:08

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock

On Thu, 2022-05-12 at 09:50 +0100, Mel Gorman wrote:
> Currently the PCP lists are protected by using local_lock_irqsave to
> prevent migration and IRQ reentrancy but this is inconvenient. Remote
> draining of the lists is impossible and a workqueue is required and
> every task allocation/free must disable then enable interrupts which
> is
> expensive.
>
> As preparation for dealing with both of those problems, protect the
> lists with a spinlock. The IRQ-unsafe version of the lock is used
> because IRQs are already disabled by local_lock_irqsave. spin_trylock
> is used in preparation for a time when local_lock could be used
> instead
> of lock_lock_irqsave.
>
> The per_cpu_pages still fits within the same number of cache lines
> after
> this patch relative to before the series.
>
> struct per_cpu_pages {
>         spinlock_t                 lock;                 /*     0    
> 4 */
>         int                        count;                /*     4    
> 4 */
>         int                        high;                 /*     8    
> 4 */
>         int                        batch;                /*    12    
> 4 */
>         short int                  free_factor;          /*    16    
> 2 */
>         short int                  expire;               /*    18    
> 2 */
>
>         /* XXX 4 bytes hole, try to pack */
>
>         struct list_head           lists[13];            /*    24  
> 208 */
>
>         /* size: 256, cachelines: 4, members: 7 */
>         /* sum members: 228, holes: 1, sum holes: 4 */
>         /* padding: 24 */
> } __attribute__((__aligned__(64)));
>
> There is overhead in the fast path due to acquiring the spinlock even
> though the spinlock is per-cpu and uncontended in the common case.
> Page
> Fault Test (PFT) running on a 1-socket reported the following results
> on
> a 1 socket machine.
>
>                                      5.18.0-rc1               5.18.0-
> rc1
>                                         vanilla         mm-pcpdrain-
> v2r1
> Hmean     faults/sec-1   886331.5718 (   0.00%)   885462.7479 (  -
> 0.10%)
> Hmean     faults/sec-3  2337706.1583 (   0.00%)  2332130.4909 *  -
> 0.24%*
> Hmean     faults/sec-5  2851594.2897 (   0.00%)  2844123.9307 (  -
> 0.26%)
> Hmean     faults/sec-7  3543251.5507 (   0.00%)  3516889.0442 *  -
> 0.74%*
> Hmean     faults/sec-8  3947098.0024 (   0.00%)  3916162.8476 *  -
> 0.78%*
> Stddev    faults/sec-1     2302.9105 (   0.00%)     2065.0845 ( 
> 10.33%)
> Stddev    faults/sec-3     7275.2442 (   0.00%)     6033.2620 ( 
> 17.07%)
> Stddev    faults/sec-5    24726.0328 (   0.00%)    12525.1026 ( 
> 49.34%)
> Stddev    faults/sec-7     9974.2542 (   0.00%)     9543.9627 (  
> 4.31%)
> Stddev    faults/sec-8     9468.0191 (   0.00%)     7958.2607 ( 
> 15.95%)
> CoeffVar  faults/sec-1        0.2598 (   0.00%)        0.2332 ( 
> 10.24%)
> CoeffVar  faults/sec-3        0.3112 (   0.00%)        0.2587 ( 
> 16.87%)
> CoeffVar  faults/sec-5        0.8670 (   0.00%)        0.4404 ( 
> 49.21%)
> CoeffVar  faults/sec-7        0.2815 (   0.00%)        0.2714 (  
> 3.60%)
> CoeffVar  faults/sec-8        0.2399 (   0.00%)        0.2032 ( 
> 15.28%)
>
> There is a small hit in the number of faults per second but given
> that
> the results are more stable, it's borderline noise.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Tested-by: Minchan Kim <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> ---

Reviewed-by: Nicolas Saenz Julienne <[email protected]>

Thanks,

--
Nicolás Sáenz

2022-05-17 02:58:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Fri, May 13, 2022 at 12:38:05PM -0700, Andrew Morton wrote:
> > The sentence can be dropped because it adds little and is potentially
> > confusing. The PCP being safe to access remotely is specific to the
> > context of the CPU being hot-removed and there are other special corner
> > cases like zone_pcp_disable that modifies a per-cpu structure remotely
> > but not in a way that causes corruption.
>
> OK. I pasted in your para from the other email. Current 0/n blurb:
>
> Some setups, notably NOHZ_FULL CPUs, may be running realtime or
> latency-sensitive applications that cannot tolerate interference due to
> per-cpu drain work queued by __drain_all_pages(). Introduce a new
> mechanism to remotely drain the per-cpu lists. It is made possible by
> remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. This has
> two advantages, the time to drain is more predictable and other unrelated
> tasks are not interrupted.
>
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority task
> due to a workqueue item draining per-cpu page lists. While many workloads
> can tolerate a brief interruption, it may cause a real-time task running
> on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is
> non-deterministic.
>
> Currently an IRQ-safe local_lock protects the page allocator per-cpu
> lists. The local_lock on its own prevents migration and the IRQ disabling
> protects from corruption due to an interrupt arriving while a page
> allocation is in progress.
>
> This series adjusts the locking. A spinlock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
>
> This series is a partial series. Follow-on work should allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. Consequently, there are some TODO
> comments highlighting the places that would change if local_irq was used.
> However, there are enough corner cases that it deserves a series on its
> own separated by one kernel release and the priority right now is to avoid
> interference of high priority tasks.
>

Looks good, thanks!

--
Mel Gorman
SUSE Labs

2022-05-17 22:29:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists

On Fri, May 13, 2022 at 07:23:01PM +0100, Mel Gorman wrote:
> > > You raise a very valid point with Thomas' mail and it is a concern that
> > > the local_lock is no longer strictly local. We still need preemption to
> > > be disabled between the percpu lookup and the lock acquisition but that
> > > can be done with get_cpu_var() to make the scope clear.
> >
> > This isn't going to work in RT :(
> >
> > get_cpu_var() disables preemption hampering RT spinlock use. There is more to
> > it in Documentation/locking/locktypes.rst.
> >
>
> Bah, you're right. A helper that called preempt_disable() on !RT
> and migrate_disable() on RT would work although similar to local_lock
> with a different name. I'll look on Monday to see how the code could be
> restructured to always have the get_cpu_var() call immediately before the
> lock acquisition. Once that is done, I'll look what sort of helper that
> "disables preempt/migration, lookup pcp structure, acquire lock, enable
> preempt/migration". It's effectively the magic trick that local_lock uses
> to always lock the right pcpu lock but we want the spinlock semantics
> for remote drain.
>

Monday was busier than I expected. Alternative to local_lock currently
looks like this but still needs testing. There is some churn because it
was no longer possible to have the CPU pinning separate from the spinlock
acquisition. It still should be possible to potentially make pcp->lock a
normal spinlock but I haven't confirmed that yet.

---8<---
mm/page_alloc: Replace local_lock with normal spinlock

struct per_cpu_pages is no longer strictly local as PCP lists can be
drained remotely using a lock for protection. While the use of local_lock
works, it goes against the intent of local_lock which is for "pure
CPU local concurrency control mechanisms and not suited for inter-CPU
concurrency control" (Documentation/locking/locktypes.rst)

local_lock protects against migration between when the percpu pointer is
accessed and the pcp->lock acquired. The lock acquisition is a preemption
point so in the worst case, a task could migrate to another NUMA node
and accidentally allocate remote memory. The main requirement is to pin
the task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.

Replace local_lock with helpers that pin a task to a CPU, lookup the
per-cpu structure and acquire the embedded lock. It's similar to local_lock
without breaking the intent behind the API.

---
mm/page_alloc.c | 225 ++++++++++++++++++++++++++++++--------------------------
1 file changed, 120 insertions(+), 105 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0f5a6a5b0302..d9c186bf498d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -125,13 +125,6 @@ typedef int __bitwise fpi_t;
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)

-struct pagesets {
- local_lock_t lock;
-};
-static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
- .lock = INIT_LOCAL_LOCK(lock),
-};
-
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
/*
* On SMP, spin_trylock is sufficient protection.
@@ -146,6 +139,80 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
#define pcp_trylock_finish(flags) local_irq_restore(flags)
#endif

+/*
+ * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid
+ * a migration causing the wrong PCP to be locked and remote memory being
+ * potentially allocated, pin the task to the CPU for the lookup+lock.
+ * preempt_disable is used on !RT because it is faster than migrate_disable.
+ * migrate_disable is used on RT because otherwise RT spinlock usage is
+ * interfered with and a high priority task cannot preempt the allocator.
+ */
+#ifndef CONFIG_PREEMPT_RT
+#define pcpu_task_pin() preempt_disable()
+#define pcpu_task_unpin() preempt_enable()
+#else
+#define pcpu_task_pin() migrate_disable()
+#define pcpu_task_unpin() migrate_enable()
+#endif
+
+/* Generic helper to lookup and a per-cpu variable with an embedded spinlock.
+ * Return value should be used with equivalent unlock helper.
+ */
+#define pcpu_spin_lock(type, member, ptr) \
+({ \
+ type *_ret; \
+ pcpu_task_pin(); \
+ _ret = this_cpu_ptr(ptr); \
+ spin_lock(&_ret->member); \
+ _ret; \
+})
+
+#define pcpu_spin_lock_irqsave(type, member, ptr, flags) \
+({ \
+ type *_ret; \
+ pcpu_task_pin(); \
+ _ret = this_cpu_ptr(ptr); \
+ spin_lock_irqsave(&_ret->member, flags); \
+ _ret; \
+})
+
+#define pcpu_spin_trylock_irqsave(type, member, ptr, flags) \
+({ \
+ type *_ret; \
+ pcpu_task_pin(); \
+ _ret = this_cpu_ptr(ptr); \
+ if (!spin_trylock_irqsave(&_ret->member, flags)) \
+ _ret = NULL; \
+ _ret; \
+})
+
+#define pcpu_spin_unlock(member, ptr) \
+({ \
+ spin_unlock(&ptr->member); \
+ pcpu_task_pin(); \
+})
+
+#define pcpu_spin_unlock_irqrestore(member, ptr, flags) \
+({ \
+ spin_unlock_irqrestore(&ptr->member, flags); \
+ pcpu_task_unpin(); \
+})
+
+/* struct per_cpu_pages specific helpers. */
+#define pcp_spin_lock(ptr) \
+ pcpu_spin_lock(struct per_cpu_pages, lock, ptr)
+
+#define pcp_spin_lock_irqsave(ptr, flags) \
+ pcpu_spin_lock_irqsave(struct per_cpu_pages, lock, ptr, flags)
+
+#define pcp_spin_trylock_irqsave(ptr, flags) \
+ pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, ptr, flags)
+
+#define pcp_spin_unlock(ptr) \
+ pcpu_spin_unlock(lock, ptr)
+
+#define pcp_spin_unlock_irqrestore(ptr, flags) \
+ pcpu_spin_unlock_irqrestore(lock, ptr, flags)
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1466,10 +1533,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* Ensure requested pindex is drained first. */
pindex = pindex - 1;

- /*
- * local_lock_irq held so equivalent to spin_lock_irqsave for
- * both PREEMPT_RT and non-PREEMPT_RT configurations.
- */
+ /* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
spin_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);

@@ -3037,10 +3101,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
{
int i, allocated = 0;

- /*
- * local_lock_irq held so equivalent to spin_lock_irqsave for
- * both PREEMPT_RT and non-PREEMPT_RT configurations.
- */
+ /* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype,
@@ -3353,30 +3414,17 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
return min(READ_ONCE(pcp->batch) << 2, high);
}

-/* Returns true if the page was committed to the per-cpu list. */
-static bool free_unref_page_commit(struct page *page, int migratetype,
- unsigned int order, bool locked)
+static void free_unref_page_commit(struct per_cpu_pages *pcp, struct zone *zone,
+ struct page *page, int migratetype,
+ unsigned int order)
{
- struct zone *zone = page_zone(page);
- struct per_cpu_pages *pcp;
int high;
int pindex;
bool free_high;
- unsigned long __maybe_unused UP_flags;

__count_vm_event(PGFREE);
- pcp = this_cpu_ptr(zone->per_cpu_pageset);
pindex = order_to_pindex(migratetype, order);

- if (!locked) {
- /* Protect against a parallel drain. */
- pcp_trylock_prepare(UP_flags);
- if (!spin_trylock(&pcp->lock)) {
- pcp_trylock_finish(UP_flags);
- return false;
- }
- }
-
list_add(&page->pcp_list, &pcp->lists[pindex]);
pcp->count += 1 << order;

@@ -3394,13 +3442,6 @@ static bool free_unref_page_commit(struct page *page, int migratetype,

free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch, free_high), pcp, pindex);
}
-
- if (!locked) {
- spin_unlock(&pcp->lock);
- pcp_trylock_finish(UP_flags);
- }
-
- return true;
}

/*
@@ -3408,10 +3449,12 @@ static bool free_unref_page_commit(struct page *page, int migratetype,
*/
void free_unref_page(struct page *page, unsigned int order)
{
- unsigned long flags;
+ struct per_cpu_pages *pcp;
+ struct zone *zone;
unsigned long pfn = page_to_pfn(page);
int migratetype;
- bool freed_pcp = false;
+ unsigned long flags;
+ unsigned long __maybe_unused UP_flags;

if (!free_unref_page_prepare(page, pfn, order))
return;
@@ -3432,12 +3475,16 @@ void free_unref_page(struct page *page, unsigned int order)
migratetype = MIGRATE_MOVABLE;
}

- local_lock_irqsave(&pagesets.lock, flags);
- freed_pcp = free_unref_page_commit(page, migratetype, order, false);
- local_unlock_irqrestore(&pagesets.lock, flags);
-
- if (unlikely(!freed_pcp))
+ zone = page_zone(page);
+ pcp_trylock_prepare(UP_flags);
+ pcp = pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, zone->per_cpu_pageset, flags);
+ if (pcp) {
+ free_unref_page_commit(pcp, zone, page, migratetype, order);
+ pcp_spin_unlock_irqrestore(pcp, flags);
+ } else {
free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
+ }
+ pcp_trylock_finish(UP_flags);
}

/*
@@ -3488,20 +3535,20 @@ void free_unref_page_list(struct list_head *list)

VM_BUG_ON(in_hardirq());

- local_lock_irqsave(&pagesets.lock, flags);
-
page = lru_to_page(list);
locked_zone = page_zone(page);
- pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
- spin_lock(&pcp->lock);
+ pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);

list_for_each_entry_safe(page, next, list, lru) {
struct zone *zone = page_zone(page);

/* Different zone, different pcp lock. */
if (zone != locked_zone) {
+ /* Leave IRQs enabled as a new lock is acquired. */
spin_unlock(&pcp->lock);
locked_zone = zone;
+
+ /* Preemption disabled by pcp_spin_lock_irqsave. */
pcp = this_cpu_ptr(zone->per_cpu_pageset);
spin_lock(&pcp->lock);
}
@@ -3516,33 +3563,19 @@ void free_unref_page_list(struct list_head *list)

trace_mm_page_free_batched(page);

- /*
- * If there is a parallel drain in progress, free to the buddy
- * allocator directly. This is expensive as the zone lock will
- * be acquired multiple times but if a drain is in progress
- * then an expensive operation is already taking place.
- *
- * TODO: Always false at the moment due to local_lock_irqsave
- * and is preparation for converting to local_lock.
- */
- if (unlikely(!free_unref_page_commit(page, migratetype, 0, true)))
- free_one_page(page_zone(page), page, page_to_pfn(page), 0, migratetype, FPI_NONE);
+ free_unref_page_commit(pcp, zone, page, migratetype, 0);

/*
* Guard against excessive IRQ disabled times when we get
* a large list of pages to free.
*/
if (++batch_count == SWAP_CLUSTER_MAX) {
- spin_unlock(&pcp->lock);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pcp_spin_unlock_irqrestore(pcp, flags);
batch_count = 0;
- local_lock_irqsave(&pagesets.lock, flags);
- pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
- spin_lock(&pcp->lock);
+ pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
}
}
- spin_unlock(&pcp->lock);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pcp_spin_unlock_irqrestore(pcp, flags);
}

/*
@@ -3713,28 +3746,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
int migratetype,
unsigned int alloc_flags,
struct per_cpu_pages *pcp,
- struct list_head *list,
- bool locked)
+ struct list_head *list)
{
struct page *page;
- unsigned long __maybe_unused UP_flags;
-
- /*
- * spin_trylock is not necessary right now due to due to
- * local_lock_irqsave and is a preparation step for
- * a conversion to local_lock using the trylock to prevent
- * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
- * uses rmqueue_buddy.
- *
- * TODO: Convert local_lock_irqsave to local_lock.
- */
- if (unlikely(!locked)) {
- pcp_trylock_prepare(UP_flags);
- if (!spin_trylock(&pcp->lock)) {
- pcp_trylock_finish(UP_flags);
- return NULL;
- }
- }

do {
if (list_empty(list)) {
@@ -3767,10 +3781,6 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
} while (check_new_pcp(page, order));

out:
- if (!locked) {
- spin_unlock(&pcp->lock);
- pcp_trylock_finish(UP_flags);
- }

return page;
}
@@ -3785,19 +3795,29 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
struct list_head *list;
struct page *page;
unsigned long flags;
+ unsigned long __maybe_unused UP_flags;

- local_lock_irqsave(&pagesets.lock, flags);
+ /*
+ * spin_trylock_irqsave is not necessary right now as it'll only be
+ * true when contending with a remote drain. It's in place as a
+ * preparation step before converting pcp locking to spin_trylock
+ * to protect against IRQ reentry.
+ */
+ pcp_trylock_prepare(UP_flags);
+ pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+ if (!pcp)
+ return NULL;

/*
* On allocation, reduce the number of pages that are batch freed.
* See nr_pcp_free() where free_factor is increased for subsequent
* frees.
*/
- pcp = this_cpu_ptr(zone->per_cpu_pageset);
pcp->free_factor >>= 1;
list = &pcp->lists[order_to_pindex(migratetype, order)];
- page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list, false);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+ pcp_spin_unlock_irqrestore(pcp, flags);
+ pcp_trylock_finish(UP_flags);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
zone_statistics(preferred_zone, zone, 1);
@@ -5396,10 +5416,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto failed;

/* Attempt the batch allocation */
- local_lock_irqsave(&pagesets.lock, flags);
- pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ pcp = pcp_spin_lock_irqsave(zone->per_cpu_pageset, flags);
pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
- spin_lock(&pcp->lock);

while (nr_populated < nr_pages) {

@@ -5410,13 +5428,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
}

page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
- pcp, pcp_list, true);
+ pcp, pcp_list);
if (unlikely(!page)) {
/* Try and get at least one page */
- if (!nr_populated) {
- spin_unlock(&pcp->lock);
+ if (!nr_populated)
goto failed_irq;
- }
break;
}
nr_account++;
@@ -5429,8 +5445,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

- spin_unlock(&pcp->lock);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pcp_spin_unlock_irqrestore(pcp, flags);

__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
@@ -5439,7 +5454,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
return nr_populated;

failed_irq:
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pcp_spin_unlock_irqrestore(pcp, flags);

failed:
page = __alloc_pages(gfp, 0, preferred_nid, nodemask);

2022-05-18 03:34:17

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote:
> Changelog since v2
> o More conversions from page->lru to page->[pcp_list|buddy_list]
> o Additional test results in changelogs
>
> Changelog since v1
> o Fix unsafe RT locking scheme
> o Use spin_trylock on UP PREEMPT_RT
>
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-time
> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> the draining in non-deterministic.
>
> Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> The local_lock on its own prevents migration and the IRQ disabling protects
> from corruption due to an interrupt arriving while a page allocation is
> in progress. The locking is inherently unsafe for remote access unless
> the CPU is hot-removed.
>
> This series adjusts the locking. A spinlock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
>
> This series is a partial series. Follow-on work should allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. Consequently, there are some TODO comments
> highlighting the places that would change if local_irq was used. However,
> there are enough corner cases that it deserves a series on its own
> separated by one kernel release and the priority right now is to avoid
> interference of high priority tasks.

Reverting the whole series fixed an issue that offlining a memory
section blocking for hours on today's linux-next tree.

__wait_rcu_gp
synchronize_rcu at kernel/rcu/tree.c:3915
lru_cache_disable at mm/swap.c:886
__alloc_contig_migrate_range at mm/page_alloc.c:9078
isolate_single_pageblock at mm/page_isolation.c:405
start_isolate_page_range
offline_pages
memory_subsys_offline
device_offline
online_store
dev_attr_store
sysfs_kf_write
kernfs_fop_write_iter
new_sync_write
vfs_write
ksys_write
__arm64_sys_write
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync

For full disclosure, I have also reverted the commit 0d523026abd4
("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the
series can be reverted cleanly. But, I can't see how the commit
0d523026abd4 could cause this issue at all.

2022-05-18 12:57:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Tue, May 17, 2022 at 07:35:07PM -0400, Qian Cai wrote:
> On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote:
> > Changelog since v2
> > o More conversions from page->lru to page->[pcp_list|buddy_list]
> > o Additional test results in changelogs
> >
> > Changelog since v1
> > o Fix unsafe RT locking scheme
> > o Use spin_trylock on UP PREEMPT_RT
> >
> > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > per-cpu lists drain support" -- avoid interference of a high priority
> > task due to a workqueue item draining per-cpu page lists. While many
> > workloads can tolerate a brief interruption, it may be cause a real-time
> > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > the draining in non-deterministic.
> >
> > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> > The local_lock on its own prevents migration and the IRQ disabling protects
> > from corruption due to an interrupt arriving while a page allocation is
> > in progress. The locking is inherently unsafe for remote access unless
> > the CPU is hot-removed.
> >
> > This series adjusts the locking. A spinlock is added to struct
> > per_cpu_pages to protect the list contents while local_lock_irq continues
> > to prevent migration and IRQ reentry. This allows a remote CPU to safely
> > drain a remote per-cpu list.
> >
> > This series is a partial series. Follow-on work should allow the
> > local_irq_save to be converted to a local_irq to avoid IRQs being
> > disabled/enabled in most cases. Consequently, there are some TODO comments
> > highlighting the places that would change if local_irq was used. However,
> > there are enough corner cases that it deserves a series on its own
> > separated by one kernel release and the priority right now is to avoid
> > interference of high priority tasks.
>
> Reverting the whole series fixed an issue that offlining a memory
> section blocking for hours on today's linux-next tree.
>
> __wait_rcu_gp
> synchronize_rcu at kernel/rcu/tree.c:3915
> lru_cache_disable at mm/swap.c:886
> __alloc_contig_migrate_range at mm/page_alloc.c:9078
> isolate_single_pageblock at mm/page_isolation.c:405
> start_isolate_page_range
> offline_pages
> memory_subsys_offline
> device_offline
> online_store
> dev_attr_store
> sysfs_kf_write
> kernfs_fop_write_iter
> new_sync_write
> vfs_write
> ksys_write
> __arm64_sys_write
> invoke_syscall
> el0_svc_common.constprop.0
> do_el0_svc
> el0_svc
> el0t_64_sync_handler
> el0t_64_sync
>
> For full disclosure, I have also reverted the commit 0d523026abd4
> ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the
> series can be reverted cleanly. But, I can't see how the commit
> 0d523026abd4 could cause this issue at all.

This is halting in __lru_add_drain_all where it calls synchronize_rcu
before the drain even happens. It's also an LRU drain and not PCP which
is what the series affects and the allocator doesn't use rcu. In a KVM
machine, I can do

$ for BANK in `(for i in {1..20}; do echo $((RANDOM%416)); done) | sort -n | uniq`; do BEFORE=`cat /sys/devices/system/memory/memory$BANK/online`; echo 0 > /sys/devices/system/memory/memory$BANK/online; AFTER=`cat /sys/devices/system/memory/memory$BANK/online`; printf "%4d %d -> %d\n" $BANK $BEFORE $AFTER; done
3 1 -> 0
57 1 -> 0
74 1 -> 0
93 1 -> 0
101 1 -> 0
128 1 -> 0
133 1 -> 0
199 1 -> 0
223 1 -> 0
225 1 -> 0
229 1 -> 0
243 1 -> 0
263 1 -> 0
300 1 -> 0
309 1 -> 0
329 1 -> 0
355 1 -> 0
365 1 -> 0
372 1 -> 0
383 1 -> 0

It offlines 20 sections although after several attempts free -m starts
reporting negative used memory so there is a bug of some description.
How are you testing this exactly? Is it every time or intermittent? Are
you confident that reverting the series makes the problem go away?

--
Mel Gorman
SUSE Labs

2022-05-18 16:31:36

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Wed, May 18, 2022 at 01:51:52PM +0100, Mel Gorman wrote:
> On Tue, May 17, 2022 at 07:35:07PM -0400, Qian Cai wrote:
> > On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote:
> > > Changelog since v2
> > > o More conversions from page->lru to page->[pcp_list|buddy_list]
> > > o Additional test results in changelogs
> > >
> > > Changelog since v1
> > > o Fix unsafe RT locking scheme
> > > o Use spin_trylock on UP PREEMPT_RT
> > >
> > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > > per-cpu lists drain support" -- avoid interference of a high priority
> > > task due to a workqueue item draining per-cpu page lists. While many
> > > workloads can tolerate a brief interruption, it may be cause a real-time
> > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > > the draining in non-deterministic.
> > >
> > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> > > The local_lock on its own prevents migration and the IRQ disabling protects
> > > from corruption due to an interrupt arriving while a page allocation is
> > > in progress. The locking is inherently unsafe for remote access unless
> > > the CPU is hot-removed.
> > >
> > > This series adjusts the locking. A spinlock is added to struct
> > > per_cpu_pages to protect the list contents while local_lock_irq continues
> > > to prevent migration and IRQ reentry. This allows a remote CPU to safely
> > > drain a remote per-cpu list.
> > >
> > > This series is a partial series. Follow-on work should allow the
> > > local_irq_save to be converted to a local_irq to avoid IRQs being
> > > disabled/enabled in most cases. Consequently, there are some TODO comments
> > > highlighting the places that would change if local_irq was used. However,
> > > there are enough corner cases that it deserves a series on its own
> > > separated by one kernel release and the priority right now is to avoid
> > > interference of high priority tasks.
> >
> > Reverting the whole series fixed an issue that offlining a memory
> > section blocking for hours on today's linux-next tree.
> >
> > __wait_rcu_gp
> > synchronize_rcu at kernel/rcu/tree.c:3915
> > lru_cache_disable at mm/swap.c:886
> > __alloc_contig_migrate_range at mm/page_alloc.c:9078
> > isolate_single_pageblock at mm/page_isolation.c:405
> > start_isolate_page_range
> > offline_pages
> > memory_subsys_offline
> > device_offline
> > online_store
> > dev_attr_store
> > sysfs_kf_write
> > kernfs_fop_write_iter
> > new_sync_write
> > vfs_write
> > ksys_write
> > __arm64_sys_write
> > invoke_syscall
> > el0_svc_common.constprop.0
> > do_el0_svc
> > el0_svc
> > el0t_64_sync_handler
> > el0t_64_sync
> >
> > For full disclosure, I have also reverted the commit 0d523026abd4
> > ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the
> > series can be reverted cleanly. But, I can't see how the commit
> > 0d523026abd4 could cause this issue at all.
>
> This is halting in __lru_add_drain_all where it calls synchronize_rcu
> before the drain even happens. It's also an LRU drain and not PCP which
> is what the series affects and the allocator doesn't use rcu. In a KVM
> machine, I can do
>
> $ for BANK in `(for i in {1..20}; do echo $((RANDOM%416)); done) | sort -n | uniq`; do BEFORE=`cat /sys/devices/system/memory/memory$BANK/online`; echo 0 > /sys/devices/system/memory/memory$BANK/online; AFTER=`cat /sys/devices/system/memory/memory$BANK/online`; printf "%4d %d -> %d\n" $BANK $BEFORE $AFTER; done
> 3 1 -> 0
> 57 1 -> 0
> 74 1 -> 0
> 93 1 -> 0
> 101 1 -> 0
> 128 1 -> 0
> 133 1 -> 0
> 199 1 -> 0
> 223 1 -> 0
> 225 1 -> 0
> 229 1 -> 0
> 243 1 -> 0
> 263 1 -> 0
> 300 1 -> 0
> 309 1 -> 0
> 329 1 -> 0
> 355 1 -> 0
> 365 1 -> 0
> 372 1 -> 0
> 383 1 -> 0
>
> It offlines 20 sections although after several attempts free -m starts
> reporting negative used memory so there is a bug of some description.
> How are you testing this exactly? Is it every time or intermittent? Are
> you confident that reverting the series makes the problem go away?

Cc'ing Paul. Either reverting this series or Paul's 3 patches below from
today's linux-next tree fixed the issue.

ca52639daa5b rcu-tasks: Drive synchronous grace periods from calling task
89ad98e93ce8 rcu-tasks: Move synchronize_rcu_tasks_generic() down
0d90e7225fb1 rcu-tasks: Split rcu_tasks_one_gp() from rcu_tasks_kthread()

It was reproduced by running this script below on an arm64 server. I can
reproduce it every time within 5 attempts. I noticed that when it happens,
we have a few rcu kthreads all are stuck in this line,

rcuwait_wait_event(&rtp->cbs_wait,
(needgpcb = rcu_tasks_need_gpcb(rtp)),
TASK_IDLE);

rcu_tasks_kthread
rcu_tasks_rude_kthread
[rcu_tasks_trace_kthread


#!/usr/bin/env python3
# SPDX-License-Identifier: GPL-2.0

import os
import re
import subprocess


def mem_iter():
base_dir = '/sys/devices/system/memory/'
for curr_dir in os.listdir(base_dir):
if re.match(r'memory\d+', curr_dir):
yield base_dir + curr_dir


if __name__ == '__main__':
print('- Try to remove each memory section and then add it back.')
for mem_dir in mem_iter():
status = f'{mem_dir}/online'
if open(status).read().rstrip() == '1':
# This could expectedly fail due to many reasons.
section = os.path.basename(mem_dir)
print(f'- Try to remove {section}.')
proc = subprocess.run([f'echo 0 | sudo tee {status}'], shell=True)
if proc.returncode == 0:
print(f'- Try to add {section}.')
subprocess.check_call([f'echo 1 | sudo tee {status}'], shell=True)


2022-05-18 17:15:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Wed, May 18, 2022 at 12:27:22PM -0400, Qian Cai wrote:
> On Wed, May 18, 2022 at 01:51:52PM +0100, Mel Gorman wrote:
> > On Tue, May 17, 2022 at 07:35:07PM -0400, Qian Cai wrote:
> > > On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote:
> > > > Changelog since v2
> > > > o More conversions from page->lru to page->[pcp_list|buddy_list]
> > > > o Additional test results in changelogs
> > > >
> > > > Changelog since v1
> > > > o Fix unsafe RT locking scheme
> > > > o Use spin_trylock on UP PREEMPT_RT
> > > >
> > > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > > > per-cpu lists drain support" -- avoid interference of a high priority
> > > > task due to a workqueue item draining per-cpu page lists. While many
> > > > workloads can tolerate a brief interruption, it may be cause a real-time
> > > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > > > the draining in non-deterministic.
> > > >
> > > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> > > > The local_lock on its own prevents migration and the IRQ disabling protects
> > > > from corruption due to an interrupt arriving while a page allocation is
> > > > in progress. The locking is inherently unsafe for remote access unless
> > > > the CPU is hot-removed.
> > > >
> > > > This series adjusts the locking. A spinlock is added to struct
> > > > per_cpu_pages to protect the list contents while local_lock_irq continues
> > > > to prevent migration and IRQ reentry. This allows a remote CPU to safely
> > > > drain a remote per-cpu list.
> > > >
> > > > This series is a partial series. Follow-on work should allow the
> > > > local_irq_save to be converted to a local_irq to avoid IRQs being
> > > > disabled/enabled in most cases. Consequently, there are some TODO comments
> > > > highlighting the places that would change if local_irq was used. However,
> > > > there are enough corner cases that it deserves a series on its own
> > > > separated by one kernel release and the priority right now is to avoid
> > > > interference of high priority tasks.
> > >
> > > Reverting the whole series fixed an issue that offlining a memory
> > > section blocking for hours on today's linux-next tree.
> > >
> > > __wait_rcu_gp
> > > synchronize_rcu at kernel/rcu/tree.c:3915
> > > lru_cache_disable at mm/swap.c:886
> > > __alloc_contig_migrate_range at mm/page_alloc.c:9078
> > > isolate_single_pageblock at mm/page_isolation.c:405
> > > start_isolate_page_range
> > > offline_pages
> > > memory_subsys_offline
> > > device_offline
> > > online_store
> > > dev_attr_store
> > > sysfs_kf_write
> > > kernfs_fop_write_iter
> > > new_sync_write
> > > vfs_write
> > > ksys_write
> > > __arm64_sys_write
> > > invoke_syscall
> > > el0_svc_common.constprop.0
> > > do_el0_svc
> > > el0_svc
> > > el0t_64_sync_handler
> > > el0t_64_sync
> > >
> > > For full disclosure, I have also reverted the commit 0d523026abd4
> > > ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the
> > > series can be reverted cleanly. But, I can't see how the commit
> > > 0d523026abd4 could cause this issue at all.
> >
> > This is halting in __lru_add_drain_all where it calls synchronize_rcu
> > before the drain even happens. It's also an LRU drain and not PCP which
> > is what the series affects and the allocator doesn't use rcu. In a KVM
> > machine, I can do
> >
> > $ for BANK in `(for i in {1..20}; do echo $((RANDOM%416)); done) | sort -n | uniq`; do BEFORE=`cat /sys/devices/system/memory/memory$BANK/online`; echo 0 > /sys/devices/system/memory/memory$BANK/online; AFTER=`cat /sys/devices/system/memory/memory$BANK/online`; printf "%4d %d -> %d\n" $BANK $BEFORE $AFTER; done
> > 3 1 -> 0
> > 57 1 -> 0
> > 74 1 -> 0
> > 93 1 -> 0
> > 101 1 -> 0
> > 128 1 -> 0
> > 133 1 -> 0
> > 199 1 -> 0
> > 223 1 -> 0
> > 225 1 -> 0
> > 229 1 -> 0
> > 243 1 -> 0
> > 263 1 -> 0
> > 300 1 -> 0
> > 309 1 -> 0
> > 329 1 -> 0
> > 355 1 -> 0
> > 365 1 -> 0
> > 372 1 -> 0
> > 383 1 -> 0
> >
> > It offlines 20 sections although after several attempts free -m starts
> > reporting negative used memory so there is a bug of some description.
> > How are you testing this exactly? Is it every time or intermittent? Are
> > you confident that reverting the series makes the problem go away?
>
> Cc'ing Paul. Either reverting this series or Paul's 3 patches below from
> today's linux-next tree fixed the issue.
>
> ca52639daa5b rcu-tasks: Drive synchronous grace periods from calling task
> 89ad98e93ce8 rcu-tasks: Move synchronize_rcu_tasks_generic() down
> 0d90e7225fb1 rcu-tasks: Split rcu_tasks_one_gp() from rcu_tasks_kthread()
>
> It was reproduced by running this script below on an arm64 server. I can
> reproduce it every time within 5 attempts. I noticed that when it happens,
> we have a few rcu kthreads all are stuck in this line,
>
> rcuwait_wait_event(&rtp->cbs_wait,
> (needgpcb = rcu_tasks_need_gpcb(rtp)),
> TASK_IDLE);
>
> rcu_tasks_kthread
> rcu_tasks_rude_kthread
> [rcu_tasks_trace_kthread

This is the normal state of these kthreads when there is nothing for
them to do.

And unless you are removing tracing trampolines (kprobes, ftrace, BPF),
there should be nothing for them to do.

So does this python script somehow change the tracing state? (It does
not look to me like it does, but I could easily be missing something.)

Either way, is there something else waiting for these RCU flavors?
(There should not be.) Nevertheless, if so, there should be
a synchronize_rcu_tasks(), synchronize_rcu_tasks_rude(), or
synchronize_rcu_tasks_trace() on some other blocked task's stack
somewhere.

Or maybe something sleeps waiting for an RCU Tasks * callback to
be invoked. In that case (and in the above case, for that matter),
at least one of these pointers would be non-NULL on some CPU:

1. rcu_tasks__percpu.cblist.head
2. rcu_tasks_rude__percpu.cblist.head
3. rcu_tasks_trace__percpu.cblist.head

The ->func field of the pointed-to structure contains a pointer to
the callback function, which will help work out what is going on.
(Most likely a wakeup being lost or not provided.)

Alternatively, if your system has hundreds of thousands of tasks and
you have attached BPF programs to short-lived socket structures and you
don't yet have the workaround, then you can see hangs. (I am working on a
longer-term fix.) In the short term, applying the workaround is the right
thing to do. (Adding a couple of the BPF guys on CC for their thoughts.)

Does any of that help?

Thanx, Paul

> #!/usr/bin/env python3
> # SPDX-License-Identifier: GPL-2.0
>
> import os
> import re
> import subprocess
>
>
> def mem_iter():
> base_dir = '/sys/devices/system/memory/'
> for curr_dir in os.listdir(base_dir):
> if re.match(r'memory\d+', curr_dir):
> yield base_dir + curr_dir
>
>
> if __name__ == '__main__':
> print('- Try to remove each memory section and then add it back.')
> for mem_dir in mem_iter():
> status = f'{mem_dir}/online'
> if open(status).read().rstrip() == '1':
> # This could expectedly fail due to many reasons.
> section = os.path.basename(mem_dir)
> print(f'- Try to remove {section}.')
> proc = subprocess.run([f'echo 0 | sudo tee {status}'], shell=True)
> if proc.returncode == 0:
> print(f'- Try to add {section}.')
> subprocess.check_call([f'echo 1 | sudo tee {status}'], shell=True)
>

2022-05-18 18:05:34

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Wed, May 18, 2022 at 02:26:08PM -0300, Marcelo Tosatti wrote:
> On Tue, May 17, 2022 at 07:35:07PM -0400, Qian Cai wrote:
> > On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote:
> > > Changelog since v2
> > > o More conversions from page->lru to page->[pcp_list|buddy_list]
> > > o Additional test results in changelogs
> > >
> > > Changelog since v1
> > > o Fix unsafe RT locking scheme
> > > o Use spin_trylock on UP PREEMPT_RT
> > >
> > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > > per-cpu lists drain support" -- avoid interference of a high priority
> > > task due to a workqueue item draining per-cpu page lists. While many
> > > workloads can tolerate a brief interruption, it may be cause a real-time
> > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > > the draining in non-deterministic.
> > >
> > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> > > The local_lock on its own prevents migration and the IRQ disabling protects
> > > from corruption due to an interrupt arriving while a page allocation is
> > > in progress. The locking is inherently unsafe for remote access unless
> > > the CPU is hot-removed.
> > >
> > > This series adjusts the locking. A spinlock is added to struct
> > > per_cpu_pages to protect the list contents while local_lock_irq continues
> > > to prevent migration and IRQ reentry. This allows a remote CPU to safely
> > > drain a remote per-cpu list.
> > >
> > > This series is a partial series. Follow-on work should allow the
> > > local_irq_save to be converted to a local_irq to avoid IRQs being
> > > disabled/enabled in most cases. Consequently, there are some TODO comments
> > > highlighting the places that would change if local_irq was used. However,
> > > there are enough corner cases that it deserves a series on its own
> > > separated by one kernel release and the priority right now is to avoid
> > > interference of high priority tasks.
> >
> > Reverting the whole series fixed an issue that offlining a memory
> > section blocking for hours on today's linux-next tree.
> >
> > __wait_rcu_gp
> > synchronize_rcu at kernel/rcu/tree.c:3915
> > lru_cache_disable at mm/swap.c:886
> > __alloc_contig_migrate_range at mm/page_alloc.c:9078
> > isolate_single_pageblock at mm/page_isolation.c:405
> > start_isolate_page_range
> > offline_pages
> > memory_subsys_offline
> > device_offline
> > online_store
> > dev_attr_store
> > sysfs_kf_write
> > kernfs_fop_write_iter
> > new_sync_write
> > vfs_write
> > ksys_write
> > __arm64_sys_write
> > invoke_syscall
> > el0_svc_common.constprop.0
> > do_el0_svc
> > el0_svc
> > el0t_64_sync_handler
> > el0t_64_sync
> >
> > For full disclosure, I have also reverted the commit 0d523026abd4
> > ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the
> > series can be reverted cleanly. But, I can't see how the commit
> > 0d523026abd4 could cause this issue at all.
>
> Hi Qian,
>
> The issue is probably due to lack of the following:
>
> https://lore.kernel.org/linux-mm/YmrWK%[email protected]/
>
> Can you please give the patch on the URL a try?
>
> Thanks!

Oops, sorry don't think the above URL has anything to do with this
problem.



2022-05-18 18:05:36

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Tue, May 17, 2022 at 07:35:07PM -0400, Qian Cai wrote:
> On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote:
> > Changelog since v2
> > o More conversions from page->lru to page->[pcp_list|buddy_list]
> > o Additional test results in changelogs
> >
> > Changelog since v1
> > o Fix unsafe RT locking scheme
> > o Use spin_trylock on UP PREEMPT_RT
> >
> > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > per-cpu lists drain support" -- avoid interference of a high priority
> > task due to a workqueue item draining per-cpu page lists. While many
> > workloads can tolerate a brief interruption, it may be cause a real-time
> > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > the draining in non-deterministic.
> >
> > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> > The local_lock on its own prevents migration and the IRQ disabling protects
> > from corruption due to an interrupt arriving while a page allocation is
> > in progress. The locking is inherently unsafe for remote access unless
> > the CPU is hot-removed.
> >
> > This series adjusts the locking. A spinlock is added to struct
> > per_cpu_pages to protect the list contents while local_lock_irq continues
> > to prevent migration and IRQ reentry. This allows a remote CPU to safely
> > drain a remote per-cpu list.
> >
> > This series is a partial series. Follow-on work should allow the
> > local_irq_save to be converted to a local_irq to avoid IRQs being
> > disabled/enabled in most cases. Consequently, there are some TODO comments
> > highlighting the places that would change if local_irq was used. However,
> > there are enough corner cases that it deserves a series on its own
> > separated by one kernel release and the priority right now is to avoid
> > interference of high priority tasks.
>
> Reverting the whole series fixed an issue that offlining a memory
> section blocking for hours on today's linux-next tree.
>
> __wait_rcu_gp
> synchronize_rcu at kernel/rcu/tree.c:3915
> lru_cache_disable at mm/swap.c:886
> __alloc_contig_migrate_range at mm/page_alloc.c:9078
> isolate_single_pageblock at mm/page_isolation.c:405
> start_isolate_page_range
> offline_pages
> memory_subsys_offline
> device_offline
> online_store
> dev_attr_store
> sysfs_kf_write
> kernfs_fop_write_iter
> new_sync_write
> vfs_write
> ksys_write
> __arm64_sys_write
> invoke_syscall
> el0_svc_common.constprop.0
> do_el0_svc
> el0_svc
> el0t_64_sync_handler
> el0t_64_sync
>
> For full disclosure, I have also reverted the commit 0d523026abd4
> ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the
> series can be reverted cleanly. But, I can't see how the commit
> 0d523026abd4 could cause this issue at all.

Hi Qian,

The issue is probably due to lack of the following:

https://lore.kernel.org/linux-mm/YmrWK%[email protected]/

Can you please give the patch on the URL a try?

Thanks!



2022-05-18 18:05:39

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Thu, 2022-05-12 at 09:50 +0100, Mel Gorman wrote:
> Changelog since v2
> o More conversions from page->lru to page->[pcp_list|buddy_list]
> o Additional test results in changelogs
>
> Changelog since v1
> o Fix unsafe RT locking scheme
> o Use spin_trylock on UP PREEMPT_RT
>
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-time
> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> the draining in non-deterministic.
>
> Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> The local_lock on its own prevents migration and the IRQ disabling protects
> from corruption due to an interrupt arriving while a page allocation is
> in progress. The locking is inherently unsafe for remote access unless
> the CPU is hot-removed.
>
> This series adjusts the locking. A spinlock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
>
> This series is a partial series. Follow-on work should allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. Consequently, there are some TODO comments
> highlighting the places that would change if local_irq was used. However,
> there are enough corner cases that it deserves a series on its own
> separated by one kernel release and the priority right now is to avoid
> interference of high priority tasks.

FWIW tested this against our RT+nohz_full workloads. I can have another go if
the locking scheme changes.

Tested-by: Nicolas Saenz Julienne <[email protected]>

Thanks,

--
Nicolás Sáenz


2022-05-19 15:00:55

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list

On 5/12/22 10:50, Mel Gorman wrote:
> The page allocator uses page->lru for storing pages on either buddy or
> PCP lists. Create page->buddy_list and page->pcp_list as a union with
> page->lru. This is simply to clarify what type of list a page is on
> in the page allocator.
>
> No functional change intended.
>
> [minchan: Fix page lru fields in macros]
> Signed-off-by: Mel Gorman <[email protected]>
> Tested-by: Minchan Kim <[email protected]>
> Acked-by: Minchan Kim <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2022-05-19 23:43:52

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Thu, May 19, 2022 at 12:15:24PM -0700, Paul E. McKenney wrote:
> Is the task doing offline_pages()->synchronize_rcu() doing this
> repeatedly? Or is there a stalled RCU grace period? (From what
> I can see, offline_pages() is not doing huge numbers of calls to
> synchronize_rcu() in any of its loops, but I freely admit that I do not
> know this code.)

Yes, we are running into an endless loop in isolate_single_pageblock().
There was a similar issue happened not long ago, so I am wondering if we
did not solve it entirely then. Anyway, I will continue the thread over
there.

https://lore.kernel.org/all/YoavU%2F+NfQIzQiDF@qian/

> Or is it possible that reverting those three patches simply decreases
> the probability of failure, rather than eliminating the failure?
> Such a decrease could be due to many things, for example, changes to
> offsets and sizes of data structures.

Entirely possible. Sorry for the false alarm.

> Do you ever see RCU CPU stall warnings?

No.

2022-05-20 00:39:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue

On Thu, May 19, 2022 at 12:57:01PM +0200, Vlastimil Babka wrote:
> On 5/12/22 10:50, Mel Gorman wrote:
> > The VM_BUG_ON check for a valid page can be avoided with a simple
> > change in the flow. The ZONE_BOOSTED_WATERMARK is unlikely in general
> > and even more unlikely if the page allocation failed so mark the
> > branch unlikely.
>
> Hm, so that makes a DEBUG_VM config avoid the check. On the other hand,
> it puts it on the path returning from rmqueue_pcplist() for all configs,
> and that should be the fast path. So unless things further change in the
> following patches, it doesn't seem that useful?
>

You're right -- the fast path ends up with both a if
(page) and if (!page) checks. Andrew, can you drop the patch
mm-page_alloc-remove-unnecessary-page-==-null-check-in-rmqueue.patch from
your tree please?

Originally the flow was important when I was writing the patch and later
became unnecessary. However, it reminded me of another problem I thought
of when writing this and then forgotten to note it in the changelog. If
the page allocation fails then ZONE_BOOSTED_WATERMARK should still be
tested and cleared before waking kswapd. It could happen if an allocation
attempt tried to fallback to another migratetype and still fail to find
a suitable page. This is true whether going through the PCP lists or not.

So what do you think of me adding this patch to a follow-up series?

--8<--
mm/page_alloc: Remove mistaken page == NULL check in rmqueue

If a page allocation fails, the ZONE_BOOSTER_WATERMARK should be tested,
cleared and kswapd woken whether the allocation attempt was via the PCP
or directly via the buddy list.

Remove the page == NULL so the ZONE_BOOSTED_WATERMARK bit is checked
unconditionally. As it is unlikely that ZONE_BOOSTED_WATERMARK is set,
mark the branch accordingly.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c4c54503a5d..61d5bc2efffe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3765,12 +3765,10 @@ struct page *rmqueue(struct zone *preferred_zone,

page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
migratetype);
- if (unlikely(!page))
- return NULL;

out:
/* Separate test+clear to avoid unnecessary atomics */
- if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
+ if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
wakeup_kswapd(zone, 0, 0, zone_idx(zone));
}

2022-05-20 03:52:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Thu, May 19, 2022 at 09:29:45AM -0400, Qian Cai wrote:
> On Wed, May 18, 2022 at 10:15:03AM -0700, Paul E. McKenney wrote:
> > So does this python script somehow change the tracing state? (It does
> > not look to me like it does, but I could easily be missing something.)
>
> No, I don't think so either. It pretty much just offline memory sections
> one at a time.

No idea.

> > Either way, is there something else waiting for these RCU flavors?
> > (There should not be.) Nevertheless, if so, there should be
> > a synchronize_rcu_tasks(), synchronize_rcu_tasks_rude(), or
> > synchronize_rcu_tasks_trace() on some other blocked task's stack
> > somewhere.
>
> There are only three blocked tasks when this happens. The kmemleak_scan()
> is just the victim waiting for the locks taken by the stucking
> offline_pages()->synchronize_rcu() task.

OK, then I believe that the RCU Tasks flavors were innocent bystanders.

Is the task doing offline_pages()->synchronize_rcu() doing this
repeatedly? Or is there a stalled RCU grace period? (From what
I can see, offline_pages() is not doing huge numbers of calls to
synchronize_rcu() in any of its loops, but I freely admit that I do not
know this code.)

If repeatedly, one workaround is to use synchronize_rcu_expedited()
instead of synchronize_rcu(). A better fix might be to batch the
grace periods, so that one RCU grace period serves several page
offline operations. An alternative better fix might be to use
call_rcu() instead of synchronize_rcu().

> task:kmemleak state:D stack:25824 pid: 1033 ppid: 2 flags:0x00000008
> Call trace:
> __switch_to
> __schedule
> schedule
> percpu_rwsem_wait
> __percpu_down_read
> percpu_down_read.constprop.0
> get_online_mems

This is read-acquiring the mem_hotplug_lock. It looks like offline_pages()
write-acquires this same lock.

> kmemleak_scan
> kmemleak_scan_thread
> kthread
> ret_from_fork
>
> task:cppc_fie state:D stack:23472 pid: 1848 ppid: 2 flags:0x00000008
> Call trace:
> __switch_to
> __schedule
> lockdep_recursion
>
> task:tee state:D stack:24816 pid:16733 ppid: 16732 flags:0x0000020c
> Call trace:
> __switch_to
> __schedule
> schedule
> schedule_timeout
> __wait_for_common
> wait_for_completion
> __wait_rcu_gp
> synchronize_rcu

So, yes, this is sleeping holding the lock that kmemleak_scan wants to
acquire.

> lru_cache_disable
> __alloc_contig_migrate_range
> isolate_single_pageblock
> start_isolate_page_range
> offline_pages
> memory_subsys_offline
> device_offline
> online_store
> dev_attr_store
> sysfs_kf_write
> kernfs_fop_write_iter
> new_sync_write
> vfs_write
> ksys_write
> __arm64_sys_write
> invoke_syscall
> el0_svc_common.constprop.0
> do_el0_svc
> el0_svc
> el0t_64_sync_handler
> el0t_64_sync
>
> > Or maybe something sleeps waiting for an RCU Tasks * callback to
> > be invoked. In that case (and in the above case, for that matter),
> > at least one of these pointers would be non-NULL on some CPU:
> >
> > 1. rcu_tasks__percpu.cblist.head
> > 2. rcu_tasks_rude__percpu.cblist.head
> > 3. rcu_tasks_trace__percpu.cblist.head
> >
> > The ->func field of the pointed-to structure contains a pointer to
> > the callback function, which will help work out what is going on.
> > (Most likely a wakeup being lost or not provided.)
>
> What would be some of the easy ways to find out those? I can't see anything
> interesting from the output of sysrq-t.

Again, I believe that these are victims of circumstance. Though that does
not explain why revertin those three patches makes things work better.

Or is it possible that reverting those three patches simply decreases
the probability of failure, rather than eliminating the failure?
Such a decrease could be due to many things, for example, changes to
offsets and sizes of data structures.

> > Alternatively, if your system has hundreds of thousands of tasks and
> > you have attached BPF programs to short-lived socket structures and you
> > don't yet have the workaround, then you can see hangs. (I am working on a
> > longer-term fix.) In the short term, applying the workaround is the right
> > thing to do. (Adding a couple of the BPF guys on CC for their thoughts.)
>
> The system is pretty much idle after a fresh reboot. The only workload is
> to run the script.

Do you ever see RCU CPU stall warnings?

Could you please trace the offline_pages() function? Is it really stuck,
or is it being invoked periodically during the hang?

Thanx, Paul

2022-05-20 12:42:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Thu, May 19, 2022 at 05:05:04PM -0400, Qian Cai wrote:
> On Thu, May 19, 2022 at 12:15:24PM -0700, Paul E. McKenney wrote:
> > Is the task doing offline_pages()->synchronize_rcu() doing this
> > repeatedly? Or is there a stalled RCU grace period? (From what
> > I can see, offline_pages() is not doing huge numbers of calls to
> > synchronize_rcu() in any of its loops, but I freely admit that I do not
> > know this code.)
>
> Yes, we are running into an endless loop in isolate_single_pageblock().
> There was a similar issue happened not long ago, so I am wondering if we
> did not solve it entirely then. Anyway, I will continue the thread over
> there.
>
> https://lore.kernel.org/all/YoavU%2F+NfQIzQiDF@qian/

I do know that feeling.

> > Or is it possible that reverting those three patches simply decreases
> > the probability of failure, rather than eliminating the failure?
> > Such a decrease could be due to many things, for example, changes to
> > offsets and sizes of data structures.
>
> Entirely possible. Sorry for the false alarm.

Not a problem!

> > Do you ever see RCU CPU stall warnings?
>
> No.

OK, then perhaps a sequence of offline_pages() calls.

Hmmm... The percpu_up_write() function sets ->block to zero before
awakening waiters. Given wakeup latencies, might this allow an only
somewhat unfortunate sequence of events to allow offline_pages() to
starve readers? Or is there something I am missing that prevents this
from happening?

Thanx, Paul

2022-05-20 16:04:26

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue

On 5/19/22 14:13, Mel Gorman wrote:
> On Thu, May 19, 2022 at 12:57:01PM +0200, Vlastimil Babka wrote:
>> On 5/12/22 10:50, Mel Gorman wrote:
>>> The VM_BUG_ON check for a valid page can be avoided with a simple
>>> change in the flow. The ZONE_BOOSTED_WATERMARK is unlikely in general
>>> and even more unlikely if the page allocation failed so mark the
>>> branch unlikely.
>>
>> Hm, so that makes a DEBUG_VM config avoid the check. On the other hand,
>> it puts it on the path returning from rmqueue_pcplist() for all configs,
>> and that should be the fast path. So unless things further change in the
>> following patches, it doesn't seem that useful?
>>
>
> You're right -- the fast path ends up with both a if
> (page) and if (!page) checks. Andrew, can you drop the patch
> mm-page_alloc-remove-unnecessary-page-==-null-check-in-rmqueue.patch from
> your tree please?
>
> Originally the flow was important when I was writing the patch and later
> became unnecessary. However, it reminded me of another problem I thought
> of when writing this and then forgotten to note it in the changelog. If
> the page allocation fails then ZONE_BOOSTED_WATERMARK should still be
> tested and cleared before waking kswapd. It could happen if an allocation
> attempt tried to fallback to another migratetype and still fail to find
> a suitable page. This is true whether going through the PCP lists or not.
>
> So what do you think of me adding this patch to a follow-up series?

LGTM.

>
> --8<--
> mm/page_alloc: Remove mistaken page == NULL check in rmqueue
>
> If a page allocation fails, the ZONE_BOOSTER_WATERMARK should be tested,
> cleared and kswapd woken whether the allocation attempt was via the PCP
> or directly via the buddy list.
>
> Remove the page == NULL so the ZONE_BOOSTED_WATERMARK bit is checked
> unconditionally. As it is unlikely that ZONE_BOOSTED_WATERMARK is set,
> mark the branch accordingly.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/page_alloc.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1c4c54503a5d..61d5bc2efffe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3765,12 +3765,10 @@ struct page *rmqueue(struct zone *preferred_zone,
>
> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> migratetype);
> - if (unlikely(!page))
> - return NULL;
>
> out:
> /* Separate test+clear to avoid unnecessary atomics */
> - if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
> + if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
> clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
> wakeup_kswapd(zone, 0, 0, zone_idx(zone));
> }
>


2022-05-20 23:19:13

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper

On 5/12/22 10:50, Mel Gorman wrote:
> This is a preparation page to allow the buddy removal code to be reused
> in a later patch.
>
> No functional change.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Tested-by: Minchan Kim <[email protected]>
> Acked-by: Minchan Kim <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2022-05-21 13:45:55

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Wed, May 18, 2022 at 10:15:03AM -0700, Paul E. McKenney wrote:
> So does this python script somehow change the tracing state? (It does
> not look to me like it does, but I could easily be missing something.)

No, I don't think so either. It pretty much just offline memory sections
one at a time.

> Either way, is there something else waiting for these RCU flavors?
> (There should not be.) Nevertheless, if so, there should be
> a synchronize_rcu_tasks(), synchronize_rcu_tasks_rude(), or
> synchronize_rcu_tasks_trace() on some other blocked task's stack
> somewhere.

There are only three blocked tasks when this happens. The kmemleak_scan()
is just the victim waiting for the locks taken by the stucking
offline_pages()->synchronize_rcu() task.

task:kmemleak state:D stack:25824 pid: 1033 ppid: 2 flags:0x00000008
Call trace:
__switch_to
__schedule
schedule
percpu_rwsem_wait
__percpu_down_read
percpu_down_read.constprop.0
get_online_mems
kmemleak_scan
kmemleak_scan_thread
kthread
ret_from_fork

task:cppc_fie state:D stack:23472 pid: 1848 ppid: 2 flags:0x00000008
Call trace:
__switch_to
__schedule
lockdep_recursion

task:tee state:D stack:24816 pid:16733 ppid: 16732 flags:0x0000020c
Call trace:
__switch_to
__schedule
schedule
schedule_timeout
__wait_for_common
wait_for_completion
__wait_rcu_gp
synchronize_rcu
lru_cache_disable
__alloc_contig_migrate_range
isolate_single_pageblock
start_isolate_page_range
offline_pages
memory_subsys_offline
device_offline
online_store
dev_attr_store
sysfs_kf_write
kernfs_fop_write_iter
new_sync_write
vfs_write
ksys_write
__arm64_sys_write
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync

> Or maybe something sleeps waiting for an RCU Tasks * callback to
> be invoked. In that case (and in the above case, for that matter),
> at least one of these pointers would be non-NULL on some CPU:
>
> 1. rcu_tasks__percpu.cblist.head
> 2. rcu_tasks_rude__percpu.cblist.head
> 3. rcu_tasks_trace__percpu.cblist.head
>
> The ->func field of the pointed-to structure contains a pointer to
> the callback function, which will help work out what is going on.
> (Most likely a wakeup being lost or not provided.)

What would be some of the easy ways to find out those? I can't see anything
interesting from the output of sysrq-t.

> Alternatively, if your system has hundreds of thousands of tasks and
> you have attached BPF programs to short-lived socket structures and you
> don't yet have the workaround, then you can see hangs. (I am working on a
> longer-term fix.) In the short term, applying the workaround is the right
> thing to do. (Adding a couple of the BPF guys on CC for their thoughts.)

The system is pretty much idle after a fresh reboot. The only workload is
to run the script.

2022-05-23 01:59:38

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue

On 5/12/22 10:50, Mel Gorman wrote:
> The VM_BUG_ON check for a valid page can be avoided with a simple
> change in the flow. The ZONE_BOOSTED_WATERMARK is unlikely in general
> and even more unlikely if the page allocation failed so mark the
> branch unlikely.

Hm, so that makes a DEBUG_VM config avoid the check. On the other hand,
it puts it on the path returning from rmqueue_pcplist() for all configs,
and that should be the fast path. So unless things further change in the
following patches, it doesn't seem that useful?

> Signed-off-by: Mel Gorman <[email protected]>
> Tested-by: Minchan Kim <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> ---
> mm/page_alloc.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1c4c54503a5d..b543333dce8f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3765,17 +3765,18 @@ struct page *rmqueue(struct zone *preferred_zone,
>
> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> migratetype);
> - if (unlikely(!page))
> - return NULL;
>
> out:
> /* Separate test+clear to avoid unnecessary atomics */
> - if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
> + if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
> clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
> wakeup_kswapd(zone, 0, 0, zone_idx(zone));
> }
>
> - VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
> + if (unlikely(!page))
> + return NULL;
> +
> + VM_BUG_ON_PAGE(bad_range(zone, page), page);
> return page;
> }
>


2022-05-23 16:09:47

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper

On 05/12/22 09:50, Mel Gorman wrote:
> This is a preparation page to allow the buddy removal code to be reused
> in a later patch.
>
> No functional change.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Tested-by: Minchan Kim <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> ---

I see this splat when this patch is applied on 5.10.107 kernel:

[ 132.779332] CPU: 1 PID: 203 Comm: klogd Not tainted 5.10.107-00039-g83962808e276 #28
[ 132.782470] BUG: using __this_cpu_add_return() in preemptible [00000000] code: udhcpc/229
[ 132.787809] Hardware name: ARM Juno development board (r2) (DT)
[ 132.787841] Call trace:
[ 132.787881] dump_backtrace+0x0/0x2c0
[ 132.787921] show_stack+0x18/0x28
[ 132.787963] dump_stack_lvl+0x108/0x150
[ 132.788003] dump_stack+0x1c/0x58
[ 132.788049] check_preemption_disabled+0xf4/0x108
[ 132.788095] __this_cpu_preempt_check+0x20/0x2c
[ 132.788135] __inc_numa_state+0x3c/0x120
[ 132.788177] get_page_from_freelist+0xd6c/0x1ac8
[ 132.788220] __alloc_pages_nodemask+0x224/0x1780
[ 132.797359] caller is __this_cpu_preempt_check+0x20/0x2c
[ 132.803579] alloc_pages_current+0xb0/0x150
[ 132.803621] allocate_slab+0x2d0/0x408
[ 132.803662] ___slab_alloc+0x43c/0x640
[ 132.803704] __slab_alloc.isra.0+0x70/0xc8
[ 132.803747] __kmalloc_node_track_caller+0x10c/0x2d8
[ 132.803792] __kmalloc_reserve.isra.0+0x80/0x160
[ 132.803835] __alloc_skb+0xd0/0x2a8
[ 132.883893] alloc_skb_with_frags+0x64/0x2a0
[ 132.888632] sock_alloc_send_pskb+0x420/0x438
[ 132.893465] unix_dgram_sendmsg+0x1d4/0x930
[ 132.898112] __sys_sendto+0x16c/0x230
[ 132.902198] __arm64_sys_sendto+0x78/0x98
[ 132.906654] el0_svc_common.constprop.0+0xac/0x278

I could resolve it by applying this patch:

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 80c1e0a0f094e..92fb0c08296ef 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -957,11 +957,11 @@ void __inc_numa_state(struct zone *zone,
u16 __percpu *p = pcp->vm_numa_stat_diff + item;
u16 v;

- v = __this_cpu_inc_return(*p);
+ v = this_cpu_inc_return(*p);

if (unlikely(v > NUMA_STATS_THRESHOLD)) {
zone_numa_state_add(v, zone, item);
- __this_cpu_write(*p, 0);
+ this_cpu_write(*p, 0);
}
}

AFAICT zone_statistics() no longer protected by the spin_lock_irqsave(), so
preemption no longer disabled.

You need to have CONFIG_NUMA and CONFIG_DEBUG_PREEMPT enabled to reproduce
this.

HTH

Thanks!

--
Qais Yousef

2022-05-24 16:24:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper

On Mon, May 23, 2022 at 05:09:21PM +0100, Qais Yousef wrote:
> On 05/12/22 09:50, Mel Gorman wrote:
> > This is a preparation page to allow the buddy removal code to be reused
> > in a later patch.
> >
> > No functional change.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > Tested-by: Minchan Kim <[email protected]>
> > Acked-by: Minchan Kim <[email protected]>
> > ---
>
> I see this splat when this patch is applied on 5.10.107 kernel:
>

<SNIP>

> I could resolve it by applying this patch:
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 80c1e0a0f094e..92fb0c08296ef 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -957,11 +957,11 @@ void __inc_numa_state(struct zone *zone,
> u16 __percpu *p = pcp->vm_numa_stat_diff + item;
> u16 v;
>
> - v = __this_cpu_inc_return(*p);
> + v = this_cpu_inc_return(*p);
>
> if (unlikely(v > NUMA_STATS_THRESHOLD)) {
> zone_numa_state_add(v, zone, item);
> - __this_cpu_write(*p, 0);
> + this_cpu_write(*p, 0);
> }
> }
>

5.18 does not have __inc_numa_state() so it's likely you are missing
backports, probably f19298b9516c1a031b34b4147773457e3efe743b at minimum.

--
Mel Gorman
SUSE Labs

2022-05-25 17:57:16

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper

On 05/24/22 12:55, Mel Gorman wrote:
> On Mon, May 23, 2022 at 05:09:21PM +0100, Qais Yousef wrote:
> > On 05/12/22 09:50, Mel Gorman wrote:
> > > This is a preparation page to allow the buddy removal code to be reused
> > > in a later patch.
> > >
> > > No functional change.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > Tested-by: Minchan Kim <[email protected]>
> > > Acked-by: Minchan Kim <[email protected]>
> > > ---
> >
> > I see this splat when this patch is applied on 5.10.107 kernel:
> >
>
> <SNIP>
>
> > I could resolve it by applying this patch:
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 80c1e0a0f094e..92fb0c08296ef 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -957,11 +957,11 @@ void __inc_numa_state(struct zone *zone,
> > u16 __percpu *p = pcp->vm_numa_stat_diff + item;
> > u16 v;
> >
> > - v = __this_cpu_inc_return(*p);
> > + v = this_cpu_inc_return(*p);
> >
> > if (unlikely(v > NUMA_STATS_THRESHOLD)) {
> > zone_numa_state_add(v, zone, item);
> > - __this_cpu_write(*p, 0);
> > + this_cpu_write(*p, 0);
> > }
> > }
> >
>
> 5.18 does not have __inc_numa_state() so it's likely you are missing
> backports, probably f19298b9516c1a031b34b4147773457e3efe743b at minimum.

Thanks Mel. Sorry it seems I stopped my analysis tad too soon. It seems the
dependency chain is more than that commit. I couldn't backport it on its own.

I just happened to be an accidental user of that kernel (it's an Android 5.10
tree). I did report it there too, but I thought (wrongly) this could benefit
this upstream discussion.

I'll take it there.

Thanks!

--
Qais Yousef

2022-05-27 09:27:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Thu, May 26, 2022 at 01:19:38PM -0400, Qian Cai wrote:
> On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote:
> > Changelog since v2
> > o More conversions from page->lru to page->[pcp_list|buddy_list]
> > o Additional test results in changelogs
> >
> > Changelog since v1
> > o Fix unsafe RT locking scheme
> > o Use spin_trylock on UP PREEMPT_RT
> >
> > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > per-cpu lists drain support" -- avoid interference of a high priority
> > task due to a workqueue item draining per-cpu page lists. While many
> > workloads can tolerate a brief interruption, it may be cause a real-time
> > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > the draining in non-deterministic.
> >
> > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> > The local_lock on its own prevents migration and the IRQ disabling protects
> > from corruption due to an interrupt arriving while a page allocation is
> > in progress. The locking is inherently unsafe for remote access unless
> > the CPU is hot-removed.
> >
> > This series adjusts the locking. A spinlock is added to struct
> > per_cpu_pages to protect the list contents while local_lock_irq continues
> > to prevent migration and IRQ reentry. This allows a remote CPU to safely
> > drain a remote per-cpu list.
> >
> > This series is a partial series. Follow-on work should allow the
> > local_irq_save to be converted to a local_irq to avoid IRQs being
> > disabled/enabled in most cases. Consequently, there are some TODO comments
> > highlighting the places that would change if local_irq was used. However,
> > there are enough corner cases that it deserves a series on its own
> > separated by one kernel release and the priority right now is to avoid
> > interference of high priority tasks.
> >
> > Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
> > and when it is storing per-cpu pages.
> >
> > Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
> > this is not necessary but it avoids per_cpu_pages consuming another
> > cache line.
> >
> > Patch 3 is a preparation patch to avoid code duplication.
> >
> > Patch 4 is a simple micro-optimisation that improves code flow necessary for
> > a later patch to avoid code duplication.
> >
> > Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
> > relying on local_lock to prevent migration, stabilise the pcp
> > lookup and prevent IRQ reentrancy.
> >
> > Patch 6 remote drains per-cpu pages directly instead of using a workqueue.
>
> Mel, we saw spontanous "mm_percpu_wq" crash on today's linux-next tree
> while running CPU offlining/onlining, and wondering if you have any
> thoughts?
>

Do you think it's related to the series and if so why? From the warning,
it's not obvious to me why it would be given that it's a warning about a
task not being inactive when it is expected to be.

--
Mel Gorman
SUSE Labs

2022-05-27 14:54:23

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote:
> Changelog since v2
> o More conversions from page->lru to page->[pcp_list|buddy_list]
> o Additional test results in changelogs
>
> Changelog since v1
> o Fix unsafe RT locking scheme
> o Use spin_trylock on UP PREEMPT_RT
>
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-time
> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> the draining in non-deterministic.
>
> Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> The local_lock on its own prevents migration and the IRQ disabling protects
> from corruption due to an interrupt arriving while a page allocation is
> in progress. The locking is inherently unsafe for remote access unless
> the CPU is hot-removed.
>
> This series adjusts the locking. A spinlock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
>
> This series is a partial series. Follow-on work should allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. Consequently, there are some TODO comments
> highlighting the places that would change if local_irq was used. However,
> there are enough corner cases that it deserves a series on its own
> separated by one kernel release and the priority right now is to avoid
> interference of high priority tasks.
>
> Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
> and when it is storing per-cpu pages.
>
> Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
> this is not necessary but it avoids per_cpu_pages consuming another
> cache line.
>
> Patch 3 is a preparation patch to avoid code duplication.
>
> Patch 4 is a simple micro-optimisation that improves code flow necessary for
> a later patch to avoid code duplication.
>
> Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
> relying on local_lock to prevent migration, stabilise the pcp
> lookup and prevent IRQ reentrancy.
>
> Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

Mel, we saw spontanous "mm_percpu_wq" crash on today's linux-next tree
while running CPU offlining/onlining, and wondering if you have any
thoughts?

WARNING: CPU: 31 PID: 173 at kernel/kthread.c:524 __kthread_bind_mask
CPU: 31 PID: 173 Comm: kworker/31:0 Not tainted 5.18.0-next-20220526-dirty #127
Workqueue: 0x0
(mm_percpu_wq)

pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __kthread_bind_mask
lr : __kthread_bind_mask
sp : ffff800018667c50
x29: ffff800018667c50
x28: ffff800018667d20
x27: ffff083678bc2458
x26: 1fffe1002f5b17a8
x25: ffff08017ad8bd40
x24: 1fffe106cf17848b
x23: 1ffff000030ccfa0 x22: ffff0801de2d1ac0
x21: ffff0801de2d1ac0
x20: ffff07ff80286f08 x19: ffff0801de2d1ac0
x18: ffffd6056a577d1c
x17: ffffffffffffffff
x16: 1fffe0fff158eb18
x15: 1fffe106cf176138
x14: 000000000000f1f1
x13: 00000000f3f3f3f3 x12: ffff7000030ccf3b
x11: 1ffff000030ccf3a x10: ffff7000030ccf3a x9 : dfff800000000000
x8 : ffff8000186679d7
x7 : 0000000000000001
x6 : ffff7000030ccf3a
x5 : 1ffff000030ccf39 x4 : 1ffff000030ccf4e x3 : 0000000000000000
x2 : 0000000000000000
x1 : ffff07ff8ac74fc0 x0 : 0000000000000000
Call trace:
__kthread_bind_mask
kthread_bind_mask
create_worker
worker_thread
kthread
ret_from_fork
irq event stamp: 146
hardirqs last enabled at (145): _raw_spin_unlock_irqrestore
hardirqs last disabled at (146): el1_dbg
softirqs last enabled at (0): copy_process
softirqs last disabled at (0): 0x0

WARNING: CPU: 31 PID: 173 at kernel/kthread.c:593 kthread_set_per_cpu
CPU: 31 PID: 173 Comm: kworker/31:0 Tainted: G W 5.18.0-next-20220526-dirty #127
Workqueue: 0x0 (mm_percpu_wq)
pstate: 10400009 (nzcV daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : kthread_set_per_cpu
lr : worker_attach_to_pool
sp : ffff800018667be0
x29: ffff800018667be0 x28: ffff800018667d20 x27: ffff083678bc2458
x26: 1fffe1002f5b17a8 x25: ffff08017ad8bd40 x24: 1fffe106cf17848b
x23: 1fffe1003bc5a35d x22: ffff0801de2d1aec x21: 0000000000000007
x20: ffff4026d8adae00 x19: ffff0801de2d1ac0 x18: ffffd6056a577d1c
x17: ffffffffffffffff x16: 1fffe0fff158eb18 x15: 1fffe106cf176138
x14: 000000000000f1f1 x13: 00000000f3f3f3f3 x12: ffff7000030ccf53
x11: 1ffff000030ccf52 x10: ffff7000030ccf52 x9 : ffffd60563f9a038
x8 : ffff800018667a97 x7 : 0000000000000001 x6 : ffff7000030ccf52
x5 : ffff800018667a90 x4 : ffff7000030ccf53 x3 : 1fffe1003bc5a408
x2 : 0000000000000000 x1 : 000000000000001f x0 : 0000000000208060
Call trace:
kthread_set_per_cpu
worker_attach_to_pool at kernel/workqueue.c:1873
create_worker
worker_thread
kthread
ret_from_fork
irq event stamp: 146
hardirqs last enabled at (145): _raw_spin_unlock_irqrestore
hardirqs last disabled at (146): el1_dbg
softirqs last enabled at (0): copy_process
softirqs last disabled at (0): 0x0

Unable to handle kernel paging request at virtual address dfff800000000003
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
Mem abort info:
ESR = 0x0000000096000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
[dfff800000000003] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] PREEMPT SMP
CPU: 83 PID: 23994 Comm: kworker/31:2 Not tainted 5.18.0-next-20220526-dirty #127
pstate: 104000c9 (nzcV daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __lock_acquire
lr : lock_acquire.part.0
sp : ffff800071777ac0
x29: ffff800071777ac0 x28: ffffd60563fa6380
x27: 0000000000000018
x26: 0000000000000080 x25: 0000000000000018 x24: 0000000000000000
x23: ffff0801de2d1ac0 x22: ffffd6056a66a7e0
x21: 0000000000000000
x20: 0000000000000000 x19: 0000000000000000
x18: 0000000000000767
x17: 0000000000000000 x16: 1fffe1003bc5a473
x15: 1fffe806c88e9338
x14: 000000000000f1f1
x13: 00000000f3f3f3f3
x12: ffff0801de2d1ac8
x11: 1ffffac0ad4aefa3
x10: ffffd6056a577d18 x9 : 0000000000000000
x8 : 0000000000000003
x7 : ffffd60563fa6380
x6 : 0000000000000000
x5 : 0000000000000080
x4 : 0000000000000001
x3 : 0000000000000000
x2 : 0000000000000000 x1 : 0000000000000003
x0 : dfff800000000000

Call trace:
__lock_acquire at kernel/locking/lockdep.c:4923
lock_acquire
_raw_spin_lock_irq
worker_thread at kernel/workqueue.c:2389
kthread
ret_from_fork
Code: d65f03c0 d343ff61 d2d00000 f2fbffe0 (38e06820)
---[ end trace 0000000000000000 ]---
1424.464630][T23994] Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: 0x56055bdf0000 from 0xffff800008000000
PHYS_OFFSET: 0x80000000
CPU features: 0x000,0042e015,19801c82
Memory Limit: none

2022-05-28 19:12:42

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3

On Fri, May 27, 2022 at 09:39:46AM +0100, Mel Gorman wrote:
> Do you think it's related to the series and if so why? From the warning,
> it's not obvious to me why it would be given that it's a warning about a
> task not being inactive when it is expected to be.

I don't know. I just saw the 6/6 patch touched the mm_percpu_wq. Anyway,
I'll spend more time to reproduce, and see if we are lucky.