LinuxLists.cc - [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support

2021-11-03 17:09:13

Subject: [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support

This series introduces a new locking scheme around mm/page_alloc.c's per-cpu
page lists which will allow for remote CPUs to drain them. Currently, only a
local CPU is permitted to change its per-cpu lists, and it's expected to do so,
on-demand, whenever a process demands it (by means of queueing an drain task on
the local CPU). Most systems will handle this promptly, but it'll cause
problems for NOHZ_FULL CPUs that can't take any sort of interruption without
breaking their functional guarantees (latency, bandwidth, etc...).

This new locking scheme, based on per-cpu spinlocks, is the simpler and more
maintainable approach so far[1], although also has some drawbacks: it comes
with a small performance. Depending on the page allocation code path
micro-benchmark we can expect 0% to 0.6% degradation on x86_64, and 0% to 2% on
arm64[2].

Assuming there is nothing too horrible in the patches themselves I believe it
all comes down to whether we prefer to take the small performance hit vs the
maintenance burden of a more complex solution[1]. I don't have enough
experience with performance tuning, nor with maintenance to have an
authoritative opinion here, so I'll defer to whatever is hopefully discussed
here. Also, I'll be happy to run any extra tests that I might have missed.

Patch #1 could be taken regardless of the rest of the series as it removes dead
code.

The series is based on today's linux-next.

Changes since v2:
- Provide performance numbers
- Unanimously use per-cpu spinlocks

[1] Other approaches can be found here:

- Static branch conditional on nohz_full, no performance loss, the extra
config option makes is painful to maintain (v1):
https://lore.kernel.org/linux-mm/[email protected]/

- RCU based approach, complex, yet a bit less taxing performance wise
(RFC):
https://lore.kernel.org/linux-mm/[email protected]/

[2] See individual patches for in-depth results

---

Nicolas Saenz Julienne (3):
mm/page_alloc: Don't pass pfn to free_unref_page_commit()
mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin
locks
mm/page_alloc: Remotely drain per-cpu lists

include/linux/mmzone.h | 1 +
mm/page_alloc.c | 151 ++++++++++++++---------------------------
2 files changed, 52 insertions(+), 100 deletions(-)

--
2.33.1

2021-11-03 17:10:26

by Nicolas Saenz Julienne

[permalink] [raw]

Subject: [PATCH v2 1/3] mm/page_alloc: Don't pass pfn to free_unref_page_commit()

free_unref_page_commit() doesn't make use of its pfn argument, so get
rid of it.

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---
mm/page_alloc.c | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..9ef03dfb8f95 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3355,8 +3355,8 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
return min(READ_ONCE(pcp->batch) << 2, high);
}

-static void free_unref_page_commit(struct page *page, unsigned long pfn,
- int migratetype, unsigned int order)
+static void free_unref_page_commit(struct page *page, int migratetype,
+ unsigned int order)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
@@ -3405,7 +3405,7 @@ void free_unref_page(struct page *page, unsigned int order)
}

local_lock_irqsave(&pagesets.lock, flags);
- free_unref_page_commit(page, pfn, migratetype, order);
+ free_unref_page_commit(page, migratetype, order);
local_unlock_irqrestore(&pagesets.lock, flags);
}

@@ -3415,13 +3415,13 @@ void free_unref_page(struct page *page, unsigned int order)
void free_unref_page_list(struct list_head *list)
{
struct page *page, *next;
- unsigned long flags, pfn;
+ unsigned long flags;
int batch_count = 0;
int migratetype;

/* Prepare pages for freeing */
list_for_each_entry_safe(page, next, list, lru) {
- pfn = page_to_pfn(page);
+ unsigned long pfn = page_to_pfn(page);
if (!free_unref_page_prepare(page, pfn, 0)) {
list_del(&page->lru);
continue;
@@ -3437,15 +3437,10 @@ void free_unref_page_list(struct list_head *list)
free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
continue;
}
-
- set_page_private(page, pfn);
}

local_lock_irqsave(&pagesets.lock, flags);
list_for_each_entry_safe(page, next, list, lru) {
- pfn = page_private(page);
- set_page_private(page, 0);
-
/*
* Non-isolated types over MIGRATE_PCPTYPES get added
* to the MIGRATE_MOVABLE pcp list.
@@ -3455,7 +3450,7 @@ void free_unref_page_list(struct list_head *list)
migratetype = MIGRATE_MOVABLE;

trace_mm_page_free_batched(page);
- free_unref_page_commit(page, pfn, migratetype, 0);
+ free_unref_page_commit(page, migratetype, 0);

/*
* Guard against excessive IRQ disabled times when we get
--
2.33.1

2021-11-03 17:10:27

by Nicolas Saenz Julienne

[permalink] [raw]

Subject: [PATCH v2 2/3] mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin locks

page_alloc's per-cpu page lists are currently protected by local locks.
This forces any remote operation dependent on draining them to schedule
drain work on all CPUs. This doesn't play well with NOHZ_FULL CPUs,
which can't be bothered to run housekeeping tasks.

As a first step to mitigate this, convert the current locking scheme to
per-cpu spinlocks. The conversion also moves the actual lock into
'struct per_cpu_pages' which is nicer code, but also essential in order
to couple access to the lock and lists. One side effect of this is a
more complex free_unref_page_list(), as the patch tries to maintain
previous function optimizations[1]. Other than that the conversion
itself is mostly trivial.

The performance difference between local locks and uncontended per-cpu
spinlocks (which they happen to be during normal operation) is pretty
small.

On an Intel Xeon E5-2640 (x86_64) with with 32GB of memory (mean
variation vs. vanilla runs, higher is worse):
- netperf: -0.5% to 0.5% (no difference)
- hackbench: -0.3% to 0.7% (almost no difference)
- mmtests/sparsetruncate-tiny: -0.1% to 0.6%

On a Cavium ThunderX2 (arm64) with 64GB of memory:
- netperf 1.0% to 1.7%
- hackbench 0.8% to 1.5%
- mmtests/sparsetruncate-tiny 1.6% to 2.1%

arm64 is a bit more sensitive to the change. Probably due to the effect
of the spinlock's memory barriers.

Note that the aim9 test suite was also run (through
mmtests/pagealloc-performance) but the test's own variance distorts the
results too much.

[1] See:
- 9cca35d42eb61 ("mm, page_alloc: enable/disable IRQs once when
freeing a list of pages ")
- c24ad77d962c3 ("mm/page_alloc.c: avoid excessive IRQ disabled
times in free_unref_page_list()")

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 87 ++++++++++++++++++++++--------------------
2 files changed, 47 insertions(+), 41 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 58e744b78c2c..83c51036c756 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -376,6 +376,7 @@ struct per_cpu_pages {

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[NR_PCP_LISTS];
+ spinlock_t lock;
};

struct per_cpu_zonestat {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ef03dfb8f95..b332d5cc40f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -122,13 +122,6 @@ typedef int __bitwise fpi_t;
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)

-struct pagesets {
- local_lock_t lock;
-};
-static DEFINE_PER_CPU(struct pagesets, pagesets) = {
- .lock = INIT_LOCAL_LOCK(lock),
-};
-
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1505,8 +1498,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
pcp->count -= nr_freed;

/*
- * local_lock_irq held so equivalent to spin_lock_irqsave for
- * both PREEMPT_RT and non-PREEMPT_RT configurations.
+ * spin_lock_irqsave(&pcp->lock) held so equivalent to
+ * spin_lock_irqsave().
*/
spin_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);
@@ -3011,8 +3004,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
int i, allocated = 0;

/*
- * local_lock_irq held so equivalent to spin_lock_irqsave for
- * both PREEMPT_RT and non-PREEMPT_RT configurations.
+ * spin_lock_irqsave(&pcp->lock) held so equivalent to
+ * spin_lock_irqsave().
*/
spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
@@ -3066,12 +3059,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
unsigned long flags;
int to_drain, batch;

- local_lock_irqsave(&pagesets.lock, flags);
+ spin_lock_irqsave(&pcp->lock, flags);
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
if (to_drain > 0)
free_pcppages_bulk(zone, to_drain, pcp);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ spin_unlock_irqrestore(&pcp->lock, flags);
}
#endif

@@ -3087,13 +3080,11 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
unsigned long flags;
struct per_cpu_pages *pcp;

- local_lock_irqsave(&pagesets.lock, flags);
-
pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ spin_lock_irqsave(&pcp->lock, flags);
if (pcp->count)
free_pcppages_bulk(zone, pcp->count, pcp);
-
- local_unlock_irqrestore(&pagesets.lock, flags);
+ spin_unlock_irqrestore(&pcp->lock, flags);
}

/*
@@ -3355,16 +3346,14 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
return min(READ_ONCE(pcp->batch) << 2, high);
}

-static void free_unref_page_commit(struct page *page, int migratetype,
- unsigned int order)
+static void free_unref_page_commit(struct page *page, struct per_cpu_pages *pcp,
+ int migratetype, unsigned int order)
{
struct zone *zone = page_zone(page);
- struct per_cpu_pages *pcp;
int high;
int pindex;

__count_vm_event(PGFREE);
- pcp = this_cpu_ptr(zone->per_cpu_pageset);
pindex = order_to_pindex(migratetype, order);
list_add(&page->lru, &pcp->lists[pindex]);
pcp->count += 1 << order;
@@ -3383,6 +3372,7 @@ void free_unref_page(struct page *page, unsigned int order)
{
unsigned long flags;
unsigned long pfn = page_to_pfn(page);
+ struct per_cpu_pages *pcp;
int migratetype;

if (!free_unref_page_prepare(page, pfn, order))
@@ -3404,9 +3394,10 @@ void free_unref_page(struct page *page, unsigned int order)
migratetype = MIGRATE_MOVABLE;
}

- local_lock_irqsave(&pagesets.lock, flags);
- free_unref_page_commit(page, migratetype, order);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pcp = this_cpu_ptr(page_zone(page)->per_cpu_pageset);
+ spin_lock_irqsave(&pcp->lock, flags);
+ free_unref_page_commit(page, pcp, migratetype, order);
+ spin_unlock_irqrestore(&pcp->lock, flags);
}

/*
@@ -3415,6 +3406,7 @@ void free_unref_page(struct page *page, unsigned int order)
void free_unref_page_list(struct list_head *list)
{
struct page *page, *next;
+ spinlock_t *lock = NULL;
unsigned long flags;
int batch_count = 0;
int migratetype;
@@ -3422,6 +3414,7 @@ void free_unref_page_list(struct list_head *list)
/* Prepare pages for freeing */
list_for_each_entry_safe(page, next, list, lru) {
unsigned long pfn = page_to_pfn(page);
+
if (!free_unref_page_prepare(page, pfn, 0)) {
list_del(&page->lru);
continue;
@@ -3439,8 +3432,22 @@ void free_unref_page_list(struct list_head *list)
}
}

- local_lock_irqsave(&pagesets.lock, flags);
list_for_each_entry_safe(page, next, list, lru) {
+ struct per_cpu_pages *pcp = this_cpu_ptr(page_zone(page)->per_cpu_pageset);
+
+ /*
+ * As an optimization, release the previously held lock only if
+ * the page belongs to a different zone. But also, guard
+ * against excessive IRQ disabled times when we get a large
+ * list of pages to free.
+ */
+ if (++batch_count == SWAP_CLUSTER_MAX ||
+ (lock != &pcp->lock && lock)) {
+ spin_unlock_irqrestore(lock, flags);
+ batch_count = 0;
+ lock = NULL;
+ }
+
/*
* Non-isolated types over MIGRATE_PCPTYPES get added
* to the MIGRATE_MOVABLE pcp list.
@@ -3450,19 +3457,17 @@ void free_unref_page_list(struct list_head *list)
migratetype = MIGRATE_MOVABLE;

trace_mm_page_free_batched(page);
- free_unref_page_commit(page, migratetype, 0);

- /*
- * Guard against excessive IRQ disabled times when we get
- * a large list of pages to free.
- */
- if (++batch_count == SWAP_CLUSTER_MAX) {
- local_unlock_irqrestore(&pagesets.lock, flags);
- batch_count = 0;
- local_lock_irqsave(&pagesets.lock, flags);
+ if (!lock) {
+ spin_lock_irqsave(&pcp->lock, flags);
+ lock = &pcp->lock;
}
+
+ free_unref_page_commit(page, pcp, migratetype, 0);
}
- local_unlock_irqrestore(&pagesets.lock, flags);
+
+ if (lock)
+ spin_unlock_irqrestore(lock, flags);
}

/*
@@ -3636,18 +3641,17 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
struct page *page;
unsigned long flags;

- local_lock_irqsave(&pagesets.lock, flags);
-
/*
* On allocation, reduce the number of pages that are batch freed.
* See nr_pcp_free() where free_factor is increased for subsequent
* frees.
*/
pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ spin_lock_irqsave(&pcp->lock, flags);
pcp->free_factor >>= 1;
list = &pcp->lists[order_to_pindex(migratetype, order)];
page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ spin_unlock_irqrestore(&pcp->lock, flags);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
zone_statistics(preferred_zone, zone, 1);
@@ -5265,8 +5269,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto failed;

/* Attempt the batch allocation */
- local_lock_irqsave(&pagesets.lock, flags);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ spin_lock_irqsave(&pcp->lock, flags);
pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];

while (nr_populated < nr_pages) {
@@ -5295,7 +5299,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

- local_unlock_irqrestore(&pagesets.lock, flags);
+ spin_unlock_irqrestore(&pcp->lock, flags);

__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
@@ -5304,7 +5308,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
return nr_populated;

failed_irq:
- local_unlock_irqrestore(&pagesets.lock, flags);
+ spin_unlock_irqrestore(&pcp->lock, flags);

failed:
page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
@@ -6947,6 +6951,7 @@ void __meminit setup_zone_pageset(struct zone *zone)
struct per_cpu_zonestat *pzstats;

pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ spin_lock_init(&pcp->lock);
pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
per_cpu_pages_init(pcp, pzstats);
}
--
2.33.1

2021-11-04 14:40:35

by Oliver Sang

[permalink] [raw]

Subject: [mm/page_alloc] 5541e53659: BUG:spinlock_bad_magic_on_CPU

Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 5541e5365954069e4c7b649831c0e41bc9e5e081 ("[PATCH v2 2/3] mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin locks")
url: https://github.com/0day-ci/linux/commits/Nicolas-Saenz-Julienne/mm-page_alloc-Remote-per-cpu-page-list-drain-support/20211104-010825
base: https://github.com/hnaz/linux-mm master
patch link: https://lore.kernel.org/lkml/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

+--------------------------------------------+------------+------------+
| | 69c421f2b4 | 5541e53659 |
+--------------------------------------------+------------+------------+
| boot_successes | 11 | 0 |
| boot_failures | 0 | 11 |
| BUG:spinlock_bad_magic_on_CPU | 0 | 11 |
| BUG:using_smp_processor_id()in_preemptible | 0 | 11 |
+--------------------------------------------+------------+------------+

If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

[ 0.161872][ T0] BUG: spinlock bad magic on CPU#0, swapper/0
[ 0.162248][ T0] lock: 0xeb24bef0, .magic: 00000000, .owner: swapper/0, .owner_cpu: 0
[ 0.162767][ T0] CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.0-rc7-mm1-00437-g5541e5365954 #1
[ 0.163325][ T0] Call Trace:
[ 0.163524][ T0] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4))
[ 0.163802][ T0] dump_stack (lib/dump_stack.c:114)
[ 0.164050][ T0] spin_bug (kernel/locking/spinlock_debug.c:70 kernel/locking/spinlock_debug.c:77)
[ 0.164296][ T0] do_raw_spin_unlock (arch/x86/include/asm/atomic.h:29 include/linux/atomic/atomic-instrumented.h:28 include/asm-generic/qspinlock.h:28 kernel/locking/spinlock_debug.c:100 kernel/locking/spinlock_debug.c:140)
[ 0.164624][ T0] _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:194)
[ 0.164971][ T0] free_unref_page (include/linux/spinlock.h:423 mm/page_alloc.c:3400)
[ 0.165253][ T0] free_the_page (mm/page_alloc.c:699)
[ 0.165521][ T0] __free_pages (mm/page_alloc.c:5453)
[ 0.165785][ T0] add_highpages_with_active_regions (include/linux/mm.h:2511 arch/x86/mm/init_32.c:416)
[ 0.166179][ T0] set_highmem_pages_init (arch/x86/mm/highmem_32.c:30)
[ 0.166501][ T0] mem_init (arch/x86/mm/init_32.c:749 (discriminator 2))
[ 0.166749][ T0] start_kernel (init/main.c:842 init/main.c:988)
[ 0.167026][ T0] ? early_idt_handler_common (arch/x86/kernel/head_32.S:417)
[ 0.167369][ T0] i386_start_kernel (arch/x86/kernel/head32.c:57)
[ 0.167662][ T0] startup_32_smp (arch/x86/kernel/head_32.S:328)
[ 1.272601][ T0] Initializing Movable for node 0 (00000000:00000000)
[ 1.563704][ T0] Checking if this processor honours the WP bit even in supervisor mode...Ok.
[ 1.564312][ T0] Memory: 2895940K/3145208K available (8028K kernel code, 4874K rwdata, 4088K rodata, 2016K init, 8452K bss, 249268K reserved, 0K cma-reserved, 2252680K highmem)
[ 1.565333][ T0] random: get_random_u32 called from __kmem_cache_create+0x13/0x3b8 with crng_init=0
[ 1.565532][ T0] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
[ 1.566585][ T0] Stack Depot allocating hash table with kvmalloc
[ 1.567566][ T0] Node 0, zone DMA: page owner found early allocated 0 pages
[ 1.569258][ T0] Node 0, zone Normal: page owner found early allocated 1057 pages
[ 1.574666][ T0] Node 0, zone HighMem: page owner found early allocated 0 pages
[ 1.575421][ T0] ODEBUG: selftest passed
[ 1.575737][ T0] trace event string verifier disabled
[ 1.576126][ T0] Dynamic Preempt: none
[ 1.576450][ T0] Running RCU self tests
[ 1.576727][ T0] rcu: Preemptible hierarchical RCU implementation.
[ 1.577149][ T0] rcu: RCU event tracing is enabled.
[ 1.577483][ T0] rcu: RCU lockdep checking is enabled.
[ 1.577833][ T0] rcu: RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=2.
[ 1.578294][ T0] rcu: RCU callback double-/use-after-free debug enabled.
[ 1.578744][ T0] rcu: RCU debug extended QS entry/exit.
[ 1.579100][ T0] RCU CPU stall warnings timeout set to 100 (rcu_cpu_stall_timeout).
[ 1.579611][ T0] Trampoline variant of Tasks RCU enabled.
[ 1.579978][ T0] Rude variant of Tasks RCU enabled.
[ 1.580311][ T0] Tracing variant of Tasks RCU enabled.
[ 1.580666][ T0] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[ 1.581199][ T0] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
[ 1.583788][ T0] NR_IRQS: 2304, nr_irqs: 56, preallocated irqs: 16
[ 1.584542][ T0] kfence: initialized - using 2097152 bytes for 255 objects at 0x(ptrval)-0x(ptrval)
[ 1.585258][ T0] printk: console [ttyS0] enabled
[ 1.585891][ T0] printk: bootconsole [earlyser0] disabled
[ 1.586621][ T0] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[ 1.587173][ T0] ... MAX_LOCKDEP_SUBCLASSES: 8
[ 1.587491][ T0] ... MAX_LOCK_DEPTH: 48
[ 1.587814][ T0] ... MAX_LOCKDEP_KEYS: 8192
[ 1.588146][ T0] ... CLASSHASH_SIZE: 4096
[ 1.588477][ T0] ... MAX_LOCKDEP_ENTRIES: 32768
[ 1.588821][ T0] ... MAX_LOCKDEP_CHAINS: 65536
[ 1.589167][ T0] ... CHAINHASH_SIZE: 32768
[ 1.589504][ T0] memory used by lock dependency info: 4061 kB
[ 1.589904][ T0] memory used for stack traces: 2112 kB
[ 1.590264][ T0] per task-struct memory footprint: 2112 bytes
[ 1.590671][ T0] ACPI: Core revision 20210930
[ 1.591135][ T0] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
[ 1.591886][ T0] APIC: Switch to symmetric I/O mode setup
[ 1.592284][ T0] Enabling APIC mode: Flat. Using 1 I/O APICs
[ 1.592710][ T0] masked ExtINT on CPU#0
[ 1.593593][ T0] ENABLING IO-APIC IRQs
[ 1.593859][ T0] init IO_APIC IRQs
[ 1.594227][ T0] apic 0 pin 0 not connected
[ 1.594688][ T0] IOAPIC[0]: Preconfigured routing entry (0-1 -> IRQ 1 Level:0 ActiveLow:0)
[ 1.595534][ T0] IOAPIC[0]: Preconfigured routing entry (0-2 -> IRQ 0 Level:0 ActiveLow:0)
[ 1.596221][ T0] IOAPIC[0]: Preconfigured routing entry (0-3 -> IRQ 3 Level:0 ActiveLow:0)
[ 1.596774][ T0] IOAPIC[0]: Preconfigured routing entry (0-4 -> IRQ 4 Level:0 ActiveLow:0)
[ 1.597320][ T0] IOAPIC[0]: Preconfigured routing entry (0-5 -> IRQ 5 Level:1 ActiveLow:0)
[ 1.597865][ T0] IOAPIC[0]: Preconfigured routing entry (0-6 -> IRQ 6 Level:0 ActiveLow:0)
[ 1.598409][ T0] IOAPIC[0]: Preconfigured routing entry (0-7 -> IRQ 7 Level:0 ActiveLow:0)
[ 1.598954][ T0] IOAPIC[0]: Preconfigured routing entry (0-8 -> IRQ 8 Level:0 ActiveLow:0)
[ 1.599498][ T0] IOAPIC[0]: Preconfigured routing entry (0-9 -> IRQ 9 Level:1 ActiveLow:0)
[ 1.600052][ T0] IOAPIC[0]: Preconfigured routing entry (0-10 -> IRQ 10 Level:1 ActiveLow:0)
[ 1.600618][ T0] IOAPIC[0]: Preconfigured routing entry (0-11 -> IRQ 11 Level:1 ActiveLow:0)
[ 1.601195][ T0] IOAPIC[0]: Preconfigured routing entry (0-12 -> IRQ 12 Level:0 ActiveLow:0)
[ 1.601760][ T0] IOAPIC[0]: Preconfigured routing entry (0-13 -> IRQ 13 Level:0 ActiveLow:0)
[ 1.602312][ T0] IOAPIC[0]: Preconfigured routing entry (0-14 -> IRQ 14 Level:0 ActiveLow:0)
[ 1.602865][ T0] IOAPIC[0]: Preconfigured routing entry (0-15 -> IRQ 15 Level:0 ActiveLow:0)
[ 1.603415][ T0] apic 0 pin 16 not connected
[ 1.603712][ T0] apic 0 pin 17 not connected
[ 1.604020][ T0] apic 0 pin 18 not connected
[ 1.604316][ T0] apic 0 pin 19 not connected
[ 1.604617][ T0] apic 0 pin 20 not connected
[ 1.604922][ T0] apic 0 pin 21 not connected
[ 1.605219][ T0] apic 0 pin 22 not connected
[ 1.605516][ T0] apic 0 pin 23 not connected
[ 1.605913][ T0] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 1.606361][ T0] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fa3704c1a9, max_idle_ns: 440795296692 ns
[ 1.607077][ T0] Calibrating delay loop (skipped) preset value.. 4389.83 BogoMIPS (lpj=8779664)
[ 1.607658][ T0] pid_max: default: 4096 minimum: 301
[ 1.611130][ T0] LSM: Security Framework initializing
[ 1.611497][ T0] Mount-cache hash table entries: 2048 (order: 1, 8192 bytes, linear)
[ 1.613263][ T0] Mountpoint-cache hash table entries: 2048 (order: 1, 8192 bytes, linear)
[ 1.615945][ T0] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 1.616341][ T0] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[ 1.616763][ T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 1.617348][ T0] Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available!
[ 1.617350][ T0] Speculative Store Bypass: Vulnerable
[ 1.618335][ T0] L1TF: Kernel not compiled for PAE. No mitigation for L1TF
[ 1.618788][ T0] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
[ 1.619423][ T0] Freeing SMP alternatives memory: 16K
[ 1.619943][ T1] smpboot: CPU0: Intel Xeon E312xx (Sandy Bridge) (family: 0x6, model: 0x2a, stepping: 0x1)

To reproduce:

# build kernel
cd linux
cp config-5.15.0-rc7-mm1-00437-g5541e5365954 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=i386 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang

Attachments:

(No filename) (10.19 kB)
config-5.15.0-rc7-mm1-00437-g5541e5365954 (130.13 kB)
job-script (4.91 kB)
dmesg.xz (22.52 kB)
Download all attachments

2021-11-04 16:42:16

by Nicolas Saenz Julienne

[permalink] [raw]

Subject: Re: [mm/page_alloc] 5541e53659: BUG:spinlock_bad_magic_on_CPU

On Thu, 2021-11-04 at 22:38 +0800, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: 5541e5365954069e4c7b649831c0e41bc9e5e081 ("[PATCH v2 2/3] mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin locks")
> url: https://github.com/0day-ci/linux/commits/Nicolas-Saenz-Julienne/mm-page_alloc-Remote-per-cpu-page-list-drain-support/20211104-010825
> base: https://github.com/hnaz/linux-mm master
> patch link: https://lore.kernel.org/lkml/[email protected]
>
> in testcase: boot
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> +--------------------------------------------+------------+------------+
> > | 69c421f2b4 | 5541e53659 |
> +--------------------------------------------+------------+------------+
> > boot_successes | 11 | 0 |
> > boot_failures | 0 | 11 |
> > BUG:spinlock_bad_magic_on_CPU | 0 | 11 |
> > BUG:using_smp_processor_id()in_preemptible | 0 | 11 |
> +--------------------------------------------+------------+------------+
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> [ 0.161872][ T0] BUG: spinlock bad magic on CPU#0, swapper/0
> [ 0.162248][ T0] lock: 0xeb24bef0, .magic: 00000000, .owner: swapper/0, .owner_cpu: 0
> [ 0.162767][ T0] CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.0-rc7-mm1-00437-g5541e5365954 #1
> [ 0.163325][ T0] Call Trace:
> [ 0.163524][ T0] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4))
> [ 0.163802][ T0] dump_stack (lib/dump_stack.c:114)
> [ 0.164050][ T0] spin_bug (kernel/locking/spinlock_debug.c:70 kernel/locking/spinlock_debug.c:77)
> [ 0.164296][ T0] do_raw_spin_unlock (arch/x86/include/asm/atomic.h:29 include/linux/atomic/atomic-instrumented.h:28 include/asm-generic/qspinlock.h:28 kernel/locking/spinlock_debug.c:100 kernel/locking/spinlock_debug.c:140)
> [ 0.164624][ T0] _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:194)
> [ 0.164971][ T0] free_unref_page (include/linux/spinlock.h:423 mm/page_alloc.c:3400)
> [ 0.165253][ T0] free_the_page (mm/page_alloc.c:699)
> [ 0.165521][ T0] __free_pages (mm/page_alloc.c:5453)
> [ 0.165785][ T0] add_highpages_with_active_regions (include/linux/mm.h:2511 arch/x86/mm/init_32.c:416)
> [ 0.166179][ T0] set_highmem_pages_init (arch/x86/mm/highmem_32.c:30)
> [ 0.166501][ T0] mem_init (arch/x86/mm/init_32.c:749 (discriminator 2))
> [ 0.166749][ T0] start_kernel (init/main.c:842 init/main.c:988)
> [ 0.167026][ T0] ? early_idt_handler_common (arch/x86/kernel/head_32.S:417)
> [ 0.167369][ T0] i386_start_kernel (arch/x86/kernel/head32.c:57)
> [ 0.167662][ T0] startup_32_smp (arch/x86/kernel/head_32.S:328)

I did test this with lock debugging enabled, but I somehow missed this stack
trace. Here's the fix:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7dbdab100461..c8964e28aa59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6853,6 +6853,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
pcp->high = BOOT_PAGESET_HIGH;
pcp->batch = BOOT_PAGESET_BATCH;
pcp->free_factor = 0;
+ spin_lock_init(&pcp->lock);
}

static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
@@ -6902,7 +6903,6 @@ void __meminit setup_zone_pageset(struct zone *zone)
struct per_cpu_zonestat *pzstats;

pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
- spin_lock_init(&pcp->lock);
pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
per_cpu_pages_init(pcp, pzstats);
}

--
Nicolás Sáenz

2021-11-23 14:41:41

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH v2 1/3] mm/page_alloc: Don't pass pfn to free_unref_page_commit()

On 11/3/21 18:05, Nicolas Saenz Julienne wrote:
> free_unref_page_commit() doesn't make use of its pfn argument, so get
> rid of it.

Yeah, looks like since df1acc856923 ("mm/page_alloc: avoid conflating IRQs
disabled with zone->lock")

> Signed-off-by: Nicolas Saenz Julienne <[email protected]>

Reviewed-by: Vlastimil Babka <[email protected]>

> ---
> mm/page_alloc.c | 17 ++++++-----------
> 1 file changed, 6 insertions(+), 11 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..9ef03dfb8f95 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3355,8 +3355,8 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
> return min(READ_ONCE(pcp->batch) << 2, high);
> }
>
> -static void free_unref_page_commit(struct page *page, unsigned long pfn,
> - int migratetype, unsigned int order)
> +static void free_unref_page_commit(struct page *page, int migratetype,
> + unsigned int order)
> {
> struct zone *zone = page_zone(page);
> struct per_cpu_pages *pcp;
> @@ -3405,7 +3405,7 @@ void free_unref_page(struct page *page, unsigned int order)
> }
>
> local_lock_irqsave(&pagesets.lock, flags);
> - free_unref_page_commit(page, pfn, migratetype, order);
> + free_unref_page_commit(page, migratetype, order);
> local_unlock_irqrestore(&pagesets.lock, flags);
> }
>
> @@ -3415,13 +3415,13 @@ void free_unref_page(struct page *page, unsigned int order)
> void free_unref_page_list(struct list_head *list)
> {
> struct page *page, *next;
> - unsigned long flags, pfn;
> + unsigned long flags;
> int batch_count = 0;
> int migratetype;
>
> /* Prepare pages for freeing */
> list_for_each_entry_safe(page, next, list, lru) {
> - pfn = page_to_pfn(page);
> + unsigned long pfn = page_to_pfn(page);
> if (!free_unref_page_prepare(page, pfn, 0)) {
> list_del(&page->lru);
> continue;
> @@ -3437,15 +3437,10 @@ void free_unref_page_list(struct list_head *list)
> free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
> continue;
> }
> -
> - set_page_private(page, pfn);
> }
>
> local_lock_irqsave(&pagesets.lock, flags);
> list_for_each_entry_safe(page, next, list, lru) {
> - pfn = page_private(page);
> - set_page_private(page, 0);
> -
> /*
> * Non-isolated types over MIGRATE_PCPTYPES get added
> * to the MIGRATE_MOVABLE pcp list.
> @@ -3455,7 +3450,7 @@ void free_unref_page_list(struct list_head *list)
> migratetype = MIGRATE_MOVABLE;
>
> trace_mm_page_free_batched(page);
> - free_unref_page_commit(page, pfn, migratetype, 0);
> + free_unref_page_commit(page, migratetype, 0);
>
> /*
> * Guard against excessive IRQ disabled times when we get
>

2021-11-23 14:58:07

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support

On 11/3/21 18:05, Nicolas Saenz Julienne wrote:
> This series introduces a new locking scheme around mm/page_alloc.c's per-cpu
> page lists which will allow for remote CPUs to drain them. Currently, only a
> local CPU is permitted to change its per-cpu lists, and it's expected to do so,
> on-demand, whenever a process demands it (by means of queueing an drain task on
> the local CPU). Most systems will handle this promptly, but it'll cause
> problems for NOHZ_FULL CPUs that can't take any sort of interruption without
> breaking their functional guarantees (latency, bandwidth, etc...).
>
> This new locking scheme, based on per-cpu spinlocks, is the simpler and more
> maintainable approach so far[1], although also has some drawbacks: it comes
> with a small performance. Depending on the page allocation code path
> micro-benchmark we can expect 0% to 0.6% degradation on x86_64, and 0% to 2% on
> arm64[2].
>
> Assuming there is nothing too horrible in the patches themselves I believe it
> all comes down to whether we prefer to take the small performance hit vs the
> maintenance burden of a more complex solution[1]. I don't have enough

I'd be for the small performance hit over more complex solution, if possible.

> experience with performance tuning, nor with maintenance to have an
> authoritative opinion here, so I'll defer to whatever is hopefully discussed
> here. Also, I'll be happy to run any extra tests that I might have missed.

I think Mel has done most page allocator optimizations recently so he would
be most authoritative to say what is or isn't acceptable.

> Patch #1 could be taken regardless of the rest of the series as it removes dead
> code.
>
> The series is based on today's linux-next.
>
> Changes since v2:
> - Provide performance numbers
> - Unanimously use per-cpu spinlocks
>
> [1] Other approaches can be found here:
>
> - Static branch conditional on nohz_full, no performance loss, the extra
> config option makes is painful to maintain (v1):
> https://lore.kernel.org/linux-mm/[email protected]/
>
> - RCU based approach, complex, yet a bit less taxing performance wise
> (RFC):
> https://lore.kernel.org/linux-mm/[email protected]/

Hm I wonder if there might still be another alternative possible. IIRC I did
propose at some point a local drain on the NOHZ cpu before returning to
userspace, and then avoiding that cpu in remote drains, but tglx didn't like
the idea of making entering the NOHZ full mode more expensive [1].

But what if we instead set pcp->high = 0 for these cpus so they would avoid
populating the pcplists in the first place? Then there wouldn't have to be a
drain at all. On the other hand page allocator operations would not benefit
from zone lock batching on those cpus. But perhaps that would be acceptable
tradeoff, as a nohz cpu is expected to run in userspace most of the time,
and page allocator operations are rare except maybe some initial page
faults? (I assume those kind of workloads pre-populate and/or mlock their
address space anyway).

[1] https://lore.kernel.org/all/878rznh93e.ffs@tglx/

> [2] See individual patches for in-depth results
>
> ---
>
> Nicolas Saenz Julienne (3):
> mm/page_alloc: Don't pass pfn to free_unref_page_commit()
> mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin
> locks
> mm/page_alloc: Remotely drain per-cpu lists
>
> include/linux/mmzone.h | 1 +
> mm/page_alloc.c | 151 ++++++++++++++---------------------------
> 2 files changed, 52 insertions(+), 100 deletions(-)
>

2021-11-30 18:09:35

by Nicolas Saenz Julienne

[permalink] [raw]

Subject: Re: [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support

Hi Vlastimil, sorry for the late reply and thanks for your feedback. :)

On Tue, 2021-11-23 at 15:58 +0100, Vlastimil Babka wrote:
> > [1] Other approaches can be found here:
> >
> > - Static branch conditional on nohz_full, no performance loss, the extra
> > config option makes is painful to maintain (v1):
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > - RCU based approach, complex, yet a bit less taxing performance wise
> > (RFC):
> > https://lore.kernel.org/linux-mm/[email protected]/
>
> Hm I wonder if there might still be another alternative possible. IIRC I did
> propose at some point a local drain on the NOHZ cpu before returning to
> userspace, and then avoiding that cpu in remote drains, but tglx didn't like
> the idea of making entering the NOHZ full mode more expensive [1].
>
> But what if we instead set pcp->high = 0 for these cpus so they would avoid
> populating the pcplists in the first place? Then there wouldn't have to be a
> drain at all. On the other hand page allocator operations would not benefit
> from zone lock batching on those cpus. But perhaps that would be acceptable
> tradeoff, as a nohz cpu is expected to run in userspace most of the time,
> and page allocator operations are rare except maybe some initial page
> faults? (I assume those kind of workloads pre-populate and/or mlock their
> address space anyway).

I've looked a bit into this and it seems straightforward. Our workloads
pre-populate everything, and a slight statup performance hit is not that tragic
(I'll measure it nonetheless). The per-cpu nohz_full state at some point will
be dynamic, but the feature seems simple to disable/enable. I'll have to teach
__drain_all_pages(zone, force_all_cpus=true) to bypass this special case
but that's all. I might have a go at this.

Thanks!

--
Nicolás Sáenz

2021-12-02 16:28:33

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [PATCH v2 0/3] mm/page_alloc: Remote per-cpu page list drain support

On Tue, Nov 30, 2021 at 07:09:23PM +0100, Nicolas Saenz Julienne wrote:
> Hi Vlastimil, sorry for the late reply and thanks for your feedback. :)
>
> On Tue, 2021-11-23 at 15:58 +0100, Vlastimil Babka wrote:
> > > [1] Other approaches can be found here:
> > >
> > > - Static branch conditional on nohz_full, no performance loss, the extra
> > > config option makes is painful to maintain (v1):
> > > https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > - RCU based approach, complex, yet a bit less taxing performance wise
> > > (RFC):
> > > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Hm I wonder if there might still be another alternative possible. IIRC I did
> > propose at some point a local drain on the NOHZ cpu before returning to
> > userspace, and then avoiding that cpu in remote drains, but tglx didn't like
> > the idea of making entering the NOHZ full mode more expensive [1].
> >
> > But what if we instead set pcp->high = 0 for these cpus so they would avoid
> > populating the pcplists in the first place? Then there wouldn't have to be a
> > drain at all. On the other hand page allocator operations would not benefit
> > from zone lock batching on those cpus. But perhaps that would be acceptable
> > tradeoff, as a nohz cpu is expected to run in userspace most of the time,
> > and page allocator operations are rare except maybe some initial page
> > faults? (I assume those kind of workloads pre-populate and/or mlock their
> > address space anyway).
>
> I've looked a bit into this and it seems straightforward. Our workloads
> pre-populate everything, and a slight statup performance hit is not that tragic
> (I'll measure it nonetheless). The per-cpu nohz_full state at some point will
> be dynamic, but the feature seems simple to disable/enable. I'll have to teach
> __drain_all_pages(zone, force_all_cpus=true) to bypass this special case
> but that's all. I might have a go at this.
>
> Thanks!
>
> --
> Nicol?s S?enz

True, but a nohz cpu does not necessarily have to run in userspace most
of the time. For example, an application can enter nohz full mode,
go back to userspace, idle, return from idle all without leaving
nohz_full mode.

So its not clear that nohz_full is an appropriate trigger for setting
pcp->high = 0. Perhaps a task isolation feature would be an appropriate
location.