2021-06-02 15:22:20

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH v2 0/3] Memory hotplug locking cleanup

Hi all,

I decided to go one step further and completely rip out zone's span_seqlock
and all related functions, since we should be ok by using {get,put}_online_mems()
on the reader side given that memory-hotplug is the only user fiddling with
those values.

Patch#1 and patch#2 could probably be squashed but I decided to keep them
separated so the intention becomes more clear.
Patch#3 only removes declarations that seem never be used.

Given that this is a much bigger surgery, I decided to drop any Acked-by/
Reviewed-by.

Oscar Salvador (3):
mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values
mm,memory_hotplug: Drop unneeded locking
mm,memory_hotplug: Remove unneeded declarations

include/linux/memory_hotplug.h | 38 --------------------------------------
include/linux/mmzone.h | 23 +++++------------------
mm/memory_hotplug.c | 16 +---------------
mm/page_alloc.c | 15 ++++++---------
4 files changed, 12 insertions(+), 80 deletions(-)

--
2.16.3


2021-06-02 15:23:43

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

Currently, page_outside_zone_boundaries() takes zone's span_seqlock
when reading zone_start_pfn and spanned_pages so those values are
stable vs memory hotplug operations.
move_pfn_range_to_zone() and remove_pfn_range_from_zone(), which are
the functions that can change zone's values are serialized by
mem_hotplug_lock by mem_hotplug_{begin,done}, so we can just use
{get,put}_online_mems() on the readers.

This will allow us to completely kill span_seqlock lock as no users
will remain after this series.

Signed-off-by: Oscar Salvador <[email protected]>
---
mm/page_alloc.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aaa1655cf682..296cb00802b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -582,17 +582,15 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
int ret = 0;
- unsigned seq;
unsigned long pfn = page_to_pfn(page);
unsigned long sp, start_pfn;

- do {
- seq = zone_span_seqbegin(zone);
- start_pfn = zone->zone_start_pfn;
- sp = zone->spanned_pages;
- if (!zone_spans_pfn(zone, pfn))
- ret = 1;
- } while (zone_span_seqretry(zone, seq));
+ get_online_mems();
+ start_pfn = zone->zone_start_pfn;
+ sp = zone->spanned_pages;
+ if (!zone_spans_pfn(zone, pfn))
+ ret = 1;
+ put_online_mems();

if (ret)
pr_err("page 0x%lx outside node %d zone %s [ 0x%lx - 0x%lx ]\n",
--
2.16.3

2021-06-02 15:23:47

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH v2 2/3] mm,memory_hotplug: Drop unneeded locking

Currently, memory-hotplug code takes zone's span_writelock
and pgdat's resize_lock when resizing the node/zone's spanned
pages via {move_pfn_range_to_zone(),remove_pfn_range_from_zone()}
and when resizing node and zone's present pages via
adjust_present_page_count().

These locks are also taken during the initialization of the system
at boot time, where it protects parallel struct page initialization,
but they should not really be needed in memory-hotplug where all
operations are a) synchronized on device level and b) serialized by
the mem_hotplug_lock lock.

Given that there are no users of span_seqlock, rip out all related
functions.

Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/memory_hotplug.h | 35 -----------------------------------
include/linux/mmzone.h | 23 +++++------------------
mm/memory_hotplug.c | 16 +---------------
mm/page_alloc.c | 1 -
4 files changed, 6 insertions(+), 69 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 28f32fd00fe9..0d837ce6ec11 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -79,31 +79,7 @@ struct range mhp_get_pluggable_range(bool need_mapping);

/*
* Zone resizing functions
- *
- * Note: any attempt to resize a zone should has pgdat_resize_lock()
- * zone_span_writelock() both held. This ensure the size of a zone
- * can't be changed while pgdat_resize_lock() held.
*/
-static inline unsigned zone_span_seqbegin(struct zone *zone)
-{
- return read_seqbegin(&zone->span_seqlock);
-}
-static inline int zone_span_seqretry(struct zone *zone, unsigned iv)
-{
- return read_seqretry(&zone->span_seqlock, iv);
-}
-static inline void zone_span_writelock(struct zone *zone)
-{
- write_seqlock(&zone->span_seqlock);
-}
-static inline void zone_span_writeunlock(struct zone *zone)
-{
- write_sequnlock(&zone->span_seqlock);
-}
-static inline void zone_seqlock_init(struct zone *zone)
-{
- seqlock_init(&zone->span_seqlock);
-}
extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
@@ -248,17 +224,6 @@ void mem_hotplug_done(void);
___page; \
})

-static inline unsigned zone_span_seqbegin(struct zone *zone)
-{
- return 0;
-}
-static inline int zone_span_seqretry(struct zone *zone, unsigned iv)
-{
- return 0;
-}
-static inline void zone_span_writelock(struct zone *zone) {}
-static inline void zone_span_writeunlock(struct zone *zone) {}
-static inline void zone_seqlock_init(struct zone *zone) {}

static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0d53eba1c383..29cd230a383c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -13,7 +13,6 @@
#include <linux/threads.h>
#include <linux/numa.h>
#include <linux/init.h>
-#include <linux/seqlock.h>
#include <linux/nodemask.h>
#include <linux/pageblock-flags.h>
#include <linux/page-flags-layout.h>
@@ -528,18 +527,11 @@ struct zone {
*
* Locking rules:
*
- * zone_start_pfn and spanned_pages are protected by span_seqlock.
- * It is a seqlock because it has to be read outside of zone->lock,
- * and it is done in the main allocator path. But, it is written
- * quite infrequently.
- *
- * The span_seq lock is declared along with zone->lock because it is
- * frequently read in proximity to zone->lock. It's good to
- * give them a chance of being in the same cacheline.
- *
- * Write access to present_pages at runtime should be protected by
- * mem_hotplug_begin/end(). Any reader who can't tolerant drift of
- * present_pages should get_online_mems() to get a stable value.
+ * Besides system initialization functions, memory-hotplug is the only
+ * user that can change zone's {spanned,present} pages at runtime, and
+ * it does so by holding the mem_hotplug_lock lock. Any readers who
+ * can't tolerate drift values should use {get,put}_online_mems to get
+ * a stable value.
*/
atomic_long_t managed_pages;
unsigned long spanned_pages;
@@ -559,11 +551,6 @@ struct zone {
unsigned long nr_isolate_pageblock;
#endif

-#ifdef CONFIG_MEMORY_HOTPLUG
- /* see spanned/present_pages for more description */
- seqlock_t span_seqlock;
-#endif
-
int initialized;

/* Write-intensive fields used from the page allocator */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..62d5dc2c01de 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -445,7 +445,6 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
unsigned long pfn;
int nid = zone_to_nid(zone);

- zone_span_writelock(zone);
if (zone->zone_start_pfn == start_pfn) {
/*
* If the section is smallest section in the zone, it need
@@ -478,7 +477,6 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
zone->spanned_pages = 0;
}
}
- zone_span_writeunlock(zone);
}

static void update_pgdat_span(struct pglist_data *pgdat)
@@ -515,7 +513,7 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,
{
const unsigned long end_pfn = start_pfn + nr_pages;
struct pglist_data *pgdat = zone->zone_pgdat;
- unsigned long pfn, cur_nr_pages, flags;
+ unsigned long pfn, cur_nr_pages;

/* Poison struct pages because they are now uninitialized again. */
for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) {
@@ -540,10 +538,8 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,

clear_zone_contiguous(zone);

- pgdat_resize_lock(zone->zone_pgdat, &flags);
shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
update_pgdat_span(pgdat);
- pgdat_resize_unlock(zone->zone_pgdat, &flags);

set_zone_contiguous(zone);
}
@@ -750,19 +746,13 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
{
struct pglist_data *pgdat = zone->zone_pgdat;
int nid = pgdat->node_id;
- unsigned long flags;

clear_zone_contiguous(zone);

- /* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
- pgdat_resize_lock(pgdat, &flags);
- zone_span_writelock(zone);
if (zone_is_empty(zone))
init_currently_empty_zone(zone, start_pfn, nr_pages);
resize_zone_range(zone, start_pfn, nr_pages);
- zone_span_writeunlock(zone);
resize_pgdat_range(pgdat, start_pfn, nr_pages);
- pgdat_resize_unlock(pgdat, &flags);

/*
* Subsection population requires care in pfn_to_online_page().
@@ -852,12 +842,8 @@ struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
*/
void adjust_present_page_count(struct zone *zone, long nr_pages)
{
- unsigned long flags;
-
zone->present_pages += nr_pages;
- pgdat_resize_lock(zone->zone_pgdat, &flags);
zone->zone_pgdat->node_present_pages += nr_pages;
- pgdat_resize_unlock(zone->zone_pgdat, &flags);
}

int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 296cb00802b4..27483245384c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7156,7 +7156,6 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
zone->name = zone_names[idx];
zone->zone_pgdat = NODE_DATA(nid);
spin_lock_init(&zone->lock);
- zone_seqlock_init(zone);
zone_pcp_init(zone);
}

--
2.16.3

2021-06-02 18:39:45

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On 02.06.21 11:14, Oscar Salvador wrote:
> Currently, page_outside_zone_boundaries() takes zone's span_seqlock
> when reading zone_start_pfn and spanned_pages so those values are
> stable vs memory hotplug operations.
> move_pfn_range_to_zone() and remove_pfn_range_from_zone(), which are
> the functions that can change zone's values are serialized by
> mem_hotplug_lock by mem_hotplug_{begin,done}, so we can just use
> {get,put}_online_mems() on the readers.
>
> This will allow us to completely kill span_seqlock lock as no users
> will remain after this series.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> mm/page_alloc.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aaa1655cf682..296cb00802b4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -582,17 +582,15 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
> static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
> {
> int ret = 0;
> - unsigned seq;
> unsigned long pfn = page_to_pfn(page);
> unsigned long sp, start_pfn;
>
> - do {
> - seq = zone_span_seqbegin(zone);
> - start_pfn = zone->zone_start_pfn;
> - sp = zone->spanned_pages;
> - if (!zone_spans_pfn(zone, pfn))
> - ret = 1;
> - } while (zone_span_seqretry(zone, seq));
> + get_online_mems();
> + start_pfn = zone->zone_start_pfn;
> + sp = zone->spanned_pages;
> + if (!zone_spans_pfn(zone, pfn))
> + ret = 1;
> + put_online_mems();
>
> if (ret)
> pr_err("page 0x%lx outside node %d zone %s [ 0x%lx - 0x%lx ]\n",
>

It's worth noting that memory offlining might hold the memory hotplug
lock for quite some time. It's not a lightweight lock, compared to the
seqlock we have here.

I can see that page_outside_zone_boundaries() is only called from
bad_range(). bad_range() is only called under VM_BUG_ON_PAGE(). Still,
are you sure that it's even valid to block e.g., __free_one_page() and
others for eventually all eternity? And I think that we might just call
it from atomic context where we cannot even sleep.

Long story short, using get_online_mems() looks wrong.

Maybe the current lightweight reader/writer protection does serve a purpose?

--
Thanks,

David / dhildenb

2021-06-02 19:49:31

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On 2021-06-02 20:37, David Hildenbrand wrote:
> Long story short, using get_online_mems() looks wrong.
>
> Maybe the current lightweight reader/writer protection does serve a
> purpose?

It was too nice and easy to be true I guess.
Yeah, I missed the point that blocking might be an issue here, and
hotplug operations can take really long, so not an option.
I must have switched my brain off back there, because now it is just too
obvious.

Then I guwmess that seqlock must stay and the only thing than can go is
the pgdat resize lock from the hotplug code.

Meh.

--
Oscar Salvador
SUSE L3

2021-06-03 02:20:17

by kernel test robot

[permalink] [raw]
Subject: [mm,page_alloc] [confidence: ] acb5758bf4: BUG:sleeping_function_called_from_invalid_context_at_include/linux/percpu-rwsem.h



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: acb5758bf41a116c65f7b3fdfd3d79b54ebfad9f ("[PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values")
url: https://github.com/0day-ci/linux/commits/Oscar-Salvador/Memory-hotplug-locking-cleanup/20210602-232017


in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+-----------------------------------------------------------------------------------+-----------+------------+
| | v5.13-rc4 | acb5758bf4 |
+-----------------------------------------------------------------------------------+-----------+------------+
| boot_successes | 167 | 0 |
| boot_failures | 0 | 6 |
| BUG:sleeping_function_called_from_invalid_context_at_include/linux/percpu-rwsem.h | 0 | 6 |
| WARNING:at_kernel/sched/core.c:#__might_sleep | 0 | 1 |
| RIP:__might_sleep | 0 | 1 |
| RIP:smpboot_thread_fn | 0 | 1 |
+-----------------------------------------------------------------------------------+-----------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 0.951423] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
[ 0.953745] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2, name: kthreadd
[ 0.953745] CPU: 0 PID: 2 Comm: kthreadd Not tainted 5.13.0-rc4-00001-gacb5758bf41a #1
[ 0.953745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 0.953745] Call Trace:
[ 0.953745] dump_stack (kbuild/src/consumer/lib/dump_stack.c:122)
[ 0.953745] ___might_sleep (kbuild/src/consumer/kernel/sched/core.c:8339 kbuild/src/consumer/kernel/sched/core.c:8296)
[ 0.953745] get_online_mems (kbuild/src/consumer/include/linux/percpu-rwsem.h:49 kbuild/src/consumer/mm/memory_hotplug.c:69)
[ 0.953745] bad_range (kbuild/src/consumer/mm/page_alloc.c:589 kbuild/src/consumer/mm/page_alloc.c:617)
[ 0.953745] expand (kbuild/src/consumer/mm/page_alloc.c:2231)
[ 0.953745] rmqueue_bulk+0x161/0x359
[ 0.953745] __rmqueue_pcplist (kbuild/src/consumer/mm/page_alloc.c:3471 (discriminator 3))
[ 0.953745] rmqueue (kbuild/src/consumer/mm/page_alloc.c:3499 kbuild/src/consumer/mm/page_alloc.c:3527)
[ 0.953745] ? __alloc_pages (kbuild/src/consumer/mm/page_alloc.c:5198)
[ 0.953745] ? __cond_resched (kbuild/src/consumer/kernel/sched/core.c:6994)
[ 0.953745] ? post_alloc_hook (kbuild/src/consumer/arch/x86/include/asm/atomic.h:29 kbuild/src/consumer/include/asm-generic/atomic-instrumented.h:28 kbuild/src/consumer/include/linux/jump_label.h:254 kbuild/src/consumer/include/linux/mm.h:2959 kbuild/src/consumer/mm/page_alloc.c:2345)
[ 0.953745] get_page_from_freelist (kbuild/src/consumer/mm/page_alloc.c:3989)
[ 0.953745] __alloc_pages (kbuild/src/consumer/mm/page_alloc.c:5198)
[ 0.953745] __vmalloc_area_node+0x133/0x1fd
[ 0.953745] __vmalloc_node_range (kbuild/src/consumer/mm/vmalloc.c:2916)
[ 0.953745] copy_process (kbuild/src/consumer/kernel/fork.c:245 kbuild/src/consumer/kernel/fork.c:869 kbuild/src/consumer/kernel/fork.c:1947)
[ 0.953745] ? kernel_clone (kbuild/src/consumer/kernel/fork.c:2503)
[ 0.953745] kernel_clone (kbuild/src/consumer/kernel/fork.c:2503)
[ 0.953745] kernel_thread (kbuild/src/consumer/kernel/fork.c:2556)
[ 0.953745] ? kthread_flush_worker (kbuild/src/consumer/kernel/kthread.c:266)
[ 0.953745] kthreadd (kbuild/src/consumer/kernel/kthread.c:337 kbuild/src/consumer/kernel/kthread.c:679)
[ 0.953745] ? kthread_is_per_cpu (kbuild/src/consumer/kernel/kthread.c:652)
[ 0.953745] ret_from_fork (kbuild/src/consumer/arch/x86/entry/entry_64.S:300)
[ 0.954392] smp: Bringing up secondary CPUs ...
[ 0.956051] x86: Booting SMP configuration:
[ 0.957352] .... node #0, CPUs: #1
[ 0.181973] kvm-clock: cpu 1, msr 303a041, secondary cpu clock
[ 0.181973] masked ExtINT on CPU#1
[ 0.181973] smpboot: CPU 1 Converting physical 0 to logical die 1
[ 0.965836] kvm-guest: stealtime: cpu 1, msr 42fd17200
[ 0.967437] smp: Brought up 1 node, 2 CPUs
[ 0.969765] smpboot: Max logical packages: 2
[ 0.970961] smpboot: Total of 2 processors activated (9575.99 BogoMIPS)
[ 0.974300] devtmpfs: initialized
[ 0.975023] x86/mm: Memory block size: 128MB
[ 0.982906] wait_for_initramfs() called before rootfs_initcalls
[ 0.982977] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[ 0.985774] futex hash table entries: 512 (order: 3, 32768 bytes, linear)
[ 0.990745] NET: Registered protocol family 16
[ 0.993435] audit: initializing netlink subsys (disabled)
[ 0.998316] thermal_sys: Registered thermal governor 'step_wise'
[ 0.998322] thermal_sys: Registered thermal governor 'user_space'
[ 0.998322] audit: type=2000 audit(1622678220.193:1): state=initialized audit_enabled=0 res=1
[ 1.004221] cpuidle: using governor ladder
[ 1.004221] cpuidle: using governor menu
[ 1.006848] ACPI: bus type PCI registered
[ 1.008400] PCI: Using configuration type 1 for base access
[ 1.035251] Kprobes globally optimized
[ 1.037864] ACPI: Added _OSI(Module Device)
[ 1.039172] ACPI: Added _OSI(Processor Device)
[ 1.041772] ACPI: Added _OSI(3.0 _SCP Extensions)
[ 1.043168] ACPI: Added _OSI(Processor Aggregator Device)
[ 1.044705] ACPI: Added _OSI(Linux-Dell-Video)
[ 1.045851] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[ 1.047408] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[ 1.050725] ACPI: 1 ACPI AML tables successfully acquired and loaded
[ 1.054357] ACPI: Interpreter enabled
[ 1.055555] ACPI: (supports S0 S3 S4 S5)
[ 1.056759] ACPI: Using IOAPIC for interrupt routing
[ 1.057923] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[ 1.062087] ACPI: Enabled 2 GPEs in block 00 to 0F
[ 1.072359] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[ 1.073771] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[ 1.076037] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[ 1.078069] PCI host bridge to bus 0000:00
[ 1.079325] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window]
[ 1.081220] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window]
[ 1.081778] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[ 1.083994] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
[ 1.085761] pci_bus 0000:00: root bus resource [mem 0x440000000-0x4bfffffff window]
[ 1.087961] pci_bus 0000:00: root bus resource [bus 00-ff]
[ 1.089587] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000
[ 1.090932] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100
[ 1.093853] pci 0000:00:01.1: [8086:7010] type 00 class 0x010180
[ 1.101023] pci 0000:00:01.1: reg 0x20: [io 0xc040-0xc04f]
[ 1.104173] pci 0000:00:01.1: legacy IDE quirk: reg 0x10: [io 0x01f0-0x01f7]
[ 1.105768] pci 0000:00:01.1: legacy IDE quirk: reg 0x14: [io 0x03f6]
[ 1.107592] pci 0000:00:01.1: legacy IDE quirk: reg 0x18: [io 0x0170-0x0177]
[ 1.109448] pci 0000:00:01.1: legacy IDE quirk: reg 0x1c: [io 0x0376]
[ 1.110306] pci 0000:00:01.3: [8086:7113] type 00 class 0x068000
[ 1.112295] pci 0000:00:01.3: quirk: [io 0x0600-0x063f] claimed by PIIX4 ACPI
[ 1.113821] pci 0000:00:01.3: quirk: [io 0x0700-0x070f] claimed by PIIX4 SMB
[ 1.116585] pci 0000:00:02.0: [1234:1111] type 00 class 0x030000
[ 1.120678] pci 0000:00:02.0: reg 0x10: [mem 0xfd000000-0xfdffffff pref]
[ 1.125763] pci 0000:00:02.0: reg 0x18: [mem 0xfebf0000-0xfebf0fff]
[ 1.135974] pci 0000:00:02.0: reg 0x30: [mem 0xfebe0000-0xfebeffff pref]
[ 1.139173] pci 0000:00:03.0: [8086:100e] type 00 class 0x020000
[ 1.142567] pci 0000:00:03.0: reg 0x10: [mem 0xfebc0000-0xfebdffff]
[ 1.145759] pci 0000:00:03.0: reg 0x14: [io 0xc000-0xc03f]
[ 1.154798] pci 0000:00:03.0: reg 0x30: [mem 0xfeb80000-0xfebbffff pref]
[ 1.157430] pci 0000:00:04.0: [8086:25ab] type 00 class 0x088000
[ 1.158615] pci 0000:00:04.0: reg 0x10: [mem 0xfebf1000-0xfebf100f]
[ 1.167002] ACPI: PCI: Interrupt link LNKA configured for IRQ 10
[ 1.169426] ACPI: PCI: Interrupt link LNKB configured for IRQ 10
[ 1.169966] ACPI: PCI: Interrupt link LNKC configured for IRQ 11
[ 1.171809] ACPI: PCI: Interrupt link LNKD configured for IRQ 11
[ 1.173641] ACPI: PCI: Interrupt link LNKS configured for IRQ 9
[ 1.174661] iommu: Default domain type: Translated
[ 1.176216] pci 0000:00:02.0: vgaarb: setting as boot VGA device
[ 1.176216] pci 0000:00:02.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[ 1.177775] pci 0000:00:02.0: vgaarb: bridge control possible
[ 1.179446] vgaarb: loaded
[ 1.180958] SCSI subsystem initialized
[ 1.182073] libata version 3.00 loaded.
[ 1.183554] ACPI: bus type USB registered
[ 1.184998] usbcore: registered new interface driver usbfs
[ 1.185927] usbcore: registered new interface driver hub
[ 1.187467] usbcore: registered new device driver usb
[ 1.189271] pps_core: LinuxPPS API ver. 1 registered
[ 1.189755] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <[email protected]>
[ 1.192502] PTP clock support registered
[ 1.194111] Advanced Linux Sound Architecture Driver Initialized.
[ 1.196236] PCI: Using ACPI for IRQ routing
[ 1.197766] PCI: pci_cache_line_size set to 64 bytes
[ 1.199350] e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
[ 1.201032] e820: reserve RAM buffer [mem 0xbffe0000-0xbfffffff]
[ 1.202909] clocksource: Switched to clocksource kvm-clock
[ 1.314382] VFS: Disk quotas dquot_6.6.0
[ 1.315518] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[ 1.317540] pnp: PnP ACPI init
[ 1.318663] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
[ 1.324095] pnp 00:01: Plug and Play ACPI device, IDs PNP0303 (active)
[ 1.326382] pnp 00:02: Plug and Play ACPI device, IDs PNP0f13 (active)
[ 1.328128] pnp 00:03: [dma 2]
[ 1.329228] pnp 00:03: Plug and Play ACPI device, IDs PNP0700 (active)
[ 1.331299] pnp 00:04: Plug and Play ACPI device, IDs PNP0400 (active)
[ 1.333537] pnp 00:05: Plug and Play ACPI device, IDs PNP0501 (active)
[ 1.335853] pnp 00:06: Plug and Play ACPI device, IDs PNP0501 (active)
[ 1.338038] pnp: PnP ACPI: found 7 devices
[ 1.351709] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 1.358348] NET: Registered protocol family 2
[ 1.360864] IP idents hash table entries: 262144 (order: 9, 2097152 bytes, linear)
[ 1.367631] tcp_listen_portaddr_hash hash table entries: 8192 (order: 5, 131072 bytes, linear)
[ 1.370180] TCP established hash table entries: 131072 (order: 8, 1048576 bytes, linear)
[ 1.372947] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes, linear)
[ 1.376215] TCP: Hash tables configured (established 131072 bind 65536)
[ 1.378753] UDP hash table entries: 8192 (order: 6, 262144 bytes, linear)
[ 1.380713] UDP-Lite hash table entries: 8192 (order: 6, 262144 bytes, linear)
[ 1.382999] NET: Registered protocol family 1
[ 1.388371] RPC: Registered named UNIX socket transport module.
[ 1.390098] RPC: Registered udp transport module.
[ 1.391490] RPC: Registered tcp transport module.
[ 1.392898] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 1.394621] pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7 window]
[ 1.396328] pci_bus 0000:00: resource 5 [io 0x0d00-0xffff window]
[ 1.398062] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]


To reproduce:

# build kernel
cd linux
cp config-5.13.0-rc4-00001-gacb5758bf41a .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (12.82 kB)
config-5.13.0-rc4-00001-gacb5758bf41a (121.88 kB)
job-script (4.43 kB)
dmesg.xz (16.06 kB)
Download all attachments

2021-06-03 08:41:58

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On Wed, Jun 02, 2021 at 09:45:58PM +0200, Oscar Salvador wrote:
> It was too nice and easy to be true I guess.
> Yeah, I missed the point that blocking might be an issue here, and hotplug
> operations can take really long, so not an option.
> I must have switched my brain off back there, because now it is just too
> obvious.
>
> Then I guwmess that seqlock must stay and the only thing than can go is the
> pgdat resize lock from the hotplug code.

So, I have been looking into this again.
Of course, the approach taken here is outrageously wrong, but there are
some other things that are a bit confusing.

As pointed out, bad_range() (the function that ends up calling
page_outside_zone_boundaries) is called from different functions via VM_BUG_ON_*.
page_outside_zone_boundaries() takes care of taking the seqlock to avoid
reading stale values that might happen if we race with memhotplug
operations.
page_outside_zone_boundaries() calls zone_spans_pfn() to check that.

Now on the funny thing.

We do have several places happily calling zone_spans_pfn() without
holding zone's seqlock, e.g:

set_pageblock_migratetype
set_pfnblock_flags_mask
zone_spans_pfn

move_freepages_block
zone_spans_pfn

alloc_contig_pages
zone_spans_last_pfn
zone_spans_pfn

Those places hold zone->lock, while move_pfn_range_to_zone() and
remove_pfn_range_from_zone() hold zone->seqlock, so AFAICS, those places
could read a stale value and proceed thinking the range is within the
zone while it is not.

So I guess my question is, should we force those places to take the
seqlock reader as we do in page_outside_zone_boundaries(), (or maybe
just move the seqlock handling to zone_spans_pfn())?

Because I does not make much sense to take it in a VM_DEBUG context and
not in "real life".

Thoughts?

--
Oscar Salvador
SUSE L3

2021-06-03 12:47:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On Thu 03-06-21 10:38:24, Oscar Salvador wrote:
> On Wed, Jun 02, 2021 at 09:45:58PM +0200, Oscar Salvador wrote:
> > It was too nice and easy to be true I guess.
> > Yeah, I missed the point that blocking might be an issue here, and hotplug
> > operations can take really long, so not an option.
> > I must have switched my brain off back there, because now it is just too
> > obvious.
> >
> > Then I guwmess that seqlock must stay and the only thing than can go is the
> > pgdat resize lock from the hotplug code.
>
> So, I have been looking into this again.
> Of course, the approach taken here is outrageously wrong, but there are
> some other things that are a bit confusing.
>
> As pointed out, bad_range() (the function that ends up calling
> page_outside_zone_boundaries) is called from different functions via VM_BUG_ON_*.
> page_outside_zone_boundaries() takes care of taking the seqlock to avoid
> reading stale values that might happen if we race with memhotplug
> operations.
> page_outside_zone_boundaries() calls zone_spans_pfn() to check that.
>
> Now on the funny thing.
>
> We do have several places happily calling zone_spans_pfn() without
> holding zone's seqlock, e.g:
>
> set_pageblock_migratetype
> set_pfnblock_flags_mask
> zone_spans_pfn
>
> move_freepages_block
> zone_spans_pfn
>
> alloc_contig_pages
> zone_spans_last_pfn
> zone_spans_pfn
>
> Those places hold zone->lock, while move_pfn_range_to_zone() and
> remove_pfn_range_from_zone() hold zone->seqlock, so AFAICS, those places
> could read a stale value and proceed thinking the range is within the
> zone while it is not.
>
> So I guess my question is, should we force those places to take the
> seqlock reader as we do in page_outside_zone_boundaries(), (or maybe
> just move the seqlock handling to zone_spans_pfn())?

I believe we need to define the purpose of the locking first. The
existing locking doesn't serve much purpose, does it? The state might
change right after the lock is released and the caller cannot really
rely on the result. So aside of the current implementation, I would
argue that any locking has to be be done on the caller layer.

But the primary question is whether anybody actually cares about
potential races in the first place.
--
Michal Hocko
SUSE Labs

2021-06-03 12:54:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm,memory_hotplug: Drop unneeded locking

On Wed 02-06-21 11:14:56, Oscar Salvador wrote:
> Currently, memory-hotplug code takes zone's span_writelock
> and pgdat's resize_lock when resizing the node/zone's spanned
> pages via {move_pfn_range_to_zone(),remove_pfn_range_from_zone()}
> and when resizing node and zone's present pages via
> adjust_present_page_count().
>
> These locks are also taken during the initialization of the system
> at boot time, where it protects parallel struct page initialization,
> but they should not really be needed in memory-hotplug where all
> operations are a) synchronized on device level and b) serialized by
> the mem_hotplug_lock lock.
>
> Given that there are no users of span_seqlock, rip out all related
> functions.
>
> Signed-off-by: Oscar Salvador <[email protected]>

Yes, I like this much! The sequence lock is just dubious unless I am
missing something and the resize lock doesn't seem to be serving
any purpose anymore. I haven't checked whether it used to at some point
in time.

Acked-by: Michal Hocko <[email protected]>

> ---
> include/linux/memory_hotplug.h | 35 -----------------------------------
> include/linux/mmzone.h | 23 +++++------------------
> mm/memory_hotplug.c | 16 +---------------
> mm/page_alloc.c | 1 -
> 4 files changed, 6 insertions(+), 69 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 28f32fd00fe9..0d837ce6ec11 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -79,31 +79,7 @@ struct range mhp_get_pluggable_range(bool need_mapping);
>
> /*
> * Zone resizing functions
> - *
> - * Note: any attempt to resize a zone should has pgdat_resize_lock()
> - * zone_span_writelock() both held. This ensure the size of a zone
> - * can't be changed while pgdat_resize_lock() held.
> */
> -static inline unsigned zone_span_seqbegin(struct zone *zone)
> -{
> - return read_seqbegin(&zone->span_seqlock);
> -}
> -static inline int zone_span_seqretry(struct zone *zone, unsigned iv)
> -{
> - return read_seqretry(&zone->span_seqlock, iv);
> -}
> -static inline void zone_span_writelock(struct zone *zone)
> -{
> - write_seqlock(&zone->span_seqlock);
> -}
> -static inline void zone_span_writeunlock(struct zone *zone)
> -{
> - write_sequnlock(&zone->span_seqlock);
> -}
> -static inline void zone_seqlock_init(struct zone *zone)
> -{
> - seqlock_init(&zone->span_seqlock);
> -}
> extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
> extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
> extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
> @@ -248,17 +224,6 @@ void mem_hotplug_done(void);
> ___page; \
> })
>
> -static inline unsigned zone_span_seqbegin(struct zone *zone)
> -{
> - return 0;
> -}
> -static inline int zone_span_seqretry(struct zone *zone, unsigned iv)
> -{
> - return 0;
> -}
> -static inline void zone_span_writelock(struct zone *zone) {}
> -static inline void zone_span_writeunlock(struct zone *zone) {}
> -static inline void zone_seqlock_init(struct zone *zone) {}
>
> static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
> {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0d53eba1c383..29cd230a383c 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -13,7 +13,6 @@
> #include <linux/threads.h>
> #include <linux/numa.h>
> #include <linux/init.h>
> -#include <linux/seqlock.h>
> #include <linux/nodemask.h>
> #include <linux/pageblock-flags.h>
> #include <linux/page-flags-layout.h>
> @@ -528,18 +527,11 @@ struct zone {
> *
> * Locking rules:
> *
> - * zone_start_pfn and spanned_pages are protected by span_seqlock.
> - * It is a seqlock because it has to be read outside of zone->lock,
> - * and it is done in the main allocator path. But, it is written
> - * quite infrequently.
> - *
> - * The span_seq lock is declared along with zone->lock because it is
> - * frequently read in proximity to zone->lock. It's good to
> - * give them a chance of being in the same cacheline.
> - *
> - * Write access to present_pages at runtime should be protected by
> - * mem_hotplug_begin/end(). Any reader who can't tolerant drift of
> - * present_pages should get_online_mems() to get a stable value.
> + * Besides system initialization functions, memory-hotplug is the only
> + * user that can change zone's {spanned,present} pages at runtime, and
> + * it does so by holding the mem_hotplug_lock lock. Any readers who
> + * can't tolerate drift values should use {get,put}_online_mems to get
> + * a stable value.
> */
> atomic_long_t managed_pages;
> unsigned long spanned_pages;
> @@ -559,11 +551,6 @@ struct zone {
> unsigned long nr_isolate_pageblock;
> #endif
>
> -#ifdef CONFIG_MEMORY_HOTPLUG
> - /* see spanned/present_pages for more description */
> - seqlock_t span_seqlock;
> -#endif
> -
> int initialized;
>
> /* Write-intensive fields used from the page allocator */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 70620d0dd923..62d5dc2c01de 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -445,7 +445,6 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> unsigned long pfn;
> int nid = zone_to_nid(zone);
>
> - zone_span_writelock(zone);
> if (zone->zone_start_pfn == start_pfn) {
> /*
> * If the section is smallest section in the zone, it need
> @@ -478,7 +477,6 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> zone->spanned_pages = 0;
> }
> }
> - zone_span_writeunlock(zone);
> }
>
> static void update_pgdat_span(struct pglist_data *pgdat)
> @@ -515,7 +513,7 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,
> {
> const unsigned long end_pfn = start_pfn + nr_pages;
> struct pglist_data *pgdat = zone->zone_pgdat;
> - unsigned long pfn, cur_nr_pages, flags;
> + unsigned long pfn, cur_nr_pages;
>
> /* Poison struct pages because they are now uninitialized again. */
> for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) {
> @@ -540,10 +538,8 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,
>
> clear_zone_contiguous(zone);
>
> - pgdat_resize_lock(zone->zone_pgdat, &flags);
> shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
> update_pgdat_span(pgdat);
> - pgdat_resize_unlock(zone->zone_pgdat, &flags);
>
> set_zone_contiguous(zone);
> }
> @@ -750,19 +746,13 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> {
> struct pglist_data *pgdat = zone->zone_pgdat;
> int nid = pgdat->node_id;
> - unsigned long flags;
>
> clear_zone_contiguous(zone);
>
> - /* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
> - pgdat_resize_lock(pgdat, &flags);
> - zone_span_writelock(zone);
> if (zone_is_empty(zone))
> init_currently_empty_zone(zone, start_pfn, nr_pages);
> resize_zone_range(zone, start_pfn, nr_pages);
> - zone_span_writeunlock(zone);
> resize_pgdat_range(pgdat, start_pfn, nr_pages);
> - pgdat_resize_unlock(pgdat, &flags);
>
> /*
> * Subsection population requires care in pfn_to_online_page().
> @@ -852,12 +842,8 @@ struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
> */
> void adjust_present_page_count(struct zone *zone, long nr_pages)
> {
> - unsigned long flags;
> -
> zone->present_pages += nr_pages;
> - pgdat_resize_lock(zone->zone_pgdat, &flags);
> zone->zone_pgdat->node_present_pages += nr_pages;
> - pgdat_resize_unlock(zone->zone_pgdat, &flags);
> }
>
> int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 296cb00802b4..27483245384c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7156,7 +7156,6 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
> zone->name = zone_names[idx];
> zone->zone_pgdat = NODE_DATA(nid);
> spin_lock_init(&zone->lock);
> - zone_seqlock_init(zone);
> zone_pcp_init(zone);
> }
>
> --
> 2.16.3

--
Michal Hocko
SUSE Labs

2021-06-04 07:43:46

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On Thu, Jun 03, 2021 at 02:45:13PM +0200, Michal Hocko wrote:
> I believe we need to define the purpose of the locking first. The

If you ask me, this locking would be meant to make sure zone's zone_start_pfn
or spanned_pages do not change under us, in case we __need__ the value to be
stable.

> existing locking doesn't serve much purpose, does it? The state might

Well, half-way. Currently, the locking is taken in write mode whenever
the zone is expanded or shrinked, and in read mode when called from
bad_range()->page_outside_zone_boundaries() (only on VM_DEBUG).

But as you pointed out, such state might change right after the locking is
released and all the work would be for nothing.
So indeed, the __whole__ operation should be envolved by the lock in the caller
The way that stands right now is not optimal.

> change right after the lock is released and the caller cannot really
> rely on the result. So aside of the current implementation, I would
> argue that any locking has to be be done on the caller layer.
>
> But the primary question is whether anybody actually cares about
> potential races in the first place.

I have been checking move_freepages_block() and alloc_contig_pages(), which
are two of the functions that call zone_spans_pfn().

move_freepages_block() uses it in a way to align the given pfn to pageblock
top and bottom, and then check that aligned pfns are still within the same zone.
From a memory-hotplug perspective that's ok as we know that we are offlining
PAGES_PER_SECTION (which implies whole pageblocks).

alloc_contig_pages() (used by the hugetlb gigantic allocator) runs through a
node's zonelist and checks whether zone->zone_start_pfn + nr_pages stays within
the same zone.
IMHO, the race with zone_spans_last_pfn() vs mem-hotplug would not be that bad,
as it will be caught afters by e.g: __alloc_contig_pages when pages cannot be
isolated because they are offline etc.

So, I would say we do not really need the lock, but I might be missing something.
But if we chose to care about this, then the locking should be done right, not
half-way as it is right now.


--
Oscar Salvador
SUSE L3

2021-06-07 07:53:52

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On Fri, Jun 04, 2021 at 09:41:45AM +0200, Oscar Salvador wrote:
> On Thu, Jun 03, 2021 at 02:45:13PM +0200, Michal Hocko wrote:
> > I believe we need to define the purpose of the locking first. The
>
> If you ask me, this locking would be meant to make sure zone's zone_start_pfn
> or spanned_pages do not change under us, in case we __need__ the value to be
> stable.
>
> > existing locking doesn't serve much purpose, does it? The state might
>
> Well, half-way. Currently, the locking is taken in write mode whenever
> the zone is expanded or shrinked, and in read mode when called from
> bad_range()->page_outside_zone_boundaries() (only on VM_DEBUG).
>
> But as you pointed out, such state might change right after the locking is
> released and all the work would be for nothing.
> So indeed, the __whole__ operation should be envolved by the lock in the caller
> The way that stands right now is not optimal.
>
> > change right after the lock is released and the caller cannot really
> > rely on the result. So aside of the current implementation, I would
> > argue that any locking has to be be done on the caller layer.
> >
> > But the primary question is whether anybody actually cares about
> > potential races in the first place.
>
> I have been checking move_freepages_block() and alloc_contig_pages(), which
> are two of the functions that call zone_spans_pfn().
>
> move_freepages_block() uses it in a way to align the given pfn to pageblock
> top and bottom, and then check that aligned pfns are still within the same zone.
> From a memory-hotplug perspective that's ok as we know that we are offlining
> PAGES_PER_SECTION (which implies whole pageblocks).
>
> alloc_contig_pages() (used by the hugetlb gigantic allocator) runs through a
> node's zonelist and checks whether zone->zone_start_pfn + nr_pages stays within
> the same zone.
> IMHO, the race with zone_spans_last_pfn() vs mem-hotplug would not be that bad,
> as it will be caught afters by e.g: __alloc_contig_pages when pages cannot be
> isolated because they are offline etc.
>
> So, I would say we do not really need the lock, but I might be missing something.
> But if we chose to care about this, then the locking should be done right, not
> half-way as it is right now.


Any thoughts on this? :-)



--
Oscar Salvador
SUSE L3

2021-06-07 08:45:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On Fri 04-06-21 09:41:45, Oscar Salvador wrote:
> On Thu, Jun 03, 2021 at 02:45:13PM +0200, Michal Hocko wrote:
[...]
> > But the primary question is whether anybody actually cares about
> > potential races in the first place.
>
> I have been checking move_freepages_block() and alloc_contig_pages(), which
> are two of the functions that call zone_spans_pfn().
>
> move_freepages_block() uses it in a way to align the given pfn to pageblock
> top and bottom, and then check that aligned pfns are still within the same zone.
> From a memory-hotplug perspective that's ok as we know that we are offlining
> PAGES_PER_SECTION (which implies whole pageblocks).
>
> alloc_contig_pages() (used by the hugetlb gigantic allocator) runs through a
> node's zonelist and checks whether zone->zone_start_pfn + nr_pages stays within
> the same zone.
> IMHO, the race with zone_spans_last_pfn() vs mem-hotplug would not be that bad,
> as it will be caught afters by e.g: __alloc_contig_pages when pages cannot be
> isolated because they are offline etc.
>
> So, I would say we do not really need the lock, but I might be missing something.
> But if we chose to care about this, then the locking should be done right, not
> half-way as it is right now.

That is my understanding as well.
--
Michal Hocko
SUSE Labs

2021-06-07 08:51:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On 07.06.21 09:52, Oscar Salvador wrote:
> On Fri, Jun 04, 2021 at 09:41:45AM +0200, Oscar Salvador wrote:
>> On Thu, Jun 03, 2021 at 02:45:13PM +0200, Michal Hocko wrote:
>>> I believe we need to define the purpose of the locking first. The
>>
>> If you ask me, this locking would be meant to make sure zone's zone_start_pfn
>> or spanned_pages do not change under us, in case we __need__ the value to be
>> stable.
>>
>>> existing locking doesn't serve much purpose, does it? The state might
>>
>> Well, half-way. Currently, the locking is taken in write mode whenever
>> the zone is expanded or shrinked, and in read mode when called from
>> bad_range()->page_outside_zone_boundaries() (only on VM_DEBUG).
>>
>> But as you pointed out, such state might change right after the locking is
>> released and all the work would be for nothing.
>> So indeed, the __whole__ operation should be envolved by the lock in the caller
>> The way that stands right now is not optimal.
>>
>>> change right after the lock is released and the caller cannot really
>>> rely on the result. So aside of the current implementation, I would
>>> argue that any locking has to be be done on the caller layer.
>>>
>>> But the primary question is whether anybody actually cares about
>>> potential races in the first place.
>>
>> I have been checking move_freepages_block() and alloc_contig_pages(), which
>> are two of the functions that call zone_spans_pfn().
>>
>> move_freepages_block() uses it in a way to align the given pfn to pageblock
>> top and bottom, and then check that aligned pfns are still within the same zone.
>> From a memory-hotplug perspective that's ok as we know that we are offlining
>> PAGES_PER_SECTION (which implies whole pageblocks).
>>
>> alloc_contig_pages() (used by the hugetlb gigantic allocator) runs through a
>> node's zonelist and checks whether zone->zone_start_pfn + nr_pages stays within
>> the same zone.
>> IMHO, the race with zone_spans_last_pfn() vs mem-hotplug would not be that bad,
>> as it will be caught afters by e.g: __alloc_contig_pages when pages cannot be
>> isolated because they are offline etc.
>>
>> So, I would say we do not really need the lock, but I might be missing something.
>> But if we chose to care about this, then the locking should be done right, not
>> half-way as it is right now.
>
>
> Any thoughts on this? :-)

I'd like to point out that I think the seqlock is not in place to
synchronize with actual growing/shrinking but to get consistent zone
ranges -- like using atomics, but we have two inter-dependent values here.

If you obtain the zone ranges that way and properly use
pfn_to_online_page(), there is hardly something that can go wrong in
practice. If the zone grew in the meantime, most probably you can just
live with not processing that part for now. If the zone shrunk in the
meantime, pfn_to_online_page() will make you skip that part (it was
offlined either way, so you most probably don't really care about that
part).

[pfn_to_online_page() is racy as well, but the race window is very small
and we never saw a problem in practice really]

Without the seqlock, you might just get a garbage zone range and have
either false/positive negatives when just testing for a simple range not
in an hot(un)plugged range [which is the usual case when talking about
compaction etc.].

--
Thanks,

David / dhildenb

2021-06-07 10:25:14

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On Mon, Jun 07, 2021 at 10:49:01AM +0200, David Hildenbrand wrote:
> I'd like to point out that I think the seqlock is not in place to
> synchronize with actual growing/shrinking but to get consistent zone ranges
> -- like using atomics, but we have two inter-dependent values here.

I guess so, at least that's what it should do.
But the way it is placed right now is misleading.

If we really want to get consistent zone ranges, we should start using
zone's seqlock where it matters and that is pretty much all those
places that use zone_spans_pfn().
Otherwise there is no way you can be sure the pfn you're checking is
within the limits. Moreover, as Michal pointed out early, if we really
want to go down that road the locking should be made in the caller
evolving the operation, otheriwse things might change once the lock
is dropped and you're working with a wrong assumption.

I can see arguments for both riping it out and doing it right (but none for
the way it is right now).
For riping it out, one could say that those races might not be fatal,
as usually the pfn you're working with (the one you want to check falls
within a certain range) you know is valid, so the worst can happen is
you get false positives/negatives and that might or might not be detected
further down. How bad are false positive/negatives I guess it depends on the
situation, but we already do that right now.
The zone_spans_pfn() from page_outside_zone_boundaries() is the only one using
locking right now, so well, if we survided this long without locks in other places
using zone_spans_pfn() makes one wonder if it is that bad.

On the other hand, one could argue that for correctness sake, we should be holding
zone's seqlock whenever checking for zone_spans_pfn() to avoid any inconsistency.


--
Oscar Salvador
SUSE L3

2021-06-08 23:42:02

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On Mon, Jun 07, 2021 at 12:23:25PM +0200, Oscar Salvador wrote:
> I can see arguments for both riping it out and doing it right (but none for
> the way it is right now).
> For riping it out, one could say that those races might not be fatal,
> as usually the pfn you're working with (the one you want to check falls
> within a certain range) you know is valid, so the worst can happen is
> you get false positives/negatives and that might or might not be detected
> further down. How bad are false positive/negatives I guess it depends on the
> situation, but we already do that right now.
> The zone_spans_pfn() from page_outside_zone_boundaries() is the only one using
> locking right now, so well, if we survided this long without locks in other places
> using zone_spans_pfn() makes one wonder if it is that bad.

Givne that

a) all current users of bad_range() are coming from VM_BUG_ON* callers
b) we only care when removing memory as the page would not lie in the
zone anymore. But for that to happen the whole offline_pages() operation
needs to succeed. bad_range() is called from rmqueue(), __free_one_page()
and expand(). If offline_pages() succeeds for the range our page lies on,
we would not be doing those operations on that page anyway?

So, I cannot find any strong reason to keep the seqlock (maybe in the future
we need to re-add it because some usecase).

Any objection on removing it?


--
Oscar Salvador
SUSE L3

2021-06-09 02:16:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On 07.06.21 12:23, Oscar Salvador wrote:
> On Mon, Jun 07, 2021 at 10:49:01AM +0200, David Hildenbrand wrote:
>> I'd like to point out that I think the seqlock is not in place to
>> synchronize with actual growing/shrinking but to get consistent zone ranges
>> -- like using atomics, but we have two inter-dependent values here.
>
> I guess so, at least that's what it should do.
> But the way it is placed right now is misleading.
>
> If we really want to get consistent zone ranges, we should start using
> zone's seqlock where it matters and that is pretty much all those
> places that use zone_spans_pfn().

Right, or even only zone_end_pfn() to get a consistent value.

> Otherwise there is no way you can be sure the pfn you're checking is
> within the limits. Moreover, as Michal pointed out early, if we really
> want to go down that road the locking should be made in the caller
> evolving the operation, otheriwse things might change once the lock
> is dropped and you're working with a wrong assumption.
>
> I can see arguments for both riping it out and doing it right (but none for
> the way it is right now).
> For riping it out, one could say that those races might not be fatal,
> as usually the pfn you're working with (the one you want to check falls
> within a certain range) you know is valid, so the worst can happen is
> you get false positives/negatives and that might or might not be detected
> further down. How bad are false positive/negatives I guess it depends on the
> situation, but we already do that right now.
> The zone_spans_pfn() from page_outside_zone_boundaries() is the only one using
> locking right now, so well, if we survided this long without locks in other places
> using zone_spans_pfn() makes one wonder if it is that bad.
>
> On the other hand, one could argue that for correctness sake, we should be holding
> zone's seqlock whenever checking for zone_spans_pfn() to avoid any inconsistency.
>
>

IMHO, as we know the race exists and we have a tool to handle it in
place, we should maybe fix the obvious cases if possible.

Code that uses zone->zone_start_pfn directly is unlikely to be broken on
most architectures. We will usually read/write via single instruction
and won't get inconsistencies, for example, when shrinking or growing
the zone. We most probably don't want to use an atomic for that right now.

Code that uses zone->spanned_pages to detect the zone end, however, is
more likely to be broken. I don't think we have any relevant around
anymore. Everything was converted to zone_end_pfn().

I feel like we should just make zone_end_pfn() take the seqlock in read.
Then, we at least get a consistent value, for example, while growing a zone.

Just imagine the following case when we grow a section to the front when
onlining memory:

zone->zone_start_pfn -= new_pages;
zone->spanned_pages += new_pages;

Note that compilers/CPUs might reshuffle as they like. If someone (e.g.,
zone_spans_pfn()) races with that code, it might get new
zone->zone_start_pfn but old zone->spanned_pages. zone_end_pfn() will
report a "too small zone" and trigger false negatives in zone_spans_pfn().

--
Thanks,

David / dhildenb

2021-06-09 17:14:53

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm,page_alloc: Use {get,put}_online_mems() to get stable zone's values

On 08.06.21 17:00, David Hildenbrand wrote:
> On 07.06.21 12:23, Oscar Salvador wrote:
>> On Mon, Jun 07, 2021 at 10:49:01AM +0200, David Hildenbrand wrote:
>>> I'd like to point out that I think the seqlock is not in place to
>>> synchronize with actual growing/shrinking but to get consistent zone ranges
>>> -- like using atomics, but we have two inter-dependent values here.
>>
>> I guess so, at least that's what it should do.
>> But the way it is placed right now is misleading.
>>
>> If we really want to get consistent zone ranges, we should start using
>> zone's seqlock where it matters and that is pretty much all those
>> places that use zone_spans_pfn().
>
> Right, or even only zone_end_pfn() to get a consistent value.
>
>> Otherwise there is no way you can be sure the pfn you're checking is
>> within the limits. Moreover, as Michal pointed out early, if we really
>> want to go down that road the locking should be made in the caller
>> evolving the operation, otheriwse things might change once the lock
>> is dropped and you're working with a wrong assumption.
>>
>> I can see arguments for both riping it out and doing it right (but none for
>> the way it is right now).
>> For riping it out, one could say that those races might not be fatal,
>> as usually the pfn you're working with (the one you want to check falls
>> within a certain range) you know is valid, so the worst can happen is
>> you get false positives/negatives and that might or might not be detected
>> further down. How bad are false positive/negatives I guess it depends on the
>> situation, but we already do that right now.
>> The zone_spans_pfn() from page_outside_zone_boundaries() is the only one using
>> locking right now, so well, if we survided this long without locks in other places
>> using zone_spans_pfn() makes one wonder if it is that bad.
>>
>> On the other hand, one could argue that for correctness sake, we should be holding
>> zone's seqlock whenever checking for zone_spans_pfn() to avoid any inconsistency.
>>
>>
>
> IMHO, as we know the race exists and we have a tool to handle it in
> place, we should maybe fix the obvious cases if possible.
>
> Code that uses zone->zone_start_pfn directly is unlikely to be broken on
> most architectures. We will usually read/write via single instruction
> and won't get inconsistencies, for example, when shrinking or growing
> the zone. We most probably don't want to use an atomic for that right now.
>
> Code that uses zone->spanned_pages to detect the zone end, however, is
> more likely to be broken. I don't think we have any relevant around
> anymore. Everything was converted to zone_end_pfn().
>
> I feel like we should just make zone_end_pfn() take the seqlock in read.
> Then, we at least get a consistent value, for example, while growing a zone.
>
> Just imagine the following case when we grow a section to the front when
> onlining memory:
>
> zone->zone_start_pfn -= new_pages;
> zone->spanned_pages += new_pages;
>
> Note that compilers/CPUs might reshuffle as they like. If someone (e.g.,
> zone_spans_pfn()) races with that code, it might get new
> zone->zone_start_pfn but old zone->spanned_pages. zone_end_pfn() will
> report a "too small zone" and trigger false negatives in zone_spans_pfn().
>

Thinking again, we could of course also simply convert to
zone->zone_start_+ pfn zone->zone_end_pfn. Places that need
spanned_pages() would have the same issue, but I think they are rather a
concern case.

--
Thanks,

David / dhildenb