2021-10-04 10:09:44

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 0/5] mm/memory_hotplug: full support for add_memory_driver_managed() with CONFIG_ARCH_KEEP_MEMBLOCK

Architectures that require CONFIG_ARCH_KEEP_MEMBLOCK=y, such as arm64,
don't cleanly support add_memory_driver_managed() yet. Most prominently,
kexec_file can still end up placing kexec images on such driver-managed
memory, resulting in undesired behavior, for example, having kexec images
located on memory not part of the firmware-provided memory map.

Teaching kexec to not place images on driver-managed memory is especially
relevant for virtio-mem. Details can be found in commit 7b7b27214bba
("mm/memory_hotplug: introduce add_memory_driver_managed()").

Extend memblock with a new flag and set it from memory hotplug code
when applicable. This is required to fully support virtio-mem on
arm64, making also kexec_file behave like on x86-64.

v1 -> v2:
- "memblock: improve MEMBLOCK_HOTPLUG documentation"
-- Added
- "memblock: add MEMBLOCK_DRIVER_MANAGED to mimic
IORESOURCE_SYSRAM_DRIVER_MANAGED"
-- Improve documentation of MEMBLOCK_DRIVER_MANAGED
- Refine patch descriptions

Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Jianyong Wu <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

David Hildenbrand (5):
mm/memory_hotplug: handle memblock_add_node() failures in
add_memory_resource()
memblock: improve MEMBLOCK_HOTPLUG documentation
memblock: allow to specify flags with memblock_add_node()
memblock: add MEMBLOCK_DRIVER_MANAGED to mimic
IORESOURCE_SYSRAM_DRIVER_MANAGED
mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with
IORESOURCE_SYSRAM_DRIVER_MANAGED

arch/arc/mm/init.c | 4 ++--
arch/ia64/mm/contig.c | 2 +-
arch/ia64/mm/init.c | 2 +-
arch/m68k/mm/mcfmmu.c | 3 ++-
arch/m68k/mm/motorola.c | 6 ++++--
arch/mips/loongson64/init.c | 4 +++-
arch/mips/sgi-ip27/ip27-memory.c | 3 ++-
arch/s390/kernel/setup.c | 3 ++-
include/linux/memblock.h | 25 +++++++++++++++++++++----
include/linux/mm.h | 2 +-
kernel/kexec_file.c | 5 +++++
mm/memblock.c | 13 +++++++++----
mm/memory_hotplug.c | 11 +++++++++--
13 files changed, 62 insertions(+), 21 deletions(-)


base-commit: 9e1ff307c779ce1f0f810c7ecce3d95bbae40896
--
2.31.1


2021-10-04 10:09:53

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 3/5] memblock: allow to specify flags with memblock_add_node()

We want to specify flags when hotplugging memory. Let's prepare to pass
flags to memblock_add_node() by adjusting all existing users.

Note that when hotplugging memory the system is already up and running
and we might have concurrent memblock users: for example, while we're
hotplugging memory, kexec_file code might search for suitable memory
regions to place kexec images. It's important to add the memory directly
to memblock via a single call with the right flags, instead of adding the
memory first and apply flags later: otherwise, concurrent memblock users
might temporarily stumble over memblocks with wrong flags, which will be
important in a follow-up patch that introduces a new flag to properly
handle add_memory_driver_managed().

Acked-by: Geert Uytterhoeven <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/arc/mm/init.c | 4 ++--
arch/ia64/mm/contig.c | 2 +-
arch/ia64/mm/init.c | 2 +-
arch/m68k/mm/mcfmmu.c | 3 ++-
arch/m68k/mm/motorola.c | 6 ++++--
arch/mips/loongson64/init.c | 4 +++-
arch/mips/sgi-ip27/ip27-memory.c | 3 ++-
arch/s390/kernel/setup.c | 3 ++-
include/linux/memblock.h | 3 ++-
include/linux/mm.h | 2 +-
mm/memblock.c | 9 +++++----
mm/memory_hotplug.c | 2 +-
12 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index 699ecf119641..110eb69e9bee 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size)

low_mem_sz = size;
in_use = 1;
- memblock_add_node(base, size, 0);
+ memblock_add_node(base, size, 0, MEMBLOCK_NONE);
} else {
#ifdef CONFIG_HIGHMEM
high_mem_start = base;
high_mem_sz = size;
in_use = 1;
- memblock_add_node(base, size, 1);
+ memblock_add_node(base, size, 1, MEMBLOCK_NONE);
memblock_reserve(base, size);
#endif
}
diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index 42e025cfbd08..24901d809301 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -153,7 +153,7 @@ find_memory (void)
efi_memmap_walk(find_max_min_low_pfn, NULL);
max_pfn = max_low_pfn;

- memblock_add_node(0, PFN_PHYS(max_low_pfn), 0);
+ memblock_add_node(0, PFN_PHYS(max_low_pfn), 0, MEMBLOCK_NONE);

find_initrd();

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 5c6da8d83c1a..5d165607bf35 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -378,7 +378,7 @@ int __init register_active_ranges(u64 start, u64 len, int nid)
#endif

if (start < end)
- memblock_add_node(__pa(start), end - start, nid);
+ memblock_add_node(__pa(start), end - start, nid, MEMBLOCK_NONE);
return 0;
}

diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c
index eac9dde65193..6f1f25125294 100644
--- a/arch/m68k/mm/mcfmmu.c
+++ b/arch/m68k/mm/mcfmmu.c
@@ -174,7 +174,8 @@ void __init cf_bootmem_alloc(void)
m68k_memory[0].addr = _rambase;
m68k_memory[0].size = _ramend - _rambase;

- memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
+ memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
+ MEMBLOCK_NONE);

/* compute total pages in system */
num_pages = PFN_DOWN(_ramend - _rambase);
diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
index 9f3f77785aa7..2b05bb2bac00 100644
--- a/arch/m68k/mm/motorola.c
+++ b/arch/m68k/mm/motorola.c
@@ -410,7 +410,8 @@ void __init paging_init(void)

min_addr = m68k_memory[0].addr;
max_addr = min_addr + m68k_memory[0].size;
- memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
+ memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
+ MEMBLOCK_NONE);
for (i = 1; i < m68k_num_memory;) {
if (m68k_memory[i].addr < min_addr) {
printk("Ignoring memory chunk at 0x%lx:0x%lx before the first chunk\n",
@@ -421,7 +422,8 @@ void __init paging_init(void)
(m68k_num_memory - i) * sizeof(struct m68k_mem_info));
continue;
}
- memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i);
+ memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i,
+ MEMBLOCK_NONE);
addr = m68k_memory[i].addr + m68k_memory[i].size;
if (addr > max_addr)
max_addr = addr;
diff --git a/arch/mips/loongson64/init.c b/arch/mips/loongson64/init.c
index 76e0a9636a0e..4ac5ba80bbf6 100644
--- a/arch/mips/loongson64/init.c
+++ b/arch/mips/loongson64/init.c
@@ -77,7 +77,9 @@ void __init szmem(unsigned int node)
(u32)node_id, mem_type, mem_start, mem_size);
pr_info(" start_pfn:0x%llx, end_pfn:0x%llx, num_physpages:0x%lx\n",
start_pfn, end_pfn, num_physpages);
- memblock_add_node(PFN_PHYS(start_pfn), PFN_PHYS(node_psize), node);
+ memblock_add_node(PFN_PHYS(start_pfn),
+ PFN_PHYS(node_psize), node,
+ MEMBLOCK_NONE);
break;
case SYSTEM_RAM_RESERVED:
pr_info("Node%d: mem_type:%d, mem_start:0x%llx, mem_size:0x%llx MB\n",
diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
index 6173684b5aaa..adc2faeecf7c 100644
--- a/arch/mips/sgi-ip27/ip27-memory.c
+++ b/arch/mips/sgi-ip27/ip27-memory.c
@@ -341,7 +341,8 @@ static void __init szmem(void)
continue;
}
memblock_add_node(PFN_PHYS(slot_getbasepfn(node, slot)),
- PFN_PHYS(slot_psize), node);
+ PFN_PHYS(slot_psize), node,
+ MEMBLOCK_NONE);
}
}
}
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index 67e5fff96ee0..f3943f15af6e 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -593,7 +593,8 @@ static void __init setup_resources(void)
* part of the System RAM resource.
*/
if (crashk_res.end) {
- memblock_add_node(crashk_res.start, resource_size(&crashk_res), 0);
+ memblock_add_node(crashk_res.start, resource_size(&crashk_res),
+ 0, MEMBLOCK_NONE);
memblock_reserve(crashk_res.start, resource_size(&crashk_res));
insert_resource(&iomem_resource, &crashk_res);
}
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 4ee8dd2d63a7..2bc726e43a1b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -104,7 +104,8 @@ static inline void memblock_discard(void) {}
#endif

void memblock_allow_resize(void);
-int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid);
+int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid,
+ enum memblock_flags flags);
int memblock_add(phys_addr_t base, phys_addr_t size);
int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..0117cb35b212 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2447,7 +2447,7 @@ static inline unsigned long get_num_physpages(void)
* unsigned long max_zone_pfns[MAX_NR_ZONES] = {max_dma, max_normal_pfn,
* max_highmem_pfn};
* for_each_valid_physical_page_range()
- * memblock_add_node(base, size, nid)
+ * memblock_add_node(base, size, nid, MEMBLOCK_NONE)
* free_area_init(max_zone_pfns);
*/
void free_area_init(unsigned long *max_zone_pfn);
diff --git a/mm/memblock.c b/mm/memblock.c
index 184dcd2e5d99..47a56b223141 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -655,6 +655,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
* @base: base address of the new region
* @size: size of the new region
* @nid: nid of the new region
+ * @flags: flags of the new region
*
* Add new memblock region [@base, @base + @size) to the "memory"
* type. See memblock_add_range() description for mode details
@@ -663,14 +664,14 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
* 0 on success, -errno on failure.
*/
int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
- int nid)
+ int nid, enum memblock_flags flags)
{
phys_addr_t end = base + size - 1;

- memblock_dbg("%s: [%pa-%pa] nid=%d %pS\n", __func__,
- &base, &end, nid, (void *)_RET_IP_);
+ memblock_dbg("%s: [%pa-%pa] nid=%d flags=%x %pS\n", __func__,
+ &base, &end, nid, flags, (void *)_RET_IP_);

- return memblock_add_range(&memblock.memory, base, size, nid, 0);
+ return memblock_add_range(&memblock.memory, base, size, nid, flags);
}

/**
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 917b3528636d..5f873e7f5b29 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1385,7 +1385,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
mem_hotplug_begin();

if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
- ret = memblock_add_node(start, size, nid);
+ ret = memblock_add_node(start, size, nid, MEMBLOCK_NONE);
if (ret)
goto error_mem_hotplug_end;
}
--
2.31.1

2021-10-04 10:10:49

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 4/5] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
indicating that we're dealing with a memory region that is never
indicated in the firmware-provided memory map, but always detected and
added by a driver.

Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory
regions like ordinary MEMBLOCK_NONE memory regions -- for example, when
selecting memory regions to add to the vmcore for dumping in the
crashkernel via for_each_mem_range().

However, especially kexec_file is not supposed to select such memblocks via
for_each_free_mem_range() / for_each_free_mem_range_reverse() to place
kexec images, similar to how we handle IORESOURCE_SYSRAM_DRIVER_MANAGED
without CONFIG_ARCH_KEEP_MEMBLOCK.

We'll make sure that memory hotplug code sets the flag where applicable
(IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures
that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
support.

Note that kexec *must not* indicate this memory to the second kernel
and *must not* place kexec-images on this memory. Let's add a comment to
kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
locate_mem_hole_callback() for kexec_walk_resources().

Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
semantics:
MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
firmware-provided memory map and added to the system early during
boot; kexec *has to* indicate this memory to the second kernel and
can place kexec-images on this memory. After memory hotunplug,
kexec has to be re-armed. We mostly ignore this flag when
"movable_node" is not set on the kernel command line, because
then we're told to not care about hotunpluggability of such
memory regions.

MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
the firmware-provided memory map; this memory is always detected
and added to the system by a driver; memory might not actually be
physically hotunpluggable. kexec *must not* indicate this memory to
the second kernel and *must not* place kexec-images on this memory.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/memblock.h | 16 ++++++++++++++--
kernel/kexec_file.c | 5 +++++
mm/memblock.c | 4 ++++
3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 2bc726e43a1b..b3b29ccf91f3 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -37,12 +37,17 @@ extern unsigned long long max_possible_pfn;
* @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
* reserved in the memory map; refer to memblock_mark_nomap() description
* for further details
+ * @MEMBLOCK_DRIVER_MANAGED: memory region that is always detected and added
+ * via a driver, and never indicated in the firmware-provided memory map as
+ * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
+ * kernel resource tree.
*/
enum memblock_flags {
MEMBLOCK_NONE = 0x0, /* No special request */
MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
MEMBLOCK_MIRROR = 0x2, /* mirrored region */
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
+ MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
};

/**
@@ -213,7 +218,8 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type,
*/
#define for_each_mem_range(i, p_start, p_end) \
__for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, \
- MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
+ MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED, \
+ p_start, p_end, NULL)

/**
* for_each_mem_range_rev - reverse iterate through memblock areas from
@@ -224,7 +230,8 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type,
*/
#define for_each_mem_range_rev(i, p_start, p_end) \
__for_each_mem_range_rev(i, &memblock.memory, NULL, NUMA_NO_NODE, \
- MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
+ MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED,\
+ p_start, p_end, NULL)

/**
* for_each_reserved_mem_range - iterate over all reserved memblock areas
@@ -254,6 +261,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
return m->flags & MEMBLOCK_NOMAP;
}

+static inline bool memblock_is_driver_managed(struct memblock_region *m)
+{
+ return m->flags & MEMBLOCK_DRIVER_MANAGED;
+}
+
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long *end_pfn);
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 33400ff051a8..8347fc158d2b 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -556,6 +556,11 @@ static int kexec_walk_memblock(struct kexec_buf *kbuf,
if (kbuf->image->type == KEXEC_TYPE_CRASH)
return func(&crashk_res, kbuf);

+ /*
+ * Using MEMBLOCK_NONE will properly skip MEMBLOCK_DRIVER_MANAGED. See
+ * IORESOURCE_SYSRAM_DRIVER_MANAGED handling in
+ * locate_mem_hole_callback().
+ */
if (kbuf->top_down) {
for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE,
&mstart, &mend, NULL) {
diff --git a/mm/memblock.c b/mm/memblock.c
index 47a56b223141..540a35317fb0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -979,6 +979,10 @@ static bool should_skip_region(struct memblock_type *type,
if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m))
return true;

+ /* skip driver-managed memory unless we were asked for it explicitly */
+ if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m))
+ return true;
+
return false;
}

--
2.31.1

2021-10-04 10:28:09

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 2/5] memblock: improve MEMBLOCK_HOTPLUG documentation

The description of MEMBLOCK_HOTPLUG is currently short and consequently
misleading: we're actually dealing with a memory region that might get
hotunplugged later (i.e., the platform+firmware supports it), yet it is
indicated in the firmware-provided memory map as system ram that will just
get used by the system for any purpose when not taking special care. The
firmware marked this memory region as a hot(un)plugged (e.g., hotplugged
before reboot), implying that it might get hotunplugged again later.

Whether we consider this information depends on the "movable_node" kernel
commandline parameter: only with "movable_node" set, we'll try keeping
this memory hotunpluggable, for example, by not serving early allocations
from this memory region and by letting the buddy manage it using the
ZONE_MOVABLE.

Let's make this clearer by extending the documentation.

Note: kexec *has to* indicate this memory to the second kernel. With
"movable_node" set, we don't want to place kexec-images on this memory.
Without "movable_node" set, we don't care and can place kexec-images on
this memory. In both cases, after successful memory hotunplug, kexec has to
be re-armed to update the memory map for the second kernel and to place the
kexec-images somewhere else.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/memblock.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 34de69b3b8ba..4ee8dd2d63a7 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -28,7 +28,11 @@ extern unsigned long long max_possible_pfn;
/**
* enum memblock_flags - definition of memory region attributes
* @MEMBLOCK_NONE: no special request
- * @MEMBLOCK_HOTPLUG: hotpluggable region
+ * @MEMBLOCK_HOTPLUG: memory region indicated in the firmware-provided memory
+ * map during early boot as hot(un)pluggable system RAM (e.g., memory range
+ * that might get hotunplugged later). With "movable_node" set on the kernel
+ * commandline, try keeping this memory region hotunpluggable. Does not apply
+ * to memblocks added ("hotplugged") after early boot.
* @MEMBLOCK_MIRROR: mirrored region
* @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
* reserved in the memory map; refer to memblock_mark_nomap() description
--
2.31.1

2021-10-04 19:33:09

by Shahab Vahedi

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] memblock: allow to specify flags with memblock_add_node()

On 10/4/21 11:36 AM, David Hildenbrand wrote:
> We want to specify flags when hotplugging memory. Let's prepare to pass
> flags to memblock_add_node() by adjusting all existing users.
>
> Note that when hotplugging memory the system is already up and running
> and we might have concurrent memblock users: for example, while we're
> hotplugging memory, kexec_file code might search for suitable memory
> regions to place kexec images. It's important to add the memory directly
> to memblock via a single call with the right flags, instead of adding the
> memory first and apply flags later: otherwise, concurrent memblock users
> might temporarily stumble over memblocks with wrong flags, which will be
> important in a follow-up patch that introduces a new flag to properly
> handle add_memory_driver_managed().
>
> Acked-by: Geert Uytterhoeven <[email protected]>
> Acked-by: Heiko Carstens <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> arch/arc/mm/init.c | 4 ++--
> arch/ia64/mm/contig.c | 2 +-
> arch/ia64/mm/init.c | 2 +-
> arch/m68k/mm/mcfmmu.c | 3 ++-
> arch/m68k/mm/motorola.c | 6 ++++--
> arch/mips/loongson64/init.c | 4 +++-
> arch/mips/sgi-ip27/ip27-memory.c | 3 ++-
> arch/s390/kernel/setup.c | 3 ++-
> include/linux/memblock.h | 3 ++-
> include/linux/mm.h | 2 +-
> mm/memblock.c | 9 +++++----
> mm/memory_hotplug.c | 2 +-
> 12 files changed, 26 insertions(+), 17 deletions(-)
>
> diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> index 699ecf119641..110eb69e9bee 100644
> --- a/arch/arc/mm/init.c
> +++ b/arch/arc/mm/init.c
> @@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size)
>
> low_mem_sz = size;
> in_use = 1;
> - memblock_add_node(base, size, 0);
> + memblock_add_node(base, size, 0, MEMBLOCK_NONE);
> } else {
> #ifdef CONFIG_HIGHMEM
> high_mem_start = base;
> high_mem_sz = size;
> in_use = 1;
> - memblock_add_node(base, size, 1);
> + memblock_add_node(base, size, 1, MEMBLOCK_NONE);
> memblock_reserve(base, size);
> #endif

arch/arc part: Acked-by: Shahab Vahedi <[email protected]>

--
Shahab

2021-10-06 00:36:31

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

On Mon, Oct 04, 2021 at 11:36:04AM +0200, David Hildenbrand wrote:
> Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
> indicating that we're dealing with a memory region that is never
> indicated in the firmware-provided memory map, but always detected and
> added by a driver.
>
> Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory
> regions like ordinary MEMBLOCK_NONE memory regions -- for example, when
> selecting memory regions to add to the vmcore for dumping in the
> crashkernel via for_each_mem_range().
>
> However, especially kexec_file is not supposed to select such memblocks via
> for_each_free_mem_range() / for_each_free_mem_range_reverse() to place
> kexec images, similar to how we handle IORESOURCE_SYSRAM_DRIVER_MANAGED
> without CONFIG_ARCH_KEEP_MEMBLOCK.
>
> We'll make sure that memory hotplug code sets the flag where applicable
> (IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures
> that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
> support.
>
> Note that kexec *must not* indicate this memory to the second kernel
> and *must not* place kexec-images on this memory. Let's add a comment to
> kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
> now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
> locate_mem_hole_callback() for kexec_walk_resources().
>
> Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
> semantics:
> MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
> firmware-provided memory map and added to the system early during
> boot; kexec *has to* indicate this memory to the second kernel and
> can place kexec-images on this memory. After memory hotunplug,
> kexec has to be re-armed. We mostly ignore this flag when
> "movable_node" is not set on the kernel command line, because
> then we're told to not care about hotunpluggability of such
> memory regions.
>
> MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
> the firmware-provided memory map; this memory is always detected
> and added to the system by a driver; memory might not actually be
> physically hotunpluggable. kexec *must not* indicate this memory to
> the second kernel and *must not* place kexec-images on this memory.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Mike Rapoport <[email protected]>

> ---
> include/linux/memblock.h | 16 ++++++++++++++--
> kernel/kexec_file.c | 5 +++++
> mm/memblock.c | 4 ++++
> 3 files changed, 23 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 2bc726e43a1b..b3b29ccf91f3 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -37,12 +37,17 @@ extern unsigned long long max_possible_pfn;
> * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
> * reserved in the memory map; refer to memblock_mark_nomap() description
> * for further details
> + * @MEMBLOCK_DRIVER_MANAGED: memory region that is always detected and added
> + * via a driver, and never indicated in the firmware-provided memory map as
> + * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
> + * kernel resource tree.
> */
> enum memblock_flags {
> MEMBLOCK_NONE = 0x0, /* No special request */
> MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
> MEMBLOCK_MIRROR = 0x2, /* mirrored region */
> MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
> + MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
> };
>
> /**
> @@ -213,7 +218,8 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type,
> */
> #define for_each_mem_range(i, p_start, p_end) \
> __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, \
> - MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
> + MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED, \
> + p_start, p_end, NULL)
>
> /**
> * for_each_mem_range_rev - reverse iterate through memblock areas from
> @@ -224,7 +230,8 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type,
> */
> #define for_each_mem_range_rev(i, p_start, p_end) \
> __for_each_mem_range_rev(i, &memblock.memory, NULL, NUMA_NO_NODE, \
> - MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
> + MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED,\
> + p_start, p_end, NULL)
>
> /**
> * for_each_reserved_mem_range - iterate over all reserved memblock areas
> @@ -254,6 +261,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
> return m->flags & MEMBLOCK_NOMAP;
> }
>
> +static inline bool memblock_is_driver_managed(struct memblock_region *m)
> +{
> + return m->flags & MEMBLOCK_DRIVER_MANAGED;
> +}
> +
> int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
> unsigned long *end_pfn);
> void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 33400ff051a8..8347fc158d2b 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -556,6 +556,11 @@ static int kexec_walk_memblock(struct kexec_buf *kbuf,
> if (kbuf->image->type == KEXEC_TYPE_CRASH)
> return func(&crashk_res, kbuf);
>
> + /*
> + * Using MEMBLOCK_NONE will properly skip MEMBLOCK_DRIVER_MANAGED. See
> + * IORESOURCE_SYSRAM_DRIVER_MANAGED handling in
> + * locate_mem_hole_callback().
> + */
> if (kbuf->top_down) {
> for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE,
> &mstart, &mend, NULL) {
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 47a56b223141..540a35317fb0 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -979,6 +979,10 @@ static bool should_skip_region(struct memblock_type *type,
> if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m))
> return true;
>
> + /* skip driver-managed memory unless we were asked for it explicitly */
> + if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m))
> + return true;
> +
> return false;
> }
>
> --
> 2.31.1
>

--
Sincerely yours,
Mike.

2021-10-06 00:38:37

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] memblock: improve MEMBLOCK_HOTPLUG documentation

On Mon, Oct 04, 2021 at 11:36:02AM +0200, David Hildenbrand wrote:
> The description of MEMBLOCK_HOTPLUG is currently short and consequently
> misleading: we're actually dealing with a memory region that might get
> hotunplugged later (i.e., the platform+firmware supports it), yet it is
> indicated in the firmware-provided memory map as system ram that will just
> get used by the system for any purpose when not taking special care. The
> firmware marked this memory region as a hot(un)plugged (e.g., hotplugged
> before reboot), implying that it might get hotunplugged again later.
>
> Whether we consider this information depends on the "movable_node" kernel
> commandline parameter: only with "movable_node" set, we'll try keeping
> this memory hotunpluggable, for example, by not serving early allocations
> from this memory region and by letting the buddy manage it using the
> ZONE_MOVABLE.
>
> Let's make this clearer by extending the documentation.
>
> Note: kexec *has to* indicate this memory to the second kernel. With
> "movable_node" set, we don't want to place kexec-images on this memory.
> Without "movable_node" set, we don't care and can place kexec-images on
> this memory. In both cases, after successful memory hotunplug, kexec has to
> be re-armed to update the memory map for the second kernel and to place the
> kexec-images somewhere else.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Mike Rapoport <[email protected]>

> ---
> include/linux/memblock.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 34de69b3b8ba..4ee8dd2d63a7 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -28,7 +28,11 @@ extern unsigned long long max_possible_pfn;
> /**
> * enum memblock_flags - definition of memory region attributes
> * @MEMBLOCK_NONE: no special request
> - * @MEMBLOCK_HOTPLUG: hotpluggable region
> + * @MEMBLOCK_HOTPLUG: memory region indicated in the firmware-provided memory
> + * map during early boot as hot(un)pluggable system RAM (e.g., memory range
> + * that might get hotunplugged later). With "movable_node" set on the kernel
> + * commandline, try keeping this memory region hotunpluggable. Does not apply
> + * to memblocks added ("hotplugged") after early boot.
> * @MEMBLOCK_MIRROR: mirrored region
> * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
> * reserved in the memory map; refer to memblock_mark_nomap() description
> --
> 2.31.1
>

--
Sincerely yours,
Mike.

2021-10-06 00:38:38

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] memblock: allow to specify flags with memblock_add_node()

On Mon, Oct 04, 2021 at 11:36:03AM +0200, David Hildenbrand wrote:
> We want to specify flags when hotplugging memory. Let's prepare to pass
> flags to memblock_add_node() by adjusting all existing users.
>
> Note that when hotplugging memory the system is already up and running
> and we might have concurrent memblock users: for example, while we're
> hotplugging memory, kexec_file code might search for suitable memory
> regions to place kexec images. It's important to add the memory directly
> to memblock via a single call with the right flags, instead of adding the
> memory first and apply flags later: otherwise, concurrent memblock users
> might temporarily stumble over memblocks with wrong flags, which will be
> important in a follow-up patch that introduces a new flag to properly
> handle add_memory_driver_managed().
>
> Acked-by: Geert Uytterhoeven <[email protected]>
> Acked-by: Heiko Carstens <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Mike Rapoport <[email protected]>

> ---
> arch/arc/mm/init.c | 4 ++--
> arch/ia64/mm/contig.c | 2 +-
> arch/ia64/mm/init.c | 2 +-
> arch/m68k/mm/mcfmmu.c | 3 ++-
> arch/m68k/mm/motorola.c | 6 ++++--
> arch/mips/loongson64/init.c | 4 +++-
> arch/mips/sgi-ip27/ip27-memory.c | 3 ++-
> arch/s390/kernel/setup.c | 3 ++-
> include/linux/memblock.h | 3 ++-
> include/linux/mm.h | 2 +-
> mm/memblock.c | 9 +++++----
> mm/memory_hotplug.c | 2 +-
> 12 files changed, 26 insertions(+), 17 deletions(-)
>
> diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> index 699ecf119641..110eb69e9bee 100644
> --- a/arch/arc/mm/init.c
> +++ b/arch/arc/mm/init.c
> @@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size)
>
> low_mem_sz = size;
> in_use = 1;
> - memblock_add_node(base, size, 0);
> + memblock_add_node(base, size, 0, MEMBLOCK_NONE);
> } else {
> #ifdef CONFIG_HIGHMEM
> high_mem_start = base;
> high_mem_sz = size;
> in_use = 1;
> - memblock_add_node(base, size, 1);
> + memblock_add_node(base, size, 1, MEMBLOCK_NONE);
> memblock_reserve(base, size);
> #endif
> }
> diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
> index 42e025cfbd08..24901d809301 100644
> --- a/arch/ia64/mm/contig.c
> +++ b/arch/ia64/mm/contig.c
> @@ -153,7 +153,7 @@ find_memory (void)
> efi_memmap_walk(find_max_min_low_pfn, NULL);
> max_pfn = max_low_pfn;
>
> - memblock_add_node(0, PFN_PHYS(max_low_pfn), 0);
> + memblock_add_node(0, PFN_PHYS(max_low_pfn), 0, MEMBLOCK_NONE);
>
> find_initrd();
>
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 5c6da8d83c1a..5d165607bf35 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -378,7 +378,7 @@ int __init register_active_ranges(u64 start, u64 len, int nid)
> #endif
>
> if (start < end)
> - memblock_add_node(__pa(start), end - start, nid);
> + memblock_add_node(__pa(start), end - start, nid, MEMBLOCK_NONE);
> return 0;
> }
>
> diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c
> index eac9dde65193..6f1f25125294 100644
> --- a/arch/m68k/mm/mcfmmu.c
> +++ b/arch/m68k/mm/mcfmmu.c
> @@ -174,7 +174,8 @@ void __init cf_bootmem_alloc(void)
> m68k_memory[0].addr = _rambase;
> m68k_memory[0].size = _ramend - _rambase;
>
> - memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
> + memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
> + MEMBLOCK_NONE);
>
> /* compute total pages in system */
> num_pages = PFN_DOWN(_ramend - _rambase);
> diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
> index 9f3f77785aa7..2b05bb2bac00 100644
> --- a/arch/m68k/mm/motorola.c
> +++ b/arch/m68k/mm/motorola.c
> @@ -410,7 +410,8 @@ void __init paging_init(void)
>
> min_addr = m68k_memory[0].addr;
> max_addr = min_addr + m68k_memory[0].size;
> - memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
> + memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
> + MEMBLOCK_NONE);
> for (i = 1; i < m68k_num_memory;) {
> if (m68k_memory[i].addr < min_addr) {
> printk("Ignoring memory chunk at 0x%lx:0x%lx before the first chunk\n",
> @@ -421,7 +422,8 @@ void __init paging_init(void)
> (m68k_num_memory - i) * sizeof(struct m68k_mem_info));
> continue;
> }
> - memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i);
> + memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i,
> + MEMBLOCK_NONE);
> addr = m68k_memory[i].addr + m68k_memory[i].size;
> if (addr > max_addr)
> max_addr = addr;
> diff --git a/arch/mips/loongson64/init.c b/arch/mips/loongson64/init.c
> index 76e0a9636a0e..4ac5ba80bbf6 100644
> --- a/arch/mips/loongson64/init.c
> +++ b/arch/mips/loongson64/init.c
> @@ -77,7 +77,9 @@ void __init szmem(unsigned int node)
> (u32)node_id, mem_type, mem_start, mem_size);
> pr_info(" start_pfn:0x%llx, end_pfn:0x%llx, num_physpages:0x%lx\n",
> start_pfn, end_pfn, num_physpages);
> - memblock_add_node(PFN_PHYS(start_pfn), PFN_PHYS(node_psize), node);
> + memblock_add_node(PFN_PHYS(start_pfn),
> + PFN_PHYS(node_psize), node,
> + MEMBLOCK_NONE);
> break;
> case SYSTEM_RAM_RESERVED:
> pr_info("Node%d: mem_type:%d, mem_start:0x%llx, mem_size:0x%llx MB\n",
> diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
> index 6173684b5aaa..adc2faeecf7c 100644
> --- a/arch/mips/sgi-ip27/ip27-memory.c
> +++ b/arch/mips/sgi-ip27/ip27-memory.c
> @@ -341,7 +341,8 @@ static void __init szmem(void)
> continue;
> }
> memblock_add_node(PFN_PHYS(slot_getbasepfn(node, slot)),
> - PFN_PHYS(slot_psize), node);
> + PFN_PHYS(slot_psize), node,
> + MEMBLOCK_NONE);
> }
> }
> }
> diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
> index 67e5fff96ee0..f3943f15af6e 100644
> --- a/arch/s390/kernel/setup.c
> +++ b/arch/s390/kernel/setup.c
> @@ -593,7 +593,8 @@ static void __init setup_resources(void)
> * part of the System RAM resource.
> */
> if (crashk_res.end) {
> - memblock_add_node(crashk_res.start, resource_size(&crashk_res), 0);
> + memblock_add_node(crashk_res.start, resource_size(&crashk_res),
> + 0, MEMBLOCK_NONE);
> memblock_reserve(crashk_res.start, resource_size(&crashk_res));
> insert_resource(&iomem_resource, &crashk_res);
> }
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 4ee8dd2d63a7..2bc726e43a1b 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -104,7 +104,8 @@ static inline void memblock_discard(void) {}
> #endif
>
> void memblock_allow_resize(void);
> -int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid);
> +int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid,
> + enum memblock_flags flags);
> int memblock_add(phys_addr_t base, phys_addr_t size);
> int memblock_remove(phys_addr_t base, phys_addr_t size);
> int memblock_free(phys_addr_t base, phys_addr_t size);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..0117cb35b212 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2447,7 +2447,7 @@ static inline unsigned long get_num_physpages(void)
> * unsigned long max_zone_pfns[MAX_NR_ZONES] = {max_dma, max_normal_pfn,
> * max_highmem_pfn};
> * for_each_valid_physical_page_range()
> - * memblock_add_node(base, size, nid)
> + * memblock_add_node(base, size, nid, MEMBLOCK_NONE)
> * free_area_init(max_zone_pfns);
> */
> void free_area_init(unsigned long *max_zone_pfn);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 184dcd2e5d99..47a56b223141 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -655,6 +655,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
> * @base: base address of the new region
> * @size: size of the new region
> * @nid: nid of the new region
> + * @flags: flags of the new region
> *
> * Add new memblock region [@base, @base + @size) to the "memory"
> * type. See memblock_add_range() description for mode details
> @@ -663,14 +664,14 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
> * 0 on success, -errno on failure.
> */
> int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
> - int nid)
> + int nid, enum memblock_flags flags)
> {
> phys_addr_t end = base + size - 1;
>
> - memblock_dbg("%s: [%pa-%pa] nid=%d %pS\n", __func__,
> - &base, &end, nid, (void *)_RET_IP_);
> + memblock_dbg("%s: [%pa-%pa] nid=%d flags=%x %pS\n", __func__,
> + &base, &end, nid, flags, (void *)_RET_IP_);
>
> - return memblock_add_range(&memblock.memory, base, size, nid, 0);
> + return memblock_add_range(&memblock.memory, base, size, nid, flags);
> }
>
> /**
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 917b3528636d..5f873e7f5b29 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1385,7 +1385,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> mem_hotplug_begin();
>
> if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> - ret = memblock_add_node(start, size, nid);
> + ret = memblock_add_node(start, size, nid, MEMBLOCK_NONE);
> if (ret)
> goto error_mem_hotplug_end;
> }
> --
> 2.31.1
>

--
Sincerely yours,
Mike.