Hi,
this series aims at improving in-kernel auto-online support. It tackles the
fundamental problems that:
1) We can create zone imbalances when onlining all memory blindly to
ZONE_MOVABLE, in the worst case crashing the system. We have to know
upfront how much memory we are going to hotplug such that we can
safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
via "online_movable". This is far from practical and only applicable in
limited setups -- like inside VMs under the RHV/oVirt hypervisor which
will never hotplug more than 3 times the boot memory (and the
limitation is only in place due to the Linux limitation).
2) We see more setups that implement dynamic VM resizing, hot(un)plugging
memory to resize VM memory. In these setups, we might hotplug a lot of
memory, but it might happen in various small steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
primary driver of this upstream right now, performing such dynamic
resizing NUMA-aware via multiple virtio-mem devices.
Onlining all hotplugged memory to ZONE_NORMAL means we basically have
no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
easily run into zone imbalances when growing a VM. We want a mixture,
and we want as much memory as reasonable/configured in ZONE_MOVABLE.
3) Memory devices consist of 1..X memory block devices, however, the
kernel doesn't really track the relationship. Consequently, also user
space has no idea. We want to make per-device decisions. As one
example, for memory hotunplug it doesn't make sense to use a mixture of
zones within a single DIMM: we want all MOVABLE if possible, otherwise
all !MOVABLE, because any !MOVABLE part will easily block the DIMM from
getting hotunplugged. As another example, virtio-mem operates on
individual units that span 1..X memory blocks. Similar to a DIMM, we
want a unit to either be all MOVABLE or !MOVABLE. Further, we want
as much memory of a virtio-mem device to be MOVABLE as possible.
4) We want memory onlining to be done right from the kernel while adding
memory; for example, this is reqired for fast memory hotplug for
drivers that add individual memory blocks, like virito-mem. We want a
way to configure a policy in the kernel and avoid implementing advanced
policies in user space.
The auto-onlining support we have in the kernel is not sufficient. All we
have is a) online everything movable (online_movable) b) online everything
!movable (online_kernel) c) keep zones contiguous (online). This series
allows configuring c) to mean instead "online movable if possible according
to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new
onlining policy.
This series does 3 things:
1) Introduces the "auto-movable" online policy that initially operates on
individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
to make a decision whether a memory block will be onlined to
ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
memory does not allow for more MOVABLE memory (details in the
patches). CMA memory is treated like MOVABLE memory.
2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
groups and uses group information to make decisions in the
"auto-movable" online policy accross memory blocks of a single memory
device (modeled as memory group).
3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
allowing ZONE_NORMAL memory within a dynamic memory group to allow for
more ZONE_MOVABLE memory within the same memory group. The target use
case is dynamic VM resizing using virtio-mem.
I remember that the basic idea of using a ratio to implement a policy in
the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong
(I lost the pointer to that discussion).
For me, the main use case is using it along with virtio-mem (and
DIMMs / ppc64 dlpar where necessary) for dynamic resizing of VMs,
increasing the amount of memory we can hotunplug reliably again if we
might eventually hotplug a lot of memory to a VM.
The target usage will be:
1) Linux boots with "mhp_default_online_type=offline"
2) User space (e.g., systemd unit) configures memory onlining (according
to a config file and system properties), for example:
* Setting memory_hotplug.online_policy=auto-movable
* Setting memory_hotplug.auto_movable_ratio=301
* Setting memory_hotplug.auto_movable_numa_aware=true
3) User space enabled auto onlining via "echo online >
/sys/devices/system/memory/auto_online_blocks"
4) User space triggers manual onlining of all already-offline memory
blocks (go over offline memory blocks and set them to "online")
For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
Memory block 1-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-79: Movable (DIMM 0)
Memory block 80-111: Movable (DIMM 1)
Memory block 112-143: Movable (DIMM 2)
Memory block 144-275: Normal (DIMM 3)
Memory block 176-207: Normal (DIMM 4)
... all Normal
(-> hotplugged Normal memory does not allow for more Movable memory)
For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
will result in the following layout:
Memory block 1-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-143: Movable (virtio-mem, first 12 GiB)
Memory block 144: Normal (virtio-mem, next 128 MiB)
Memory block 145-147: Movable (virtio-mem, next 384 MiB)
Memory block 148: Normal (virtio-mem, next 128 MiB)
Memory block 149-151: Movable (virtio-mem, next 384 MiB)
... Normal/Movable mixture as above
(-> hotplugged Normal memory allows for more Movable memory within
the same device)
Which gives us maximum flexibility when dynamically growing/shrinking a
VM in smaller steps. When shrinking, virtio-mem will prioritize unplug of
MOVABLE memory with [1] sent last week, such that we won't accidentially
trigger zone imbalances in more complicated setups that involve multiple
virtio-mem devices.
I'll update the memory-hotplug.rst documentation separately, once the
overhaul is done. For now I decided to use module parameters that can be
changed at runtime and not add new sysfs files for configuration. Easier
and cleaner IMHO -- especially, temporarily overwritable via the cmdline.
Future work:
- Use a single static memory group for dax/kmem
- Sense upfront for DIMMs (and dax/kmem) which pieces can actually be added
and won't actually contribute to the static memory group size
- Use memory groups for ppc64 dlpar
- More tunables. For example, a way to configure to keep some memory
default offline (s390x standby memory, dax/kmem)
- Indicate to user space that MOVABLE might be a bad idea -- especially
relevant when memory ballooning without support for balloon compaction
is active.
Cc: Andrew Morton <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Marek Kedzierski <[email protected]>
Cc: Hui Zhu <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
[1] https://lkml.kernel.org/r/[email protected]
David Hildenbrand (12):
mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()
mm: track present early pages per zone
mm/memory_hotplug: introduce "auto-movable" online policy
mm/memory_hotplug: remove nid parameter from arch_remove_memory()
mm/memory_hotplug: remove nid parameter from remove_memory() and
friends
drivers/base/memory: "memory groups" to logically group memory blocks
mm/memory_hotplug: track present pages in memory groups
ACPI: memhotplug: memory resources cannot be enabled yet
ACPI: memhotplug: use a single static memory group for a single memory
device
virtio-mem: use a single dynamic memory group for a single virtio-mem
device
mm/memory_hotplug: memory group aware "auto-movable" online policy
mm/memory_hotplug: improved dynamic memory group aware "auto-movable"
online policy
arch/arm64/mm/mmu.c | 3 +-
arch/ia64/mm/init.c | 3 +-
arch/powerpc/mm/mem.c | 3 +-
.../platforms/pseries/hotplug-memory.c | 9 +-
arch/s390/mm/init.c | 3 +-
arch/sh/mm/init.c | 3 +-
arch/x86/mm/init_32.c | 3 +-
arch/x86/mm/init_64.c | 3 +-
drivers/acpi/acpi_memhotplug.c | 46 ++-
drivers/base/memory.c | 163 +++++++--
drivers/dax/kmem.c | 3 +-
drivers/virtio/virtio_mem.c | 26 +-
include/linux/memory.h | 53 ++-
include/linux/memory_hotplug.h | 35 +-
include/linux/mmzone.h | 7 +
mm/memory_hotplug.c | 346 +++++++++++++++++-
mm/memremap.c | 5 +-
mm/page_alloc.c | 3 +
18 files changed, 618 insertions(+), 99 deletions(-)
base-commit: 614124bea77e452aa6df7a8714e8bc820b489922
--
2.31.1
Checkpatch complained on a follow-up patch that we are using "unsigned"
here, which defaults to "unsigned int" and checkpatch is correct.
Use "unsigned long" instead, just as we do in other places when handling
PFNs. This can bite us once we have physical addresses in the range of
multiple TB.
Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/memory_hotplug.h | 4 ++--
mm/memory_hotplug.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 28f32fd00fe9..57a8aa463ccb 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -366,8 +366,8 @@ extern void sparse_remove_section(struct mem_section *ms,
unsigned long map_offset, struct vmem_altmap *altmap);
extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
unsigned long pnum);
-extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
- unsigned long nr_pages);
+extern struct zone *zone_for_pfn_range(int online_type, int nid,
+ unsigned long start_pfn, unsigned long nr_pages);
extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
struct mhp_params *params);
void arch_remove_linear_mapping(u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..4969614d2140 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -834,8 +834,8 @@ static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn
return movable_node_enabled ? movable_zone : kernel_zone;
}
-struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
- unsigned long nr_pages)
+struct zone *zone_for_pfn_range(int online_type, int nid,
+ unsigned long start_pfn, unsigned long nr_pages)
{
if (online_type == MMOP_ONLINE_KERNEL)
return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
--
2.31.1
For implementing a new memory onlining policy, which determines when to
online memory blocks to ZONE_MOVABLE semi-automatically, we need the number
of present early (boot) pages -- present pages excluding hotplugged pages.
Let's track these pages per zone.
Pass a page instead of the zone to adjust_present_page_count(), similar
as adjust_managed_page_count() and derive the zone from the page.
It's worth noting that a memory block to be offlined/onlined is either
completely "early" or "not early". add_memory() and friends can only add
complete memory blocks and we only online/offline complete (individual)
memory blocks.
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 14 +++++++-------
include/linux/memory_hotplug.h | 2 +-
include/linux/mmzone.h | 7 +++++++
mm/memory_hotplug.c | 13 ++++++++++---
mm/page_alloc.c | 3 +++
5 files changed, 28 insertions(+), 11 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index d5ffaab3cb61..427323620ce8 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -205,7 +205,8 @@ static int memory_block_online(struct memory_block *mem)
* now already properly populated.
*/
if (nr_vmemmap_pages)
- adjust_present_page_count(zone, nr_vmemmap_pages);
+ adjust_present_page_count(pfn_to_page(start_pfn),
+ nr_vmemmap_pages);
return ret;
}
@@ -215,24 +216,23 @@ static int memory_block_offline(struct memory_block *mem)
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
- struct zone *zone;
int ret;
/*
* Unaccount before offlining, such that unpopulated zone and kthreads
* can properly be torn down in offline_pages().
*/
- if (nr_vmemmap_pages) {
- zone = page_zone(pfn_to_page(start_pfn));
- adjust_present_page_count(zone, -nr_vmemmap_pages);
- }
+ if (nr_vmemmap_pages)
+ adjust_present_page_count(pfn_to_page(start_pfn),
+ -nr_vmemmap_pages);
ret = offline_pages(start_pfn + nr_vmemmap_pages,
nr_pages - nr_vmemmap_pages);
if (ret) {
/* offline_pages() failed. Account back. */
if (nr_vmemmap_pages)
- adjust_present_page_count(zone, nr_vmemmap_pages);
+ adjust_present_page_count(pfn_to_page(start_pfn),
+ nr_vmemmap_pages);
return ret;
}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 57a8aa463ccb..571734fd95bd 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -107,7 +107,7 @@ static inline void zone_seqlock_init(struct zone *zone)
extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
-extern void adjust_present_page_count(struct zone *zone, long nr_pages);
+extern void adjust_present_page_count(struct page *page, long nr_pages);
/* VM interface that may be used by firmware interface */
extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
struct zone *zone);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0d53eba1c383..71621f96077e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -512,6 +512,10 @@ struct zone {
* is calculated as:
* present_pages = spanned_pages - absent_pages(pages in holes);
*
+ * present_early_pages is present pages existing within the zone
+ * located on memory available since early boot, excluding hotplugged
+ * memory.
+ *
* managed_pages is present pages managed by the buddy system, which
* is calculated as (reserved_pages includes pages allocated by the
* bootmem allocator):
@@ -544,6 +548,9 @@ struct zone {
atomic_long_t managed_pages;
unsigned long spanned_pages;
unsigned long present_pages;
+#if defined(CONFIG_MEMORY_HOTPLUG)
+ unsigned long present_early_pages;
+#endif
#ifdef CONFIG_CMA
unsigned long cma_pages;
#endif
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4969614d2140..f2646db86190 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -850,10 +850,17 @@ struct zone *zone_for_pfn_range(int online_type, int nid,
* This function should only be called by memory_block_{online,offline},
* and {online,offline}_pages.
*/
-void adjust_present_page_count(struct zone *zone, long nr_pages)
+void adjust_present_page_count(struct page *page, long nr_pages)
{
+ struct zone *zone = page_zone(page);
unsigned long flags;
+ /*
+ * We only support onlining/offlining/adding/removing of complete
+ * memory blocks; therefore, either all is early or hotplugged.
+ */
+ if (early_section(__pfn_to_section(page_to_pfn(page))))
+ zone->present_early_pages += nr_pages;
zone->present_pages += nr_pages;
pgdat_resize_lock(zone->zone_pgdat, &flags);
zone->zone_pgdat->node_present_pages += nr_pages;
@@ -956,7 +963,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *z
}
online_pages_range(pfn, nr_pages);
- adjust_present_page_count(zone, nr_pages);
+ adjust_present_page_count(pfn_to_page(pfn), nr_pages);
node_states_set_node(nid, &arg);
if (need_zonelists_rebuild)
@@ -1827,7 +1834,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
- adjust_present_page_count(zone, -nr_pages);
+ adjust_present_page_count(pfn_to_page(start_pfn), -nr_pages);
init_per_zone_wmark_min();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d1f5de1c1283..08f68238882e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7007,6 +7007,9 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
zone->zone_start_pfn = 0;
zone->spanned_pages = size;
zone->present_pages = real_size;
+#if defined(CONFIG_MEMORY_HOTPLUG)
+ zone->present_early_pages = real_size;
+#endif
totalpages += size;
realtotalpages += real_size;
--
2.31.1
The parameter is unused, let's remove it.
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Sergei Trofimovich <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Pierre Morel <[email protected]>
Cc: Jia He <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/arm64/mm/mmu.c | 3 +--
arch/ia64/mm/init.c | 3 +--
arch/powerpc/mm/mem.c | 3 +--
arch/s390/mm/init.c | 3 +--
arch/sh/mm/init.c | 3 +--
arch/x86/mm/init_32.c | 3 +--
arch/x86/mm/init_64.c | 3 +--
include/linux/memory_hotplug.h | 3 +--
mm/memory_hotplug.c | 4 ++--
mm/memremap.c | 5 +----
10 files changed, 11 insertions(+), 22 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 89b66ef43a0f..c7821013f551 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1502,8 +1502,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
return ret;
}
-void arch_remove_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 064a967a7b6e..5c6da8d83c1a 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -484,8 +484,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
return ret;
}
-void arch_remove_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 043bbeaf407c..fc5c36189c26 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -115,8 +115,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
return rc;
}
-void __ref arch_remove_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 8ac710de1ab1..d85bd7f5d8dc 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -306,8 +306,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
return rc;
}
-void arch_remove_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 168d7d4dd735..d74daf68e59e 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -414,8 +414,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
return ret;
}
-void arch_remove_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 21ffb03f6c72..5e82aafb5b49 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -801,8 +801,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
return __add_pages(nid, start_pfn, nr_pages, params);
}
-void arch_remove_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e527d829e1ed..d0296c081607 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1254,8 +1254,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
remove_pagetable(start, end, true, NULL);
}
-void __ref arch_remove_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 571734fd95bd..1d8d09c029c9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -142,8 +142,7 @@ static inline bool movable_node_is_enabled(void)
return movable_node_enabled;
}
-extern void arch_remove_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap);
+extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap);
extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
struct vmem_altmap *altmap);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7206787ac5a9..f9be66bbd847 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1431,7 +1431,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
/* create memory block devices after memory was added */
ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
if (ret) {
- arch_remove_memory(nid, start, size, NULL);
+ arch_remove_memory(start, size, NULL);
goto error;
}
@@ -2211,7 +2211,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
mem_hotplug_begin();
- arch_remove_memory(nid, start, size, altmap);
+ arch_remove_memory(start, size, altmap);
if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
memblock_free(start, size);
diff --git a/mm/memremap.c b/mm/memremap.c
index 15a074ffb8d7..ed593bf87109 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -140,14 +140,11 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
{
struct range *range = &pgmap->ranges[range_id];
struct page *first_page;
- int nid;
/* make sure to access a memmap that was actually initialized */
first_page = pfn_to_page(pfn_first(pgmap, range_id));
/* pages are dead and unused, undo the arch mapping */
- nid = page_to_nid(first_page);
-
mem_hotplug_begin();
remove_pfn_range_from_zone(page_zone(first_page), PHYS_PFN(range->start),
PHYS_PFN(range_len(range)));
@@ -155,7 +152,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
__remove_pages(PHYS_PFN(range->start),
PHYS_PFN(range_len(range)), NULL);
} else {
- arch_remove_memory(nid, range->start, range_len(range),
+ arch_remove_memory(range->start, range_len(range),
pgmap_altmap(pgmap));
kasan_remove_zero_shadow(__va(range->start), range_len(range));
}
--
2.31.1
When onlining without specifying a zone (using "online" instead of
"online_kernel" or "online_movable"), we currently select a zone such that
existing zones are kept contiguous. This online policy made sense in the
past, where contiguous zones where required.
We'd like to implement smarter policies, however:
* User space has little insight. As one example, it has no idea which
memory blocks logically belong together (e.g., to a DIMM or to a
virtio-mem device).
* Drivers that add memory in separate memory blocks, especially
virtio-mem, want memory to get onlined right from the kernel when
adding.
So we really want to have onlining to differing zones managed in the
kernel, configured by user space.
We see more and more cases where we might eventually hotplug a lot of
memory in the future (e.g., eventually grow a 2 GiB VM to 64 GiB), however:
* Resizing happens dynamically, in smaller steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...)
* We still want as much flexibility as possible, especially,
hotunplugging as much memory as possible later.
We can really only use "online_movable" if we know that the amount of
memory we are going to hotplug upfront, and we know that it won't result
in a zone imbalance. So in our example, a 2 GiB VM that could grow to 64
GiB could currently not use "online_movable", and instead,
"online_kernel" would have to be used, resulting in worse (no) memory
hotunplug reliability.
Let's add a new "auto-movable" online policy that considers the current
zone ratios (global, per-node) to determine, whether we a memory block
can be onlined to ZONE_MOVABLE:
MOVABLE : KERNEL
However, internally we'll only consider the following ratio for now:
MOVABLE : KERNEL_EARLY
For now, we don't allow for hotplugged KERNEL memory to allow for more
MOVABLE memory, because there is no coordination accross memory devices.
In follow-up patches, we will allow for more KERNEL memory within a memory
device to allow for more MOVABLE memory within the same memory device --
which only makes sense for special memory device types.
We base our calculation on "present pages", see the code comments for
details. Hotplugged memory will get online to ZONE_MOVABLE if the
configured ratio allows for it. Depending on the setup, this can result
in fragmented zones, which can make compaction slower and dynamic
allocation of gigantic pages when not using CMA less reliable
(... which is already pretty unreliable).
The old policy will be the default and called "contig-zones". In follow-up
patches, our new policy will use additional information, such as memory
groups, to make even smarter decisions across memory blocks.
Configuration:
* memory_hotplug.online_policy is used to switch between both polices and
defaults to "contig-zones".
* memory_hotplug.auto_movable_ratio defines the maximum ratio is in percent
and defaults to "301" -- allowing e.g., most 8 GiB machines to grow to 32
GiB and have all hotplugged memory in ZONE_MOVABLE. The additional
percent accounts for a handful of lost present pages (e.g., firmware
allocations). User space is expected to adjust this ratio when enabling
the new "auto-movable" policy, though.
* memory_hotplug.auto_movable_numa_aware considers numa node stats in
addition to global stats, and defaults to "true".
Note: just like the old policy, the new policy won't take things like
unmovable huge pages or memory ballooning that doesn't support balloon
compaction into account. User space has to configure onlining
accordingly.
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory_hotplug.c | 189 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 189 insertions(+)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f2646db86190..7206787ac5a9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -52,6 +52,73 @@ module_param(memmap_on_memory, bool, 0444);
MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
#endif
+enum {
+ ONLINE_POLICY_CONTIG_ZONES = 0,
+ ONLINE_POLICY_AUTO_MOVABLE,
+};
+
+const char *online_policy_to_str[] = {
+ [ONLINE_POLICY_CONTIG_ZONES] = "contig-zones",
+ [ONLINE_POLICY_AUTO_MOVABLE] = "auto-movable",
+};
+
+static int set_online_policy(const char *val, const struct kernel_param *kp)
+{
+ int ret = sysfs_match_string(online_policy_to_str, val);
+
+ if (ret < 0)
+ return ret;
+ *((int *)kp->arg) = ret;
+ return 0;
+}
+
+static int get_online_policy(char *buffer, const struct kernel_param *kp)
+{
+ return sprintf(buffer, "%s\n", online_policy_to_str[*((int *)kp->arg)]);
+}
+
+/*
+ * memory_hotplug.online_policy: configure online behavior when onlining without
+ * specifying a zone (MMOP_ONLINE)
+ *
+ * "contig-zones": keep zone contiguous
+ * "auto-movable": online memory to ZONE_MOVABLE if the configuration
+ * (auto_movable_ratio, auto_movable_numa_aware) allows for it
+ */
+static int online_policy __read_mostly = ONLINE_POLICY_CONTIG_ZONES;
+static const struct kernel_param_ops online_policy_ops = {
+ .set = set_online_policy,
+ .get = get_online_policy,
+};
+module_param_cb(online_policy, &online_policy_ops, &online_policy, 0644);
+MODULE_PARM_DESC(online_policy,
+ "Set the online policy (\"contig-zones\", \"auto-movable\") "
+ "Default: \"contig-zones\"");
+
+/*
+ * memory_hotplug.auto_movable_ratio: specify maximum MOVABLE:KERNEL ratio
+ *
+ * The ratio represent an upper limit and the kernel might decide to not
+ * online some memory to ZONE_MOVABLE -- e.g., because hotplugged KERNEL memory
+ * doesn't allow for more MOVABLE memory.
+ */
+static unsigned int auto_movable_ratio __read_mostly = 301;
+module_param(auto_movable_ratio, uint, 0644);
+MODULE_PARM_DESC(auto_movable_ratio,
+ "Set the maximum ratio of MOVABLE:KERNEL memory in the system "
+ "in percent for \"auto-movable\" online policy. Default: 301");
+
+/*
+ * memory_hotplug.auto_movable_numa_aware: consider numa node stats
+ */
+#ifdef CONFIG_NUMA
+static bool auto_movable_numa_aware __read_mostly = true;
+module_param(auto_movable_numa_aware, bool, 0644);
+MODULE_PARM_DESC(auto_movable_numa_aware,
+ "Consider numa node stats in addition to global stats in "
+ "\"auto-movable\" online policy. Default: true");
+#endif /* CONFIG_NUMA */
+
/*
* online_page_callback contains pointer to current page onlining function.
* Initially it is generic_online_page(). If it is required it could be
@@ -789,6 +856,59 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
set_zone_contiguous(zone);
}
+struct auto_movable_stats {
+ unsigned long kernel_early_pages;
+ unsigned long movable_pages;
+};
+
+static void auto_movable_stats_account_zone(struct auto_movable_stats *stats,
+ struct zone *zone)
+{
+ if (zone_idx(zone) == ZONE_MOVABLE) {
+ stats->movable_pages += zone->present_pages;
+ } else {
+ /*
+ * CMA pages (never on hotplugged memory) behave like
+ * ZONE_MOVABLE.
+ */
+ stats->movable_pages += zone->cma_pages;
+ stats->kernel_early_pages += zone->present_early_pages;
+ stats->kernel_early_pages -= zone->cma_pages;
+ }
+}
+
+static bool auto_movable_can_online_movable(int nid, unsigned long nr_pages)
+{
+ struct auto_movable_stats stats = {};
+ unsigned long kernel_early_pages, movable_pages;
+ pg_data_t *pgdat = NODE_DATA(nid);
+ struct zone *zone;
+ int i;
+
+ /* Walk all relevant zones and collect MOVABLE vs. KERNEL stats. */
+ if (nid == NUMA_NO_NODE) {
+ /* TODO: cache values */
+ for_each_populated_zone(zone)
+ auto_movable_stats_account_zone(&stats, zone);
+ } else {
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ zone = pgdat->node_zones + i;
+ if (populated_zone(zone))
+ auto_movable_stats_account_zone(&stats, zone);
+ }
+ }
+
+ kernel_early_pages = stats.kernel_early_pages;
+ movable_pages = stats.movable_pages;
+
+ /*
+ * Test if we could online the given number of pages to ZONE_MOVABLE
+ * and still stay in the configured ratio.
+ */
+ movable_pages += nr_pages;
+ return movable_pages <= (auto_movable_ratio * kernel_early_pages) / 100;
+}
+
/*
* Returns a default kernel memory zone for the given pfn range.
* If no kernel zone covers this pfn range it will automatically go
@@ -810,6 +930,72 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
return &pgdat->node_zones[ZONE_NORMAL];
}
+/*
+ * Determine to which zone to online memory dynamically based on user
+ * configuration and system stats. We care about the following ratio:
+ *
+ * MOVABLE : KERNEL
+ *
+ * Whereby MOVABLE is memory in ZONE_MOVABLE and KERNEL is memory in
+ * one of the kernel zones. CMA pages inside one of the kernel zones really
+ * behaves like ZONE_MOVABLE, so we treat them accordingly.
+ *
+ * We don't allow for hotplugged memory in a KERNEL zone to increase the
+ * amount of MOVABLE memory we can have, so we end up with:
+ *
+ * MOVABLE : KERNEL_EARLY
+ *
+ * Whereby KERNEL_EARLY is memory in one of the kernel zones, available sinze
+ * boot. We base our calculation on KERNEL_EARLY internally, because:
+ *
+ * a) Hotplugged memory in one of the kernel zones can sometimes still get
+ * hotunplugged, especially when hot(un)plugging individual memory blocks.
+ * There is no coordination across memory devices, therefore "automatic"
+ * hotunplugging, as implemented in hypervisors, could result in zone
+ * imbalances.
+ * b) Early/boot memory in one of the kernel zones can usually not get
+ * hotunplugged again (e.g., no firmware interface to unplug, fragmented
+ * with unmovable allocations). While there are corner cases where it might
+ * still work, it is barely relevant in practice.
+ *
+ * We rely on "present pages" instead of "managed pages", as the latter is
+ * highly unreliable and dynamic in virtualized environments, and does not
+ * consider boot time allocations. For example, memory ballooning adjusts the
+ * managed pages when inflating/deflating the balloon, and balloon compaction
+ * can even migrate inflated pages between zones.
+ *
+ * Using "present pages" is better but some things to keep in mind are:
+ *
+ * a) Some memblock allocations, such as for the crashkernel area, are
+ * effectively unused by the kernel, yet they account to "present pages".
+ * Fortunately, these allocations are comparatively small in relevant setups
+ * (e.g., fraction of system memory).
+ * b) Some hotplugged memory blocks in virtualized environments, esecially
+ * hotplugged by virtio-mem, look like they are completely present, however,
+ * only parts of the memory block are actually currently usable.
+ * "present pages" is an upper limit that can get reached at runtime. As
+ * we base our calculations on KERNEL_EARLY, this is not an issue.
+ */
+static struct zone *auto_movable_zone_for_pfn(int nid, unsigned long pfn,
+ unsigned long nr_pages)
+{
+ if (!auto_movable_ratio)
+ goto kernel_zone;
+
+ if (!auto_movable_can_online_movable(NUMA_NO_NODE, nr_pages))
+ goto kernel_zone;
+
+#ifdef CONFIG_NUMA
+ if (auto_movable_numa_aware &&
+ !auto_movable_can_online_movable(nid, nr_pages))
+ goto kernel_zone;
+#endif /* CONFIG_NUMA */
+
+ return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
+kernel_zone:
+ return default_kernel_zone_for_pfn(nid, pfn, nr_pages);
+}
+
static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
unsigned long nr_pages)
{
@@ -843,6 +1029,9 @@ struct zone *zone_for_pfn_range(int online_type, int nid,
if (online_type == MMOP_ONLINE_MOVABLE)
return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
+ if (online_policy == ONLINE_POLICY_AUTO_MOVABLE)
+ return auto_movable_zone_for_pfn(nid, start_pfn, nr_pages);
+
return default_zone_for_pfn(nid, start_pfn, nr_pages);
}
--
2.31.1
There is only a single user remaining. We can simply try to offline all
online nodes - which is fast, because we usually span pages and can skip
such nodes right away.
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Nathan Lynch <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Scott Cheloha <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: David Hildenbrand <[email protected]>
---
.../platforms/pseries/hotplug-memory.c | 9 ++++-----
drivers/acpi/acpi_memhotplug.c | 7 +------
drivers/dax/kmem.c | 3 +--
drivers/virtio/virtio_mem.c | 4 ++--
include/linux/memory_hotplug.h | 10 +++++-----
mm/memory_hotplug.c | 20 +++++++++----------
6 files changed, 23 insertions(+), 30 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..4a9232ddbefe 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -286,7 +286,7 @@ static int pseries_remove_memblock(unsigned long base, unsigned long memblock_si
{
unsigned long block_sz, start_pfn;
int sections_per_block;
- int i, nid;
+ int i;
start_pfn = base >> PAGE_SHIFT;
@@ -297,10 +297,9 @@ static int pseries_remove_memblock(unsigned long base, unsigned long memblock_si
block_sz = pseries_memory_block_size();
sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
- nid = memory_add_physaddr_to_nid(base);
for (i = 0; i < sections_per_block; i++) {
- __remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE);
+ __remove_memory(base, MIN_MEMORY_BLOCK_SIZE);
base += MIN_MEMORY_BLOCK_SIZE;
}
@@ -386,7 +385,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
block_sz = pseries_memory_block_size();
- __remove_memory(mem_block->nid, lmb->base_addr, block_sz);
+ __remove_memory(lmb->base_addr, block_sz);
put_device(&mem_block->dev);
/* Update memory regions for memory remove */
@@ -638,7 +637,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
rc = dlpar_online_lmb(lmb);
if (rc) {
- __remove_memory(nid, lmb->base_addr, block_sz);
+ __remove_memory(lmb->base_addr, block_sz);
invalidate_lmb_associativity_index(lmb);
} else {
lmb->flags |= DRCONF_MEM_ASSIGNED;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 8cc195c4c861..1d01d9414c40 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -239,19 +239,14 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
{
- acpi_handle handle = mem_device->device->handle;
struct acpi_memory_info *info, *n;
- int nid = acpi_get_node(handle);
list_for_each_entry_safe(info, n, &mem_device->res_list, list) {
if (!info->enabled)
continue;
- if (nid == NUMA_NO_NODE)
- nid = memory_add_physaddr_to_nid(info->start_addr);
-
acpi_unbind_memory_blocks(info);
- __remove_memory(nid, info->start_addr, info->length);
+ __remove_memory(info->start_addr, info->length);
list_del(&info->list);
kfree(info);
}
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index ac231cc36359..99e0f60c4c26 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -156,8 +156,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
if (rc)
continue;
- rc = remove_memory(dev_dax->target_node, range.start,
- range_len(&range));
+ rc = remove_memory(range.start, range_len(&range));
if (rc == 0) {
release_resource(data->res[i]);
kfree(data->res[i]);
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 10ec60d81e84..e327fb878143 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -673,7 +673,7 @@ static int virtio_mem_remove_memory(struct virtio_mem *vm, uint64_t addr,
dev_dbg(&vm->vdev->dev, "removing memory: 0x%llx - 0x%llx\n", addr,
addr + size - 1);
- rc = remove_memory(vm->nid, addr, size);
+ rc = remove_memory(addr, size);
if (!rc) {
atomic64_sub(size, &vm->offline_size);
/*
@@ -728,7 +728,7 @@ static int virtio_mem_offline_and_remove_memory(struct virtio_mem *vm,
"offlining and removing memory: 0x%llx - 0x%llx\n", addr,
addr + size - 1);
- rc = offline_and_remove_memory(vm->nid, addr, size);
+ rc = offline_and_remove_memory(addr, size);
if (!rc) {
atomic64_sub(size, &vm->offline_size);
/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 1d8d09c029c9..84f05435e2ae 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -319,9 +319,9 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
extern void try_offline_node(int nid);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
-extern int remove_memory(int nid, u64 start, u64 size);
-extern void __remove_memory(int nid, u64 start, u64 size);
-extern int offline_and_remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(u64 start, u64 size);
+extern void __remove_memory(u64 start, u64 size);
+extern int offline_and_remove_memory(u64 start, u64 size);
#else
static inline void try_offline_node(int nid) {}
@@ -331,12 +331,12 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
return -EINVAL;
}
-static inline int remove_memory(int nid, u64 start, u64 size)
+static inline int remove_memory(u64 start, u64 size)
{
return -EBUSY;
}
-static inline void __remove_memory(int nid, u64 start, u64 size) {}
+static inline void __remove_memory(u64 start, u64 size) {}
#endif /* CONFIG_MEMORY_HOTREMOVE */
extern void set_zone_contiguous(struct zone *zone);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f9be66bbd847..9cae42636f3e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2157,9 +2157,9 @@ void try_offline_node(int nid)
}
EXPORT_SYMBOL(try_offline_node);
-static int __ref try_remove_memory(int nid, u64 start, u64 size)
+static int __ref try_remove_memory(u64 start, u64 size)
{
- int rc = 0;
+ int rc = 0, nid;
struct vmem_altmap mhp_altmap = {};
struct vmem_altmap *altmap = NULL;
unsigned long nr_vmemmap_pages;
@@ -2220,7 +2220,8 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
release_mem_region_adjustable(start, size);
- try_offline_node(nid);
+ for_each_online_node(nid)
+ try_offline_node(nid);
mem_hotplug_done();
return 0;
@@ -2228,7 +2229,6 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
/**
* remove_memory
- * @nid: the node ID
* @start: physical address of the region to remove
* @size: size of the region to remove
*
@@ -2236,14 +2236,14 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
* and online/offline operations before this call, as required by
* try_offline_node().
*/
-void __remove_memory(int nid, u64 start, u64 size)
+void __remove_memory(u64 start, u64 size)
{
/*
* trigger BUG() if some memory is not offlined prior to calling this
* function
*/
- if (try_remove_memory(nid, start, size))
+ if (try_remove_memory(start, size))
BUG();
}
@@ -2251,12 +2251,12 @@ void __remove_memory(int nid, u64 start, u64 size)
* Remove memory if every memory block is offline, otherwise return -EBUSY is
* some memory is not offline
*/
-int remove_memory(int nid, u64 start, u64 size)
+int remove_memory(u64 start, u64 size)
{
int rc;
lock_device_hotplug();
- rc = try_remove_memory(nid, start, size);
+ rc = try_remove_memory(start, size);
unlock_device_hotplug();
return rc;
@@ -2316,7 +2316,7 @@ static int try_reonline_memory_block(struct memory_block *mem, void *arg)
* unplugged all memory (so it's no longer in use) and want to offline + remove
* that memory.
*/
-int offline_and_remove_memory(int nid, u64 start, u64 size)
+int offline_and_remove_memory(u64 start, u64 size)
{
const unsigned long mb_count = size / memory_block_size_bytes();
uint8_t *online_types, *tmp;
@@ -2352,7 +2352,7 @@ int offline_and_remove_memory(int nid, u64 start, u64 size)
* This cannot fail as it cannot get onlined in the meantime.
*/
if (!rc) {
- rc = try_remove_memory(nid, start, size);
+ rc = try_remove_memory(start, size);
if (rc)
pr_err("%s: Failed to remove memory: %d", __func__, rc);
}
--
2.31.1
We allocate + initialize everything from scratch. In case enabling the
device fails, we free all memory resourcs.
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/acpi/acpi_memhotplug.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 1d01d9414c40..eb4faf7c5cad 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -182,10 +182,6 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
* (i.e. memory-hot-remove function)
*/
list_for_each_entry(info, &mem_device->res_list, list) {
- if (info->enabled) { /* just sanity check...*/
- num_enabled++;
- continue;
- }
/*
* If the memory block size is zero, please ignore it.
* Don't try to do the following memory hotplug flowchart.
--
2.31.1
Let's group all memory we add for a single memory device - we want a
single node for that (which also seems to be the sane thing to do).
We won't care for now about memory that was already added to the system
(e.g., via e820) -- usually *all* memory of a memory device was already
added and we'll fail acpi_memory_enable_device().
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/acpi/acpi_memhotplug.c | 35 +++++++++++++++++++++++++++++-----
1 file changed, 30 insertions(+), 5 deletions(-)
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index eb4faf7c5cad..0c7b062c0e5d 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -54,6 +54,7 @@ struct acpi_memory_info {
struct acpi_memory_device {
struct acpi_device *device;
struct list_head res_list;
+ int mgid;
};
static acpi_status
@@ -171,10 +172,31 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
acpi_handle handle = mem_device->device->handle;
int result, num_enabled = 0;
struct acpi_memory_info *info;
- mhp_t mhp_flags = MHP_NONE;
- int node;
+ mhp_t mhp_flags = MHP_NID_IS_MGID;
+ u64 total_length = 0;
+ int node, mgid;
node = acpi_get_node(handle);
+
+ list_for_each_entry(info, &mem_device->res_list, list) {
+ if (!info->length)
+ continue;
+ /* We want a single node for the whole memory group */
+ if (node < 0)
+ node = memory_add_physaddr_to_nid(info->start_addr);
+ total_length += info->length;
+ }
+
+ if (!total_length) {
+ dev_err(&mem_device->device->dev, "device is empty\n");
+ return -EINVAL;
+ }
+
+ mgid = register_static_memory_group(node, PFN_UP(total_length));
+ if (mgid < 0)
+ return mgid;
+ mem_device->mgid = mgid;
+
/*
* Tell the VM there is more memory here...
* Note: Assume that this function returns zero on success
@@ -188,12 +210,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
*/
if (!info->length)
continue;
- if (node < 0)
- node = memory_add_physaddr_to_nid(info->start_addr);
if (mhp_supports_memmap_on_memory(info->length))
mhp_flags |= MHP_MEMMAP_ON_MEMORY;
- result = __add_memory(node, info->start_addr, info->length,
+ result = __add_memory(mgid, info->start_addr, info->length,
mhp_flags);
/*
@@ -253,6 +273,10 @@ static void acpi_memory_device_free(struct acpi_memory_device *mem_device)
if (!mem_device)
return;
+ /* In case we succeeded adding *some* memory, unregistering fails. */
+ if (mem_device->mgid >= 0)
+ unregister_memory_group(mem_device->mgid);
+
acpi_memory_free_device_resources(mem_device);
mem_device->device->driver_data = NULL;
kfree(mem_device);
@@ -273,6 +297,7 @@ static int acpi_memory_device_add(struct acpi_device *device,
INIT_LIST_HEAD(&mem_device->res_list);
mem_device->device = device;
+ mem_device->mgid = -1;
sprintf(acpi_device_name(device), "%s", ACPI_MEMORY_DEVICE_NAME);
sprintf(acpi_device_class(device), "%s", ACPI_MEMORY_DEVICE_CLASS);
device->driver_data = mem_device;
--
2.31.1
In our "auto-movable" memory onlining policy, we want to make decisions
across memory blocks of a single memory device. Examples of memory devices
include ACPI memory devices (in the simplest case a single DIMM) a
virtio-mem. For now, we don't have a connection between a single memory
block device and the real memory device. Each memory device consists of
1..X memory block devices.
Let's logically group memory blocks belonging to the same memory device
in "memory groups". Memory groups can span multiple physical ranges and a
memory group itself does not contain any information regarding physical
ranges, only properties (e.g., "max_pages") necessary for improved memory
onlining.
Introduce two memory group types:
1) Static memory group: E.g., a single ACPI memory device, consisting of
1..X memory resources. A memory group consists of 1..Y memory blocks.
The whole group is added/removed in one go. If any part cannot get
offlined, the whole group cannot be removed.
2) Dynamic memory group: E.g., a single virtio-mem device. Memory is
dynamically added/removed in a fixed granularity, called a "unit",
consisting of 1..X memory blocks. A unit is added/removed in one go.
If any part of a unit cannot get offlined, the whole unit cannot be
removed.
In case of 1) we usually want either all memory managed by ZONE_MOVABLE
or none. In case of 2) we usually want to have as many units as possible
managed by ZONE_MOVABLE. We want a single unit to be of the same type.
For now, memory groups are an internal concept that is not exposed to
user space; we might want to change that in the future, though.
add_memory() users can specify a mgid instead of a nid when passing
the MHP_NID_IS_MGID flag.
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 102 +++++++++++++++++++++++++++++++--
include/linux/memory.h | 46 ++++++++++++++-
include/linux/memory_hotplug.h | 6 +-
mm/memory_hotplug.c | 11 +++-
4 files changed, 158 insertions(+), 7 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 427323620ce8..00c58a6632a6 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -82,6 +82,11 @@ static struct bus_type memory_subsys = {
*/
static DEFINE_XARRAY(memory_blocks);
+/*
+ * Memory groups, indexed by memory group identification (mgid).
+ */
+static DEFINE_XARRAY_FLAGS(memory_groups, XA_FLAGS_ALLOC);
+
static BLOCKING_NOTIFIER_HEAD(memory_chain);
int register_memory_notifier(struct notifier_block *nb)
@@ -634,7 +639,8 @@ int register_memory(struct memory_block *memory)
}
static int init_memory_block(unsigned long block_id, unsigned long state,
- unsigned long nr_vmemmap_pages)
+ unsigned long nr_vmemmap_pages,
+ struct memory_group *group)
{
struct memory_block *mem;
int ret = 0;
@@ -653,6 +659,11 @@ static int init_memory_block(unsigned long block_id, unsigned long state,
mem->nid = NUMA_NO_NODE;
mem->nr_vmemmap_pages = nr_vmemmap_pages;
+ if (group) {
+ mem->group = group;
+ refcount_inc(&group->refcount);
+ }
+
ret = register_memory(mem);
return ret;
@@ -671,7 +682,7 @@ static int add_memory_block(unsigned long base_section_nr)
if (section_count == 0)
return 0;
return init_memory_block(memory_block_id(base_section_nr),
- MEM_ONLINE, 0);
+ MEM_ONLINE, 0, NULL);
}
static void unregister_memory(struct memory_block *memory)
@@ -681,6 +692,11 @@ static void unregister_memory(struct memory_block *memory)
WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
+ if (memory->group) {
+ refcount_dec(&memory->group->refcount);
+ memory->group = NULL;
+ }
+
/* drop the ref. we got via find_memory_block() */
put_device(&memory->dev);
device_unregister(&memory->dev);
@@ -694,7 +710,8 @@ static void unregister_memory(struct memory_block *memory)
* Called under device_hotplug_lock.
*/
int create_memory_block_devices(unsigned long start, unsigned long size,
- unsigned long vmemmap_pages)
+ unsigned long vmemmap_pages,
+ struct memory_group *group)
{
const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
@@ -707,7 +724,8 @@ int create_memory_block_devices(unsigned long start, unsigned long size,
return -EINVAL;
for (block_id = start_block_id; block_id != end_block_id; block_id++) {
- ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
+ ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages,
+ group);
if (ret)
break;
}
@@ -891,3 +909,79 @@ int for_each_memory_block(void *arg, walk_memory_blocks_func_t func)
return bus_for_each_dev(&memory_subsys, NULL, &cb_data,
for_each_memory_block_cb);
}
+
+static int register_memory_group(struct memory_group group)
+{
+ struct memory_group *new_group;
+ uint32_t mgid;
+ int ret;
+
+ if (!node_possible(group.nid))
+ return -EINVAL;
+
+ new_group = kzalloc(sizeof(group), GFP_KERNEL);
+ if (!new_group)
+ return -ENOMEM;
+ *new_group = group;
+ refcount_set(&new_group->refcount, 1);
+
+ ret = xa_alloc(&memory_groups, &mgid, new_group, xa_limit_31b,
+ GFP_KERNEL);
+ if (ret)
+ kfree(new_group);
+ return ret ? ret : mgid;
+}
+
+int register_static_memory_group(int nid, unsigned long max_pages)
+{
+ struct memory_group group = {
+ .nid = nid,
+ .s = {
+ .max_pages = max_pages,
+ },
+ };
+
+ if (!max_pages)
+ return -EINVAL;
+ return register_memory_group(group);
+}
+EXPORT_SYMBOL_GPL(register_static_memory_group);
+
+int register_dynamic_memory_group(int nid, unsigned long unit_pages)
+{
+ struct memory_group group = {
+ .nid = nid,
+ .is_dynamic = true,
+ .d = {
+ .unit_pages = unit_pages,
+ },
+ };
+
+ if (!unit_pages || !is_power_of_2(unit_pages) ||
+ unit_pages < PHYS_PFN(memory_block_size_bytes()))
+ return -EINVAL;
+ return register_memory_group(group);
+}
+EXPORT_SYMBOL_GPL(register_dynamic_memory_group);
+
+int unregister_memory_group(int mgid)
+{
+ struct memory_group *group;
+
+ if (mgid < 0)
+ return -EINVAL;
+
+ group = xa_load(&memory_groups, mgid);
+ if (!group || refcount_read(&group->refcount) > 1)
+ return -EINVAL;
+
+ xa_erase(&memory_groups, mgid);
+ kfree(group);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(unregister_memory_group);
+
+struct memory_group *get_memory_group(int mgid)
+{
+ return xa_load(&memory_groups, mgid);
+}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 97e92e8b556a..6e20a6174fe5 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -23,6 +23,42 @@
#define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS)
+struct memory_group {
+ /* Nid the whole group belongs to. */
+ int nid;
+ /* References from memory blocks + 1. */
+ refcount_t refcount;
+ /*
+ * Memory group type: static vs. dynamic.
+ *
+ * Static: All memory in the group belongs to a single unit, such as,
+ * a DIMM. All memory belonging to the group will be added in
+ * one go and removed in one go -- it's static.
+ *
+ * Dynamic: Memory within the group is added/removed dynamically in
+ * units of the specified granularity of at least one memory block.
+ */
+ bool is_dynamic;
+
+ union {
+ struct {
+ /*
+ * Maximum number of pages we'll have in this static
+ * memory group.
+ */
+ unsigned long max_pages;
+ } s;
+ struct {
+ /*
+ * Unit in pages in which memory is added/removed in
+ * this dynamic memory group. This granularity defines
+ * the alignment of a unit in physical address space.
+ */
+ unsigned long unit_pages;
+ } d;
+ };
+};
+
struct memory_block {
unsigned long start_section_nr;
unsigned long state; /* serialized by the dev->lock */
@@ -34,6 +70,7 @@ struct memory_block {
* lay at the beginning of the memory block.
*/
unsigned long nr_vmemmap_pages;
+ struct memory_group *group; /* group (if any) for this block */
};
int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -86,7 +123,8 @@ static inline int memory_notify(unsigned long val, void *v)
extern int register_memory_notifier(struct notifier_block *nb);
extern void unregister_memory_notifier(struct notifier_block *nb);
int create_memory_block_devices(unsigned long start, unsigned long size,
- unsigned long vmemmap_pages);
+ unsigned long vmemmap_pages,
+ struct memory_group *group);
void remove_memory_block_devices(unsigned long start, unsigned long size);
extern void memory_dev_init(void);
extern int memory_notify(unsigned long val, void *v);
@@ -95,6 +133,12 @@ typedef int (*walk_memory_blocks_func_t)(struct memory_block *, void *);
extern int walk_memory_blocks(unsigned long start, unsigned long size,
void *arg, walk_memory_blocks_func_t func);
extern int for_each_memory_block(void *arg, walk_memory_blocks_func_t func);
+
+extern int register_static_memory_group(int nid, unsigned long max_pages);
+extern int register_dynamic_memory_group(int nid, unsigned long unit_pages);
+extern int unregister_memory_group(int mgid);
+struct memory_group *get_memory_group(int mgid);
+
#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 84f05435e2ae..5c910dc2526a 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -54,7 +54,6 @@ typedef int __bitwise mhp_t;
* might be stale, or the resource might have changed.
*/
#define MHP_MERGE_RESOURCE ((__force mhp_t)BIT(0))
-
/*
* We want memmap (struct page array) to be self contained.
* To do so, we will use the beginning of the hot-added range to build
@@ -62,6 +61,11 @@ typedef int __bitwise mhp_t;
* Only selected architectures support it with SPARSE_VMEMMAP.
*/
#define MHP_MEMMAP_ON_MEMORY ((__force mhp_t)BIT(1))
+/*
+ * The nid field specifies a memory group identifier (mgid) instead. The memory
+ * group implies the nid.
+ */
+#define MHP_NID_IS_MGID ((__force mhp_t)BIT(2))
/*
* Extended parameters for memory hotplug:
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9cae42636f3e..4e039c82e7b6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1384,6 +1384,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
{
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
struct vmem_altmap mhp_altmap = {};
+ struct memory_group *group = NULL;
u64 start, size;
bool new_node = false;
int ret;
@@ -1395,6 +1396,13 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
if (ret)
return ret;
+ if (mhp_flags & MHP_NID_IS_MGID) {
+ group = get_memory_group(nid);
+ if (!group)
+ return -EINVAL;
+ nid = group->nid;
+ }
+
if (!node_possible(nid)) {
WARN(1, "node %d was absent from the node_possible_map\n", nid);
return -EINVAL;
@@ -1429,7 +1437,8 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
goto error;
/* create memory block devices after memory was added */
- ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
+ ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
+ group);
if (ret) {
arch_remove_memory(start, size, NULL);
goto error;
--
2.31.1
Let's use a single dynamic memory group.
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/virtio/virtio_mem.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index e327fb878143..6b9b8b7bf89d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -139,6 +139,8 @@ struct virtio_mem {
* add_memory_driver_managed().
*/
const char *resource_name;
+ /* Memory group identification. */
+ int mgid;
/*
* We don't want to add too much memory if it's not getting onlined,
@@ -622,8 +624,8 @@ static int virtio_mem_add_memory(struct virtio_mem *vm, uint64_t addr,
addr + size - 1);
/* Memory might get onlined immediately. */
atomic64_add(size, &vm->offline_size);
- rc = add_memory_driver_managed(vm->nid, addr, size, vm->resource_name,
- MHP_MERGE_RESOURCE);
+ rc = add_memory_driver_managed(vm->mgid, addr, size, vm->resource_name,
+ MHP_MERGE_RESOURCE | MHP_NID_IS_MGID);
if (rc) {
atomic64_sub(size, &vm->offline_size);
dev_warn(&vm->vdev->dev, "adding memory failed: %d\n", rc);
@@ -2550,6 +2552,7 @@ static bool virtio_mem_has_memory_added(struct virtio_mem *vm)
static int virtio_mem_probe(struct virtio_device *vdev)
{
struct virtio_mem *vm;
+ uint64_t unit_pages;
int rc;
BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24);
@@ -2584,6 +2587,16 @@ static int virtio_mem_probe(struct virtio_device *vdev)
if (rc)
goto out_del_vq;
+ /* use a single dynamic memory group to cover the whole memory device */
+ if (vm->in_sbm)
+ unit_pages = PHYS_PFN(memory_block_size_bytes());
+ else
+ unit_pages = PHYS_PFN(vm->bbm.bb_size);
+ rc = register_dynamic_memory_group(vm->nid, unit_pages);
+ if (rc < 0)
+ goto out_del_resource;
+ vm->mgid = rc;
+
/*
* If we still have memory plugged, we have to unplug all memory first.
* Registering our parent resource makes sure that this memory isn't
@@ -2598,7 +2611,7 @@ static int virtio_mem_probe(struct virtio_device *vdev)
vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
rc = register_memory_notifier(&vm->memory_notifier);
if (rc)
- goto out_del_resource;
+ goto out_unreg_group;
rc = register_virtio_mem_device(vm);
if (rc)
goto out_unreg_mem;
@@ -2612,6 +2625,8 @@ static int virtio_mem_probe(struct virtio_device *vdev)
return 0;
out_unreg_mem:
unregister_memory_notifier(&vm->memory_notifier);
+out_unreg_group:
+ unregister_memory_group(vm->mgid);
out_del_resource:
virtio_mem_delete_resource(vm);
out_del_vq:
@@ -2676,6 +2691,7 @@ static void virtio_mem_remove(struct virtio_device *vdev)
} else {
virtio_mem_delete_resource(vm);
kfree_const(vm->resource_name);
+ unregister_memory_group(vm->mgid);
}
/* remove all tracking data - no locking needed */
--
2.31.1
Use memory groups to improve our "auto-movable" onlining policy:
1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE
only if all other memory blocks in the group are either MOVABLE or could
be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture.
2. For dynamic memory groups (e.g., a virtio-mem device), online a
memory block MOVABLE only if all other memory blocks inside the
current unit are either MOVABLE or could be onlined MOVABLE. For a
virtio-mem device with a device block size with 512 MiB, all 128 MiB
memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not
a mixture.
We have to pass the memory group to zone_for_pfn_range() to take the
memory group into account.
Note: for now, there seems to be no compelling reason to make this
behavior configurable.
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 18 +++++++------
include/linux/memory_hotplug.h | 3 ++-
mm/memory_hotplug.c | 48 +++++++++++++++++++++++++++++++---
3 files changed, 57 insertions(+), 12 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index d8ea448e5fb8..ae70d4005fe2 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -182,7 +182,8 @@ static int memory_block_online(struct memory_block *mem)
struct zone *zone;
int ret;
- zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
+ zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
+ start_pfn, nr_pages);
/*
* Although vmemmap pages have a different lifecycle than the pages
@@ -379,12 +380,13 @@ static ssize_t phys_device_show(struct device *dev,
#ifdef CONFIG_MEMORY_HOTREMOVE
static int print_allowed_zone(char *buf, int len, int nid,
+ struct memory_group *group,
unsigned long start_pfn, unsigned long nr_pages,
int online_type, struct zone *default_zone)
{
struct zone *zone;
- zone = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages);
+ zone = zone_for_pfn_range(online_type, nid, group, start_pfn, nr_pages);
if (zone == default_zone)
return 0;
@@ -397,9 +399,10 @@ static ssize_t valid_zones_show(struct device *dev,
struct memory_block *mem = to_memory_block(dev);
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+ struct memory_group *group = mem->group;
struct zone *default_zone;
+ int nid = mem->nid;
int len = 0;
- int nid;
/*
* Check the existing zone. Make sure that we do that only on the
@@ -418,14 +421,13 @@ static ssize_t valid_zones_show(struct device *dev,
goto out;
}
- nid = mem->nid;
- default_zone = zone_for_pfn_range(MMOP_ONLINE, nid, start_pfn,
- nr_pages);
+ default_zone = zone_for_pfn_range(MMOP_ONLINE, nid, group,
+ start_pfn, nr_pages);
len += sysfs_emit_at(buf, len, "%s", default_zone->name);
- len += print_allowed_zone(buf, len, nid, start_pfn, nr_pages,
+ len += print_allowed_zone(buf, len, nid, group, start_pfn, nr_pages,
MMOP_ONLINE_KERNEL, default_zone);
- len += print_allowed_zone(buf, len, nid, start_pfn, nr_pages,
+ len += print_allowed_zone(buf, len, nid, group, start_pfn, nr_pages,
MMOP_ONLINE_MOVABLE, default_zone);
out:
len += sysfs_emit_at(buf, len, "\n");
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f607d6677873..73d5aead39fc 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -375,7 +375,8 @@ extern void sparse_remove_section(struct mem_section *ms,
extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
unsigned long pnum);
extern struct zone *zone_for_pfn_range(int online_type, int nid,
- unsigned long start_pfn, unsigned long nr_pages);
+ struct memory_group *group, unsigned long start_pfn,
+ unsigned long nr_pages);
extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
struct mhp_params *params);
void arch_remove_linear_mapping(u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5dacb0ed2997..5a3ad9cb48a3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -976,12 +976,53 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
* "present pages" is an upper limit that can get reached at runtime. As
* we base our calculations on KERNEL_EARLY, this is not an issue.
*/
-static struct zone *auto_movable_zone_for_pfn(int nid, unsigned long pfn,
+static struct zone *auto_movable_zone_for_pfn(int nid,
+ struct memory_group *group,
+ unsigned long pfn,
unsigned long nr_pages)
{
+ unsigned long online_pages = 0, max_pages, end_pfn;
+ struct page *page;
+
if (!auto_movable_ratio)
goto kernel_zone;
+ if (group && !group->is_dynamic) {
+ max_pages = group->s.max_pages;
+ online_pages = group->present_movable_pages;
+
+ /* If anything is !MOVABLE online the rest !MOVABLE. */
+ if (group->present_kernel_pages)
+ goto kernel_zone;
+ } else if (!group || group->d.unit_pages == nr_pages) {
+ max_pages = nr_pages;
+ } else {
+ max_pages = group->d.unit_pages;
+ /*
+ * Take a look at all online sections in the current unit.
+ * We can safely assume that all pages within a section belong
+ * to the same zone, because dynamic memory groups only deal
+ * with hotplugged memory.
+ */
+ pfn = ALIGN_DOWN(pfn, group->d.unit_pages);
+ end_pfn = pfn + group->d.unit_pages;
+ for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+ page = pfn_to_online_page(pfn);
+ if (!page)
+ continue;
+ /* If anything is !MOVABLE online the rest !MOVABLE. */
+ if (page_zonenum(page) != ZONE_MOVABLE)
+ goto kernel_zone;
+ online_pages += PAGES_PER_SECTION;
+ }
+ }
+
+ /*
+ * Online MOVABLE if we could *currently* online all remaining parts
+ * MOVABLE. We expect to (add+) online them immediately next, so if
+ * nobody interferes, all will be MOVABLE if possible.
+ */
+ nr_pages = max_pages - online_pages;
if (!auto_movable_can_online_movable(NUMA_NO_NODE, nr_pages))
goto kernel_zone;
@@ -1021,7 +1062,8 @@ static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn
}
struct zone *zone_for_pfn_range(int online_type, int nid,
- unsigned long start_pfn, unsigned long nr_pages)
+ struct memory_group *group, unsigned long start_pfn,
+ unsigned long nr_pages)
{
if (online_type == MMOP_ONLINE_KERNEL)
return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
@@ -1030,7 +1072,7 @@ struct zone *zone_for_pfn_range(int online_type, int nid,
return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
if (online_policy == ONLINE_POLICY_AUTO_MOVABLE)
- return auto_movable_zone_for_pfn(nid, start_pfn, nr_pages);
+ return auto_movable_zone_for_pfn(nid, group, start_pfn, nr_pages);
return default_zone_for_pfn(nid, start_pfn, nr_pages);
}
--
2.31.1
Currently, the "auto-movable" online policy does not allow for hotplugged
KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we can
have, primarily, because there is no coordiantion across memory devices and
we don't want to create zone-imbalances accidentially when unplugging
memory.
However, within a single memory device it's different. Let's allow for
KERNEL memory within a dynamic memory group to allow for more MOVABLE
within the same memory group. The only thing we have to take care of is
that the managing driver avoids zone imbalances by unplugging MOVABLE
memory first, otherwise there can be corner cases where unplug of memory
could result in (accidential) zone imbalances.
virtio-mem is the only user of dynamic memory groups and recently added
support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we
don't need a new toggle to enable it for dynamic memory groups.
We limit this handling to dynamic memory groups, because:
* We want to keep the runtime overhead for collecting stats when onlining
a single memory block small. We tend to have only a handful of dynamic
memory groups, but we can have quite some static memory groups (e.g., 256
DIMMs).
* It doesn't make too much sense for static memory groups, as we try
onlining all applicable memory blocks either completely to ZONE_MOVABLE
or not. In ordinary operation, we won't have a mixture of zones
within a static memory group.
When adding memory to a dynamic memory group, we'll first online memory to
ZONE_MOVABLE as long as early KERNEL memory allows for it. Then, we'll
online the next unit(s) to ZONE_NORMAL, until we can online the next
unit(s) to ZONE_MOVABLE.
For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it
will result in a layout like:
[M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]...
^ movable memory due to early kernel memory
^ allows for more movable memory ...
^-----^ ... here
^ allows for more movable memory ...
^-----^ ... here
While the created layout is sub-optimal when it comes to contiguous zones,
it gives us the maximum flexibility when dynamically growing/shrinking a
device; we can grow small VMs really big in small steps, and still
shrink reliably to e.g., 1/4 of the maximum VM size in this example,
removing full memory blocks along with meta data more reliably.
Mark dynamic memory groups in the xarray such that we can efficiently
iterate over them when collecting stats. In usual setups, we have one
virtio-mem device per NUMA node, and usually only a small number of NUMA
nodes.
Note: for now, there seems to be no compelling reason to make this
behavior configurable.
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 25 ++++++++++++++++++
include/linux/memory.h | 3 +++
mm/memory_hotplug.c | 60 +++++++++++++++++++++++++++++++++++++++---
3 files changed, 84 insertions(+), 4 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index ae70d4005fe2..b826cf96ffcd 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -86,6 +86,7 @@ static DEFINE_XARRAY(memory_blocks);
* Memory groups, indexed by memory group identification (mgid).
*/
static DEFINE_XARRAY_FLAGS(memory_groups, XA_FLAGS_ALLOC);
+#define MEMORY_GROUP_MARK_DYNAMIC XA_MARK_1
static BLOCKING_NOTIFIER_HEAD(memory_chain);
@@ -931,6 +932,8 @@ static int register_memory_group(struct memory_group group)
GFP_KERNEL);
if (ret)
kfree(new_group);
+ else if (group.is_dynamic)
+ xa_set_mark(&memory_groups, mgid, MEMORY_GROUP_MARK_DYNAMIC);
return ret ? ret : mgid;
}
@@ -987,3 +990,25 @@ struct memory_group *get_memory_group(int mgid)
{
return xa_load(&memory_groups, mgid);
}
+
+int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
+ struct memory_group *excluded, void *arg)
+{
+ struct memory_group *group;
+ unsigned long index;
+ int ret = 0;
+
+ xa_for_each_marked(&memory_groups, index, group,
+ MEMORY_GROUP_MARK_DYNAMIC) {
+ if (group == excluded)
+ continue;
+#ifdef CONFIG_NUMA
+ if (nid != NUMA_NO_NODE && group->nid != nid)
+ continue;
+#endif /* CONFIG_NUMA */
+ ret = func(group, arg);
+ if (ret)
+ break;
+ }
+ return ret;
+}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 0eceb8467d9a..c56260853caf 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -142,6 +142,9 @@ extern int register_static_memory_group(int nid, unsigned long max_pages);
extern int register_dynamic_memory_group(int nid, unsigned long unit_pages);
extern int unregister_memory_group(int mgid);
struct memory_group *get_memory_group(int mgid);
+typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
+int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
+ struct memory_group *excluded, void *arg);
#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5a3ad9cb48a3..61bff8f3bfb1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -876,11 +876,44 @@ static void auto_movable_stats_account_zone(struct auto_movable_stats *stats,
stats->kernel_early_pages -= zone->cma_pages;
}
}
+struct auto_movable_group_stats {
+ unsigned long movable_pages;
+ unsigned long req_kernel_early_pages;
+};
-static bool auto_movable_can_online_movable(int nid, unsigned long nr_pages)
+static int auto_movable_stats_account_group(struct memory_group *group,
+ void *arg)
+{
+ const int ratio = READ_ONCE(auto_movable_ratio);
+ struct auto_movable_group_stats *stats = arg;
+ long pages;
+
+ /*
+ * We don't support modifying the config while the auto-movable online
+ * mode is already enabled. Just avoid the division by zero below.
+ */
+ if (!ratio)
+ return 0;
+
+ /*
+ * Calculate how many early kernel pages this group requires to
+ * satisfy the configured zone ratio.
+ */
+ pages = group->present_movable_pages * 100 / ratio;
+ pages -= group->present_kernel_pages;
+
+ if (pages > 0)
+ stats->req_kernel_early_pages += pages;
+ stats->movable_pages += group->present_movable_pages;
+ return 0;
+}
+
+static bool auto_movable_can_online_movable(int nid, struct memory_group *group,
+ unsigned long nr_pages)
{
- struct auto_movable_stats stats = {};
unsigned long kernel_early_pages, movable_pages;
+ struct auto_movable_group_stats group_stats = {};
+ struct auto_movable_stats stats = {};
pg_data_t *pgdat = NODE_DATA(nid);
struct zone *zone;
int i;
@@ -901,6 +934,21 @@ static bool auto_movable_can_online_movable(int nid, unsigned long nr_pages)
kernel_early_pages = stats.kernel_early_pages;
movable_pages = stats.movable_pages;
+ /*
+ * Kernel memory inside dynamic memory group allows for more MOVABLE
+ * memory within the same group. Remove the effect of all but the
+ * current group from the stats.
+ */
+ walk_dynamic_memory_groups(nid, auto_movable_stats_account_group,
+ group, &group_stats);
+ if (kernel_early_pages <= group_stats.req_kernel_early_pages)
+ return false;
+ kernel_early_pages -= group_stats.req_kernel_early_pages;
+ movable_pages -= group_stats.movable_pages;
+
+ if (group && group->is_dynamic)
+ kernel_early_pages += group->present_kernel_pages;
+
/*
* Test if we could online the given number of pages to ZONE_MOVABLE
* and still stay in the configured ratio.
@@ -958,6 +1006,10 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
* with unmovable allocations). While there are corner cases where it might
* still work, it is barely relevant in practice.
*
+ * Exceptions are dynamic memory groups, which allow for more MOVABLE
+ * memory within the same memory group -- because in that case, there is
+ * coordination within the single memory device managed by a single driver.
+ *
* We rely on "present pages" instead of "managed pages", as the latter is
* highly unreliable and dynamic in virtualized environments, and does not
* consider boot time allocations. For example, memory ballooning adjusts the
@@ -1023,12 +1075,12 @@ static struct zone *auto_movable_zone_for_pfn(int nid,
* nobody interferes, all will be MOVABLE if possible.
*/
nr_pages = max_pages - online_pages;
- if (!auto_movable_can_online_movable(NUMA_NO_NODE, nr_pages))
+ if (!auto_movable_can_online_movable(NUMA_NO_NODE, group, nr_pages))
goto kernel_zone;
#ifdef CONFIG_NUMA
if (auto_movable_numa_aware &&
- !auto_movable_can_online_movable(nid, nr_pages))
+ !auto_movable_can_online_movable(nid, group, nr_pages))
goto kernel_zone;
#endif /* CONFIG_NUMA */
--
2.31.1
Let's track all present pages in each memory group. Especially, track
memory present in ZONE_MOVABLE and memory present in one of the kernel
zones (which really only is ZONE_NORMAL right now as memory groups only
apply to hotplugged memory) separate;y within a memory group, to prepare
for making smart auto-online decision for individualmemory blocks within a
memory group based on group statistics.
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 10 +++++-----
include/linux/memory.h | 4 ++++
include/linux/memory_hotplug.h | 13 +++++++++----
mm/memory_hotplug.c | 19 ++++++++++++++-----
4 files changed, 32 insertions(+), 14 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 00c58a6632a6..d8ea448e5fb8 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -198,7 +198,7 @@ static int memory_block_online(struct memory_block *mem)
}
ret = online_pages(start_pfn + nr_vmemmap_pages,
- nr_pages - nr_vmemmap_pages, zone);
+ nr_pages - nr_vmemmap_pages, zone, mem->group);
if (ret) {
if (nr_vmemmap_pages)
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
@@ -210,7 +210,7 @@ static int memory_block_online(struct memory_block *mem)
* now already properly populated.
*/
if (nr_vmemmap_pages)
- adjust_present_page_count(pfn_to_page(start_pfn),
+ adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
nr_vmemmap_pages);
return ret;
@@ -228,16 +228,16 @@ static int memory_block_offline(struct memory_block *mem)
* can properly be torn down in offline_pages().
*/
if (nr_vmemmap_pages)
- adjust_present_page_count(pfn_to_page(start_pfn),
+ adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
-nr_vmemmap_pages);
ret = offline_pages(start_pfn + nr_vmemmap_pages,
- nr_pages - nr_vmemmap_pages);
+ nr_pages - nr_vmemmap_pages, mem->group);
if (ret) {
/* offline_pages() failed. Account back. */
if (nr_vmemmap_pages)
adjust_present_page_count(pfn_to_page(start_pfn),
- nr_vmemmap_pages);
+ mem->group, nr_vmemmap_pages);
return ret;
}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 6e20a6174fe5..0eceb8467d9a 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -28,6 +28,10 @@ struct memory_group {
int nid;
/* References from memory blocks + 1. */
refcount_t refcount;
+ /* Present (online) memory outside ZONE_MOVABLE of this memory group. */
+ unsigned long present_kernel_pages;
+ /* Present (online) memory in ZONE_MOVABLE of this memory group. */
+ unsigned long present_movable_pages;
/*
* Memory group type: static vs. dynamic.
*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 5c910dc2526a..f607d6677873 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -12,6 +12,7 @@ struct zone;
struct pglist_data;
struct mem_section;
struct memory_block;
+struct memory_group;
struct resource;
struct vmem_altmap;
@@ -111,13 +112,15 @@ static inline void zone_seqlock_init(struct zone *zone)
extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
-extern void adjust_present_page_count(struct page *page, long nr_pages);
+extern void adjust_present_page_count(struct page *page,
+ struct memory_group *group,
+ long nr_pages);
/* VM interface that may be used by firmware interface */
extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
struct zone *zone);
extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
extern int online_pages(unsigned long pfn, unsigned long nr_pages,
- struct zone *zone);
+ struct zone *zone, struct memory_group *group);
extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
unsigned long end_pfn);
extern void __offline_isolated_pages(unsigned long start_pfn,
@@ -322,7 +325,8 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
#ifdef CONFIG_MEMORY_HOTREMOVE
extern void try_offline_node(int nid);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
+extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+ struct memory_group *group);
extern int remove_memory(u64 start, u64 size);
extern void __remove_memory(u64 start, u64 size);
extern int offline_and_remove_memory(u64 start, u64 size);
@@ -330,7 +334,8 @@ extern int offline_and_remove_memory(u64 start, u64 size);
#else
static inline void try_offline_node(int nid) {}
-static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+ struct memory_group *group)
{
return -EINVAL;
}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4e039c82e7b6..5dacb0ed2997 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1039,9 +1039,11 @@ struct zone *zone_for_pfn_range(int online_type, int nid,
* This function should only be called by memory_block_{online,offline},
* and {online,offline}_pages.
*/
-void adjust_present_page_count(struct page *page, long nr_pages)
+void adjust_present_page_count(struct page *page, struct memory_group *group,
+ long nr_pages)
{
struct zone *zone = page_zone(page);
+ const bool movable = zone_idx(zone) == ZONE_MOVABLE;
unsigned long flags;
/*
@@ -1054,6 +1056,11 @@ void adjust_present_page_count(struct page *page, long nr_pages)
pgdat_resize_lock(zone->zone_pgdat, &flags);
zone->zone_pgdat->node_present_pages += nr_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
+
+ if (group && movable)
+ group->present_movable_pages += nr_pages;
+ else if (group && !movable)
+ group->present_kernel_pages += nr_pages;
}
int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
@@ -1099,7 +1106,8 @@ void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages)
kasan_remove_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
}
-int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone)
+int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
+ struct zone *zone, struct memory_group *group)
{
unsigned long flags;
int need_zonelists_rebuild = 0;
@@ -1152,7 +1160,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *z
}
online_pages_range(pfn, nr_pages);
- adjust_present_page_count(pfn_to_page(pfn), nr_pages);
+ adjust_present_page_count(pfn_to_page(pfn), group, nr_pages);
node_states_set_node(nid, &arg);
if (need_zonelists_rebuild)
@@ -1896,7 +1904,8 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
return 0;
}
-int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+ struct memory_group *group)
{
const unsigned long end_pfn = start_pfn + nr_pages;
unsigned long pfn, system_ram_pages = 0;
@@ -2032,7 +2041,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
- adjust_present_page_count(pfn_to_page(start_pfn), -nr_pages);
+ adjust_present_page_count(pfn_to_page(start_pfn), group, -nr_pages);
init_per_zone_wmark_min();
--
2.31.1
On Mon, Jun 07, 2021 at 09:54:22PM +0200, David Hildenbrand wrote:
> The parameter is unused, let's remove it.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Vasily Gorbik <[email protected]>
> Cc: Christian Borntraeger <[email protected]>
> Cc: Yoshinori Sato <[email protected]>
> Cc: Rich Felker <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: [email protected]
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Anshuman Khandual <[email protected]>
> Cc: Ard Biesheuvel <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Nicholas Piggin <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Laurent Dufour <[email protected]>
> Cc: Sergei Trofimovich <[email protected]>
> Cc: Kefeng Wang <[email protected]>
> Cc: Michel Lespinasse <[email protected]>
> Cc: Christophe Leroy <[email protected]>
> Cc: "Aneesh Kumar K.V" <[email protected]>
> Cc: Thiago Jung Bauermann <[email protected]>
> Cc: Joe Perches <[email protected]>
> Cc: Pierre Morel <[email protected]>
> Cc: Jia He <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 3 +--
Acked-by: Catalin Marinas <[email protected]>
On Mon, Jun 07, 2021 at 09:54:18PM +0200, David Hildenbrand wrote:
> Hi,
>
> this series aims at improving in-kernel auto-online support. It tackles the
> fundamental problems that:
Hi David,
the idea sounds good to me, and I like that this series takes away part of the
responsability from the user to know where the memory should go.
I think the kernel is a much better fit for that as it has all the required
information to balance things.
I also glanced over the series and besides some things here and there the
whole approach looks sane.
I plan to have a look into it in a few days, just have some high level questions
for the time being:
> 1) We can create zone imbalances when onlining all memory blindly to
> ZONE_MOVABLE, in the worst case crashing the system. We have to know
> upfront how much memory we are going to hotplug such that we can
> safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
> via "online_movable". This is far from practical and only applicable in
> limited setups -- like inside VMs under the RHV/oVirt hypervisor which
> will never hotplug more than 3 times the boot memory (and the
> limitation is only in place due to the Linux limitation).
Could you give more insight about the problems created by zone imbalances (e.g:
a lot of movable memory and little kernel memory).
> 2) We see more setups that implement dynamic VM resizing, hot(un)plugging
> memory to resize VM memory. In these setups, we might hotplug a lot of
> memory, but it might happen in various small steps in both directions
> (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
> primary driver of this upstream right now, performing such dynamic
> resizing NUMA-aware via multiple virtio-mem devices.
>
> Onlining all hotplugged memory to ZONE_NORMAL means we basically have
> no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
> easily run into zone imbalances when growing a VM. We want a mixture,
> and we want as much memory as reasonable/configured in ZONE_MOVABLE.
>
> 3) Memory devices consist of 1..X memory block devices, however, the
> kernel doesn't really track the relationship. Consequently, also user
> space has no idea. We want to make per-device decisions. As one
> example, for memory hotunplug it doesn't make sense to use a mixture of
> zones within a single DIMM: we want all MOVABLE if possible, otherwise
> all !MOVABLE, because any !MOVABLE part will easily block the DIMM from
> getting hotunplugged. As another example, virtio-mem operates on
> individual units that span 1..X memory blocks. Similar to a DIMM, we
> want a unit to either be all MOVABLE or !MOVABLE. Further, we want
> as much memory of a virtio-mem device to be MOVABLE as possible.
So, a virtio-mem unit could be seen as DIMM right?
> 4) We want memory onlining to be done right from the kernel while adding
> memory; for example, this is reqired for fast memory hotplug for
> drivers that add individual memory blocks, like virito-mem. We want a
> way to configure a policy in the kernel and avoid implementing advanced
> policies in user space.
"we want memory onlining to be done right from the kernel while adding memory"
is not that always the case when a driver adds memory? User has no interaction
with that right?
> The auto-onlining support we have in the kernel is not sufficient. All we
> have is a) online everything movable (online_movable) b) online everything
> !movable (online_kernel) c) keep zones contiguous (online). This series
> allows configuring c) to mean instead "online movable if possible according
> to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new
> onlining policy.
>
> This series does 3 things:
>
> 1) Introduces the "auto-movable" online policy that initially operates on
> individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
> to make a decision whether a memory block will be onlined to
> ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
> memory does not allow for more MOVABLE memory (details in the
> patches). CMA memory is treated like MOVABLE memory.
How a user would know which ratio is sane? Could we add some info in the
Docu part that kinda sets some "basic" rules?
> 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
> groups and uses group information to make decisions in the
> "auto-movable" online policy accross memory blocks of a single memory
> device (modeled as memory group).
So, the distinction being that a DIMM cannot grow larger but we can add more
memory to a virtio-mem unit? I feel I am missing some insight here.
> 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
> allowing ZONE_NORMAL memory within a dynamic memory group to allow for
> more ZONE_MOVABLE memory within the same memory group. The target use
> case is dynamic VM resizing using virtio-mem.
Sorry, I got lost in this one. Care to explain a bit more?
> The target usage will be:
>
> 1) Linux boots with "mhp_default_online_type=offline"
>
> 2) User space (e.g., systemd unit) configures memory onlining (according
> to a config file and system properties), for example:
> * Setting memory_hotplug.online_policy=auto-movable
> * Setting memory_hotplug.auto_movable_ratio=301
> * Setting memory_hotplug.auto_movable_numa_aware=true
I think we would need to document those in order to let the user know what
it is best for them. e.g: when do we want to enable auto_movable_numa_aware etc.
> For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
> 301% results in the following layout:
> Memory block 1-15: DMA32 (early)
> Memory block 32-47: Normal (early)
> Memory block 48-79: Movable (DIMM 0)
> Memory block 80-111: Movable (DIMM 1)
> Memory block 112-143: Movable (DIMM 2)
> Memory block 144-275: Normal (DIMM 3)
> Memory block 176-207: Normal (DIMM 4)
> ... all Normal
> (-> hotplugged Normal memory does not allow for more Movable memory)
Uhm, I am sorry for being dense here:
On x86_64, 4GB = 32 sections (of 128MB each). Why the memblock span from #1 to #47?
> For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
> will result in the following layout:
> Memory block 1-15: DMA32 (early)
> Memory block 32-47: Normal (early)
> Memory block 48-143: Movable (virtio-mem, first 12 GiB)
> Memory block 144: Normal (virtio-mem, next 128 MiB)
> Memory block 145-147: Movable (virtio-mem, next 384 MiB)
> Memory block 148: Normal (virtio-mem, next 128 MiB)
> Memory block 149-151: Movable (virtio-mem, next 384 MiB)
> ... Normal/Movable mixture as above
> (-> hotplugged Normal memory allows for more Movable memory within
> the same device)
>
> Which gives us maximum flexibility when dynamically growing/shrinking a
> VM in smaller steps. When shrinking, virtio-mem will prioritize unplug of
> MOVABLE memory with [1] sent last week, such that we won't accidentially
> trigger zone imbalances in more complicated setups that involve multiple
> virtio-mem devices.
--
Oscar Salvador
SUSE L3
On 08.06.21 11:42, Oscar Salvador wrote:
> On Mon, Jun 07, 2021 at 09:54:18PM +0200, David Hildenbrand wrote:
>> Hi,
>>
>> this series aims at improving in-kernel auto-online support. It tackles the
>> fundamental problems that:
>
> Hi David,
>
> the idea sounds good to me, and I like that this series takes away part of the
> responsability from the user to know where the memory should go.
> I think the kernel is a much better fit for that as it has all the required
> information to balance things.
>
> I also glanced over the series and besides some things here and there the
> whole approach looks sane.
> I plan to have a look into it in a few days, just have some high level questions
> for the time being:
Hi Oscar,
>
>> 1) We can create zone imbalances when onlining all memory blindly to
>> ZONE_MOVABLE, in the worst case crashing the system. We have to know
>> upfront how much memory we are going to hotplug such that we can
>> safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
>> via "online_movable". This is far from practical and only applicable in
>> limited setups -- like inside VMs under the RHV/oVirt hypervisor which
>> will never hotplug more than 3 times the boot memory (and the
>> limitation is only in place due to the Linux limitation).
>
> Could you give more insight about the problems created by zone imbalances (e.g:
> a lot of movable memory and little kernel memory).
I just updated memory-hotplug.rst exactly for that purpose :)
https://lkml.kernel.org/r/[email protected]
There, also safe zone ratios and "usually well known values" are given.
I can link it in the next cover letter.
>
>> 2) We see more setups that implement dynamic VM resizing, hot(un)plugging
>> memory to resize VM memory. In these setups, we might hotplug a lot of
>> memory, but it might happen in various small steps in both directions
>> (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
>> primary driver of this upstream right now, performing such dynamic
>> resizing NUMA-aware via multiple virtio-mem devices.
>>
>> Onlining all hotplugged memory to ZONE_NORMAL means we basically have
>> no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
>> easily run into zone imbalances when growing a VM. We want a mixture,
>> and we want as much memory as reasonable/configured in ZONE_MOVABLE.
>>
>> 3) Memory devices consist of 1..X memory block devices, however, the
>> kernel doesn't really track the relationship. Consequently, also user
>> space has no idea. We want to make per-device decisions. As one
>> example, for memory hotunplug it doesn't make sense to use a mixture of
>> zones within a single DIMM: we want all MOVABLE if possible, otherwise
>> all !MOVABLE, because any !MOVABLE part will easily block the DIMM from
>> getting hotunplugged. As another example, virtio-mem operates on
>> individual units that span 1..X memory blocks. Similar to a DIMM, we
>> want a unit to either be all MOVABLE or !MOVABLE. Further, we want
>> as much memory of a virtio-mem device to be MOVABLE as possible.
>
> So, a virtio-mem unit could be seen as DIMM right?
It's a bit more complicated. Each individual unit (e.g., a 128 MiB
memory block) is the smallest granularity we can add/remove of that
device. So such a unit is somewhat like a DIMM. However, all "units" of
the device can interact -- it's a single memory device.
>
>> 4) We want memory onlining to be done right from the kernel while adding
>> memory; for example, this is reqired for fast memory hotplug for
>> drivers that add individual memory blocks, like virito-mem. We want a
>> way to configure a policy in the kernel and avoid implementing advanced
>> policies in user space.
>
> "we want memory onlining to be done right from the kernel while adding memory"
>
> is not that always the case when a driver adds memory? User has no interaction
> with that right?
Well, with auto-onlining in the kernel disabled, user space has to do
the onlining -- for example via udev rules right now in major distributions.
But there are also users that always want to online manually in user
space to select a zone. Most prominently standby memory on s390x, but
also in some cases dax/kmem memory. But these two are really corner
cases. In general, we want hotplugged memory to be onlined immediately.
>
>> The auto-onlining support we have in the kernel is not sufficient. All we
>> have is a) online everything movable (online_movable) b) online everything
>> !movable (online_kernel) c) keep zones contiguous (online). This series
>> allows configuring c) to mean instead "online movable if possible according
>> to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new
>> onlining policy.
>>
>> This series does 3 things:
>>
>> 1) Introduces the "auto-movable" online policy that initially operates on
>> individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
>> to make a decision whether a memory block will be onlined to
>> ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
>> memory does not allow for more MOVABLE memory (details in the
>> patches). CMA memory is treated like MOVABLE memory.
>
> How a user would know which ratio is sane? Could we add some info in the
> Docu part that kinda sets some "basic" rules?
Again, currently resides in the memory-hotplug.rst overhaul.
>
>> 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
>> groups and uses group information to make decisions in the
>> "auto-movable" online policy accross memory blocks of a single memory
>> device (modeled as memory group).
>
> So, the distinction being that a DIMM cannot grow larger but we can add more
> memory to a virtio-mem unit? I feel I am missing some insight here.
Right, the relevant patch contains more info.
You either plug or unplug a DIMM (or a NUMA node which spans multiple
DIMMS) -- both are ACPI memory devices that span multiple physical
regions. You cannot unplug parts of a DIMM or grow it. "static" as also
expressed by ACPI code ("adds" and "removes" all memory device memory in
one go).
virtio-mem behaves differently, as it's a single physical memory region
in which we dynamically add or remove memory. The granularity in which
we add/remove memory from Linux is a "unit". In the simplest case, it's
just a single memory block (e.g., 128 MiB). So it's a memory device that
can grow/shrink in the given unit -- "dynamic".
>
>> 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
>> allowing ZONE_NORMAL memory within a dynamic memory group to allow for
>> more ZONE_MOVABLE memory within the same memory group. The target use
>> case is dynamic VM resizing using virtio-mem.
>
> Sorry, I got lost in this one. Care to explain a bit more?
The virtio-mem example below should make this a bit more clearer (in
addition to the relevant patch), especially in contrast to static memory
devices like DIMMs. Key is that a single virtio-mem device is a "dynamic
memory group" in which memory can get added/removed dynamically in a
given unit granularity. And we want to special case that type of device
to have as much memory of a virtio-mem device being MOVABLE as possible
(and configured).
>
>> The target usage will be:
>>
>> 1) Linux boots with "mhp_default_online_type=offline"
>>
>> 2) User space (e.g., systemd unit) configures memory onlining (according
>> to a config file and system properties), for example:
>> * Setting memory_hotplug.online_policy=auto-movable
>> * Setting memory_hotplug.auto_movable_ratio=301
>> * Setting memory_hotplug.auto_movable_numa_aware=true
>
> I think we would need to document those in order to let the user know what
> it is best for them. e.g: when do we want to enable auto_movable_numa_aware etc.
Yes, as mentioned below, an memory-hotplug.rst update will follow once
the overhaul is done. The respective patch contains more information.
>
>> For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
>> 301% results in the following layout:
>> Memory block 1-15: DMA32 (early)
>> Memory block 32-47: Normal (early)
>> Memory block 48-79: Movable (DIMM 0)
>> Memory block 80-111: Movable (DIMM 1)
>> Memory block 112-143: Movable (DIMM 2)
>> Memory block 144-275: Normal (DIMM 3)
>> Memory block 176-207: Normal (DIMM 4)
>> ... all Normal
>> (-> hotplugged Normal memory does not allow for more Movable memory)
>
> Uhm, I am sorry for being dense here:
>
> On x86_64, 4GB = 32 sections (of 128MB each). Why the memblock span from #1 to #47?
Sorry, it's actually "Memory block 0-15", which gives us 0-15 and 32-47
== 32 memory blocks corresponding to boot memory. Note that the absent
memory blocks 16-31 should correspond to the PCI hole.
Thanks Oscar!
--
Thanks,
David / dhildenb
David Hildenbrand <[email protected]> writes:
> The parameter is unused, let's remove it.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Vasily Gorbik <[email protected]>
> Cc: Christian Borntraeger <[email protected]>
> Cc: Yoshinori Sato <[email protected]>
> Cc: Rich Felker <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: [email protected]
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Anshuman Khandual <[email protected]>
> Cc: Ard Biesheuvel <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Nicholas Piggin <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Laurent Dufour <[email protected]>
> Cc: Sergei Trofimovich <[email protected]>
> Cc: Kefeng Wang <[email protected]>
> Cc: Michel Lespinasse <[email protected]>
> Cc: Christophe Leroy <[email protected]>
> Cc: "Aneesh Kumar K.V" <[email protected]>
> Cc: Thiago Jung Bauermann <[email protected]>
> Cc: Joe Perches <[email protected]>
> Cc: Pierre Morel <[email protected]>
> Cc: Jia He <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 3 +--
> arch/ia64/mm/init.c | 3 +--
> arch/powerpc/mm/mem.c | 3 +--
> arch/s390/mm/init.c | 3 +--
> arch/sh/mm/init.c | 3 +--
> arch/x86/mm/init_32.c | 3 +--
> arch/x86/mm/init_64.c | 3 +--
> include/linux/memory_hotplug.h | 3 +--
> mm/memory_hotplug.c | 4 ++--
> mm/memremap.c | 5 +----
> 10 files changed, 11 insertions(+), 22 deletions(-)
>
...
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 043bbeaf407c..fc5c36189c26 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -115,8 +115,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
> return rc;
> }
>
> -void __ref arch_remove_memory(int nid, u64 start, u64 size,
> - struct vmem_altmap *altmap)
> +void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
> {
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
Acked-by: Michael Ellerman <[email protected]> (powerpc)
cheers
On 08.06.21 13:11, Michael Ellerman wrote:
> David Hildenbrand <[email protected]> writes:
>> There is only a single user remaining. We can simply try to offline all
>> online nodes - which is fast, because we usually span pages and can skip
>> such nodes right away.
>
> That makes me slightly nervous, because our big powerpc boxes tend to
> trip on these scaling issues before others.
>
> But the spanned pages check is just:
>
> void try_offline_node(int nid)
> {
> pg_data_t *pgdat = NODE_DATA(nid);
> ...
> if (pgdat->node_spanned_pages)
> return;
>
> So I guess that's pretty cheap, and it's only O(nodes), which should
> never get that big.
Exactly. And if it does turn out to be a problem, we can walk all memory
blocks before removing them, collecting the nid(s).
--
Thanks,
David / dhildenb
On Mon, Jun 7, 2021 at 9:55 PM David Hildenbrand <[email protected]> wrote:
>
> Let's group all memory we add for a single memory device - we want a
> single node for that (which also seems to be the sane thing to do).
>
> We won't care for now about memory that was already added to the system
> (e.g., via e820) -- usually *all* memory of a memory device was already
> added and we'll fail acpi_memory_enable_device().
>
> Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
> ---
> drivers/acpi/acpi_memhotplug.c | 35 +++++++++++++++++++++++++++++-----
> 1 file changed, 30 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index eb4faf7c5cad..0c7b062c0e5d 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -54,6 +54,7 @@ struct acpi_memory_info {
> struct acpi_memory_device {
> struct acpi_device *device;
> struct list_head res_list;
> + int mgid;
> };
>
> static acpi_status
> @@ -171,10 +172,31 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
> acpi_handle handle = mem_device->device->handle;
> int result, num_enabled = 0;
> struct acpi_memory_info *info;
> - mhp_t mhp_flags = MHP_NONE;
> - int node;
> + mhp_t mhp_flags = MHP_NID_IS_MGID;
> + u64 total_length = 0;
> + int node, mgid;
>
> node = acpi_get_node(handle);
> +
> + list_for_each_entry(info, &mem_device->res_list, list) {
> + if (!info->length)
> + continue;
> + /* We want a single node for the whole memory group */
> + if (node < 0)
> + node = memory_add_physaddr_to_nid(info->start_addr);
> + total_length += info->length;
> + }
> +
> + if (!total_length) {
> + dev_err(&mem_device->device->dev, "device is empty\n");
> + return -EINVAL;
> + }
> +
> + mgid = register_static_memory_group(node, PFN_UP(total_length));
> + if (mgid < 0)
> + return mgid;
> + mem_device->mgid = mgid;
> +
> /*
> * Tell the VM there is more memory here...
> * Note: Assume that this function returns zero on success
> @@ -188,12 +210,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
> */
> if (!info->length)
> continue;
> - if (node < 0)
> - node = memory_add_physaddr_to_nid(info->start_addr);
>
> if (mhp_supports_memmap_on_memory(info->length))
> mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> - result = __add_memory(node, info->start_addr, info->length,
> + result = __add_memory(mgid, info->start_addr, info->length,
> mhp_flags);
>
> /*
> @@ -253,6 +273,10 @@ static void acpi_memory_device_free(struct acpi_memory_device *mem_device)
> if (!mem_device)
> return;
>
> + /* In case we succeeded adding *some* memory, unregistering fails. */
> + if (mem_device->mgid >= 0)
> + unregister_memory_group(mem_device->mgid);
> +
> acpi_memory_free_device_resources(mem_device);
> mem_device->device->driver_data = NULL;
> kfree(mem_device);
> @@ -273,6 +297,7 @@ static int acpi_memory_device_add(struct acpi_device *device,
>
> INIT_LIST_HEAD(&mem_device->res_list);
> mem_device->device = device;
> + mem_device->mgid = -1;
> sprintf(acpi_device_name(device), "%s", ACPI_MEMORY_DEVICE_NAME);
> sprintf(acpi_device_class(device), "%s", ACPI_MEMORY_DEVICE_CLASS);
> device->driver_data = mem_device;
> --
> 2.31.1
>
On Mon, Jun 7, 2021 at 9:55 PM David Hildenbrand <[email protected]> wrote:
>
> We allocate + initialize everything from scratch. In case enabling the
> device fails, we free all memory resourcs.
>
> Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
> ---
> drivers/acpi/acpi_memhotplug.c | 4 ----
> 1 file changed, 4 deletions(-)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 1d01d9414c40..eb4faf7c5cad 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -182,10 +182,6 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
> * (i.e. memory-hot-remove function)
> */
> list_for_each_entry(info, &mem_device->res_list, list) {
> - if (info->enabled) { /* just sanity check...*/
> - num_enabled++;
> - continue;
> - }
> /*
> * If the memory block size is zero, please ignore it.
> * Don't try to do the following memory hotplug flowchart.
> --
> 2.31.1
>
David Hildenbrand <[email protected]> writes:
> There is only a single user remaining. We can simply try to offline all
> online nodes - which is fast, because we usually span pages and can skip
> such nodes right away.
That makes me slightly nervous, because our big powerpc boxes tend to
trip on these scaling issues before others.
But the spanned pages check is just:
void try_offline_node(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
...
if (pgdat->node_spanned_pages)
return;
So I guess that's pretty cheap, and it's only O(nodes), which should
never get that big.
> Cc: Michael Ellerman <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: "Michael S. Tsirkin" <[email protected]>
> Cc: Jason Wang <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Nathan Lynch <[email protected]>
> Cc: Laurent Dufour <[email protected]>
> Cc: "Aneesh Kumar K.V" <[email protected]>
> Cc: Scott Cheloha <[email protected]>
> Cc: Anton Blanchard <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> .../platforms/pseries/hotplug-memory.c | 9 ++++-----
> drivers/acpi/acpi_memhotplug.c | 7 +------
> drivers/dax/kmem.c | 3 +--
> drivers/virtio/virtio_mem.c | 4 ++--
> include/linux/memory_hotplug.h | 10 +++++-----
> mm/memory_hotplug.c | 20 +++++++++----------
> 6 files changed, 23 insertions(+), 30 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 8377f1f7c78e..4a9232ddbefe 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -286,7 +286,7 @@ static int pseries_remove_memblock(unsigned long base, unsigned long memblock_si
> {
> unsigned long block_sz, start_pfn;
> int sections_per_block;
> - int i, nid;
> + int i;
>
> start_pfn = base >> PAGE_SHIFT;
>
> @@ -297,10 +297,9 @@ static int pseries_remove_memblock(unsigned long base, unsigned long memblock_si
>
> block_sz = pseries_memory_block_size();
> sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> - nid = memory_add_physaddr_to_nid(base);
>
> for (i = 0; i < sections_per_block; i++) {
> - __remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE);
> + __remove_memory(base, MIN_MEMORY_BLOCK_SIZE);
> base += MIN_MEMORY_BLOCK_SIZE;
> }
>
> @@ -386,7 +385,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
>
> block_sz = pseries_memory_block_size();
>
> - __remove_memory(mem_block->nid, lmb->base_addr, block_sz);
> + __remove_memory(lmb->base_addr, block_sz);
> put_device(&mem_block->dev);
>
> /* Update memory regions for memory remove */
> @@ -638,7 +637,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>
> rc = dlpar_online_lmb(lmb);
> if (rc) {
> - __remove_memory(nid, lmb->base_addr, block_sz);
> + __remove_memory(lmb->base_addr, block_sz);
> invalidate_lmb_associativity_index(lmb);
> } else {
> lmb->flags |= DRCONF_MEM_ASSIGNED;
Acked-by: Michael Ellerman <[email protected]> (powerpc)
cheers
On Mon, Jun 07, 2021 at 09:54:22PM +0200, David Hildenbrand wrote:
> The parameter is unused, let's remove it.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 3 +--
> arch/ia64/mm/init.c | 3 +--
> arch/powerpc/mm/mem.c | 3 +--
> arch/s390/mm/init.c | 3 +--
> arch/sh/mm/init.c | 3 +--
> arch/x86/mm/init_32.c | 3 +--
> arch/x86/mm/init_64.c | 3 +--
> include/linux/memory_hotplug.h | 3 +--
> mm/memory_hotplug.c | 4 ++--
> mm/memremap.c | 5 +----
> 10 files changed, 11 insertions(+), 22 deletions(-)
For s390:
Acked-by: Heiko Carstens <[email protected]>
On 08.06.21 13:18, David Hildenbrand wrote:
> On 08.06.21 13:11, Michael Ellerman wrote:
>> David Hildenbrand <[email protected]> writes:
>>> There is only a single user remaining. We can simply try to offline all
>>> online nodes - which is fast, because we usually span pages and can skip
>>> such nodes right away.
>>
>> That makes me slightly nervous, because our big powerpc boxes tend to
>> trip on these scaling issues before others.
>>
>> But the spanned pages check is just:
>>
>> void try_offline_node(int nid)
>> {
>> pg_data_t *pgdat = NODE_DATA(nid);
>> ...
>> if (pgdat->node_spanned_pages)
>> return;
>>
>> So I guess that's pretty cheap, and it's only O(nodes), which should
>> never get that big.
>
> Exactly. And if it does turn out to be a problem, we can walk all memory
> blocks before removing them, collecting the nid(s).
>
I might just do the following on top:
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 61bff8f3bfb1..bbc26fdac364 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2176,7 +2176,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
{
int ret = !is_memblock_offlined(mem);
+ int *nid = arg;
+ *nid = mem->nid;
if (unlikely(ret)) {
phys_addr_t beginpa, endpa;
@@ -2271,10 +2273,10 @@ EXPORT_SYMBOL(try_offline_node);
static int __ref try_remove_memory(u64 start, u64 size)
{
- int rc = 0, nid;
struct vmem_altmap mhp_altmap = {};
struct vmem_altmap *altmap = NULL;
unsigned long nr_vmemmap_pages;
+ int rc = 0, nid = NUMA_NO_NODE;
BUG_ON(check_hotplug_memory_range(start, size));
@@ -2282,8 +2284,12 @@ static int __ref try_remove_memory(u64 start, u64 size)
* All memory blocks must be offlined before removing memory. Check
* whether all memory blocks in question are offline and return error
* if this is not the case.
+ *
+ * While at it, determine the nid. Note that if we'd have mixed nodes,
+ * we'd only try to offline the last determined one -- which is good
+ * enough for the cases we care about.
*/
- rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb);
+ rc = walk_memory_blocks(start, size, &nid, check_memblock_offlined_cb);
if (rc)
return rc;
@@ -2332,7 +2338,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
release_mem_region_adjustable(start, size);
- for_each_online_node(nid)
+ if (nid != NUMA_NO_NODE)
try_offline_node(nid);
mem_hotplug_done();
--
Thanks,
David / dhildenb