Hi Andrew,
Please, consider pulling out the patch that sits currently in linux-mm and
put this one instead.
I would still like to hear Michal's opinion but it should be safe
enough to let it sit in the mmotm/linux-next for a while.
Thanks
Changes from v8 -> v9:
- Change order of kasan calls and offline_mem_sections
- Collect Reviewed-by
Changes from v7 -> v8:
- Addressed feedback from David for patch#4
Changes from v6 -> v7:
- Fix check from "mm,memory_hotplug: Relax fully spanned sections check"
- Add fixup from "mm,memory_hotplug: Allocate memmap from the added memory range"
- Add Reviewed-by from David for patch#2
- Fix changelog in "mm,memory_hotplug: Factor out adjusting present pages into
adjust_present_page_count()"
Changes from v5 -> v6:
- Create memory_block_{online,offline} functions
- Create vmemmap_* functions to deal with vmemmap stuff, so
{online,offline}_pages remain untouched
- Add adjust_present_page_count's patch from David
- Relax check in {offline,online}_pages
- Rework changelogs
Changes from v4 -> v5:
- Addressed feedback from David (patch#1)
- Tested on x86_64 with different struct page sizes and on large/small memory
blocks
- Tested on arm64 with 4K, 64K (with and without THP) and with different struct
page sizes
NOTE: We might need to make this feature and hugetlb-vmemmap feature [1] mutually
exclusive. I raised an issue I see in [2].
Hugetlb-vmemmap feature has been withdrawn for the time being due to the need
in further changes wrt. locking/freeing context.
I will keep an eye, and when the time comes again I will see how the two
features play together and how one another can be disabled when needed.
Changes from v3 -> v4:
- Addressed feedback from David
- Wrap memmap_on_memory module thingy with #ifdef
on MHP_MEMMAP_ON_MEMORY
- Move "depend on MEMORY_HOTPLUG" to MHP_MEMMAP_ON_MEMORY
in generic mm/Kconfig
- Collect David's Reviewed-bys
Changes from v2 -> v3:
- Addressed feedback from David
- Squash former patch#4 and and patch#5 into patch#1
- Fix config dependency CONFIR_SPARSE_VMEMMAP vs CONFIG_SPARSE_VMEMMAP_ENABLE
- Simplify module parameter functions
Changes from v1 -> v2
- Addressed feedback from David
- Fence off the feature in case struct page size is not
multiple of PMD size or pageblock alignment cannot be guaranted
- Tested on x86_64 small and large memory_blocks
- Tested on arm64 4KB and 64KB page sizes (for some reason I cannot boot
my VM with 16KB page size).
Arm64 with 4KB page size behaves like x86_64 after [1], which made section
size smaller.
With 64KB, the feature gets fenced off due to pageblock alignment.
Changes from RFCv3 -> v1:
- Addressed feedback from David
- Re-order patches
Changes from v2 -> v3 (RFC):
- Re-order patches (Michal)
- Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1
- Add kernel boot option to enable this feature (Michal)
Changes from v1 -> v2 (RFC):
- Addressed feedback provided by David
- Add a arch_support_memmap_on_memory to be called
from mhp_supports_memmap_on_memory, as atm,
only ARM, powerpc and x86_64 have altmat support.
[1] https://lore.kernel.org/lkml/cover.1611206601.git.sudaraj...
Original cover letter:
The primary goal of this patchset is to reduce memory overhead of the
hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).
The current way we use to populate memmap (struct page array) has two main drawbacks:
a) it consumes an additional memory until the hotadded memory itself is
onlined and
b) memmap might end up on a different numa node which is especially true
for movable_node configuration.
c) due to fragmentation we might end up populating memmap with base
pages
One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
us to map any pfn range so the memory doesn't need to be online to be
usable for the array. See patch 4 for more details.
This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.
[Overall design]:
Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by vmemap_populate.
memory_block structure gains a new field called nr_vmemmap_pages,
which accounts for the number of vmemmap pages used by that memory_block.
E.g: On x86_64, that is 512 vmemmap pages on small memory bloks and 4096
on large memory blocks (1GB)
We also introduce new two functions: memory_block_{online,offline}.
These functions take care of initializing/unitializing vmemmap pages
prior to calling {online,offline}_pages, so the latter functions can
remain totally untouched.
More details can be found in the respective changelogs.
David Hildenbrand (1):
mm,memory_hotplug: Factor out adjusting present pages into
adjust_present_page_count()
Oscar Salvador (7):
drivers/base/memory: Introduce memory_block_{online,offline}
mm,memory_hotplug: Relax fully spanned sections check
mm,memory_hotplug: Allocate memmap from the added memory range
acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
Documentation/admin-guide/kernel-parameters.txt | 17 ++
arch/arm64/Kconfig | 3 +
arch/x86/Kconfig | 3 +
drivers/acpi/acpi_memhotplug.c | 5 +-
drivers/base/memory.c | 100 ++++++++++--
include/linux/memory.h | 8 +-
include/linux/memory_hotplug.h | 15 +-
include/linux/memremap.h | 2 +-
include/linux/mmzone.h | 7 +-
mm/Kconfig | 5 +
mm/Makefile | 5 +-
mm/memory_hotplug.c | 205 +++++++++++++++++++++---
mm/sparse.c | 2 -
13 files changed, 330 insertions(+), 47 deletions(-)
--
2.16.3
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.
This has some disadvantages:
a) an existing memory is consumed for that purpose
(eg: ~2MB per 128MB memory section on x86_64)
b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
c) It might be there are no PMD_ALIGNED chunks so memmap array gets
populated with base pages.
This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.
Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This patch also introduces the following functions:
- mhp_init_memmap_on_memory:
Initializes vmemmap pages by calling move_pfn_range_to_zone(),
calls kasan_add_zero_shadow(), and onlines as many sections
as vmemmap pages fully span.
- mhp_deinit_memmap_on_memory:
Undoes what mhp_init_memmap_on_memory.
The new function memory_block_online() calls mhp_init_memmap_on_memory() before
doing the actual online_pages(). Should online_pages() fail, we clean up
by calling mhp_deinit_memmap_on_memory().
Adjusting of present_pages is done at the end once we know that online_pages()
succedeed.
On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages().
This is necessary because offline_pages() tears down some structures based
on the fact whether the node or the zone become empty.
If offline_pages() fails, we account back vmemmap pages.
If it succeeds, we call mhp_deinit_memmap_on_memory().
Hot-remove:
We need to be careful when removing memory, as adding and
removing memory needs to be done with the same granularity.
To check that this assumption is not violated, we check the
memory range we want to remove and if a) any memory block has
vmemmap pages and b) the range spans more than a single memory
block, we scream out loud and refuse to proceed.
If all is good and the range was using memmap on memory (aka vmemmap pages),
we construct an altmap structure so free_hugepage_table does the right
thing and calls vmem_altmap_free instead of free_pagetable.
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 71 ++++++++++++++++--
include/linux/memory.h | 8 ++-
include/linux/memory_hotplug.h | 15 +++-
include/linux/memremap.h | 2 +-
include/linux/mmzone.h | 7 +-
mm/Kconfig | 5 ++
mm/memory_hotplug.c | 159 ++++++++++++++++++++++++++++++++++++++---
mm/sparse.c | 2 -
8 files changed, 247 insertions(+), 22 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f209925a5d4e..2e2b2f654f0a 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -173,16 +173,72 @@ static int memory_block_online(struct memory_block *mem)
{
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+ unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+ struct zone *zone;
+ int ret;
+
+ zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
+
+ /*
+ * Although vmemmap pages have a different lifecycle than the pages
+ * they describe (they remain until the memory is unplugged), doing
+ * their initialization and accounting at memory onlining/offlining
+ * stage simplifies things a lot.
+ */
+ if (nr_vmemmap_pages) {
+ ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
+ if (ret)
+ return ret;
+ }
+
+ ret = online_pages(start_pfn + nr_vmemmap_pages,
+ nr_pages - nr_vmemmap_pages, zone);
+ if (ret) {
+ if (nr_vmemmap_pages)
+ mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
+ return ret;
+ }
+
+ /*
+ * Account once onlining succeeded. If the zone was unpopulated, it is
+ * now already properly populated.
+ */
+ if (nr_vmemmap_pages)
+ adjust_present_page_count(zone, nr_vmemmap_pages);
- return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+ return ret;
}
static int memory_block_offline(struct memory_block *mem)
{
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+ unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+ struct zone *zone;
+ int ret;
+
+ zone = page_zone(pfn_to_page(start_pfn));
+
+ /*
+ * Unaccount before offlining, such that unpopulated zone and kthreads
+ * can properly be torn down in offline_pages().
+ */
+ if (nr_vmemmap_pages)
+ adjust_present_page_count(zone, -nr_vmemmap_pages);
- return offline_pages(start_pfn, nr_pages);
+ ret = offline_pages(start_pfn + nr_vmemmap_pages,
+ nr_pages - nr_vmemmap_pages);
+ if (ret) {
+ /* offline_pages() failed. Account back. */
+ if (nr_vmemmap_pages)
+ adjust_present_page_count(zone, nr_vmemmap_pages);
+ return ret;
+ }
+
+ if (nr_vmemmap_pages)
+ mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
+
+ return ret;
}
/*
@@ -576,7 +632,8 @@ int register_memory(struct memory_block *memory)
return ret;
}
-static int init_memory_block(unsigned long block_id, unsigned long state)
+static int init_memory_block(unsigned long block_id, unsigned long state,
+ unsigned long nr_vmemmap_pages)
{
struct memory_block *mem;
int ret = 0;
@@ -593,6 +650,7 @@ static int init_memory_block(unsigned long block_id, unsigned long state)
mem->start_section_nr = block_id * sections_per_block;
mem->state = state;
mem->nid = NUMA_NO_NODE;
+ mem->nr_vmemmap_pages = nr_vmemmap_pages;
ret = register_memory(mem);
@@ -612,7 +670,7 @@ static int add_memory_block(unsigned long base_section_nr)
if (section_count == 0)
return 0;
return init_memory_block(memory_block_id(base_section_nr),
- MEM_ONLINE);
+ MEM_ONLINE, 0);
}
static void unregister_memory(struct memory_block *memory)
@@ -634,7 +692,8 @@ static void unregister_memory(struct memory_block *memory)
*
* Called under device_hotplug_lock.
*/
-int create_memory_block_devices(unsigned long start, unsigned long size)
+int create_memory_block_devices(unsigned long start, unsigned long size,
+ unsigned long vmemmap_pages)
{
const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
@@ -647,7 +706,7 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
return -EINVAL;
for (block_id = start_block_id; block_id != end_block_id; block_id++) {
- ret = init_memory_block(block_id, MEM_OFFLINE);
+ ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
if (ret)
break;
}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 4da95e684e20..97e92e8b556a 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -29,6 +29,11 @@ struct memory_block {
int online_type; /* for passing data to online routine */
int nid; /* NID for this memory block */
struct device dev;
+ /*
+ * Number of vmemmap pages. These pages
+ * lay at the beginning of the memory block.
+ */
+ unsigned long nr_vmemmap_pages;
};
int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -80,7 +85,8 @@ static inline int memory_notify(unsigned long val, void *v)
#else
extern int register_memory_notifier(struct notifier_block *nb);
extern void unregister_memory_notifier(struct notifier_block *nb);
-int create_memory_block_devices(unsigned long start, unsigned long size);
+int create_memory_block_devices(unsigned long start, unsigned long size,
+ unsigned long vmemmap_pages);
void remove_memory_block_devices(unsigned long start, unsigned long size);
extern void memory_dev_init(void);
extern int memory_notify(unsigned long val, void *v);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7288aa5ef73b..28f32fd00fe9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -55,6 +55,14 @@ typedef int __bitwise mhp_t;
*/
#define MHP_MERGE_RESOURCE ((__force mhp_t)BIT(0))
+/*
+ * We want memmap (struct page array) to be self contained.
+ * To do so, we will use the beginning of the hot-added range to build
+ * the page tables for the memmap array that describes the entire range.
+ * Only selected architectures support it with SPARSE_VMEMMAP.
+ */
+#define MHP_MEMMAP_ON_MEMORY ((__force mhp_t)BIT(1))
+
/*
* Extended parameters for memory hotplug:
* altmap: alternative allocator for memmap array (optional)
@@ -99,9 +107,13 @@ static inline void zone_seqlock_init(struct zone *zone)
extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
+extern void adjust_present_page_count(struct zone *zone, long nr_pages);
/* VM interface that may be used by firmware interface */
+extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
+ struct zone *zone);
+extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
extern int online_pages(unsigned long pfn, unsigned long nr_pages,
- int online_type, int nid);
+ struct zone *zone);
extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
unsigned long end_pfn);
extern void __offline_isolated_pages(unsigned long start_pfn,
@@ -359,6 +371,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_
extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
struct mhp_params *params);
void arch_remove_linear_mapping(u64 start, u64 size);
+extern bool mhp_supports_memmap_on_memory(unsigned long size);
#endif /* CONFIG_MEMORY_HOTPLUG */
#endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f5b464daeeca..45a79da89c5f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -17,7 +17,7 @@ struct device;
* @alloc: track pages consumed, private to vmemmap_populate()
*/
struct vmem_altmap {
- const unsigned long base_pfn;
+ unsigned long base_pfn;
const unsigned long end_pfn;
const unsigned long reserve;
unsigned long free;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..76f4ca5ed230 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -427,6 +427,11 @@ enum zone_type {
* techniques might use alloc_contig_range() to hide previously
* exposed pages from the buddy again (e.g., to implement some sort
* of memory unplug in virtio-mem).
+ * 6. Memory-hotplug: when using memmap_on_memory and onlining the memory
+ * to the MOVABLE zone, the vmemmap pages are also placed in such
+ * zone. Such pages cannot be really moved around as they are
+ * self-stored in the range, but they are treated as movable when
+ * the range they describe is about to be offlined.
*
* In general, no unmovable allocations that degrade memory offlining
* should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
@@ -1378,10 +1383,8 @@ static inline int online_section_nr(unsigned long nr)
#ifdef CONFIG_MEMORY_HOTPLUG
void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
-#ifdef CONFIG_MEMORY_HOTREMOVE
void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
#endif
-#endif
static inline struct mem_section *__pfn_to_section(unsigned long pfn)
{
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..febf805000f8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -183,6 +183,11 @@ config MEMORY_HOTREMOVE
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION
+config MHP_MEMMAP_ON_MEMORY
+ def_bool y
+ depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
+ depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+
# Heavily threaded applications may benefit from splitting the mm-wide
# page_table_lock, so that faults on different parts of the user address
# space can be handled with less contention: split it at this NR_CPUS.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d05056b3c173..5ef626926449 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,6 +42,8 @@
#include "internal.h"
#include "shuffle.h"
+static bool memmap_on_memory;
+
/*
* online_page_callback contains pointer to current page onlining function.
* Initially it is generic_online_page(). If it is required it could be
@@ -641,7 +643,12 @@ EXPORT_SYMBOL_GPL(generic_online_page);
static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
{
const unsigned long end_pfn = start_pfn + nr_pages;
- unsigned long pfn;
+ unsigned long pfn = start_pfn;
+
+ while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
+ (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
+ pfn += pageblock_nr_pages;
+ }
/*
* Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
@@ -649,7 +656,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
* later). We account all pages as being online and belonging to this
* zone ("present").
*/
- for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
+ for (; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
/* mark all involved sections as online */
@@ -829,7 +836,11 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
return default_zone_for_pfn(nid, start_pfn, nr_pages);
}
-static void adjust_present_page_count(struct zone *zone, long nr_pages)
+/*
+ * This function should only be called by memory_block_{online,offline},
+ * and {online,offline}_pages.
+ */
+void adjust_present_page_count(struct zone *zone, long nr_pages)
{
unsigned long flags;
@@ -839,12 +850,54 @@ static void adjust_present_page_count(struct zone *zone, long nr_pages)
pgdat_resize_unlock(zone->zone_pgdat, &flags);
}
-int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
- int online_type, int nid)
+int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
+ struct zone *zone)
+{
+ unsigned long end_pfn = pfn + nr_pages;
+ int ret;
+
+ ret = kasan_add_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
+ if (ret)
+ return ret;
+
+ move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
+
+ /*
+ * It might be that the vmemmap_pages fully span sections. If that is
+ * the case, mark those sections online here as otherwise they will be
+ * left offline.
+ */
+ if (nr_pages >= PAGES_PER_SECTION)
+ online_mem_sections(pfn, ALIGN_DOWN(end_pfn, PAGES_PER_SECTION));
+
+ return ret;
+}
+
+void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages)
+{
+ unsigned long end_pfn = pfn + nr_pages;
+
+ /*
+ * It might be that the vmemmap_pages fully span sections. If that is
+ * the case, mark those sections offline here as otherwise they will be
+ * left online.
+ */
+ if (nr_pages >= PAGES_PER_SECTION)
+ offline_mem_sections(pfn, ALIGN_DOWN(end_pfn, PAGES_PER_SECTION));
+
+ /*
+ * The pages associated with this vmemmap have been offlined, so
+ * we can reset its state here.
+ */
+ remove_pfn_range_from_zone(page_zone(pfn_to_page(pfn)), pfn, nr_pages);
+ kasan_remove_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
+}
+
+int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone)
{
unsigned long flags;
- struct zone *zone;
int need_zonelists_rebuild = 0;
+ const int nid = zone_to_nid(zone);
int ret;
struct memory_notify arg;
@@ -861,7 +914,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
mem_hotplug_begin();
/* associate pfn range with the zone */
- zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
arg.start_pfn = pfn;
@@ -1075,6 +1127,45 @@ static int online_memory_block(struct memory_block *mem, void *arg)
return device_online(&mem->dev);
}
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+ unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
+ unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
+ unsigned long remaining_size = size - vmemmap_size;
+
+ /*
+ * Besides having arch support and the feature enabled at runtime, we
+ * need a few more assumptions to hold true:
+ *
+ * a) We span a single memory block: memory onlining/offlinin;g happens
+ * in memory block granularity. We don't want the vmemmap of online
+ * memory blocks to reside on offline memory blocks. In the future,
+ * we might want to support variable-sized memory blocks to make the
+ * feature more versatile.
+ *
+ * b) The vmemmap pages span complete PMDs: We don't want vmemmap code
+ * to populate memory from the altmap for unrelated parts (i.e.,
+ * other memory blocks)
+ *
+ * c) The vmemmap pages (and thereby the pages that will be exposed to
+ * the buddy) have to cover full pageblocks: memory onlining/offlining
+ * code requires applicable ranges to be page-aligned, for example, to
+ * set the migratetypes properly.
+ *
+ * TODO: Although we have a check here to make sure that vmemmap pages
+ * fully populate a PMD, it is not the right place to check for
+ * this. A much better solution involves improving vmemmap code
+ * to fallback to base pages when trying to populate vmemmap using
+ * altmap as an alternative source of memory, and we do not exactly
+ * populate a single PMD.
+ */
+ return memmap_on_memory &&
+ IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
+ size == memory_block_size_bytes() &&
+ IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
+ IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+}
+
/*
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
* and online/offline operations (triggered e.g. by sysfs).
@@ -1084,6 +1175,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
{
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+ struct vmem_altmap mhp_altmap = {};
u64 start, size;
bool new_node = false;
int ret;
@@ -1110,13 +1202,26 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
goto error;
new_node = ret;
+ /*
+ * Self hosted memmap array
+ */
+ if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
+ if (!mhp_supports_memmap_on_memory(size)) {
+ ret = -EINVAL;
+ goto error;
+ }
+ mhp_altmap.free = PHYS_PFN(size);
+ mhp_altmap.base_pfn = PHYS_PFN(start);
+ params.altmap = &mhp_altmap;
+ }
+
/* call arch's memory hotadd */
ret = arch_add_memory(nid, start, size, ¶ms);
if (ret < 0)
goto error;
/* create memory block devices after memory was added */
- ret = create_memory_block_devices(start, size);
+ ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
if (ret) {
arch_remove_memory(nid, start, size, NULL);
goto error;
@@ -1762,6 +1867,14 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
return 0;
}
+static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
+{
+ /*
+ * If not set, continue with the next block.
+ */
+ return mem->nr_vmemmap_pages;
+}
+
static int check_cpu_on_node(pg_data_t *pgdat)
{
int cpu;
@@ -1836,6 +1949,9 @@ EXPORT_SYMBOL(try_offline_node);
static int __ref try_remove_memory(int nid, u64 start, u64 size)
{
int rc = 0;
+ struct vmem_altmap mhp_altmap = {};
+ struct vmem_altmap *altmap = NULL;
+ unsigned long nr_vmemmap_pages;
BUG_ON(check_hotplug_memory_range(start, size));
@@ -1848,6 +1964,31 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
if (rc)
return rc;
+ /*
+ * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
+ * the same granularity it was added - a single memory block.
+ */
+ if (memmap_on_memory) {
+ nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
+ get_nr_vmemmap_pages_cb);
+ if (nr_vmemmap_pages) {
+ if (size != memory_block_size_bytes()) {
+ pr_warn("Refuse to remove %#llx - %#llx,"
+ "wrong granularity\n",
+ start, start + size);
+ return -EINVAL;
+ }
+
+ /*
+ * Let remove_pmd_table->free_hugepage_table do the
+ * right thing if we used vmem_altmap when hot-adding
+ * the range.
+ */
+ mhp_altmap.alloc = nr_vmemmap_pages;
+ altmap = &mhp_altmap;
+ }
+ }
+
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
@@ -1859,7 +2000,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
mem_hotplug_begin();
- arch_remove_memory(nid, start, size, NULL);
+ arch_remove_memory(nid, start, size, altmap);
if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
memblock_free(start, size);
diff --git a/mm/sparse.c b/mm/sparse.c
index 7bd23f9d6cef..8e96cf00536b 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -623,7 +623,6 @@ void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
}
}
-#ifdef CONFIG_MEMORY_HOTREMOVE
/* Mark all memory sections within the pfn range as offline */
void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
{
@@ -644,7 +643,6 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
ms->section_mem_map &= ~SECTION_IS_ONLINE;
}
}
-#endif
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static struct page * __meminit populate_section_memmap(unsigned long pfn,
--
2.16.3
From: David Hildenbrand <[email protected]>
Let's have a single place (inspired by adjust_managed_page_count()) where
we adjust present pages.
In contrast to adjust_managed_page_count(), only memory onlining/offlining
is allowed to modify the number of present pages.
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
---
mm/memory_hotplug.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 25e59d5dc13c..d05056b3c173 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -829,6 +829,16 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
return default_zone_for_pfn(nid, start_pfn, nr_pages);
}
+static void adjust_present_page_count(struct zone *zone, long nr_pages)
+{
+ unsigned long flags;
+
+ zone->present_pages += nr_pages;
+ pgdat_resize_lock(zone->zone_pgdat, &flags);
+ zone->zone_pgdat->node_present_pages += nr_pages;
+ pgdat_resize_unlock(zone->zone_pgdat, &flags);
+}
+
int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
int online_type, int nid)
{
@@ -882,11 +892,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
}
online_pages_range(pfn, nr_pages);
- zone->present_pages += nr_pages;
-
- pgdat_resize_lock(zone->zone_pgdat, &flags);
- zone->zone_pgdat->node_present_pages += nr_pages;
- pgdat_resize_unlock(zone->zone_pgdat, &flags);
+ adjust_present_page_count(zone, nr_pages);
node_states_set_node(nid, &arg);
if (need_zonelists_rebuild)
@@ -1701,11 +1707,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
- zone->present_pages -= nr_pages;
-
- pgdat_resize_lock(zone->zone_pgdat, &flags);
- zone->zone_pgdat->node_present_pages -= nr_pages;
- pgdat_resize_unlock(zone->zone_pgdat, &flags);
+ adjust_present_page_count(zone, -nr_pages);
init_per_zone_wmark_min();
--
2.16.3
Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature.
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
---
arch/arm64/Kconfig | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e4e1b6550115..68735831b236 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -309,6 +309,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
config ARCH_ENABLE_MEMORY_HOTREMOVE
def_bool y
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+ def_bool y
+
config SMP
def_bool y
--
2.16.3
Self stored memmap leads to a sparse memory situation which is unsuitable
for workloads that requires large contiguous memory chunks, so make this
an opt-in which needs to be explicitly enabled.
To control this, let memory_hotplug have its own memory space, as suggested
by David, so we can add memory_hotplug.memmap_on_memory parameter.
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 17 +++++++++++++++++
mm/Makefile | 5 ++++-
mm/memory_hotplug.c | 10 +++++++++-
3 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..af32c17cd4eb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2794,6 +2794,23 @@
seconds. Use this parameter to check at some
other rate. 0 disables periodic checking.
+ memory_hotplug.memmap_on_memory
+ [KNL,X86,ARM] Boolean flag to enable this feature.
+ Format: {on | off (default)}
+ When enabled, runtime hotplugged memory will
+ allocate its internal metadata (struct pages)
+ from the hotadded memory which will allow to
+ hotadd a lot of memory without requiring
+ additional memory to do so.
+ This feature is disabled by default because it
+ has some implication on large (e.g. GB)
+ allocations in some configurations (e.g. small
+ memory blocks).
+ The state of the flag can be read in
+ /sys/module/memory_hotplug/parameters/memmap_on_memory.
+ Note that even when enabled, there are a few cases where
+ the feature is not effective.
+
memtest= [KNL,X86,ARM,PPC] Enable memtest
Format: <integer>
default : 0 <disable>
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..82ae9482f5e3 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,9 +58,13 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
page-alloc-y := page_alloc.o
page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
+# Give 'memory_hotplug' its own module-parameter namespace
+memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
+
obj-y += page-alloc.o
obj-y += init-mm.o
obj-y += memblock.o
+obj-y += $(memory-hotplug-y)
ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o
@@ -83,7 +87,6 @@ obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_KASAN) += kasan/
obj-$(CONFIG_KFENCE) += kfence/
obj-$(CONFIG_FAILSLAB) += failslab.o
-obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_MEMTEST) += memtest.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5ef626926449..b93949a84d4a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,7 +42,15 @@
#include "internal.h"
#include "shuffle.h"
-static bool memmap_on_memory;
+
+/*
+ * memory_hotplug.memmap_on_memory parameter
+ */
+static bool memmap_on_memory __ro_after_init;
+#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
+module_param(memmap_on_memory, bool, 0444);
+MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
+#endif
/*
* online_page_callback contains pointer to current page onlining function.
--
2.16.3
This is a preparatory patch that introduces two new functions:
memory_block_online() and memory_block_offline().
For now, these functions will only call online_pages() and offline_pages()
respectively, but they will be later in charge of preparing the vmemmap
pages, carrying out the initialization and proper accounting of such
pages.
Since memory_block struct contains all the information, pass this struct
down the chain till the end functions.
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 33 +++++++++++++++++++++------------
1 file changed, 21 insertions(+), 12 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f35298425575..f209925a5d4e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -169,30 +169,41 @@ int memory_notify(unsigned long val, void *v)
return blocking_notifier_call_chain(&memory_chain, val, v);
}
+static int memory_block_online(struct memory_block *mem)
+{
+ unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
+ unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+
+ return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+}
+
+static int memory_block_offline(struct memory_block *mem)
+{
+ unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
+ unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+
+ return offline_pages(start_pfn, nr_pages);
+}
+
/*
* MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
* OK to have direct references to sparsemem variables in here.
*/
static int
-memory_block_action(unsigned long start_section_nr, unsigned long action,
- int online_type, int nid)
+memory_block_action(struct memory_block *mem, unsigned long action)
{
- unsigned long start_pfn;
- unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
int ret;
- start_pfn = section_nr_to_pfn(start_section_nr);
-
switch (action) {
case MEM_ONLINE:
- ret = online_pages(start_pfn, nr_pages, online_type, nid);
+ ret = memory_block_online(mem);
break;
case MEM_OFFLINE:
- ret = offline_pages(start_pfn, nr_pages);
+ ret = memory_block_offline(mem);
break;
default:
WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
- "%ld\n", __func__, start_section_nr, action, action);
+ "%ld\n", __func__, mem->start_section_nr, action, action);
ret = -EINVAL;
}
@@ -210,9 +221,7 @@ static int memory_block_change_state(struct memory_block *mem,
if (to_state == MEM_OFFLINE)
mem->state = MEM_GOING_OFFLINE;
- ret = memory_block_action(mem->start_section_nr, to_state,
- mem->online_type, mem->nid);
-
+ ret = memory_block_action(mem, to_state);
mem->state = ret ? from_state_req : to_state;
return ret;
--
2.16.3
When using self-hosted vmemmap pages, the number of pages passed to
{online,offline}_pages might not fully span sections, but they always
fully span pageblocks.
Relax the check account for that case.
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
---
mm/memory_hotplug.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0cdbbfbc5757..25e59d5dc13c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -838,9 +838,14 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
int ret;
struct memory_notify arg;
- /* We can only online full sections (e.g., SECTION_IS_ONLINE) */
+ /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
+ * However, when using e.g: memmap_on_memory, some pages are initialized
+ * prior to calling in here. The remaining amount of pages must be
+ * pageblock aligned.
+ */
if (WARN_ON_ONCE(!nr_pages ||
- !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
+ !IS_ALIGNED(pfn, pageblock_nr_pages) ||
+ !IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
mem_hotplug_begin();
@@ -1573,9 +1578,14 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
int ret, node;
char *reason;
- /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
+ /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
+ * However, when using e.g: memmap_on_memory, some pages are initialized
+ * prior to calling in here. The remaining amount of pages must be
+ * pageblock aligned.
+ */
if (WARN_ON_ONCE(!nr_pages ||
- !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
+ !IS_ALIGNED(start_pfn, pageblock_nr_pages) ||
+ !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
mem_hotplug_begin();
--
2.16.3
Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
checking mhp_supports_memmap_on_memory().
MHP_MEMMAP_ON_MEMORY can only be set in case
ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
altmap, and the range to be added spans a single memory block.
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
drivers/acpi/acpi_memhotplug.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index b02fd51e5589..8cc195c4c861 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
acpi_handle handle = mem_device->device->handle;
int result, num_enabled = 0;
struct acpi_memory_info *info;
+ mhp_t mhp_flags = MHP_NONE;
int node;
node = acpi_get_node(handle);
@@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
+ if (mhp_supports_memmap_on_memory(info->length))
+ mhp_flags |= MHP_MEMMAP_ON_MEMORY;
result = __add_memory(node, info->start_addr, info->length,
- MHP_NONE);
+ mhp_flags);
/*
* If the memory block has been used by the kernel, add_memory()
--
2.16.3
Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
---
arch/x86/Kconfig | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..9f0211df1746 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2433,6 +2433,9 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
def_bool y
depends on MEMORY_HOTPLUG
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+ def_bool y
+
config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
--
2.16.3
On Fri 16-04-21 13:24:04, Oscar Salvador wrote:
> This is a preparatory patch that introduces two new functions:
> memory_block_online() and memory_block_offline().
>
> For now, these functions will only call online_pages() and offline_pages()
> respectively, but they will be later in charge of preparing the vmemmap
> pages, carrying out the initialization and proper accounting of such
> pages.
>
> Since memory_block struct contains all the information, pass this struct
> down the chain till the end functions.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> drivers/base/memory.c | 33 +++++++++++++++++++++------------
> 1 file changed, 21 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index f35298425575..f209925a5d4e 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -169,30 +169,41 @@ int memory_notify(unsigned long val, void *v)
> return blocking_notifier_call_chain(&memory_chain, val, v);
> }
>
> +static int memory_block_online(struct memory_block *mem)
> +{
> + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> +
> + return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
> +}
> +
> +static int memory_block_offline(struct memory_block *mem)
> +{
> + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> +
> + return offline_pages(start_pfn, nr_pages);
> +}
> +
> /*
> * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
> * OK to have direct references to sparsemem variables in here.
> */
> static int
> -memory_block_action(unsigned long start_section_nr, unsigned long action,
> - int online_type, int nid)
> +memory_block_action(struct memory_block *mem, unsigned long action)
> {
> - unsigned long start_pfn;
> - unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> int ret;
>
> - start_pfn = section_nr_to_pfn(start_section_nr);
> -
> switch (action) {
> case MEM_ONLINE:
> - ret = online_pages(start_pfn, nr_pages, online_type, nid);
> + ret = memory_block_online(mem);
> break;
> case MEM_OFFLINE:
> - ret = offline_pages(start_pfn, nr_pages);
> + ret = memory_block_offline(mem);
> break;
> default:
> WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
> - "%ld\n", __func__, start_section_nr, action, action);
> + "%ld\n", __func__, mem->start_section_nr, action, action);
> ret = -EINVAL;
> }
>
> @@ -210,9 +221,7 @@ static int memory_block_change_state(struct memory_block *mem,
> if (to_state == MEM_OFFLINE)
> mem->state = MEM_GOING_OFFLINE;
>
> - ret = memory_block_action(mem->start_section_nr, to_state,
> - mem->online_type, mem->nid);
> -
> + ret = memory_block_action(mem, to_state);
> mem->state = ret ? from_state_req : to_state;
>
> return ret;
> --
> 2.16.3
--
Michal Hocko
SUSE Labs
On Fri 16-04-21 13:24:05, Oscar Salvador wrote:
> When using self-hosted vmemmap pages, the number of pages passed to
> {online,offline}_pages might not fully span sections, but they always
> fully span pageblocks.
> Relax the check account for that case.
It would be good to call those out explicitly. It would be also
great to explain why pageblock_nr_pages is an actual constrain. There
shouldn't be any real reason for that except for "we want online_pages
to operate on whole memblocks and memmap_on_memory will poke
pageblock_nr_pages aligned holes in the beginning which is a special
case we want to allow."
> Signed-off-by: Oscar Salvador <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
With the changelog extended and the comment clarification (se below)
feel free to add
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/memory_hotplug.c | 18 ++++++++++++++----
> 1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0cdbbfbc5757..25e59d5dc13c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -838,9 +838,14 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> int ret;
> struct memory_notify arg;
>
> - /* We can only online full sections (e.g., SECTION_IS_ONLINE) */
> + /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
> + * However, when using e.g: memmap_on_memory, some pages are initialized
> + * prior to calling in here. The remaining amount of pages must be
> + * pageblock aligned.
I would rephrase (and also note that multi line comment usually have a
leading line without any content - not that I care much though).
/*
* {on,off}lining is constrained to full memory sections (or
* more precisly to memory blocks from the user space POV).
* memmap_on_memory is an exception because it reserves initial
* part of the physical memory space for vmemmaps. That space is
* pageblock aligned.
> + */
Same comment would apply to oofline_pages.
> if (WARN_ON_ONCE(!nr_pages ||
> - !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
> + !IS_ALIGNED(pfn, pageblock_nr_pages) ||
> + !IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION)))
> return -EINVAL;
>
> mem_hotplug_begin();
> @@ -1573,9 +1578,14 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> int ret, node;
> char *reason;
>
> - /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
> + /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
> + * However, when using e.g: memmap_on_memory, some pages are initialized
> + * prior to calling in here. The remaining amount of pages must be
> + * pageblock aligned.
> + */
> if (WARN_ON_ONCE(!nr_pages ||
> - !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
> + !IS_ALIGNED(start_pfn, pageblock_nr_pages) ||
> + !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
> return -EINVAL;
>
> mem_hotplug_begin();
> --
> 2.16.3
--
Michal Hocko
SUSE Labs
On Fri 16-04-21 13:24:06, Oscar Salvador wrote:
> From: David Hildenbrand <[email protected]>
>
> Let's have a single place (inspired by adjust_managed_page_count()) where
> we adjust present pages.
> In contrast to adjust_managed_page_count(), only memory onlining/offlining
> is allowed to modify the number of present pages.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> Signed-off-by: Oscar Salvador <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
Not sure self review counts ;)
Acked-by: Michal Hocko <[email protected]>
Btw. I strongly suspect the resize lock is quite pointless here.
Something for a follow up patch.
> ---
> mm/memory_hotplug.c | 22 ++++++++++++----------
> 1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 25e59d5dc13c..d05056b3c173 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -829,6 +829,16 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
> return default_zone_for_pfn(nid, start_pfn, nr_pages);
> }
>
> +static void adjust_present_page_count(struct zone *zone, long nr_pages)
> +{
> + unsigned long flags;
> +
> + zone->present_pages += nr_pages;
> + pgdat_resize_lock(zone->zone_pgdat, &flags);
> + zone->zone_pgdat->node_present_pages += nr_pages;
> + pgdat_resize_unlock(zone->zone_pgdat, &flags);
> +}
> +
> int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> int online_type, int nid)
> {
> @@ -882,11 +892,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> }
>
> online_pages_range(pfn, nr_pages);
> - zone->present_pages += nr_pages;
> -
> - pgdat_resize_lock(zone->zone_pgdat, &flags);
> - zone->zone_pgdat->node_present_pages += nr_pages;
> - pgdat_resize_unlock(zone->zone_pgdat, &flags);
> + adjust_present_page_count(zone, nr_pages);
>
> node_states_set_node(nid, &arg);
> if (need_zonelists_rebuild)
> @@ -1701,11 +1707,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>
> /* removal success */
> adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
> - zone->present_pages -= nr_pages;
> -
> - pgdat_resize_lock(zone->zone_pgdat, &flags);
> - zone->zone_pgdat->node_present_pages -= nr_pages;
> - pgdat_resize_unlock(zone->zone_pgdat, &flags);
> + adjust_present_page_count(zone, -nr_pages);
>
> init_per_zone_wmark_min();
>
> --
> 2.16.3
--
Michal Hocko
SUSE Labs
On Fri 16-04-21 13:24:07, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
>
> This has some disadvantages:
> a) an existing memory is consumed for that purpose
> (eg: ~2MB per 128MB memory section on x86_64)
I would extend this slightly. This can even lead to extreme cases where
system goes OOM because the physically hotplugged memory depletes the
available memory before it is onlined.
> b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
> c) It might be there are no PMD_ALIGNED chunks so memmap array gets
> populated with base pages.
>
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
Again this can be confusing because this is not what is really happening
in practice because we are going to have a multisection memory block
where all sections will be backed by a common reserved space rather than
per section sparse space. I would go with
"
Vmemap page tables can map arbitrary memory. That means that we can
reserve a part of the physically hotadded memory to back vmemmap page
tables. This implementation uses the beggining of the hotplugged memory
for that purpose.
"
> struct pages which back the allocated space then just need to be treated
> carefully.
>
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This patch also introduces the following functions:
There is quite a large leap from __populate_section_memmap to the
memory_block that deserves explaining to not lose all the subtle things
discussed in the past. I think it should be made clear why all the fuzz.
I would structure it as follows:
"
There are some non-obiously things to consider though. Vmemmap
pages are allocated/freed during the memory hotplug events
(add_memory_resource, try_remove_memory) when the memory is
added/removed. This means that the reserved physical range is not online
yet it is used. The most obvious side effect is that pfn_to_online_page
returns NULL for those pfns. The current design expects that this
should be OK as the hotplugged memory is considered a garbage until it
is onlined. For example hibernation wouldn't save the content of those
vmmemmaps into the image so it wouldn't be restored on resume but this
should be OK as there no real content to recover anyway while metadata
is reachable from other data structures (e.g. vmemmap page tables).
The reserved space is therefore (de)initialized during the {on,off}line
events (mhp_{de}init_memmap_on_memory). That is done by extracting page
allocator independent initialization from the regular onlining path.
The primary reason to handle the reserved space outside of {on,off}line_pages
is to make each initialization specific to the purpose rather than
special case them in a single function.
> Adjusting of present_pages is done at the end once we know that online_pages()
> succedeed.
>
> On offline, memory_block_offline() needs to unaccount vmemmap pages from
> present_pages() before calling offline_pages().
> This is necessary because offline_pages() tears down some structures based
> on the fact whether the node or the zone become empty.
> If offline_pages() fails, we account back vmemmap pages.
> If it succeeds, we call mhp_deinit_memmap_on_memory().
>
> Hot-remove:
>
> We need to be careful when removing memory, as adding and
> removing memory needs to be done with the same granularity.
> To check that this assumption is not violated, we check the
> memory range we want to remove and if a) any memory block has
> vmemmap pages and b) the range spans more than a single memory
> block, we scream out loud and refuse to proceed.
>
> If all is good and the range was using memmap on memory (aka vmemmap pages),
> we construct an altmap structure so free_hugepage_table does the right
> thing and calls vmem_altmap_free instead of free_pagetable.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> ---
> drivers/base/memory.c | 71 ++++++++++++++++--
> include/linux/memory.h | 8 ++-
> include/linux/memory_hotplug.h | 15 +++-
> include/linux/memremap.h | 2 +-
> include/linux/mmzone.h | 7 +-
> mm/Kconfig | 5 ++
> mm/memory_hotplug.c | 159 ++++++++++++++++++++++++++++++++++++++---
> mm/sparse.c | 2 -
> 8 files changed, 247 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index f209925a5d4e..2e2b2f654f0a 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -173,16 +173,72 @@ static int memory_block_online(struct memory_block *mem)
> {
> unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> + unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> + struct zone *zone;
> + int ret;
> +
> + zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
> +
> + /*
> + * Although vmemmap pages have a different lifecycle than the pages
> + * they describe (they remain until the memory is unplugged), doing
> + * their initialization and accounting at memory onlining/offlining
> + * stage simplifies things a lot.
"simplify things a lot" is not really helpful to people reading the
code. It would be much better to state reasons here. I would go with
* stage helps to keep accounting easier to follow - e.g.
* vmemmaps belong to the same zone as the onlined memory.
> + */
> + if (nr_vmemmap_pages) {
> + ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
> + if (ret)
> + return ret;
> + }
> +
> + ret = online_pages(start_pfn + nr_vmemmap_pages,
> + nr_pages - nr_vmemmap_pages, zone);
> + if (ret) {
> + if (nr_vmemmap_pages)
> + mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
> + return ret;
> + }
> +
> + /*
> + * Account once onlining succeeded. If the zone was unpopulated, it is
> + * now already properly populated.
> + */
> + if (nr_vmemmap_pages)
> + adjust_present_page_count(zone, nr_vmemmap_pages);
>
> - return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
> + return ret;
> }
[...]
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d05056b3c173..5ef626926449 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,8 @@
> #include "internal.h"
> #include "shuffle.h"
>
> +static bool memmap_on_memory;
> +
> /*
> * online_page_callback contains pointer to current page onlining function.
> * Initially it is generic_online_page(). If it is required it could be
> @@ -641,7 +643,12 @@ EXPORT_SYMBOL_GPL(generic_online_page);
> static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> {
> const unsigned long end_pfn = start_pfn + nr_pages;
> - unsigned long pfn;
> + unsigned long pfn = start_pfn;
> +
> + while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
> + (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> + pfn += pageblock_nr_pages;
> + }
I believe we do not need to check for nr_pages as the actual operation
will never run out of range in practice but the code is more subtle than
necessary. Using two different iteration styles is also hurting the code
readability. I would go with the following
for (pfn = start_pfn; pfn < end_pfn; ) {
unsigned long order = min(MAX_ORDER - 1UL, __ffs(pfn));
while (start + (1UL << order) > end_pfn)
order--;
(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
pfn += 1 << order;
}
which is what __free_pages_memory does already.
>
> /*
> * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
> @@ -649,7 +656,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> * later). We account all pages as being online and belonging to this
> * zone ("present").
> */
> - for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
> + for (; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
> (*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
>
> /* mark all involved sections as online */
[...]
> @@ -1848,6 +1964,31 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
> if (rc)
> return rc;
>
> + /*
> + * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> + * the same granularity it was added - a single memory block.
> + */
> + if (memmap_on_memory) {
> + nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> + get_nr_vmemmap_pages_cb);
> + if (nr_vmemmap_pages) {
> + if (size != memory_block_size_bytes()) {
> + pr_warn("Refuse to remove %#llx - %#llx,"
> + "wrong granularity\n",
> + start, start + size);
> + return -EINVAL;
> + }
> +
> + /*
> + * Let remove_pmd_table->free_hugepage_table do the
> + * right thing if we used vmem_altmap when hot-adding
> + * the range.
> + */
> + mhp_altmap.alloc = nr_vmemmap_pages;
> + altmap = &mhp_altmap;
> + }
> + }
> +
> /* remove memmap entry */
> firmware_map_remove(start, start + size, "System RAM");
I have to say I still dislike this and I would just wrap it inside out
and do the operation from within walk_memory_blocks but I will not
insist.
--
Michal Hocko
SUSE Labs
On Fri 16-04-21 13:24:10, Oscar Salvador wrote:
> Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> arch/x86/Kconfig | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2792879d398e..9f0211df1746 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2433,6 +2433,9 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
> def_bool y
> depends on MEMORY_HOTPLUG
>
> +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> + def_bool y
> +
> config USE_PERCPU_NUMA_NODE_ID
> def_bool y
> depends on NUMA
> --
> 2.16.3
--
Michal Hocko
SUSE Labs
On Tue, Apr 20, 2021 at 11:45:55AM +0200, Michal Hocko wrote:
> On Fri 16-04-21 13:24:06, Oscar Salvador wrote:
> > From: David Hildenbrand <[email protected]>
> >
> > Let's have a single place (inspired by adjust_managed_page_count()) where
> > we adjust present pages.
> > In contrast to adjust_managed_page_count(), only memory onlining/offlining
> > is allowed to modify the number of present pages.
> >
> > Signed-off-by: David Hildenbrand <[email protected]>
> > Signed-off-by: Oscar Salvador <[email protected]>
> > Reviewed-by: Oscar Salvador <[email protected]>
>
> Not sure self review counts ;)
Uhm, the original author is David, I just added my signed-off-by as a deliverer.
I thought that in that case was ok to stick my Reviewed-by.
Or maybe my signed-off-by carries that implicitly.
> Acked-by: Michal Hocko <[email protected]>
>
> Btw. I strongly suspect the resize lock is quite pointless here.
> Something for a follow up patch.
What makes you think that?
I have been thinking about this, let us ignore this patch for a moment.
If I poked the code correctly, node_size_lock is taken in:
remove_pfn_range_from_zone()
move_pfn_range_to_zone()
both of them handling {zone,node}->spanned_pages
Then we take it in {offline,online}_pages() for {zone,node}->present_pages.
The other places where we take it are __init functions, so not of interest.
Given that {offline,online}_pages() is serialized by the memory_hotplug lock,
I would say that {node,zone}->{spanned,present}_pages is, at any time, stable?
So, no need for the lock even without considering this patch?
Now, getting back to this patch.
adjust_present_page_count() will be called from memory_block_online(), which
is not holding the memory_hotplug lock yet.
But, we only fiddle with present pages out of {online,offline}_pages() if
we have vmemmap pages, and since that operates on the same memory block,
its lock should serialize that.
I think I went down a rabbit hole, I am slightly confused now.
--
Oscar Salvador
SUSE L3
On 21.04.21 10:00, Oscar Salvador wrote:
> On Tue, Apr 20, 2021 at 11:45:55AM +0200, Michal Hocko wrote:
>> On Fri 16-04-21 13:24:06, Oscar Salvador wrote:
>>> From: David Hildenbrand <[email protected]>
>>>
>>> Let's have a single place (inspired by adjust_managed_page_count()) where
>>> we adjust present pages.
>>> In contrast to adjust_managed_page_count(), only memory onlining/offlining
>>> is allowed to modify the number of present pages.
>>>
>>> Signed-off-by: David Hildenbrand <[email protected]>
>>> Signed-off-by: Oscar Salvador <[email protected]>
>>> Reviewed-by: Oscar Salvador <[email protected]>
>>
>> Not sure self review counts ;)
>
> Uhm, the original author is David, I just added my signed-off-by as a deliverer.
> I thought that in that case was ok to stick my Reviewed-by.
> Or maybe my signed-off-by carries that implicitly.
>
>> Acked-by: Michal Hocko <[email protected]>
>>
>> Btw. I strongly suspect the resize lock is quite pointless here.
>> Something for a follow up patch.
>
> What makes you think that?
> I have been thinking about this, let us ignore this patch for a moment.
>
> If I poked the code correctly, node_size_lock is taken in:
>
> remove_pfn_range_from_zone()
> move_pfn_range_to_zone()
>
> both of them handling {zone,node}->spanned_pages
>
> Then we take it in {offline,online}_pages() for {zone,node}->present_pages.
>
> The other places where we take it are __init functions, so not of interest.
>
> Given that {offline,online}_pages() is serialized by the memory_hotplug lock,
> I would say that {node,zone}->{spanned,present}_pages is, at any time, stable?
> So, no need for the lock even without considering this patch?
>
> Now, getting back to this patch.
> adjust_present_page_count() will be called from memory_block_online(), which
> is not holding the memory_hotplug lock yet.
> But, we only fiddle with present pages out of {online,offline}_pages() if
> we have vmemmap pages, and since that operates on the same memory block,
> its lock should serialize that.
>
> I think I went down a rabbit hole, I am slightly confused now.
We always hold the device hotplug lock when onlining/offlining memory.
I agree that the lock might be unnecessary (had the same thoughts a
while ago), we can look into that in the future.
--
Thanks,
David / dhildenb
On Wed 21-04-21 10:00:36, Oscar Salvador wrote:
> On Tue, Apr 20, 2021 at 11:45:55AM +0200, Michal Hocko wrote:
> > On Fri 16-04-21 13:24:06, Oscar Salvador wrote:
> > > From: David Hildenbrand <[email protected]>
> > >
> > > Let's have a single place (inspired by adjust_managed_page_count()) where
> > > we adjust present pages.
> > > In contrast to adjust_managed_page_count(), only memory onlining/offlining
> > > is allowed to modify the number of present pages.
> > >
> > > Signed-off-by: David Hildenbrand <[email protected]>
> > > Signed-off-by: Oscar Salvador <[email protected]>
> > > Reviewed-by: Oscar Salvador <[email protected]>
> >
> > Not sure self review counts ;)
>
> Uhm, the original author is David, I just added my signed-off-by as a deliverer.
> I thought that in that case was ok to stick my Reviewed-by.
> Or maybe my signed-off-by carries that implicitly.
Yeah I do expect that one should review own changes but this is not
really anything to lose sleep over.
> > Acked-by: Michal Hocko <[email protected]>
> >
> > Btw. I strongly suspect the resize lock is quite pointless here.
> > Something for a follow up patch.
>
> What makes you think that?
* Write access to present_pages at runtime should be protected by
* mem_hotplug_begin/end(). Any reader who can't tolerant drift of
* present_pages should get_online_mems() to get a stable value.
> I have been thinking about this, let us ignore this patch for a moment.
>
> If I poked the code correctly, node_size_lock is taken in:
>
> remove_pfn_range_from_zone()
> move_pfn_range_to_zone()
>
> both of them handling {zone,node}->spanned_pages
>
> Then we take it in {offline,online}_pages() for {zone,node}->present_pages.
>
> The other places where we take it are __init functions, so not of interest.
>
> Given that {offline,online}_pages() is serialized by the memory_hotplug lock,
> I would say that {node,zone}->{spanned,present}_pages is, at any time, stable?
> So, no need for the lock even without considering this patch?
Yes. The resize lock is really only relevant to the parallel struct page
initialization during early boot. The hotplug usage seems just a left
over from the past or maybe it has never been really relevant in that
context.
> Now, getting back to this patch.
> adjust_present_page_count() will be called from memory_block_online(), which
> is not holding the memory_hotplug lock yet.
> But, we only fiddle with present pages out of {online,offline}_pages() if
> we have vmemmap pages, and since that operates on the same memory block,
> its lock should serialize that.
Memory hotplug is always synchronized on the device level.
--
Michal Hocko
SUSE Labs
On Wed, Apr 21, 2021 at 10:31:30AM +0200, Michal Hocko wrote:
> > Given that {offline,online}_pages() is serialized by the memory_hotplug lock,
> > I would say that {node,zone}->{spanned,present}_pages is, at any time, stable?
> > So, no need for the lock even without considering this patch?
>
> Yes. The resize lock is really only relevant to the parallel struct page
> initialization during early boot. The hotplug usage seems just a left
> over from the past or maybe it has never been really relevant in that
> context.
Ok, I will prepare a follow-up patch to remove the lock in such situations
when this work goes in.
--
Oscar Salvador
SUSE L3
On Wed 21-04-21 10:15:46, Oscar Salvador wrote:
> On Tue, Apr 20, 2021 at 12:56:03PM +0200, Michal Hocko wrote:
[...]
> > necessary. Using two different iteration styles is also hurting the code
> > readability. I would go with the following
> > for (pfn = start_pfn; pfn < end_pfn; ) {
> > unsigned long order = min(MAX_ORDER - 1UL, __ffs(pfn));
> >
> > while (start + (1UL << order) > end_pfn)
> > order--;
> > (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> > pfn += 1 << order;
> > }
> >
> > which is what __free_pages_memory does already.
>
> this is kinda what I used to have in the early versions, but it was agreed
> with David to split it in two loops to make it explicit.
> I can go back to that if it is preferred.
Not that I would insist but I find it better to use common constructs
when it doesn't hurt readability. The order evaluation can be even done
in a trivial helper.
> > > + if (memmap_on_memory) {
> > > + nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> > > + get_nr_vmemmap_pages_cb);
> > > + if (nr_vmemmap_pages) {
> > > + if (size != memory_block_size_bytes()) {
> > > + pr_warn("Refuse to remove %#llx - %#llx,"
> > > + "wrong granularity\n",
> > > + start, start + size);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + /*
> > > + * Let remove_pmd_table->free_hugepage_table do the
> > > + * right thing if we used vmem_altmap when hot-adding
> > > + * the range.
> > > + */
> > > + mhp_altmap.alloc = nr_vmemmap_pages;
> > > + altmap = &mhp_altmap;
> > > + }
> > > + }
> > > +
> > > /* remove memmap entry */
> > > firmware_map_remove(start, start + size, "System RAM");
> >
> > I have to say I still dislike this and I would just wrap it inside out
> > and do the operation from within walk_memory_blocks but I will not
> > insist.
>
> I have to confess I forgot about the details of that dicussion, as we were
> quite focused on decoupling vmemmap pages from {online,offline} interface.
> Would you mind elaborating a bit more?
As I've said I will not insist and this can be done in the follow up.
You are iterating over memory blocks just to refuse to do an operation
which can be split to several memory blocks. See
http://lkml.kernel.org/r/[email protected] and follow
walk_memory_blocks(start, size, NULL, remove_memory_block_cb)
--
Michal Hocko
SUSE Labs
On Wed, Apr 21, 2021 at 10:39:16AM +0200, Michal Hocko wrote:
> Not that I would insist but I find it better to use common constructs
> when it doesn't hurt readability. The order evaluation can be even done
> in a trivial helper.
Uhm, I will have a look how it looks.
Maybe with a nice comment explaining what is going on can make it in.
If not, I can always keep what we have atm.
> As I've said I will not insist and this can be done in the follow up.
> You are iterating over memory blocks just to refuse to do an operation
> which can be split to several memory blocks. See
> http://lkml.kernel.org/r/[email protected] and follow
> walk_memory_blocks(start, size, NULL, remove_memory_block_cb)
Ok, thanks for the link.
I will have a look, but I would rather do it as a follow-up.
--
Oscar Salvador
SUSE L3
On Tue, Apr 20, 2021 at 11:40:50AM +0200, Michal Hocko wrote:
> On Fri 16-04-21 13:24:05, Oscar Salvador wrote:
> > When using self-hosted vmemmap pages, the number of pages passed to
> > {online,offline}_pages might not fully span sections, but they always
> > fully span pageblocks.
> > Relax the check account for that case.
>
> It would be good to call those out explicitly. It would be also
> great to explain why pageblock_nr_pages is an actual constrain. There
> shouldn't be any real reason for that except for "we want online_pages
> to operate on whole memblocks and memmap_on_memory will poke
> pageblock_nr_pages aligned holes in the beginning which is a special
> case we want to allow."
Sounds good.
>
> > Signed-off-by: Oscar Salvador <[email protected]>
> > Reviewed-by: David Hildenbrand <[email protected]>
>
> With the changelog extended and the comment clarification (se below)
> feel free to add
Ok, thanks for the suggestion Michal.
> Acked-by: Michal Hocko <[email protected]>
--
Oscar Salvador
SUSE L3
On Tue, Apr 20, 2021 at 12:56:03PM +0200, Michal Hocko wrote:
> On Fri 16-04-21 13:24:07, Oscar Salvador wrote:
> > Physical memory hotadd has to allocate a memmap (struct page array) for
> > the newly added memory section. Currently, alloc_pages_node() is used
> > for those allocations.
> >
> > This has some disadvantages:
> > a) an existing memory is consumed for that purpose
> > (eg: ~2MB per 128MB memory section on x86_64)
>
> I would extend this slightly. This can even lead to extreme cases where
> system goes OOM because the physically hotplugged memory depletes the
> available memory before it is onlined.
Ok.
> > Vmemap page tables can map arbitrary memory.
> > That means that we can simply use the beginning of each memory section and
> > map struct pages there.
>
> Again this can be confusing because this is not what is really happening
> in practice because we are going to have a multisection memory block
> where all sections will be backed by a common reserved space rather than
> per section sparse space. I would go with
>
> "
> Vmemap page tables can map arbitrary memory. That means that we can
> reserve a part of the physically hotadded memory to back vmemmap page
> tables. This implementation uses the beggining of the hotplugged memory
> for that purpose.
> "
Yeah, I thought I fixed that, it should have been "That means that we can simply
use the beginning of each memory block...", but I am ok with your rewording.
> There is quite a large leap from __populate_section_memmap to the
> memory_block that deserves explaining to not lose all the subtle things
> discussed in the past. I think it should be made clear why all the fuzz.
> I would structure it as follows:
> "
> There are some non-obiously things to consider though. Vmemmap
> pages are allocated/freed during the memory hotplug events
> (add_memory_resource, try_remove_memory) when the memory is
> added/removed. This means that the reserved physical range is not online
> yet it is used. The most obvious side effect is that pfn_to_online_page
> returns NULL for those pfns. The current design expects that this
> should be OK as the hotplugged memory is considered a garbage until it
> is onlined. For example hibernation wouldn't save the content of those
> vmmemmaps into the image so it wouldn't be restored on resume but this
> should be OK as there no real content to recover anyway while metadata
> is reachable from other data structures (e.g. vmemmap page tables).
>
> The reserved space is therefore (de)initialized during the {on,off}line
> events (mhp_{de}init_memmap_on_memory). That is done by extracting page
> allocator independent initialization from the regular onlining path.
> The primary reason to handle the reserved space outside of {on,off}line_pages
> is to make each initialization specific to the purpose rather than
> special case them in a single function.
Ok, that definitely adds a valuable information.
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index f209925a5d4e..2e2b2f654f0a 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -173,16 +173,72 @@ static int memory_block_online(struct memory_block *mem)
> > {
> > unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> > unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> > + unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> > + struct zone *zone;
> > + int ret;
> > +
> > + zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
> > +
> > + /*
> > + * Although vmemmap pages have a different lifecycle than the pages
> > + * they describe (they remain until the memory is unplugged), doing
> > + * their initialization and accounting at memory onlining/offlining
> > + * stage simplifies things a lot.
>
> "simplify things a lot" is not really helpful to people reading the
> code. It would be much better to state reasons here. I would go with
> * stage helps to keep accounting easier to follow - e.g.
> * vmemmaps belong to the same zone as the onlined memory.
Ok
> > static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> > {
> > const unsigned long end_pfn = start_pfn + nr_pages;
> > - unsigned long pfn;
> > + unsigned long pfn = start_pfn;
> > +
> > + while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
> > + (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> > + pfn += pageblock_nr_pages;
> > + }
>
> I believe we do not need to check for nr_pages as the actual operation
> will never run out of range in practice but the code is more subtle than
If you mean that IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES) can go, that is not
right.
Of course, with your changes below it would not be necesary.
> necessary. Using two different iteration styles is also hurting the code
> readability. I would go with the following
> for (pfn = start_pfn; pfn < end_pfn; ) {
> unsigned long order = min(MAX_ORDER - 1UL, __ffs(pfn));
>
> while (start + (1UL << order) > end_pfn)
> order--;
> (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> pfn += 1 << order;
> }
>
> which is what __free_pages_memory does already.
this is kinda what I used to have in the early versions, but it was agreed
with David to split it in two loops to make it explicit.
I can go back to that if it is preferred.
> > + if (memmap_on_memory) {
> > + nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> > + get_nr_vmemmap_pages_cb);
> > + if (nr_vmemmap_pages) {
> > + if (size != memory_block_size_bytes()) {
> > + pr_warn("Refuse to remove %#llx - %#llx,"
> > + "wrong granularity\n",
> > + start, start + size);
> > + return -EINVAL;
> > + }
> > +
> > + /*
> > + * Let remove_pmd_table->free_hugepage_table do the
> > + * right thing if we used vmem_altmap when hot-adding
> > + * the range.
> > + */
> > + mhp_altmap.alloc = nr_vmemmap_pages;
> > + altmap = &mhp_altmap;
> > + }
> > + }
> > +
> > /* remove memmap entry */
> > firmware_map_remove(start, start + size, "System RAM");
>
> I have to say I still dislike this and I would just wrap it inside out
> and do the operation from within walk_memory_blocks but I will not
> insist.
I have to confess I forgot about the details of that dicussion, as we were
quite focused on decoupling vmemmap pages from {online,offline} interface.
Would you mind elaborating a bit more?
--
Oscar Salvador
SUSE L3
On 21.04.21 10:39, Michal Hocko wrote:
> On Wed 21-04-21 10:15:46, Oscar Salvador wrote:
>> On Tue, Apr 20, 2021 at 12:56:03PM +0200, Michal Hocko wrote:
> [...]
>>> necessary. Using two different iteration styles is also hurting the code
>>> readability. I would go with the following
>>> for (pfn = start_pfn; pfn < end_pfn; ) {
>>> unsigned long order = min(MAX_ORDER - 1UL, __ffs(pfn));
>>>
>>> while (start + (1UL << order) > end_pfn)
>>> order--;
>>> (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
>>> pfn += 1 << order;
>>> }
>>>
>>> which is what __free_pages_memory does already.
>>
>> this is kinda what I used to have in the early versions, but it was agreed
>> with David to split it in two loops to make it explicit.
>> I can go back to that if it is preferred.
>
> Not that I would insist but I find it better to use common constructs
> when it doesn't hurt readability. The order evaluation can be even done
> in a trivial helper.
>
>>>> + if (memmap_on_memory) {
>>>> + nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
>>>> + get_nr_vmemmap_pages_cb);
>>>> + if (nr_vmemmap_pages) {
>>>> + if (size != memory_block_size_bytes()) {
>>>> + pr_warn("Refuse to remove %#llx - %#llx,"
>>>> + "wrong granularity\n",
>>>> + start, start + size);
>>>> + return -EINVAL;
>>>> + }
>>>> +
>>>> + /*
>>>> + * Let remove_pmd_table->free_hugepage_table do the
>>>> + * right thing if we used vmem_altmap when hot-adding
>>>> + * the range.
>>>> + */
>>>> + mhp_altmap.alloc = nr_vmemmap_pages;
>>>> + altmap = &mhp_altmap;
>>>> + }
>>>> + }
>>>> +
>>>> /* remove memmap entry */
>>>> firmware_map_remove(start, start + size, "System RAM");
>>>
>>> I have to say I still dislike this and I would just wrap it inside out
>>> and do the operation from within walk_memory_blocks but I will not
>>> insist.
>>
>> I have to confess I forgot about the details of that dicussion, as we were
>> quite focused on decoupling vmemmap pages from {online,offline} interface.
>> Would you mind elaborating a bit more?
>
> As I've said I will not insist and this can be done in the follow up.
> You are iterating over memory blocks just to refuse to do an operation
> which can be split to several memory blocks. See
> http://lkml.kernel.org/r/[email protected] and follow
> walk_memory_blocks(start, size, NULL, remove_memory_block_cb)
>
We'll have to be careful in general when removing memory in different
granularity than it was added, especially calling arch_remove_memory()
in smaller granularity than it was added via arch_add_memory(). We might
fail to tear down the direct map, imagine having mapped a 1GiB page but
decide to remove individual 128 MiB chunks -- that won't work and the
direct map would currently remain.
So this should be handled separately in the future.
--
Thanks,
David / dhildenb
On Wed 21-04-21 10:44:38, David Hildenbrand wrote:
> On 21.04.21 10:39, Michal Hocko wrote:
> > On Wed 21-04-21 10:15:46, Oscar Salvador wrote:
> > > On Tue, Apr 20, 2021 at 12:56:03PM +0200, Michal Hocko wrote:
> > [...]
> > > > necessary. Using two different iteration styles is also hurting the code
> > > > readability. I would go with the following
> > > > for (pfn = start_pfn; pfn < end_pfn; ) {
> > > > unsigned long order = min(MAX_ORDER - 1UL, __ffs(pfn));
> > > >
> > > > while (start + (1UL << order) > end_pfn)
> > > > order--;
> > > > (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> > > > pfn += 1 << order;
> > > > }
> > > >
> > > > which is what __free_pages_memory does already.
> > >
> > > this is kinda what I used to have in the early versions, but it was agreed
> > > with David to split it in two loops to make it explicit.
> > > I can go back to that if it is preferred.
> >
> > Not that I would insist but I find it better to use common constructs
> > when it doesn't hurt readability. The order evaluation can be even done
> > in a trivial helper.
> >
> > > > > + if (memmap_on_memory) {
> > > > > + nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> > > > > + get_nr_vmemmap_pages_cb);
> > > > > + if (nr_vmemmap_pages) {
> > > > > + if (size != memory_block_size_bytes()) {
> > > > > + pr_warn("Refuse to remove %#llx - %#llx,"
> > > > > + "wrong granularity\n",
> > > > > + start, start + size);
> > > > > + return -EINVAL;
> > > > > + }
> > > > > +
> > > > > + /*
> > > > > + * Let remove_pmd_table->free_hugepage_table do the
> > > > > + * right thing if we used vmem_altmap when hot-adding
> > > > > + * the range.
> > > > > + */
> > > > > + mhp_altmap.alloc = nr_vmemmap_pages;
> > > > > + altmap = &mhp_altmap;
> > > > > + }
> > > > > + }
> > > > > +
> > > > > /* remove memmap entry */
> > > > > firmware_map_remove(start, start + size, "System RAM");
> > > >
> > > > I have to say I still dislike this and I would just wrap it inside out
> > > > and do the operation from within walk_memory_blocks but I will not
> > > > insist.
> > >
> > > I have to confess I forgot about the details of that dicussion, as we were
> > > quite focused on decoupling vmemmap pages from {online,offline} interface.
> > > Would you mind elaborating a bit more?
> >
> > As I've said I will not insist and this can be done in the follow up.
> > You are iterating over memory blocks just to refuse to do an operation
> > which can be split to several memory blocks. See
> > http://lkml.kernel.org/r/[email protected] and follow
> > walk_memory_blocks(start, size, NULL, remove_memory_block_cb)
> >
>
> We'll have to be careful in general when removing memory in different
> granularity than it was added, especially calling arch_remove_memory() in
> smaller granularity than it was added via arch_add_memory(). We might fail
> to tear down the direct map, imagine having mapped a 1GiB page but decide to
> remove individual 128 MiB chunks -- that won't work and the direct map would
> currently remain.
Agreed but I am not referring to arbitrary hotremove path. All I am
pointing at is to split up to memory blocks and do the same kind of work
on each separately. Partial failures might turn out to be more tricky
and as I've said I do not insist on doing that right now but it is a bit
weird to outright fail the operation even when in fact there are more
blocks to be hot removed in once.
--
Michal Hocko
SUSE Labs
On 21.04.21 10:49, Michal Hocko wrote:
> On Wed 21-04-21 10:44:38, David Hildenbrand wrote:
>> On 21.04.21 10:39, Michal Hocko wrote:
>>> On Wed 21-04-21 10:15:46, Oscar Salvador wrote:
>>>> On Tue, Apr 20, 2021 at 12:56:03PM +0200, Michal Hocko wrote:
>>> [...]
>>>>> necessary. Using two different iteration styles is also hurting the code
>>>>> readability. I would go with the following
>>>>> for (pfn = start_pfn; pfn < end_pfn; ) {
>>>>> unsigned long order = min(MAX_ORDER - 1UL, __ffs(pfn));
>>>>>
>>>>> while (start + (1UL << order) > end_pfn)
>>>>> order--;
>>>>> (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
>>>>> pfn += 1 << order;
>>>>> }
>>>>>
>>>>> which is what __free_pages_memory does already.
>>>>
>>>> this is kinda what I used to have in the early versions, but it was agreed
>>>> with David to split it in two loops to make it explicit.
>>>> I can go back to that if it is preferred.
>>>
>>> Not that I would insist but I find it better to use common constructs
>>> when it doesn't hurt readability. The order evaluation can be even done
>>> in a trivial helper.
>>>
>>>>>> + if (memmap_on_memory) {
>>>>>> + nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
>>>>>> + get_nr_vmemmap_pages_cb);
>>>>>> + if (nr_vmemmap_pages) {
>>>>>> + if (size != memory_block_size_bytes()) {
>>>>>> + pr_warn("Refuse to remove %#llx - %#llx,"
>>>>>> + "wrong granularity\n",
>>>>>> + start, start + size);
>>>>>> + return -EINVAL;
>>>>>> + }
>>>>>> +
>>>>>> + /*
>>>>>> + * Let remove_pmd_table->free_hugepage_table do the
>>>>>> + * right thing if we used vmem_altmap when hot-adding
>>>>>> + * the range.
>>>>>> + */
>>>>>> + mhp_altmap.alloc = nr_vmemmap_pages;
>>>>>> + altmap = &mhp_altmap;
>>>>>> + }
>>>>>> + }
>>>>>> +
>>>>>> /* remove memmap entry */
>>>>>> firmware_map_remove(start, start + size, "System RAM");
>>>>>
>>>>> I have to say I still dislike this and I would just wrap it inside out
>>>>> and do the operation from within walk_memory_blocks but I will not
>>>>> insist.
>>>>
>>>> I have to confess I forgot about the details of that dicussion, as we were
>>>> quite focused on decoupling vmemmap pages from {online,offline} interface.
>>>> Would you mind elaborating a bit more?
>>>
>>> As I've said I will not insist and this can be done in the follow up.
>>> You are iterating over memory blocks just to refuse to do an operation
>>> which can be split to several memory blocks. See
>>> http://lkml.kernel.org/r/[email protected] and follow
>>> walk_memory_blocks(start, size, NULL, remove_memory_block_cb)
>>>
>>
>> We'll have to be careful in general when removing memory in different
>> granularity than it was added, especially calling arch_remove_memory() in
>> smaller granularity than it was added via arch_add_memory(). We might fail
>> to tear down the direct map, imagine having mapped a 1GiB page but decide to
>> remove individual 128 MiB chunks -- that won't work and the direct map would
>> currently remain.
>
> Agreed but I am not referring to arbitrary hotremove path. All I am
> pointing at is to split up to memory blocks and do the same kind of work
> on each separately. Partial failures might turn out to be more tricky
> and as I've said I do not insist on doing that right now but it is a bit
> weird to outright fail the operation even when in fact there are more
> blocks to be hot removed in once.
Agreed. But we should also focus on what actual users need to see if
it's worth the trouble (I know of none that will be using memmap_on_memory).
--
Thanks,
David / dhildenb