2021-05-06 15:30:35

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

From: Zi Yan <[email protected]>

Hi all,

This patchset tries to remove the restriction on memory hotplug/hotremove
granularity, which is always greater or equal to memory section size[1].
With the patchset, kernel is able to online/offline memory at a size independent
of memory section size, as small as 2MB (the subsection size).

The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
size without increasing memory hotplug/hotremove granularity at the same time,
so that the kernel can allocator 1GB pages using buddy allocator and utilizes
existing pageblock based anti-fragmentation, paving the road for 1GB THP
support[2].

The patchset utilizes the existing subsection support[3] and changes the
section size alignment checks to subsection size alignment checks. There are
also changes to pageblock code to support partial pageblocks, when pageblock
size is increased along with MAX_ORDER. Increasing pageblock size can enable
kernel to utilize existing anti-fragmentation mechanism for gigantic page
allocations.

The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
hotplug/hotremove subsection, but is not intended to be merged as is. It is
there in case one wants to try this out and will be removed during the final
submission.

Feel free to give suggestions and comments. I am looking forward to your
feedback.

Thanks.

Zi Yan (7):
mm: sparse: set/clear subsection bitmap when pages are
onlined/offlined.
mm: set pageblock_order to the max of HUGETLB_PAGE_ORDER and
MAX_ORDER-1
mm: memory_hotplug: decouple memory_block size with section size.
mm: pageblock: allow set/unset migratetype for partial pageblock
mm: memory_hotplug, sparse: enable memory hotplug/hotremove
subsections
arch: x86: no MAX_ORDER exceeds SECTION_SIZE check for 32bit vdso.
[not for merge] mm: increase SECTION_SIZE_BITS to 31

arch/ia64/Kconfig | 1 -
arch/powerpc/Kconfig | 1 -
arch/x86/Kconfig | 15 +++
arch/x86/entry/vdso/Makefile | 1 +
arch/x86/include/asm/sparsemem.h | 2 +-
drivers/base/memory.c | 176 +++++++++++++++----------------
drivers/base/node.c | 2 +-
include/linux/memory.h | 8 +-
include/linux/mmzone.h | 2 +
include/linux/page-isolation.h | 8 +-
include/linux/pageblock-flags.h | 9 --
mm/Kconfig | 7 --
mm/memory_hotplug.c | 22 ++--
mm/page_alloc.c | 40 ++++---
mm/page_isolation.c | 30 +++---
mm/sparse.c | 55 ++++++++--
16 files changed, 219 insertions(+), 160 deletions(-)

--
2.30.2


2021-05-06 15:30:36

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 4/7] mm: pageblock: allow set/unset migratetype for partial pageblock

From: Zi Yan <[email protected]>

When we online subsections and pageblock size is larger than
a subsection, we should be able to change migratetype for partial
pageblocks. We also change the assumption that all pageblocks are in
a whole.

Signed-off-by: Zi Yan <[email protected]>
---
include/linux/page-isolation.h | 8 ++++++--
mm/page_alloc.c | 27 ++++++++++++++++++---------
mm/page_isolation.c | 26 ++++++++++++++------------
3 files changed, 38 insertions(+), 23 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 572458016331..308b540865b7 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -33,11 +33,15 @@ static inline bool is_migrate_isolate(int migratetype)
#define MEMORY_OFFLINE 0x1
#define REPORT_FAILURE 0x2

-struct page *has_unmovable_pages(struct zone *zone, struct page *page,
- int migratetype, int flags);
+struct page *has_unmovable_pages(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn, int migratetype,
+ int flags);
void set_pageblock_migratetype(struct page *page, int migratetype);
int move_freepages_block(struct zone *zone, struct page *page,
int migratetype, int *num_movable);
+int move_freepages(struct zone *zone,
+ unsigned long start_pfn, unsigned long end_pfn,
+ int migratetype, int *num_movable);

/*
* Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 72bb4a300e03..bc410f45c355 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2433,7 +2433,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
* Note that start_page and end_pages are not aligned on a pageblock
* boundary. If alignment is required, use move_freepages_block()
*/
-static int move_freepages(struct zone *zone,
+int move_freepages(struct zone *zone,
unsigned long start_pfn, unsigned long end_pfn,
int migratetype, int *num_movable)
{
@@ -6328,6 +6328,7 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
{
unsigned long pfn, end_pfn = start_pfn + size;
struct page *page;
+ bool set_migratetype = false;

if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;
@@ -6374,10 +6375,16 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
*/
if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
set_pageblock_migratetype(page, migratetype);
+ set_migratetype = true;
cond_resched();
}
pfn++;
}
+ /* in case the range is smaller than a pageblock */
+ if (!set_migratetype && context == MEMINIT_HOTPLUG) {
+ page = pfn_to_page(start_pfn);
+ set_pageblock_migratetype(page, migratetype);
+ }
}

#ifdef CONFIG_ZONE_DEVICE
@@ -8524,12 +8531,14 @@ void *__init alloc_large_system_hash(const char *tablename,
* cannot get removed (e.g., via memory unplug) concurrently.
*
*/
-struct page *has_unmovable_pages(struct zone *zone, struct page *page,
- int migratetype, int flags)
+struct page *has_unmovable_pages(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn, int migratetype,
+ int flags)
{
unsigned long iter = 0;
- unsigned long pfn = page_to_pfn(page);
- unsigned long offset = pfn % pageblock_nr_pages;
+ unsigned long pfn = start_pfn;
+ struct page *page = pfn_to_page(pfn);
+

if (is_migrate_cma_page(page)) {
/*
@@ -8543,11 +8552,11 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
return page;
}

- for (; iter < pageblock_nr_pages - offset; iter++) {
- if (!pfn_valid_within(pfn + iter))
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ if (!pfn_valid_within(pfn))
continue;

- page = pfn_to_page(pfn + iter);
+ page = pfn_to_page(pfn);

/*
* Both, bootmem allocations and memory holes are marked
@@ -8596,7 +8605,7 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
*/
if (!page_ref_count(page)) {
if (PageBuddy(page))
- iter += (1 << buddy_order(page)) - 1;
+ pfn += (1 << buddy_order(page)) - 1;
continue;
}

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index bddf788f45bf..c1b9b8848382 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -15,8 +15,10 @@
#define CREATE_TRACE_POINTS
#include <trace/events/page_isolation.h>

-static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags)
+static int set_migratetype_isolate(unsigned long start_pfn, unsigned long end_pfn,
+ int migratetype, int isol_flags)
{
+ struct page *page = pfn_to_page(start_pfn);
struct zone *zone = page_zone(page);
struct page *unmovable;
unsigned long flags;
@@ -37,15 +39,14 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
* FIXME: Now, memory hotplug doesn't call shrink_slab() by itself.
* We just check MOVABLE pages.
*/
- unmovable = has_unmovable_pages(zone, page, migratetype, isol_flags);
+ unmovable = has_unmovable_pages(zone, start_pfn, end_pfn, migratetype, isol_flags);
if (!unmovable) {
unsigned long nr_pages;
int mt = get_pageblock_migratetype(page);

set_pageblock_migratetype(page, MIGRATE_ISOLATE);
zone->nr_isolate_pageblock++;
- nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
- NULL);
+ nr_pages = move_freepages(zone, start_pfn, end_pfn, MIGRATE_ISOLATE, NULL);

__mod_zone_freepage_state(zone, -nr_pages, mt);
spin_unlock_irqrestore(&zone->lock, flags);
@@ -64,7 +65,8 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
return -EBUSY;
}

-static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
+static void unset_migratetype_isolate(unsigned long start_pfn, unsigned long end_pfn,
+ unsigned migratetype)
{
struct zone *zone;
unsigned long flags, nr_pages;
@@ -72,6 +74,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
unsigned int order;
unsigned long pfn, buddy_pfn;
struct page *buddy;
+ struct page *page = pfn_to_page(start_pfn);

zone = page_zone(page);
spin_lock_irqsave(&zone->lock, flags);
@@ -112,7 +115,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
* allocation.
*/
if (!isolated_page) {
- nr_pages = move_freepages_block(zone, page, migratetype, NULL);
+ nr_pages = move_freepages(zone, start_pfn, end_pfn, migratetype, NULL);
__mod_zone_freepage_state(zone, nr_pages, migratetype);
}
set_pageblock_migratetype(page, migratetype);
@@ -195,7 +198,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
pfn += pageblock_nr_pages) {
page = __first_valid_page(pfn, pageblock_nr_pages);
if (page) {
- if (set_migratetype_isolate(page, migratetype, flags)) {
+ if (set_migratetype_isolate(pfn, min(end_pfn, pfn + pageblock_nr_pages),
+ migratetype, flags)) {
undo_pfn = pfn;
goto undo;
}
@@ -209,7 +213,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
struct page *page = pfn_to_online_page(pfn);
if (!page)
continue;
- unset_migratetype_isolate(page, migratetype);
+ unset_migratetype_isolate(pfn, min(pfn + pageblock_nr_pages, undo_pfn),
+ migratetype);
}

return -EBUSY;
@@ -224,16 +229,13 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
unsigned long pfn;
struct page *page;

- BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
- BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
-
for (pfn = start_pfn;
pfn < end_pfn;
pfn += pageblock_nr_pages) {
page = __first_valid_page(pfn, pageblock_nr_pages);
if (!page || !is_migrate_isolate_page(page))
continue;
- unset_migratetype_isolate(page, migratetype);
+ unset_migratetype_isolate(pfn, min(pfn + pageblock_nr_pages, end_pfn), migratetype);
}
}
/*
--
2.30.2

2021-05-06 15:30:42

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 6/7] arch: x86: no MAX_ORDER exceeds SECTION_SIZE check for 32bit vdso.

From: Zi Yan <[email protected]>

For x86_64, 32bit vdso is compiled for compatibility reason and 32bit
SECTION_SIZE_BITS value is used during compilation. It causes
compilation time error when MAX_ORDER is increased in the 64bit
environment even if it is OK for 64bit SECTION_SIZE_BITS. Remove the
check during 32bit vdso compilation. It will be checked when other
kernel components are compiled.

Signed-off-by: Zi Yan <[email protected]>
---
arch/x86/entry/vdso/Makefile | 1 +
include/linux/mmzone.h | 2 ++
2 files changed, 3 insertions(+)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 05c4abc2fdfd..cad339136ed1 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -156,6 +156,7 @@ KBUILD_CFLAGS_32 += -fno-stack-protector
KBUILD_CFLAGS_32 += $(call cc-option, -foptimize-sibling-calls)
KBUILD_CFLAGS_32 += -fno-omit-frame-pointer
KBUILD_CFLAGS_32 += -DDISABLE_BRANCH_PROFILING
+KBUILD_CFLAGS_32 += -DNO_MAX_ORDER_CHECK

ifdef CONFIG_RETPOLINE
ifneq ($(RETPOLINE_VDSO_CFLAGS),)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 965a0cd5eac1..fb5a0c2ab528 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1211,9 +1211,11 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
#define SECTION_BLOCKFLAGS_BITS \
((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)

+#ifndef NO_MAX_ORDER_CHECK
#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif
+#endif

static inline unsigned long pfn_to_section_nr(unsigned long pfn)
{
--
2.30.2

2021-05-06 15:31:06

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 1/7] mm: sparse: set/clear subsection bitmap when pages are onlined/offlined.

From: Zi Yan <[email protected]>

subsection bitmap was set/cleared when a section is added/removed, but
pfn_to_online_page() uses subsection bitmap to check if the page is
online, which is not accurate. It was working when a whole section is
added/removed during memory hotplug and hotremove. When the following
patches enable memory hotplug and hotremove for subsections,
subsection bitmap needs to be changed during page online/offline time,
otherwise, pfn_to_online_page() will not give right answers. Move the
subsection bitmap manipulation code from section_activate() to
online_mem_sections() and section_deactivate() to
offline_mem_sections(), respectively.

Signed-off-by: Zi Yan <[email protected]>
---
mm/sparse.c | 36 +++++++++++++++++++++++++++++++++---
1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index b2ada9dc00cb..7637208b8874 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -606,6 +606,7 @@ void __init sparse_init(void)

#ifdef CONFIG_MEMORY_HOTPLUG

+static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages);
/* Mark all memory sections within the pfn range as online */
void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
{
@@ -621,9 +622,12 @@ void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)

ms = __nr_to_section(section_nr);
ms->section_mem_map |= SECTION_IS_ONLINE;
+ fill_subsection_map(pfn, min(end_pfn, pfn + PAGES_PER_SECTION) - pfn);
}
}

+static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages);
+static bool is_subsection_map_empty(struct mem_section *ms);
/* Mark all memory sections within the pfn range as offline */
void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
{
@@ -641,7 +645,13 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
continue;

ms = __nr_to_section(section_nr);
- ms->section_mem_map &= ~SECTION_IS_ONLINE;
+
+ if (end_pfn < pfn + PAGES_PER_SECTION) {
+ clear_subsection_map(pfn, end_pfn - pfn);
+ if (is_subsection_map_empty(ms))
+ ms->section_mem_map &= ~SECTION_IS_ONLINE;
+ } else
+ ms->section_mem_map &= ~SECTION_IS_ONLINE;
}
}

@@ -668,6 +678,17 @@ static void free_map_bootmem(struct page *memmap)
vmemmap_free(start, end, NULL);
}

+static int subsection_map_intersects(struct mem_section *ms, unsigned long pfn,
+ unsigned long nr_pages)
+{
+ DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
+ unsigned long *subsection_map = &ms->usage->subsection_map[0];
+
+ subsection_mask_set(map, pfn, nr_pages);
+
+ return bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION);
+}
+
static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages)
{
DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
@@ -760,6 +781,12 @@ static void free_map_bootmem(struct page *memmap)
}
}

+static int subsection_map_intersects(struct mem_section *ms, unsigned long pfn,
+ unsigned long nr_pages)
+{
+ return 0;
+}
+
static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages)
{
return 0;
@@ -800,7 +827,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
struct page *memmap = NULL;
bool empty;

- if (clear_subsection_map(pfn, nr_pages))
+ if (WARN((IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) && !ms->usage) ||
+ subsection_map_intersects(ms, pfn, nr_pages),
+ "section already deactivated (%#lx + %ld)\n",
+ pfn, nr_pages))
return;

empty = is_subsection_map_empty(ms);
@@ -855,7 +885,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
ms->usage = usage;
}

- rc = fill_subsection_map(pfn, nr_pages);
+ rc = !nr_pages || subsection_map_intersects(ms, pfn, nr_pages);
if (rc) {
if (usage)
ms->usage = NULL;
--
2.30.2

2021-05-06 15:31:48

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 5/7] mm: memory_hotplug, sparse: enable memory hotplug/hotremove subsections

From: Zi Yan <[email protected]>

Remove the section size alignment checks for memory hotplug/hotremove,
so that we can online/offline subsection memory.

Signed-off-by: Zi Yan <[email protected]>
---
mm/memory_hotplug.c | 16 +++++++++-------
mm/page_isolation.c | 4 ----
mm/sparse.c | 17 ++++++++++++++---
3 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6e93b0ecc5cb..5384bb62ac10 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -661,12 +661,15 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
* When using memmap_on_memory, the range might not be aligned to
* MAX_ORDER_NR_PAGES - 1, but pageblock aligned. __ffs() will detect
* this and the first chunk to online will be pageblock_nr_pages.
+ * When onlining subsections, the range might be smaller than MAX_ORDER
+ * - 1, use __ffs(nr_pages) to get the right size.
*/
for (pfn = start_pfn; pfn < end_pfn;) {
- int order = min(MAX_ORDER - 1UL, __ffs(pfn));
+ int order = min3(MAX_ORDER - 1UL, __ffs(pfn), __ffs(nr_pages));

(*online_page_callback)(pfn_to_page(pfn), order);
pfn += (1UL << order);
+ nr_pages -= (1UL << order);
}

/* mark all involved sections as online */
@@ -912,16 +915,16 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *z
struct memory_notify arg;

/*
- * {on,off}lining is constrained to full memory sections (or more
+ * {on,off}lining is constrained to memory subsections (or more
* precisly to memory blocks from the user space POV).
* memmap_on_memory is an exception because it reserves initial part
* of the physical memory space for vmemmaps. That space is pageblock
* aligned.
*/
if (WARN_ON_ONCE(!nr_pages ||
- !IS_ALIGNED(pfn, pageblock_nr_pages) ||
- !IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION)))
+ !IS_ALIGNED(pfn + nr_pages, PAGES_PER_SUBSECTION))) {
return -EINVAL;
+ }

mem_hotplug_begin();

@@ -1702,15 +1705,14 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
char *reason;

/*
- * {on,off}lining is constrained to full memory sections (or more
+ * {on,off}lining is constrained to memory subsections (or more
* precisly to memory blocks from the user space POV).
* memmap_on_memory is an exception because it reserves initial part
* of the physical memory space for vmemmaps. That space is pageblock
* aligned.
*/
if (WARN_ON_ONCE(!nr_pages ||
- !IS_ALIGNED(start_pfn, pageblock_nr_pages) ||
- !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
+ !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SUBSECTION)))
return -EINVAL;

mem_hotplug_begin();
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c1b9b8848382..7f1791faf03f 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -147,7 +147,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* be MIGRATE_ISOLATE.
* @start_pfn: The lower PFN of the range to be isolated.
* @end_pfn: The upper PFN of the range to be isolated.
- * start_pfn/end_pfn must be aligned to pageblock_order.
* @migratetype: Migrate type to set in error recovery.
* @flags: The following flags are allowed (they can be combined in
* a bit mask)
@@ -190,9 +189,6 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
unsigned long undo_pfn;
struct page *page;

- BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
- BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
-
for (pfn = start_pfn;
pfn < end_pfn;
pfn += pageblock_nr_pages) {
diff --git a/mm/sparse.c b/mm/sparse.c
index 1c2957807882..09b5e6978ab0 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -251,7 +251,8 @@ void __init subsection_map_init(unsigned long pfn, unsigned long nr_pages)
/* Record a memory area against a node. */
static void __init memory_present(int nid, unsigned long start, unsigned long end)
{
- unsigned long pfn;
+ unsigned long pfn, nr_pages;
+ unsigned long section, end_sec, start_sec;

#ifdef CONFIG_SPARSEMEM_EXTREME
if (unlikely(!mem_section)) {
@@ -268,9 +269,17 @@ static void __init memory_present(int nid, unsigned long start, unsigned long en

start &= PAGE_SECTION_MASK;
mminit_validate_memmodel_limits(&start, &end);
- for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
- unsigned long section = pfn_to_section_nr(pfn);
+ start_sec = pfn_to_section_nr(start);
+ end_sec = pfn_to_section_nr(end - 1);
+ pfn = start;
+ nr_pages = end - start;
+
+ for (section = start_sec; section <= end_sec; section++) {
struct mem_section *ms;
+ unsigned long pfns;
+
+ pfns = min(nr_pages, PAGES_PER_SECTION
+ - (pfn & ~PAGE_SECTION_MASK));

sparse_index_init(section, nid);
set_section_nid(section, nid);
@@ -281,6 +290,8 @@ static void __init memory_present(int nid, unsigned long start, unsigned long en
SECTION_IS_ONLINE;
section_mark_present(ms);
}
+ pfn += pfns;
+ nr_pages -= pfns;
}
}

--
2.30.2

2021-05-06 15:32:06

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 7/7] [not for merge] mm: increase SECTION_SIZE_BITS to 31

From: Zi Yan <[email protected]>

This is only used to test onlining/offlining subsection memory in
a x86_64 system by increasing section size to 2GB and pageblock size to
1GB when MAX_ORDER is set to 20.

Signed-off-by: Zi Yan <[email protected]>
---
arch/x86/Kconfig | 15 +++++++++++++++
arch/x86/include/asm/sparsemem.h | 2 +-
2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0045e1b44190..d8faf59fa5ff 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1654,6 +1654,21 @@ config X86_PMEM_LEGACY

Say Y if unsure.

+config FORCE_MAX_ZONEORDER
+ int "Maximum zone order"
+ range 11 20
+ default "20"
+ help
+ The kernel memory allocator divides physically contiguous memory
+ blocks into "zones", where each zone is a power of two number of
+ pages. This option selects the largest power of two that the kernel
+ keeps in the memory allocator. If you need to allocate very large
+ blocks of physically contiguous memory, then you may need to
+ increase this value.
+
+ This config option is actually maximum order plus one. For example,
+ a value of 11 means that the largest free memory block is 2^10 pages.
+
config HIGHPTE
bool "Allocate 3rd-level pagetables from highmem"
depends on HIGHMEM
diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 6a9ccc1b2be5..c5a9d498a7e7 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -23,7 +23,7 @@
# define MAX_PHYSMEM_BITS 32
# endif
#else /* CONFIG_X86_32 */
-# define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */
+# define SECTION_SIZE_BITS 31 /* matt - 128 is convenient right now */
# define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
#endif

--
2.30.2

2021-05-06 15:34:51

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 06.05.21 17:26, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> Hi all,
>
> This patchset tries to remove the restriction on memory hotplug/hotremove
> granularity, which is always greater or equal to memory section size[1].
> With the patchset, kernel is able to online/offline memory at a size independent
> of memory section size, as small as 2MB (the subsection size).

... which doesn't make any sense as we can only online/offline whole
memory block devices.

>
> The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
> size without increasing memory hotplug/hotremove granularity at the same time,

Gah, no. Please no. No.

> so that the kernel can allocator 1GB pages using buddy allocator and utilizes
> existing pageblock based anti-fragmentation, paving the road for 1GB THP
> support[2].

Not like this, please no.

>
> The patchset utilizes the existing subsection support[3] and changes the
> section size alignment checks to subsection size alignment checks. There are
> also changes to pageblock code to support partial pageblocks, when pageblock
> size is increased along with MAX_ORDER. Increasing pageblock size can enable
> kernel to utilize existing anti-fragmentation mechanism for gigantic page
> allocations.

Please not like this.

>
> The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
> hotplug/hotremove subsection, but is not intended to be merged as is. It is
> there in case one wants to try this out and will be removed during the final
> submission.
>
> Feel free to give suggestions and comments. I am looking forward to your
> feedback.

Please not like this.

--
Thanks,

David / dhildenb

2021-05-06 15:43:58

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 6 May 2021, at 11:26, Zi Yan wrote:

> From: Zi Yan <[email protected]>
>
> Hi all,
>
> This patchset tries to remove the restriction on memory hotplug/hotremove
> granularity, which is always greater or equal to memory section size[1].
> With the patchset, kernel is able to online/offline memory at a size independent
> of memory section size, as small as 2MB (the subsection size).
>
> The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
> size without increasing memory hotplug/hotremove granularity at the same time,
> so that the kernel can allocator 1GB pages using buddy allocator and utilizes
> existing pageblock based anti-fragmentation, paving the road for 1GB THP
> support[2].
>
> The patchset utilizes the existing subsection support[3] and changes the
> section size alignment checks to subsection size alignment checks. There are
> also changes to pageblock code to support partial pageblocks, when pageblock
> size is increased along with MAX_ORDER. Increasing pageblock size can enable
> kernel to utilize existing anti-fragmentation mechanism for gigantic page
> allocations.
>
> The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
> hotplug/hotremove subsection, but is not intended to be merged as is. It is
> there in case one wants to try this out and will be removed during the final
> submission.
>
> Feel free to give suggestions and comments. I am looking forward to your
> feedback.
>
> Thanks.

Added the missing references.

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://patchwork.kernel.org/project/linux-nvdimm/cover/156092349300.979959.17603710711957735135.stgit@dwillia2-desk3.amr.corp.intel.com/

>
> Zi Yan (7):
> mm: sparse: set/clear subsection bitmap when pages are
> onlined/offlined.
> mm: set pageblock_order to the max of HUGETLB_PAGE_ORDER and
> MAX_ORDER-1
> mm: memory_hotplug: decouple memory_block size with section size.
> mm: pageblock: allow set/unset migratetype for partial pageblock
> mm: memory_hotplug, sparse: enable memory hotplug/hotremove
> subsections
> arch: x86: no MAX_ORDER exceeds SECTION_SIZE check for 32bit vdso.
> [not for merge] mm: increase SECTION_SIZE_BITS to 31
>
> arch/ia64/Kconfig | 1 -
> arch/powerpc/Kconfig | 1 -
> arch/x86/Kconfig | 15 +++
> arch/x86/entry/vdso/Makefile | 1 +
> arch/x86/include/asm/sparsemem.h | 2 +-
> drivers/base/memory.c | 176 +++++++++++++++----------------
> drivers/base/node.c | 2 +-
> include/linux/memory.h | 8 +-
> include/linux/mmzone.h | 2 +
> include/linux/page-isolation.h | 8 +-
> include/linux/pageblock-flags.h | 9 --
> mm/Kconfig | 7 --
> mm/memory_hotplug.c | 22 ++--
> mm/page_alloc.c | 40 ++++---
> mm/page_isolation.c | 30 +++---
> mm/sparse.c | 55 ++++++++--
> 16 files changed, 219 insertions(+), 160 deletions(-)
>
> --
> 2.30.2



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2021-05-06 15:52:27

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 6 May 2021, at 11:40, David Hildenbrand wrote:

>>>> The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
>>>> hotplug/hotremove subsection, but is not intended to be merged as is. It is
>>>> there in case one wants to try this out and will be removed during the final
>>>> submission.
>>>>
>>>> Feel free to give suggestions and comments. I am looking forward to your
>>>> feedback.
>>>
>>> Please not like this.
>>
>> Do you mind sharing more useful feedback instead of just saying a lot of No?
>
> I remember reasoning about this already in another thread, no? Either you're ignoring my previous feedback or my mind is messing with me.

I definitely remember all your suggestions:

1. do not use CMA allocation for 1GB THP.
2. section size defines the minimum size in which we can add_memory(), so we cannot increase it.

I am trying an alternative here. I am not using CMA allocation and not increasing the minimum size of add_memory() by decoupling the memory block size from section size, so that add_memory() can add a memory block smaller (as small as 2MB, the subsection size) than section size. In this way, section size can be increased freely. I do not see the strong tie between add_memory() and section size, especially we have subsection bitmap support.



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2021-05-06 17:50:10

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] mm: sparse: set/clear subsection bitmap when pages are onlined/offlined.

On 06.05.21 17:26, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> subsection bitmap was set/cleared when a section is added/removed, but
> pfn_to_online_page() uses subsection bitmap to check if the page is
> online, which is not accurate. It was working when a whole section is
> added/removed during memory hotplug and hotremove. When the following
> patches enable memory hotplug and hotremove for subsections,
> subsection bitmap needs to be changed during page online/offline time,
> otherwise, pfn_to_online_page() will not give right answers. Move the
> subsection bitmap manipulation code from section_activate() to
> online_mem_sections() and section_deactivate() to
> offline_mem_sections(), respectively.
>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> mm/sparse.c | 36 +++++++++++++++++++++++++++++++++---
> 1 file changed, 33 insertions(+), 3 deletions(-)
>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index b2ada9dc00cb..7637208b8874 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -606,6 +606,7 @@ void __init sparse_init(void)
>
> #ifdef CONFIG_MEMORY_HOTPLUG
>
> +static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages);
> /* Mark all memory sections within the pfn range as online */
> void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
> {
> @@ -621,9 +622,12 @@ void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>
> ms = __nr_to_section(section_nr);
> ms->section_mem_map |= SECTION_IS_ONLINE;
> + fill_subsection_map(pfn, min(end_pfn, pfn + PAGES_PER_SECTION) - pfn);
> }
> }
>
> +static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages);
> +static bool is_subsection_map_empty(struct mem_section *ms);
> /* Mark all memory sections within the pfn range as offline */
> void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
> {
> @@ -641,7 +645,13 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
> continue;
>
> ms = __nr_to_section(section_nr);
> - ms->section_mem_map &= ~SECTION_IS_ONLINE;
> +
> + if (end_pfn < pfn + PAGES_PER_SECTION) {
> + clear_subsection_map(pfn, end_pfn - pfn);
> + if (is_subsection_map_empty(ms))
> + ms->section_mem_map &= ~SECTION_IS_ONLINE;
> + } else
> + ms->section_mem_map &= ~SECTION_IS_ONLINE;
> }
> }
>
> @@ -668,6 +678,17 @@ static void free_map_bootmem(struct page *memmap)
> vmemmap_free(start, end, NULL);
> }
>
> +static int subsection_map_intersects(struct mem_section *ms, unsigned long pfn,
> + unsigned long nr_pages)
> +{
> + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
> + unsigned long *subsection_map = &ms->usage->subsection_map[0];
> +
> + subsection_mask_set(map, pfn, nr_pages);
> +
> + return bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION);
> +}
> +
> static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages)
> {
> DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
> @@ -760,6 +781,12 @@ static void free_map_bootmem(struct page *memmap)
> }
> }
>
> +static int subsection_map_intersects(struct mem_section *ms, unsigned long pfn,
> + unsigned long nr_pages)
> +{
> + return 0;
> +}
> +
> static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages)
> {
> return 0;
> @@ -800,7 +827,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
> struct page *memmap = NULL;
> bool empty;
>
> - if (clear_subsection_map(pfn, nr_pages))
> + if (WARN((IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) && !ms->usage) ||
> + subsection_map_intersects(ms, pfn, nr_pages),
> + "section already deactivated (%#lx + %ld)\n",
> + pfn, nr_pages))
> return;
>
> empty = is_subsection_map_empty(ms);
> @@ -855,7 +885,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
> ms->usage = usage;
> }
>
> - rc = fill_subsection_map(pfn, nr_pages);
> + rc = !nr_pages || subsection_map_intersects(ms, pfn, nr_pages);
> if (rc) {
> if (usage)
> ms->usage = NULL;
>

If I am not missing something, this is completely broken for
devmem/ZONE_DEVICE that never onlines pages. But also when memory blocks
are never onlined, this would be just wrong. Least thing you would need
is a sub-section online map.

But glimpsing at patch #2, I'd rather stop right away digging deeper
into this series :)

I think what would really help is drafting a design of how it all could
look like and then first discussing the high-level design, investigating
how it could play along with all existing users, existing workloads, and
existing use cases. Proposing such changes without a clear picture in
mind and a high-level overview might give you some unpleasant reactions
from some of the developers around here ;)

--
Thanks,

David / dhildenb

2021-05-06 19:19:57

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 3/7] mm: memory_hotplug: decouple memory_block size with section size.

From: Zi Yan <[email protected]>

To enable subsection memory online/offline, we need to remove the
assumption of memory_block size being greater or equal to section size.

The following changes are made:
1. use (start_pfn, nr_pages) pair to specify memory_block size instead of
start_section_nr.
2. calculate memory_block id using phys / memory_block_size_bytes()
instead of section number.

The memory_block minimum size is set to the smaller of 128MB (the old
x86_64 section size) and section size instead.

Signed-off-by: Zi Yan <[email protected]>
---
drivers/base/memory.c | 176 ++++++++++++++++++++---------------------
drivers/base/node.c | 2 +-
include/linux/memory.h | 8 +-
mm/memory_hotplug.c | 6 +-
4 files changed, 98 insertions(+), 94 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b31b3af5c490..141431eb64a4 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -50,19 +50,15 @@ int mhp_online_type_from_str(const char *str)

static int sections_per_block;

-static inline unsigned long memory_block_id(unsigned long section_nr)
+static inline unsigned long phys_to_block_id(unsigned long phys)
{
- return section_nr / sections_per_block;
+ return phys / memory_block_size_bytes();
}

static inline unsigned long pfn_to_block_id(unsigned long pfn)
{
- return memory_block_id(pfn_to_section_nr(pfn));
-}
-
-static inline unsigned long phys_to_block_id(unsigned long phys)
-{
- return pfn_to_block_id(PFN_DOWN(phys));
+ /* calculate using memory_block_size_bytes() */
+ return phys_to_block_id(PFN_PHYS(pfn));
}

static int memory_subsys_online(struct device *dev);
@@ -118,7 +114,7 @@ static ssize_t phys_index_show(struct device *dev,
struct memory_block *mem = to_memory_block(dev);
unsigned long phys_index;

- phys_index = mem->start_section_nr / sections_per_block;
+ phys_index = pfn_to_section_nr(mem->start_pfn);

return sysfs_emit(buf, "%08lx\n", phys_index);
}
@@ -171,8 +167,8 @@ int memory_notify(unsigned long val, void *v)

static int memory_block_online(struct memory_block *mem)
{
- unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
- unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+ unsigned long start_pfn = mem->start_pfn;
+ unsigned long nr_pages = mem->nr_pages;
unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
struct zone *zone;
int ret;
@@ -212,8 +208,8 @@ static int memory_block_online(struct memory_block *mem)

static int memory_block_offline(struct memory_block *mem)
{
- unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
- unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+ unsigned long start_pfn = mem->start_pfn;
+ unsigned long nr_pages = mem->nr_pages;
unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
struct zone *zone;
int ret;
@@ -260,7 +256,7 @@ memory_block_action(struct memory_block *mem, unsigned long action)
break;
default:
WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
- "%ld\n", __func__, mem->start_section_nr, action, action);
+ "%ld\n", __func__, mem->start_pfn, mem->nr_pages, action);
ret = -EINVAL;
}

@@ -366,7 +362,7 @@ static ssize_t phys_device_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct memory_block *mem = to_memory_block(dev);
- unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
+ unsigned long start_pfn = mem->start_pfn;

return sysfs_emit(buf, "%d\n",
arch_get_memory_phys_device(start_pfn));
@@ -390,8 +386,8 @@ static ssize_t valid_zones_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct memory_block *mem = to_memory_block(dev);
- unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
- unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+ unsigned long start_pfn = mem->start_pfn;
+ unsigned long nr_pages = mem->nr_pages;
struct zone *default_zone;
int len = 0;
int nid;
@@ -575,16 +571,6 @@ static struct memory_block *find_memory_block_by_id(unsigned long block_id)
return mem;
}

-/*
- * Called under device_hotplug_lock.
- */
-struct memory_block *find_memory_block(struct mem_section *section)
-{
- unsigned long block_id = memory_block_id(__section_nr(section));
-
- return find_memory_block_by_id(block_id);
-}
-
static struct attribute *memory_memblk_attrs[] = {
&dev_attr_phys_index.attr,
&dev_attr_state.attr,
@@ -614,7 +600,7 @@ int register_memory(struct memory_block *memory)
int ret;

memory->dev.bus = &memory_subsys;
- memory->dev.id = memory->start_section_nr / sections_per_block;
+ memory->dev.id = memory->start_pfn / (memory_block_size_bytes() >> PAGE_SHIFT);
memory->dev.release = memory_block_release;
memory->dev.groups = memory_memblk_attr_groups;
memory->dev.offline = memory->state == MEM_OFFLINE;
@@ -633,57 +619,89 @@ int register_memory(struct memory_block *memory)
return ret;
}

-static int init_memory_block(unsigned long block_id, unsigned long state,
+static void unregister_memory(struct memory_block *memory)
+{
+ if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
+ return;
+
+ WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
+
+ /* drop the ref. we got via find_memory_block() */
+ put_device(&memory->dev);
+ device_unregister(&memory->dev);
+}
+
+static int init_memory_blocks(unsigned long start_pfn, unsigned long num_pages, unsigned long state,
unsigned long nr_vmemmap_pages)
{
struct memory_block *mem;
int ret = 0;
+ unsigned long block_nr_pages = memory_block_size_bytes() / PAGE_SIZE;
+ unsigned long block_start_pfn;

- mem = find_memory_block_by_id(block_id);
- if (mem) {
- put_device(&mem->dev);
- return -EEXIST;
- }
- mem = kzalloc(sizeof(*mem), GFP_KERNEL);
- if (!mem)
- return -ENOMEM;
-
- mem->start_section_nr = block_id * sections_per_block;
- mem->state = state;
- mem->nid = NUMA_NO_NODE;
- mem->nr_vmemmap_pages = nr_vmemmap_pages;
+ for (block_start_pfn = start_pfn; num_pages != 0; block_start_pfn += block_nr_pages) {
+ unsigned long block_id = pfn_to_block_id(block_start_pfn);

- ret = register_memory(mem);
-
- return ret;
+ mem = find_memory_block_by_id(block_id);
+ if (mem) {
+ put_device(&mem->dev);
+ return -EEXIST;
+ }
+ mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+ if (!mem)
+ return -ENOMEM;
+
+ mem->start_pfn = block_start_pfn;
+ mem->nr_pages = min(num_pages, block_nr_pages);
+ mem->state = state;
+ mem->nid = NUMA_NO_NODE;
+ mem->nr_vmemmap_pages = nr_vmemmap_pages;
+
+ ret = register_memory(mem);
+
+ if (ret) {
+ unsigned long unregister_block_pfn;
+
+ for (unregister_block_pfn = start_pfn;
+ unregister_block_pfn < block_start_pfn;
+ unregister_block_pfn -= block_nr_pages) {
+ block_id = pfn_to_block_id(unregister_block_pfn);
+ mem = find_memory_block_by_id(block_id);
+ if (WARN_ON_ONCE(!mem))
+ continue;
+ unregister_memory(mem);
+ }
+ return -EINVAL;
+ }
+ if (num_pages > block_nr_pages)
+ num_pages -= block_nr_pages;
+ else
+ num_pages = 0;
+ }
+ return 0;
}

-static int add_memory_block(unsigned long base_section_nr)
+static void add_whole_section_memory_block(unsigned long base_section_nr)
{
- int section_count = 0;
- unsigned long nr;
+ int ret;
+ unsigned long start_pfn = section_nr_to_pfn(base_section_nr);
+ unsigned long nr_pages = 0;
+ struct mem_section *ms = __nr_to_section(base_section_nr);

- for (nr = base_section_nr; nr < base_section_nr + sections_per_block;
- nr++)
- if (present_section_nr(nr))
- section_count++;
+ if (bitmap_full(ms->usage->subsection_map, SUBSECTIONS_PER_SECTION))
+ nr_pages = PAGES_PER_SECTION;
+ else
+ nr_pages = PAGES_PER_SUBSECTION *
+ bitmap_weight(ms->usage->subsection_map, SUBSECTIONS_PER_SECTION);

- if (section_count == 0)
- return 0;
- return init_memory_block(memory_block_id(base_section_nr),
- MEM_ONLINE, 0);
-}

-static void unregister_memory(struct memory_block *memory)
-{
- if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
+ if (!nr_pages)
return;

- WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
-
- /* drop the ref. we got via find_memory_block() */
- put_device(&memory->dev);
- device_unregister(&memory->dev);
+ ret = init_memory_blocks(start_pfn, nr_pages, MEM_ONLINE, 0);
+ if (ret)
+ panic("%s() failed to add memory block: %d\n", __func__,
+ ret);
}

/*
@@ -696,31 +714,16 @@ static void unregister_memory(struct memory_block *memory)
int create_memory_block_devices(unsigned long start, unsigned long size,
unsigned long vmemmap_pages)
{
- const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
- unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
- struct memory_block *mem;
- unsigned long block_id;
+ unsigned long start_pfn = PFN_DOWN(start);
+ unsigned long end_pfn = PFN_DOWN(start + size);
int ret = 0;

if (WARN_ON_ONCE(!IS_ALIGNED(start, memory_block_size_bytes()) ||
!IS_ALIGNED(size, memory_block_size_bytes())))
return -EINVAL;

- for (block_id = start_block_id; block_id != end_block_id; block_id++) {
- ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
- if (ret)
- break;
- }
- if (ret) {
- end_block_id = block_id;
- for (block_id = start_block_id; block_id != end_block_id;
- block_id++) {
- mem = find_memory_block_by_id(block_id);
- if (WARN_ON_ONCE(!mem))
- continue;
- unregister_memory(mem);
- }
- }
+ ret = init_memory_blocks(start_pfn, end_pfn - start_pfn, MEM_OFFLINE, vmemmap_pages);
+
return ret;
}

@@ -807,10 +810,7 @@ void __init memory_dev_init(void)
*/
for (nr = 0; nr <= __highest_present_section_nr;
nr += sections_per_block) {
- ret = add_memory_block(nr);
- if (ret)
- panic("%s() failed to add memory block: %d\n", __func__,
- ret);
+ add_whole_section_memory_block(nr);
}
}

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 2c36f61d30bc..76d67b8ddf1b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -809,7 +809,7 @@ static int register_mem_block_under_node_early(struct memory_block *mem_blk,
void *arg)
{
unsigned long memory_block_pfns = memory_block_size_bytes() / PAGE_SIZE;
- unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
+ unsigned long start_pfn = mem_blk->start_pfn;
unsigned long end_pfn = start_pfn + memory_block_pfns - 1;
int nid = *(int *)arg;
unsigned long pfn;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 97e92e8b556a..e9590c7c6a9e 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -21,10 +21,15 @@
#include <linux/mutex.h>
#include <linux/notifier.h>

+#if SECTION_SIZE_BITS > 27 /* 128MB */
+#define MIN_MEMORY_BLOCK_SIZE (1UL << 27)
+#else
#define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS)
+#endif

struct memory_block {
- unsigned long start_section_nr;
+ unsigned long start_pfn;
+ unsigned long nr_pages;
unsigned long state; /* serialized by the dev->lock */
int online_type; /* for passing data to online routine */
int nid; /* NID for this memory block */
@@ -90,7 +95,6 @@ int create_memory_block_devices(unsigned long start, unsigned long size,
void remove_memory_block_devices(unsigned long start, unsigned long size);
extern void memory_dev_init(void);
extern int memory_notify(unsigned long val, void *v);
-extern struct memory_block *find_memory_block(struct mem_section *);
typedef int (*walk_memory_blocks_func_t)(struct memory_block *, void *);
extern int walk_memory_blocks(unsigned long start, unsigned long size,
void *arg, walk_memory_blocks_func_t func);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..6e93b0ecc5cb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1872,8 +1872,8 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
if (unlikely(ret)) {
phys_addr_t beginpa, endpa;

- beginpa = PFN_PHYS(section_nr_to_pfn(mem->start_section_nr));
- endpa = beginpa + memory_block_size_bytes() - 1;
+ beginpa = PFN_PHYS(mem->start_pfn);
+ endpa = beginpa + mem->nr_pages * PAGE_SIZE - 1;
pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
&beginpa, &endpa);

@@ -2079,7 +2079,7 @@ static int try_offline_memory_block(struct memory_block *mem, void *arg)
* with multiple zones within one memory block will be rejected
* by offlining code ... so we don't care about that.
*/
- page = pfn_to_online_page(section_nr_to_pfn(mem->start_section_nr));
+ page = pfn_to_online_page(mem->start_pfn);
if (page && zone_idx(page_zone(page)) == ZONE_MOVABLE)
online_type = MMOP_ONLINE_MOVABLE;

--
2.30.2

2021-05-06 19:20:39

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 2/7] mm: set pageblock_order to the max of HUGETLB_PAGE_ORDER and MAX_ORDER-1

From: Zi Yan <[email protected]>

As MAX_ORDER can be increased to larger than hugetlb page orders,
pageblock_order will be limited at HUGETLB_PAGE_ORDER. It is not
ideal for anti-fragmentation for allocating pages from buddy allocator
with MAX_ORDER > HUGETLB_PAGE_ORDER. Make pageblock_order a variable and
set it to the max of HUGETLB_PAGE_ORDER, MAX_ORDER - 1.

Remove HUGETLB_PAGE_SIZE_VARIABLE as HUGETLB_PAGE does the same thing
now.

Export pageblock_order since virtio-mem module needs it.

Signed-off-by: Zi Yan <[email protected]>
---
arch/ia64/Kconfig | 1 -
arch/powerpc/Kconfig | 1 -
include/linux/pageblock-flags.h | 9 ---------
mm/Kconfig | 7 -------
mm/page_alloc.c | 13 +++++++------
mm/sparse.c | 2 +-
6 files changed, 8 insertions(+), 25 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 279252e3e0f7..c6a150cde668 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -34,7 +34,6 @@ config IA64
select TTY
select HAVE_ARCH_TRACEHOOK
select HAVE_VIRT_CPU_ACCOUNTING
- select HUGETLB_PAGE_SIZE_VARIABLE if HUGETLB_PAGE
select VIRT_TO_BUS
select GENERIC_IRQ_PROBE
select GENERIC_PENDING_IRQ if SMP
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8926461d25b2..9f27c131ed45 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -238,7 +238,6 @@ config PPC
select HAVE_HARDLOCKUP_DETECTOR_PERF if PERF_EVENTS && HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
- select HUGETLB_PAGE_SIZE_VARIABLE if PPC_BOOK3S_64 && HUGETLB_PAGE
select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_PAGE_SIZE
select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index fff52ad370c1..efab85e2ae1c 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,18 +30,9 @@ enum pageblock_bits {

#ifdef CONFIG_HUGETLB_PAGE

-#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
-
/* Huge page sizes are variable */
extern unsigned int pageblock_order;

-#else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
-
-/* Huge pages are a constant size */
-#define pageblock_order HUGETLB_PAGE_ORDER
-
-#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
-
#else /* CONFIG_HUGETLB_PAGE */

/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
diff --git a/mm/Kconfig b/mm/Kconfig
index a8a367c30053..3466c3d09295 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -284,13 +284,6 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
config ARCH_ENABLE_THP_MIGRATION
bool

-config HUGETLB_PAGE_SIZE_VARIABLE
- def_bool n
- help
- Allows the pageblock_order value to be dynamic instead of just standard
- HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes available
- on a platform.
-
config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2fe714aed93..72bb4a300e03 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -256,8 +256,9 @@ bool pm_suspended_storage(void)
}
#endif /* CONFIG_PM_SLEEP */

-#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
+#ifdef CONFIG_HUGETLB_PAGE
unsigned int pageblock_order __read_mostly;
+EXPORT_SYMBOL_GPL(pageblock_order);
#endif

static void __free_pages_ok(struct page *page, unsigned int order,
@@ -7057,7 +7058,7 @@ static void __ref setup_usemap(struct zone *zone)
static inline void setup_usemap(struct zone *zone) {}
#endif /* CONFIG_SPARSEMEM */

-#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
+#ifdef CONFIG_HUGETLB_PAGE

/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
void __init set_pageblock_order(void)
@@ -7068,7 +7069,7 @@ void __init set_pageblock_order(void)
if (pageblock_order)
return;

- if (HPAGE_SHIFT > PAGE_SHIFT)
+ if ((HPAGE_SHIFT > PAGE_SHIFT) && (HUGETLB_PAGE_ORDER > (MAX_ORDER - 1)))
order = HUGETLB_PAGE_ORDER;
else
order = MAX_ORDER - 1;
@@ -7080,10 +7081,10 @@ void __init set_pageblock_order(void)
*/
pageblock_order = order;
}
-#else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
+#else /* CONFIG_HUGETLB_PAGE */

/*
- * When CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is not set, set_pageblock_order()
+ * When CONFIG_HUGETLB_PAGE is not set, set_pageblock_order()
* is unused as pageblock_order is set at compile-time. See
* include/linux/pageblock-flags.h for the values of pageblock_order based on
* the kernel config
@@ -7092,7 +7093,7 @@ void __init set_pageblock_order(void)
{
}

-#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
+#endif /* CONFIG_HUGETLB_PAGE */

static unsigned long __init calc_memmap_size(unsigned long spanned_pages,
unsigned long present_pages)
diff --git a/mm/sparse.c b/mm/sparse.c
index 7637208b8874..1c2957807882 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -583,7 +583,7 @@ void __init sparse_init(void)
pnum_begin = first_present_section_nr();
nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));

- /* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
+ /* Setup pageblock_order for HUGETLB_PAGE */
set_pageblock_order();

for_each_present_section_nr(pnum_begin + 1, pnum_end) {
--
2.30.2

2021-05-06 19:25:07

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 6 May 2021, at 11:31, David Hildenbrand wrote:

> On 06.05.21 17:26, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> Hi all,
>>
>> This patchset tries to remove the restriction on memory hotplug/hotremove
>> granularity, which is always greater or equal to memory section size[1].
>> With the patchset, kernel is able to online/offline memory at a size independent
>> of memory section size, as small as 2MB (the subsection size).
>
> ... which doesn't make any sense as we can only online/offline whole memory block devices.

Why limit the memory block size to section size? Patch 3 removes the restriction
by using (start_pfn, nr_pages) to allow memory block size goes below section size.
Also we have subsection bitmap available to tell us which subsection is online,
there is no reason to force memory block size to match section size.

>
>>
>> The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
>> size without increasing memory hotplug/hotremove granularity at the same time,
>
> Gah, no. Please no. No.
>
>> so that the kernel can allocator 1GB pages using buddy allocator and utilizes
>> existing pageblock based anti-fragmentation, paving the road for 1GB THP
>> support[2].
>
> Not like this, please no.
>
>>
>> The patchset utilizes the existing subsection support[3] and changes the
>> section size alignment checks to subsection size alignment checks. There are
>> also changes to pageblock code to support partial pageblocks, when pageblock
>> size is increased along with MAX_ORDER. Increasing pageblock size can enable
>> kernel to utilize existing anti-fragmentation mechanism for gigantic page
>> allocations.
>
> Please not like this.
>
>>
>> The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
>> hotplug/hotremove subsection, but is not intended to be merged as is. It is
>> there in case one wants to try this out and will be removed during the final
>> submission.
>>
>> Feel free to give suggestions and comments. I am looking forward to your
>> feedback.
>
> Please not like this.

Do you mind sharing more useful feedback instead of just saying a lot of No?

Thanks.



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2021-05-06 19:27:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 06.05.21 17:31, David Hildenbrand wrote:
> On 06.05.21 17:26, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> Hi all,
>>
>> This patchset tries to remove the restriction on memory hotplug/hotremove
>> granularity, which is always greater or equal to memory section size[1].
>> With the patchset, kernel is able to online/offline memory at a size independent
>> of memory section size, as small as 2MB (the subsection size).
>
> ... which doesn't make any sense as we can only online/offline whole
> memory block devices.
>
>>
>> The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
>> size without increasing memory hotplug/hotremove granularity at the same time,
>
> Gah, no. Please no. No.
>
>> so that the kernel can allocator 1GB pages using buddy allocator and utilizes
>> existing pageblock based anti-fragmentation, paving the road for 1GB THP
>> support[2].
>
> Not like this, please no.
>
>>
>> The patchset utilizes the existing subsection support[3] and changes the
>> section size alignment checks to subsection size alignment checks. There are
>> also changes to pageblock code to support partial pageblocks, when pageblock
>> size is increased along with MAX_ORDER. Increasing pageblock size can enable
>> kernel to utilize existing anti-fragmentation mechanism for gigantic page
>> allocations.
>
> Please not like this.
>
>>
>> The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
>> hotplug/hotremove subsection, but is not intended to be merged as is. It is
>> there in case one wants to try this out and will be removed during the final
>> submission.
>>
>> Feel free to give suggestions and comments. I am looking forward to your
>> feedback.
>
> Please not like this.
>

And just to be clear (I think I mentioned this already to you?): Nack to
increasing the section size. Nack to increasing the pageblock order.
Please find different ways to group on gigantic-pages level. There are
alternative ideas floating around.

Semi-nack to increasing MAX_ORDER. I first want to see
alloc_contig_range() be able to fully and cleanly handle allocations <
MAX_ORDER in all cases (especially !CMA and !ZONE_MOVABLE) before we go
down that path.

--
Thanks,

David / dhildenb

2021-05-06 19:35:48

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 06.05.21 17:50, Zi Yan wrote:
> On 6 May 2021, at 11:40, David Hildenbrand wrote:
>
>>>>> The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
>>>>> hotplug/hotremove subsection, but is not intended to be merged as is. It is
>>>>> there in case one wants to try this out and will be removed during the final
>>>>> submission.
>>>>>
>>>>> Feel free to give suggestions and comments. I am looking forward to your
>>>>> feedback.
>>>>
>>>> Please not like this.
>>>
>>> Do you mind sharing more useful feedback instead of just saying a lot of No?
>>
>> I remember reasoning about this already in another thread, no? Either you're ignoring my previous feedback or my mind is messing with me.
>
> I definitely remember all your suggestions:
>
> 1. do not use CMA allocation for 1GB THP.
> 2. section size defines the minimum size in which we can add_memory(), so we cannot increase it.
>
> I am trying an alternative here. I am not using CMA allocation and not increasing the minimum size of add_memory() by decoupling the memory block size from section size, so that add_memory() can add a memory block smaller (as small as 2MB, the subsection size) than section size. In this way, section size can be increased freely. I do not see the strong tie between add_memory() and section size, especially we have subsection bitmap support.

Okay, let me express my thoughts, I could have sworn I explained back
then why I am not a friend of messing with the existing pageblock size:

1. Pageblock size

There are a couple of features that rely on the pageblock size to be
reasonably small to work as expected. One example is virtio-balloon free
page reporting, then there is virtio-mem (still also glued MAX_ORDER)
and we have CMA (still also glued to MAX_ORDER). Most probably there are
more. We track movability/ page isolation per pageblock; it's the
smallest granularity you can effectively isolate pages or mark them as
CMA (MIGRATE_ISOLATE, MIGRATE_CMA). Well, and there are "ordinary" THP /
huge pages most of our applications use and will use, especially on
smallish systems.

Assume you bump up the pageblock order to 1G. Small VMs won't be able to
report any free pages to the hypervisor. You'll take the "fine-grained"
out of virtio-mem. Each CMA area will have to be at least 1G big, which
turns CMA essentially useless on smallish systems (like we have on arm64
with 64k base pages -- pageblock_size is 512MB and I hate it).

Then, imagine systems that have like 4G of main memory. By stopping
grouping at 2M and instead grouping at 1G you can very easily find
yourself in the system where all your 4 pageblocks are unmovable and you
essentially don't optimize for huge pages in that environment any more.

Long story short: we need a different mechanism on top and shall leave
the pageblock size untouched, it's too tightly integrated with page
isolation, ordinary THP, and CMA.

2. Section size

I assume the only reason you want to touch that is because
pageblock_size <= section_size, and I guess that's one of the reasons I
dislike it so much. Messing with the section size really only makes
sense when we want to manage metadata for larger granularity within a
section.

We allocate metadata per section. We mark whole sections
early/online/present/.... Yes, in case of vmemmap, we manage the memmap
in smaller granularity using the sub-section map, some kind of hack to
support some ZONE_DEVICE cases better.

Let's assume we introduce something new "gigapage_order", corresponding
to 1G. We could either decide to squeeze the metadata into sections,
having to increase the section size, or manage that metadata differently.

Managing it differently certainly makes the necessary changes easier.
Instead of adding more hacks into sections, rather manage that metadata
at differently place / in a different way.

See [1] for an alternative. Not necessarily what I would dream off, but
just to showcase that there might be alternative to group pages.

3. Grouping pages > pageblock_order

There are other approaches that would benefit from grouping at >
pageblock_order and having bigger MAX_ORDER. And that doesn't
necessarily mean to form gigantic pages only, we might want to group in
multiple granularity on a single system. Memory hot(un)plug is one
example, but also optimizing memory consumption by powering down DIMM
banks. Also, some architectures support differing huge page sizes
(aarch64) that could be improved without CMA. Why not have more than 2
THP sizes on these systems?

Ideally, we'd have a mechanism that tries grouping on different
granularity, like for every order in pageblock_order ...
max_pageblock_order (e.g., 1 GiB), and not only add one new level of
grouping (or increase the single grouping size).

[1] https://lkml.kernel.org/r/[email protected]

--
Thanks,

David / dhildenb

2021-05-06 19:55:27

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] mm: sparse: set/clear subsection bitmap when pages are onlined/offlined.

On 6 May 2021, at 13:48, David Hildenbrand wrote:

> On 06.05.21 17:26, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> subsection bitmap was set/cleared when a section is added/removed, but
>> pfn_to_online_page() uses subsection bitmap to check if the page is
>> online, which is not accurate. It was working when a whole section is
>> added/removed during memory hotplug and hotremove. When the following
>> patches enable memory hotplug and hotremove for subsections,
>> subsection bitmap needs to be changed during page online/offline time,
>> otherwise, pfn_to_online_page() will not give right answers. Move the
>> subsection bitmap manipulation code from section_activate() to
>> online_mem_sections() and section_deactivate() to
>> offline_mem_sections(), respectively.
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> ---
>> mm/sparse.c | 36 +++++++++++++++++++++++++++++++++---
>> 1 file changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index b2ada9dc00cb..7637208b8874 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -606,6 +606,7 @@ void __init sparse_init(void)
>> #ifdef CONFIG_MEMORY_HOTPLUG
>> +static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages);
>> /* Mark all memory sections within the pfn range as online */
>> void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>> {
>> @@ -621,9 +622,12 @@ void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>> ms = __nr_to_section(section_nr);
>> ms->section_mem_map |= SECTION_IS_ONLINE;
>> + fill_subsection_map(pfn, min(end_pfn, pfn + PAGES_PER_SECTION) - pfn);
>> }
>> }
>> +static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages);
>> +static bool is_subsection_map_empty(struct mem_section *ms);
>> /* Mark all memory sections within the pfn range as offline */
>> void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>> {
>> @@ -641,7 +645,13 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>> continue;
>> ms = __nr_to_section(section_nr);
>> - ms->section_mem_map &= ~SECTION_IS_ONLINE;
>> +
>> + if (end_pfn < pfn + PAGES_PER_SECTION) {
>> + clear_subsection_map(pfn, end_pfn - pfn);
>> + if (is_subsection_map_empty(ms))
>> + ms->section_mem_map &= ~SECTION_IS_ONLINE;
>> + } else
>> + ms->section_mem_map &= ~SECTION_IS_ONLINE;
>> }
>> }
>> @@ -668,6 +678,17 @@ static void free_map_bootmem(struct page *memmap)
>> vmemmap_free(start, end, NULL);
>> }
>> +static int subsection_map_intersects(struct mem_section *ms, unsigned long pfn,
>> + unsigned long nr_pages)
>> +{
>> + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
>> + unsigned long *subsection_map = &ms->usage->subsection_map[0];
>> +
>> + subsection_mask_set(map, pfn, nr_pages);
>> +
>> + return bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION);
>> +}
>> +
>> static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages)
>> {
>> DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
>> @@ -760,6 +781,12 @@ static void free_map_bootmem(struct page *memmap)
>> }
>> }
>> +static int subsection_map_intersects(struct mem_section *ms, unsigned long pfn,
>> + unsigned long nr_pages)
>> +{
>> + return 0;
>> +}
>> +
>> static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages)
>> {
>> return 0;
>> @@ -800,7 +827,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>> struct page *memmap = NULL;
>> bool empty;
>> - if (clear_subsection_map(pfn, nr_pages))
>> + if (WARN((IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) && !ms->usage) ||
>> + subsection_map_intersects(ms, pfn, nr_pages),
>> + "section already deactivated (%#lx + %ld)\n",
>> + pfn, nr_pages))
>> return;
>> empty = is_subsection_map_empty(ms);
>> @@ -855,7 +885,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
>> ms->usage = usage;
>> }
>> - rc = fill_subsection_map(pfn, nr_pages);
>> + rc = !nr_pages || subsection_map_intersects(ms, pfn, nr_pages);
>> if (rc) {
>> if (usage)
>> ms->usage = NULL;
>>
>
> If I am not missing something, this is completely broken for devmem/ZONE_DEVICE that never onlines pages. But also when memory blocks are never onlined, this would be just wrong. Least thing you would need is a sub-section online map.

Thanks for pointing this out. I did not know that devmem/ZONE_DEVICE never onlines pages.

>
> But glimpsing at patch #2, I'd rather stop right away digging deeper into this series :)

What is the issue of patch 2, which makes pageblock_order a variable all the time? BTW, patch 2 fixes a bug by exporting pageblock_order, since when HUGETLB_PAGE_SIZE_VARIABLE is set, virtio-mem will not see pageblock_order as a variable, which could happen for PPC_BOOK2S_64 with virtio-men enabled, right? Or is this an invalid combination?

>
> I think what would really help is drafting a design of how it all could look like and then first discussing the high-level design, investigating how it could play along with all existing users, existing workloads, and existing use cases. Proposing such changes without a clear picture in mind and a high-level overview might give you some unpleasant reactions from some of the developers around here ;)

Please see my other email for a high-level design. Also I sent the patchset as a RFC to gather information on users, workloads, use cases I did not know about and I learnt a lot from your replies. :) Feedback is always welcome, but I am not sure why it needs to make people unpleasant. ;)



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2021-05-06 19:56:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

>>
>> 1. Pageblock size
>>
>> There are a couple of features that rely on the pageblock size to be reasonably small to work as expected. One example is virtio-balloon free page reporting, then there is virtio-mem (still also glued MAX_ORDER) and we have CMA (still also glued to MAX_ORDER). Most probably there are more. We track movability/ page isolation per pageblock; it's the smallest granularity you can effectively isolate pages or mark them as CMA (MIGRATE_ISOLATE, MIGRATE_CMA). Well, and there are "ordinary" THP / huge pages most of our applications use and will use, especially on smallish systems.
>>
>> Assume you bump up the pageblock order to 1G. Small VMs won't be able to report any free pages to the hypervisor. You'll take the "fine-grained" out of virtio-mem. Each CMA area will have to be at least 1G big, which turns CMA essentially useless on smallish systems (like we have on arm64 with 64k base pages -- pageblock_size is 512MB and I hate it).
>
> I understand the issue of having large pageblock in small systems. My plan for this issue is to make MAX_ORDER a variable (pageblock size would be set according to MAX_ORDER) that can be adjusted based on total memory and via boot time parameter. My apology since I did not state this clearly in my cover letter and it confused you. When we have a boot time adjustable MAX_ORDER, large pageblock like 1GB would only appear for systems with large memory. For small VMs, pageblock size would stay at 2MB, so all your concerns on smallish systems should go away.

I have to admit that I am not really a friend of that. I still think our
target goal should be to have gigantic THP *in addition to* ordinary
THP. Use gigantic THP where enabled and possible, and just use ordinary
THP everywhere else. Having one pageblock granularity is a real
limitation IMHO and requires us to hack the system to support it to some
degree.

>
>>
>> Then, imagine systems that have like 4G of main memory. By stopping grouping at 2M and instead grouping at 1G you can very easily find yourself in the system where all your 4 pageblocks are unmovable and you essentially don't optimize for huge pages in that environment any more.
>>
>> Long story short: we need a different mechanism on top and shall leave the pageblock size untouched, it's too tightly integrated with page isolation, ordinary THP, and CMA.
>
> I think it is better to make pageblock size adjustable based on total memory of a system. It is not reasonable to have the same pageblock size across systems with memory sizes from <1GB to several TBs. Do you agree?
>

I suggest an additional mechanism on top. Please bear in mind that
ordinary THP will most probably be still the default for 99.9% of all
application/library cases, even when you have gigantic THP around.

>>
>> 2. Section size
>>
>> I assume the only reason you want to touch that is because pageblock_size <= section_size, and I guess that's one of the reasons I dislike it so much. Messing with the section size really only makes sense when we want to manage metadata for larger granularity within a section.
>
> Perhaps it is worth checking if it is feasible to make pageblock_size > section_size, so we can still have small sections when pageblock_size are large. One potential issue for that is when PFN discontinues at section boundary, we might have partial pageblock when pageblock_size is big. I guess supporting partial pageblock (or different pageblock sizes like you mentioned below ) would be the right solution.
>
>>
>> We allocate metadata per section. We mark whole sections early/online/present/.... Yes, in case of vmemmap, we manage the memmap in smaller granularity using the sub-section map, some kind of hack to support some ZONE_DEVICE cases better.
>>
>> Let's assume we introduce something new "gigapage_order", corresponding to 1G. We could either decide to squeeze the metadata into sections, having to increase the section size, or manage that metadata differently.
>>
>> Managing it differently certainly makes the necessary changes easier. Instead of adding more hacks into sections, rather manage that metadata at differently place / in a different way.
>
> Can you elaborate on managing it differently?

Let's keep it simple. Assume you track on a 1G gigpageblock MOVABLE vs.
!movable in addition to existing pageblocks. A 64 TB system would have
64*1024 gigpageblocks. One bit per gigapageblock would require 8k a.k.a.
2 pages. If you need more states, it would maybe double. No need to
manage that using sparse memory sections IMHO. Just allocate 2/4 pages
during boot for the bitmap.

>
>>
>> See [1] for an alternative. Not necessarily what I would dream off, but just to showcase that there might be alternative to group pages.
>
> I saw this patch too. It is an interesting idea to separate different allocation orders into different regions, but it would not work for gigantic page allocations unless we have large pageblock size to utilize existing anti-fragmentation mechanisms.

Right, any anti-fragmentation mechanism on top.

>> 3. Grouping pages > pageblock_order
>>
>> There are other approaches that would benefit from grouping at > pageblock_order and having bigger MAX_ORDER. And that doesn't necessarily mean to form gigantic pages only, we might want to group in multiple granularity on a single system. Memory hot(un)plug is one example, but also optimizing memory consumption by powering down DIMM banks. Also, some architectures support differing huge page sizes (aarch64) that could be improved without CMA. Why not have more than 2 THP sizes on these systems?
>>
>> Ideally, we'd have a mechanism that tries grouping on different granularity, like for every order in pageblock_order ... max_pageblock_order (e.g., 1 GiB), and not only add one new level of grouping (or increase the single grouping size).
>
> I agree. In some sense, supporting partial pageblock and increasing pageblock size (e.g., to 1GB) is, at the high level, quite similar to having multiple pageblock sizes. But I am not sure if we really want to support multiple pageblock sizes, since it creates pageblock fragmentation when we want to change migratetype for part of a pageblock. This means we would break a large pageblock into small ones if we just want to steal a subset of pages from MOVEABLE for UNMOVABLE allocations. Then pageblock loses its most useful anti-fragmentation feature. Also it seems to be a replication of buddy allocator functionalities when it comes to pageblock split and merge.

Let's assume for simplicity that you have a 4G machine, maximum 4
gigantic pages. The first gigantic page will be impossible either way
due to the kernel, boot time allocations etc. So you're left with 3
gigantic pages you could use at best.

Obviously, you want to make sure that the remaining parts of the first
gigantic page are used as best as possible for ordinary huge pages, so
you would actually want to group them in 2 MiB chunks and avoid
fragmentation there.

Obviously, supporting two pageblock types would require core
modifications to support it natively. (not pushing for the idea of two
pageblock orders, just motivating why we actually want to keep grouping
for ordinary THP).

>
>
> The above is really a nice discussion with you on pageblock, section, memory hotplug/hotremove, which also helps me understand more on the issues with increasing MAX_ORDER to enable 1GB page allocation.
>
> In sum, if I get it correctly, the issues I need to address are:
>
> 1. large pageblock size (which is needed when we bump MAX_ORDER for gigantic page allocation from buddy allocator) is not good for machines with small memory.
>
> 2. pageblock size is currently tied with section size (which made me want to bump section size).
>
>
> For 1, I think making MAX_ORDER a variable that can be set based on total memory size and adjustable via boot time parameter should solve the problem. For small machines, we will keep MAX_ORDER as small as we have now like 4MB, whereas for large machines, we can increase MAX_ORDER to utilize gigantic pages.
>
> For 2, supporting partial pageblock and allow a pageblock to cross multiple sections would break the tie between pageblock size and section to solve the issue.
>
> I am going to look into them. What do you think?

I am not sure that's really the right direction as stated above.

--
Thanks,

David / dhildenb

2021-05-06 19:59:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] mm: sparse: set/clear subsection bitmap when pages are onlined/offlined.

>> But glimpsing at patch #2, I'd rather stop right away digging deeper into this series :)
>
> What is the issue of patch 2, which makes pageblock_order a variable all the time? BTW, patch 2 fixes a bug by exporting pageblock_order, since when HUGETLB_PAGE_SIZE_VARIABLE is set, virtio-mem will not see pageblock_order as a variable, which could happen for PPC_BOOK2S_64 with virtio-men enabled, right? Or is this an invalid combination?

virtio_mem is x86_64 only. aarch64 and s390x prototypes are available.

If I understood "Make pageblock_order a variable and
set it to the max of HUGETLB_PAGE_ORDER, MAX_ORDER - 1" correctly,
you're setting the pageblock_order on x86_64 to 4M. That mean's you're
no longer grouping for THP but MAX_ORDER - 1, which is not what we want.
We want to optimize for THP.

Also, that would affect virtio-balloon with free page reporting (report
only 4 MiB chunks not 2 MiB chunks).

>
>>
>> I think what would really help is drafting a design of how it all could look like and then first discussing the high-level design, investigating how it could play along with all existing users, existing workloads, and existing use cases. Proposing such changes without a clear picture in mind and a high-level overview might give you some unpleasant reactions from some of the developers around here ;)
>
> Please see my other email for a high-level design. Also I sent the patchset as a RFC to gather information on users, workloads, use cases I did not know about and I learnt a lot from your replies. :) Feedback is always welcome, but I am not sure why it needs to make people unpleasant. ;)

Rather the replies might be unpleasant ;)

--
Thanks,

David / dhildenb

2021-05-06 20:03:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On Thu, May 06, 2021 at 09:10:52PM +0200, David Hildenbrand wrote:
> I have to admit that I am not really a friend of that. I still think our
> target goal should be to have gigantic THP *in addition to* ordinary THP.
> Use gigantic THP where enabled and possible, and just use ordinary THP
> everywhere else. Having one pageblock granularity is a real limitation IMHO
> and requires us to hack the system to support it to some degree.

You're thinking too small with only two THP sizes ;-) I'm aiming to
support arbitrary power-of-two memory allocations. I think there's a
fruitful discussion to be had about how that works for anonymous memory --
with page cache, we have readahead to tell us when our predictions of use
are actually fulfilled. It doesn't tell us what percentage of the pages
allocated were actually used, but it's a hint. It's a big lift to go from
2MB all the way to 1GB ... if you can look back to see that the previous
1GB was basically fully populated, then maybe jump up from allocating
2MB folios to allocating a 1GB folio, but wow, that's a big step.

This goal really does mean that we want to allocate from the page
allocator, and so we do want to grow MAX_ORDER. I suppose we could
do somethig ugly like

if (order <= MAX_ORDER)
alloc_page()
else
alloc_really_big_page()

but that feels like unnecessary hardship to place on the user.

I know that for the initial implementation, we're going to rely on hints
from the user to use 1GB pages, but it'd be nice to not do that.

2021-05-06 20:05:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 06.05.21 21:30, Matthew Wilcox wrote:
> On Thu, May 06, 2021 at 09:10:52PM +0200, David Hildenbrand wrote:
>> I have to admit that I am not really a friend of that. I still think our
>> target goal should be to have gigantic THP *in addition to* ordinary THP.
>> Use gigantic THP where enabled and possible, and just use ordinary THP
>> everywhere else. Having one pageblock granularity is a real limitation IMHO
>> and requires us to hack the system to support it to some degree.
>
> You're thinking too small with only two THP sizes ;-) I'm aiming to

Well, I raised in my other mail that we will have multiple different use
cases, including multiple different THP e.g., on aarch64 ;)

> support arbitrary power-of-two memory allocations. I think there's a
> fruitful discussion to be had about how that works for anonymous memory --
> with page cache, we have readahead to tell us when our predictions of use
> are actually fulfilled. It doesn't tell us what percentage of the pages

Right, and I think we have to think about a better approach than just
increasing the pageblock_order.

> allocated were actually used, but it's a hint. It's a big lift to go from
> 2MB all the way to 1GB ... if you can look back to see that the previous
> 1GB was basically fully populated, then maybe jump up from allocating
> 2MB folios to allocating a 1GB folio, but wow, that's a big step.
>
> This goal really does mean that we want to allocate from the page
> allocator, and so we do want to grow MAX_ORDER. I suppose we could
> do somethig ugly like
>
> if (order <= MAX_ORDER)
> alloc_page()
> else
> alloc_really_big_page()
>
> but that feels like unnecessary hardship to place on the user.

I had something similar for the sort term in mind, relying on
alloc_contig_pages() (and maybe ZONE_MOVABLE to make allocations more
likely to succeed). Devil's in the details (page migration, ...).


--
Thanks,

David / dhildenb

2021-05-06 23:17:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

>>> The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
>>> hotplug/hotremove subsection, but is not intended to be merged as is. It is
>>> there in case one wants to try this out and will be removed during the final
>>> submission.
>>>
>>> Feel free to give suggestions and comments. I am looking forward to your
>>> feedback.
>>
>> Please not like this.
>
> Do you mind sharing more useful feedback instead of just saying a lot of No?

I remember reasoning about this already in another thread, no? Either
you're ignoring my previous feedback or my mind is messing with me.

--
Thanks,

David / dhildenb

2021-05-07 00:01:17

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 6 May 2021, at 12:28, David Hildenbrand wrote:

> On 06.05.21 17:50, Zi Yan wrote:
>> On 6 May 2021, at 11:40, David Hildenbrand wrote:
>>
>>>>>> The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
>>>>>> hotplug/hotremove subsection, but is not intended to be merged as is. It is
>>>>>> there in case one wants to try this out and will be removed during the final
>>>>>> submission.
>>>>>>
>>>>>> Feel free to give suggestions and comments. I am looking forward to your
>>>>>> feedback.
>>>>>
>>>>> Please not like this.
>>>>
>>>> Do you mind sharing more useful feedback instead of just saying a lot of No?
>>>
>>> I remember reasoning about this already in another thread, no? Either you're ignoring my previous feedback or my mind is messing with me.
>>
>> I definitely remember all your suggestions:
>>
>> 1. do not use CMA allocation for 1GB THP.
>> 2. section size defines the minimum size in which we can add_memory(), so we cannot increase it.
>>
>> I am trying an alternative here. I am not using CMA allocation and not increasing the minimum size of add_memory() by decoupling the memory block size from section size, so that add_memory() can add a memory block smaller (as small as 2MB, the subsection size) than section size. In this way, section size can be increased freely. I do not see the strong tie between add_memory() and section size, especially we have subsection bitmap support.
>
> Okay, let me express my thoughts, I could have sworn I explained back then why I am not a friend of messing with the existing pageblock size:

Thanks for writing down your thoughts in detail. I will clarify my high-level plan below too.

>
> 1. Pageblock size
>
> There are a couple of features that rely on the pageblock size to be reasonably small to work as expected. One example is virtio-balloon free page reporting, then there is virtio-mem (still also glued MAX_ORDER) and we have CMA (still also glued to MAX_ORDER). Most probably there are more. We track movability/ page isolation per pageblock; it's the smallest granularity you can effectively isolate pages or mark them as CMA (MIGRATE_ISOLATE, MIGRATE_CMA). Well, and there are "ordinary" THP / huge pages most of our applications use and will use, especially on smallish systems.
>
> Assume you bump up the pageblock order to 1G. Small VMs won't be able to report any free pages to the hypervisor. You'll take the "fine-grained" out of virtio-mem. Each CMA area will have to be at least 1G big, which turns CMA essentially useless on smallish systems (like we have on arm64 with 64k base pages -- pageblock_size is 512MB and I hate it).

I understand the issue of having large pageblock in small systems. My plan for this issue is to make MAX_ORDER a variable (pageblock size would be set according to MAX_ORDER) that can be adjusted based on total memory and via boot time parameter. My apology since I did not state this clearly in my cover letter and it confused you. When we have a boot time adjustable MAX_ORDER, large pageblock like 1GB would only appear for systems with large memory. For small VMs, pageblock size would stay at 2MB, so all your concerns on smallish systems should go away.

>
> Then, imagine systems that have like 4G of main memory. By stopping grouping at 2M and instead grouping at 1G you can very easily find yourself in the system where all your 4 pageblocks are unmovable and you essentially don't optimize for huge pages in that environment any more.
>
> Long story short: we need a different mechanism on top and shall leave the pageblock size untouched, it's too tightly integrated with page isolation, ordinary THP, and CMA.

I think it is better to make pageblock size adjustable based on total memory of a system. It is not reasonable to have the same pageblock size across systems with memory sizes from <1GB to several TBs. Do you agree?

>
> 2. Section size
>
> I assume the only reason you want to touch that is because pageblock_size <= section_size, and I guess that's one of the reasons I dislike it so much. Messing with the section size really only makes sense when we want to manage metadata for larger granularity within a section.

Perhaps it is worth checking if it is feasible to make pageblock_size > section_size, so we can still have small sections when pageblock_size are large. One potential issue for that is when PFN discontinues at section boundary, we might have partial pageblock when pageblock_size is big. I guess supporting partial pageblock (or different pageblock sizes like you mentioned below ) would be the right solution.

>
> We allocate metadata per section. We mark whole sections early/online/present/.... Yes, in case of vmemmap, we manage the memmap in smaller granularity using the sub-section map, some kind of hack to support some ZONE_DEVICE cases better.
>
> Let's assume we introduce something new "gigapage_order", corresponding to 1G. We could either decide to squeeze the metadata into sections, having to increase the section size, or manage that metadata differently.
>
> Managing it differently certainly makes the necessary changes easier. Instead of adding more hacks into sections, rather manage that metadata at differently place / in a different way.

Can you elaborate on managing it differently?

>
> See [1] for an alternative. Not necessarily what I would dream off, but just to showcase that there might be alternative to group pages.

I saw this patch too. It is an interesting idea to separate different allocation orders into different regions, but it would not work for gigantic page allocations unless we have large pageblock size to utilize existing anti-fragmentation mechanisms.


>
> 3. Grouping pages > pageblock_order
>
> There are other approaches that would benefit from grouping at > pageblock_order and having bigger MAX_ORDER. And that doesn't necessarily mean to form gigantic pages only, we might want to group in multiple granularity on a single system. Memory hot(un)plug is one example, but also optimizing memory consumption by powering down DIMM banks. Also, some architectures support differing huge page sizes (aarch64) that could be improved without CMA. Why not have more than 2 THP sizes on these systems?
>
> Ideally, we'd have a mechanism that tries grouping on different granularity, like for every order in pageblock_order ... max_pageblock_order (e.g., 1 GiB), and not only add one new level of grouping (or increase the single grouping size).

I agree. In some sense, supporting partial pageblock and increasing pageblock size (e.g., to 1GB) is, at the high level, quite similar to having multiple pageblock sizes. But I am not sure if we really want to support multiple pageblock sizes, since it creates pageblock fragmentation when we want to change migratetype for part of a pageblock. This means we would break a large pageblock into small ones if we just want to steal a subset of pages from MOVEABLE for UNMOVABLE allocations. Then pageblock loses its most useful anti-fragmentation feature. Also it seems to be a replication of buddy allocator functionalities when it comes to pageblock split and merge.


The above is really a nice discussion with you on pageblock, section, memory hotplug/hotremove, which also helps me understand more on the issues with increasing MAX_ORDER to enable 1GB page allocation.

In sum, if I get it correctly, the issues I need to address are:

1. large pageblock size (which is needed when we bump MAX_ORDER for gigantic page allocation from buddy allocator) is not good for machines with small memory.

2. pageblock size is currently tied with section size (which made me want to bump section size).


For 1, I think making MAX_ORDER a variable that can be set based on total memory size and adjustable via boot time parameter should solve the problem. For small machines, we will keep MAX_ORDER as small as we have now like 4MB, whereas for large machines, we can increase MAX_ORDER to utilize gigantic pages.

For 2, supporting partial pageblock and allow a pageblock to cross multiple sections would break the tie between pageblock size and section to solve the issue.

I am going to look into them. What do you think?


Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2021-05-07 15:12:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

[I haven't read through respective patches due to lack of time but let
me comment on the general idea and the underlying justification]

On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
> On 06.05.21 17:26, Zi Yan wrote:
> > From: Zi Yan <[email protected]>
> >
> > Hi all,
> >
> > This patchset tries to remove the restriction on memory hotplug/hotremove
> > granularity, which is always greater or equal to memory section size[1].
> > With the patchset, kernel is able to online/offline memory at a size independent
> > of memory section size, as small as 2MB (the subsection size).
>
> ... which doesn't make any sense as we can only online/offline whole memory
> block devices.

Agreed. The subsection thingy is just a hack to workaround pmem
alignement problems. For the real memory hotplug it is quite hard to
argue for reasonable hotplug scenarios for very small physical memory
ranges wrt. to the existing sparsemem memory model.

> > The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
> > size without increasing memory hotplug/hotremove granularity at the same time,
>
> Gah, no. Please no. No.

Agreed. Those are completely independent concepts. MAX_ORDER is can be
really arbitrary irrespective of the section size with vmemmap sparse
model. The existing restriction is due to old sparse model not being
able to do page pointer arithmetic across memory sections. Is there any
reason to stick with that memory model for an advance feature you are
working on?
--
Michal Hocko
SUSE Labs

2021-05-07 17:16:46

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 07.05.21 13:55, Michal Hocko wrote:
> [I haven't read through respective patches due to lack of time but let
> me comment on the general idea and the underlying justification]
>
> On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
>> On 06.05.21 17:26, Zi Yan wrote:
>>> From: Zi Yan <[email protected]>
>>>
>>> Hi all,
>>>
>>> This patchset tries to remove the restriction on memory hotplug/hotremove
>>> granularity, which is always greater or equal to memory section size[1].
>>> With the patchset, kernel is able to online/offline memory at a size independent
>>> of memory section size, as small as 2MB (the subsection size).
>>
>> ... which doesn't make any sense as we can only online/offline whole memory
>> block devices.
>
> Agreed. The subsection thingy is just a hack to workaround pmem
> alignement problems. For the real memory hotplug it is quite hard to
> argue for reasonable hotplug scenarios for very small physical memory
> ranges wrt. to the existing sparsemem memory model.
>
>>> The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
>>> size without increasing memory hotplug/hotremove granularity at the same time,
>>
>> Gah, no. Please no. No.
>
> Agreed. Those are completely independent concepts. MAX_ORDER is can be
> really arbitrary irrespective of the section size with vmemmap sparse
> model. The existing restriction is due to old sparse model not being
> able to do page pointer arithmetic across memory sections. Is there any
> reason to stick with that memory model for an advance feature you are
> working on?
>

I gave it some more thought yesterday. I guess the first thing we should
look into is increasing MAX_ORDER and leaving pageblock_order and
section size as is -- finding out what we have to tweak to get that up
and running. Once we have that in place, we can actually look into
better fragmentation avoidance etc. One step at a time.

Because that change itself might require some thought. Requiring that
bigger MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do.

As stated somewhere here already, we'll have to look into making
alloc_contig_range() (and main users CMA and virtio-mem) independent of
MAX_ORDER and mainly rely on pageblock_order. The current handling in
alloc_contig_range() is far from optimal as we have to isolate a whole
MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part
contains something unmovable although we don't even want to allocate
that part. I actually have that on my list (to be able to fully support
pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however
didn't have time to look into it.

Further, page onlining / offlining code and early init code most
probably also needs care if MAX_ORDER - 1 crosses sections. Memory holes
we might suddenly have in MAX_ORDER - 1 pages might become a problem and
will have to be handled. Not sure which other code has to be tweaked
(compaction? page isolation?).

Figuring out what needs care itself might take quite some effort.


One thing I was thinking about as well: The bigger our MAX_ORDER, the
slower it could be to allocate smaller pages. If we have 1G pages,
splitting them down to 4k then takes 8 additional steps if I'm, not
wrong. Of course, that's the worst case. Would be interesting to evaluate.

--
Thanks,

David / dhildenb

2021-05-10 14:42:04

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 7 May 2021, at 10:00, David Hildenbrand wrote:

> On 07.05.21 13:55, Michal Hocko wrote:
>> [I haven't read through respective patches due to lack of time but let
>> me comment on the general idea and the underlying justification]
>>
>> On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
>>> On 06.05.21 17:26, Zi Yan wrote:
>>>> From: Zi Yan <[email protected]>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset tries to remove the restriction on memory hotplug/hotremove
>>>> granularity, which is always greater or equal to memory section size[1].
>>>> With the patchset, kernel is able to online/offline memory at a size independent
>>>> of memory section size, as small as 2MB (the subsection size).
>>>
>>> ... which doesn't make any sense as we can only online/offline whole memory
>>> block devices.
>>
>> Agreed. The subsection thingy is just a hack to workaround pmem
>> alignement problems. For the real memory hotplug it is quite hard to
>> argue for reasonable hotplug scenarios for very small physical memory
>> ranges wrt. to the existing sparsemem memory model.
>>
>>>> The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
>>>> size without increasing memory hotplug/hotremove granularity at the same time,
>>>
>>> Gah, no. Please no. No.
>>
>> Agreed. Those are completely independent concepts. MAX_ORDER is can be
>> really arbitrary irrespective of the section size with vmemmap sparse
>> model. The existing restriction is due to old sparse model not being
>> able to do page pointer arithmetic across memory sections. Is there any
>> reason to stick with that memory model for an advance feature you are
>> working on?

No. I just want to increase MAX_ORDER. If the existing restriction can
be removed, that will be great.

>
> I gave it some more thought yesterday. I guess the first thing we should look into is increasing MAX_ORDER and leaving pageblock_order and section size as is -- finding out what we have to tweak to get that up and running. Once we have that in place, we can actually look into better fragmentation avoidance etc. One step at a time.

It makes sense to me.

>
> Because that change itself might require some thought. Requiring that bigger MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do.

OK, if with SPARSE_VMEMMAP MAX_ORDER can be set to be bigger than
SECTION_SIZE, it is perfectly OK to me. Since 1GB THP support, which I
want to add ultimately, will require SPARSE_VMEMMAP too (otherwise,
all page++ will need to be changed to nth_page(page,1)).

>
> As stated somewhere here already, we'll have to look into making alloc_contig_range() (and main users CMA and virtio-mem) independent of MAX_ORDER and mainly rely on pageblock_order. The current handling in alloc_contig_range() is far from optimal as we have to isolate a whole MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains something unmovable although we don't even want to allocate that part. I actually have that on my list (to be able to fully support pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to look into it.

So in your mind, for gigantic page allocation (> MAX_ORDER), alloc_contig_range()
should be used instead of buddy allocator while pageblock_order is kept at a small
granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
when any of the pageblocks within a gigantic page range (like 1GB) becomes unmovable?
Are you thinking additional mechanism/policy to prevent such thing happening as
an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE idea?

>
> Further, page onlining / offlining code and early init code most probably also needs care if MAX_ORDER - 1 crosses sections. Memory holes we might suddenly have in MAX_ORDER - 1 pages might become a problem and will have to be handled. Not sure which other code has to be tweaked (compaction? page isolation?).

Can you elaborate it a little more? From what I understand, memory holes mean valid
PFNs are not contiguous before and after a hole, so pfn++ will not work, but
struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning page++
would still work. So when MAX_ORDER - 1 crosses sections, additional code would be
needed instead of simple pfn++. Is there anything I am missing?

BTW, to test a system with memory holes, do you know is there an easy of adding
random memory holes to an x86_64 VM, which can help reveal potential missing pieces
in the code? Changing BIOS-e820 table might be one way, but I have no idea on
how to do it on QEMU.

>
> Figuring out what needs care itself might take quite some effort.
>
> One thing I was thinking about as well: The bigger our MAX_ORDER, the slower it could be to allocate smaller pages. If we have 1G pages, splitting them down to 4k then takes 8 additional steps if I'm, not wrong. Of course, that's the worst case. Would be interesting to evaluate.

Sure. I am planning to check it too. As a simple start, I am going to run will it scale
benchmarks to see if there is any performance difference between different MAX_ORDERs.

Thank you for all these valuable inputs. They are very helpful. I appreciate them.



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2021-05-12 17:54:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

>>
>> As stated somewhere here already, we'll have to look into making alloc_contig_range() (and main users CMA and virtio-mem) independent of MAX_ORDER and mainly rely on pageblock_order. The current handling in alloc_contig_range() is far from optimal as we have to isolate a whole MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains something unmovable although we don't even want to allocate that part. I actually have that on my list (to be able to fully support pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to look into it.
>
> So in your mind, for gigantic page allocation (> MAX_ORDER), alloc_contig_range()
> should be used instead of buddy allocator while pageblock_order is kept at a small
> granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
> when any of the pageblocks within a gigantic page range (like 1GB) becomes unmovable?
> Are you thinking additional mechanism/policy to prevent such thing happening as
> an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE idea?
>

I am not fully sure yet where the journey will go , I guess nobody
knows. Ultimately, having buddy support for >= current MAX_ORDER (IOW,
increasing MAX_ORDER) will most probably happen, so it would be worth
investigating what has to be done to get that running as a first step.

Of course, we could temporarily think about wiring it up in the buddy like

if (order < MAX_ORDER)
__alloc_pages()...
else
alloc_contig_pages()

but it doesn't really improve the situation IMHO, just an API change.

So I think we should look into increasing MAX_ORDER, seeing what needs
to be done to have that part running while keeping the section size and
the pageblock order as is. I know that at least memory
onlining/offlining, cma, alloc_contig_range(), ... needs tweaking,
especially when we don't increase the section size (but also if we would
due to the way page isolation is currently handled). Having a MAX_ORDER
-1 page being partially in different nodes might be another thing to
look into (I heard that it can already happen right now, but I don't
remember the details).

The next step after that would then be better fragmentation avoidance
for larger granularity like 1G THP.

>>
>> Further, page onlining / offlining code and early init code most probably also needs care if MAX_ORDER - 1 crosses sections. Memory holes we might suddenly have in MAX_ORDER - 1 pages might become a problem and will have to be handled. Not sure which other code has to be tweaked (compaction? page isolation?).
>
> Can you elaborate it a little more? From what I understand, memory holes mean valid
> PFNs are not contiguous before and after a hole, so pfn++ will not work, but
> struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning page++
> would still work. So when MAX_ORDER - 1 crosses sections, additional code would be
> needed instead of simple pfn++. Is there anything I am missing?

I think there are two cases when talking about MAX_ORDER and memory holes:

1. Hole with a valid memmap: the memmap is initialize to PageReserved()
and the pages are not given to the buddy. pfn_valid() and
pfn_to_page() works as expected.
2. Hole without a valid memmam: we have that CONFIG_HOLES_IN_ZONE thing
already, see include/linux/mmzone.h. pfn_valid_within() checks are
required. Doesn't win a beauty contest, but gets the job done in
existing setups that seem to care.

"If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we
need to check pfn validity within that MAX_ORDER_NR_PAGES block.
pfn_valid_within() should be used in this case; we optimise this away
when we have no holes within a MAX_ORDER_NR_PAGES block."

CONFIG_HOLES_IN_ZONE is just a bad name for this.

(increasing the section size implies that we waste more memory for the
memmap in holes. increasing MAX_ORDER means that we might have to deal
with holes within MAX_ORDER chunks)

We don't have too many pfn_valid_within() checks. I wonder if we could
add something that is optimized for "holes are a power of two and
properly aligned", because pfn_valid_within() right not deals with holes
of any kind which makes it somewhat inefficient IIRC.

>
> BTW, to test a system with memory holes, do you know is there an easy of adding
> random memory holes to an x86_64 VM, which can help reveal potential missing pieces
> in the code? Changing BIOS-e820 table might be one way, but I have no idea on
> how to do it on QEMU.

It might not be very easy that way. But I heard that some arm64 systems
have crazy memory layouts -- maybe there, it's easier to get something
nasty running? :)

https://lkml.kernel.org/r/[email protected]

I remember there was a way to define the e820 completely on kernel
cmdline, but I might be wrong ...

--
Thanks,

David / dhildenb

2021-06-02 15:58:16

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 10 May 2021, at 10:36, Zi Yan wrote:

> On 7 May 2021, at 10:00, David Hildenbrand wrote:
>
>> On 07.05.21 13:55, Michal Hocko wrote:
>>> [I haven't read through respective patches due to lack of time but let
>>> me comment on the general idea and the underlying justification]
>>>
>>> On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
>>>> On 06.05.21 17:26, Zi Yan wrote:
>>>>> From: Zi Yan <[email protected]>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This patchset tries to remove the restriction on memory hotplug/hotremove
>>>>> granularity, which is always greater or equal to memory section size[1].
>>>>> With the patchset, kernel is able to online/offline memory at a size independent
>>>>> of memory section size, as small as 2MB (the subsection size).
>>>>
>>>> ... which doesn't make any sense as we can only online/offline whole memory
>>>> block devices.
>>>
>>> Agreed. The subsection thingy is just a hack to workaround pmem
>>> alignement problems. For the real memory hotplug it is quite hard to
>>> argue for reasonable hotplug scenarios for very small physical memory
>>> ranges wrt. to the existing sparsemem memory model.
>>>
>>>>> The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
>>>>> size without increasing memory hotplug/hotremove granularity at the same time,
>>>>
>>>> Gah, no. Please no. No.
>>>
>>> Agreed. Those are completely independent concepts. MAX_ORDER is can be
>>> really arbitrary irrespective of the section size with vmemmap sparse
>>> model. The existing restriction is due to old sparse model not being
>>> able to do page pointer arithmetic across memory sections. Is there any
>>> reason to stick with that memory model for an advance feature you are
>>> working on?
>
> No. I just want to increase MAX_ORDER. If the existing restriction can
> be removed, that will be great.
>
>>
>> I gave it some more thought yesterday. I guess the first thing we should look into is increasing MAX_ORDER and leaving pageblock_order and section size as is -- finding out what we have to tweak to get that up and running. Once we have that in place, we can actually look into better fragmentation avoidance etc. One step at a time.
>
> It makes sense to me.
>
>>
>> Because that change itself might require some thought. Requiring that bigger MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do.
>
> OK, if with SPARSE_VMEMMAP MAX_ORDER can be set to be bigger than
> SECTION_SIZE, it is perfectly OK to me. Since 1GB THP support, which I
> want to add ultimately, will require SPARSE_VMEMMAP too (otherwise,
> all page++ will need to be changed to nth_page(page,1)).
>
>>
>> As stated somewhere here already, we'll have to look into making alloc_contig_range() (and main users CMA and virtio-mem) independent of MAX_ORDER and mainly rely on pageblock_order. The current handling in alloc_contig_range() is far from optimal as we have to isolate a whole MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains something unmovable although we don't even want to allocate that part. I actually have that on my list (to be able to fully support pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to look into it.
>
> So in your mind, for gigantic page allocation (> MAX_ORDER), alloc_contig_range()
> should be used instead of buddy allocator while pageblock_order is kept at a small
> granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
> when any of the pageblocks within a gigantic page range (like 1GB) becomes unmovable?
> Are you thinking additional mechanism/policy to prevent such thing happening as
> an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE idea?
>
>>
>> Further, page onlining / offlining code and early init code most probably also needs care if MAX_ORDER - 1 crosses sections. Memory holes we might suddenly have in MAX_ORDER - 1 pages might become a problem and will have to be handled. Not sure which other code has to be tweaked (compaction? page isolation?).
>
> Can you elaborate it a little more? From what I understand, memory holes mean valid
> PFNs are not contiguous before and after a hole, so pfn++ will not work, but
> struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning page++
> would still work. So when MAX_ORDER - 1 crosses sections, additional code would be
> needed instead of simple pfn++. Is there anything I am missing?
>
> BTW, to test a system with memory holes, do you know is there an easy of adding
> random memory holes to an x86_64 VM, which can help reveal potential missing pieces
> in the code? Changing BIOS-e820 table might be one way, but I have no idea on
> how to do it on QEMU.
>
>>
>> Figuring out what needs care itself might take quite some effort.
>>
>> One thing I was thinking about as well: The bigger our MAX_ORDER, the slower it could be to allocate smaller pages. If we have 1G pages, splitting them down to 4k then takes 8 additional steps if I'm, not wrong. Of course, that's the worst case. Would be interesting to evaluate.
>
> Sure. I am planning to check it too. As a simple start, I am going to run will it scale
> benchmarks to see if there is any performance difference between different MAX_ORDERs.

I ran vm-scalablity and memory-related will-it-scale on a server with 256GB memory to
see the impact of increasing MAX_ORDER and didn’t see much difference for most of
the workloads like page_fault1, page_fault2, and page_fault3 from will-it-scale.
But feel free to check the attached complete results and let me know what should be
looked into. Thanks.

# Environment
Dell R630 with 2x 12-core E5-2650 v4 and 256GB memory.


# Kernel changes
On top of v5.13-rc1-mmotm-2021-05-13-17-18, with SECTION_SIZE_BITS set to 31 and
MAX_ORDER set to 11 and 20 respectively.

# Results of page_fault1, page_fault2, and page_fault3


compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/thread/50%/debian/dellr630/page_fault3/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
3199850 ± 2% +6.0% 3390866 ± 3% will-it-scale.24.threads
54.94 +1.7% 55.85 will-it-scale.24.threads_idle
133326 ± 2% +6.0% 141285 ± 3% will-it-scale.per_thread_ops
3199850 ± 2% +6.0% 3390866 ± 3% will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/thread/50%/debian/dellr630/page_fault2/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
2016984 -6.6% 1883075 ± 2% will-it-scale.24.threads
69.69 -4.4% 66.64 will-it-scale.24.threads_idle
84040 -6.6% 78461 ± 2% will-it-scale.per_thread_ops
2016984 -6.6% 1883075 ± 2% will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/thread/50%/debian/dellr630/page_fault1/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
2138067 -1.3% 2109865 will-it-scale.24.threads
63.34 +1.1% 64.06 will-it-scale.24.threads_idle
89085 -1.3% 87910 will-it-scale.per_thread_ops
2138067 -1.3% 2109865 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/thread/16/debian/dellr630/page_fault3/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
3216287 ± 3% +4.8% 3369356 ± 10% will-it-scale.16.threads
69.18 +0.5% 69.51 will-it-scale.16.threads_idle
201017 ± 3% +4.8% 210584 ± 10% will-it-scale.per_thread_ops
3216287 ± 3% +4.8% 3369356 ± 10% will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/thread/16/debian/dellr630/page_fault2/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
2005510 -2.7% 1950620 ± 2% will-it-scale.16.threads
78.77 -0.2% 78.64 will-it-scale.16.threads_idle
125344 -2.7% 121913 ± 2% will-it-scale.per_thread_ops
2005510 -2.7% 1950620 ± 2% will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/thread/16/debian/dellr630/page_fault1/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
2332446 -6.5% 2179823 ± 2% will-it-scale.16.threads
77.57 -2.0% 76.03 will-it-scale.16.threads_idle
145777 -6.5% 136238 ± 2% will-it-scale.per_thread_ops
2332446 -6.5% 2179823 ± 2% will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/thread/100%/debian/dellr630/page_fault3/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
3236057 ± 2% -4.5% 3089222 ± 4% will-it-scale.48.threads
24.64 ± 7% -3.3% 23.83 ± 2% will-it-scale.48.threads_idle
67417 ± 2% -4.5% 64358 ± 4% will-it-scale.per_thread_ops
3236057 ± 2% -4.5% 3089222 ± 4% will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/thread/100%/debian/dellr630/page_fault2/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
1611363 -0.1% 1609891 will-it-scale.48.threads
47.42 ± 2% +1.2% 48.01 will-it-scale.48.threads_idle
33569 -0.1% 33539 will-it-scale.per_thread_ops
1611363 -0.1% 1609891 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/thread/100%/debian/dellr630/page_fault1/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
1776494 ± 3% -2.6% 1730693 will-it-scale.48.threads
43.36 ± 4% +0.5% 43.57 ± 2% will-it-scale.48.threads_idle
37010 ± 3% -2.6% 36055 will-it-scale.per_thread_ops
1776494 ± 3% -2.6% 1730693 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/process/50%/debian/dellr630/page_fault3/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
15235214 -0.3% 15185167 will-it-scale.24.processes
49.63 -0.4% 49.45 will-it-scale.24.processes_idle
634800 -0.3% 632715 will-it-scale.per_process_ops
15235214 -0.3% 15185167 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/process/50%/debian/dellr630/page_fault2/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
6700813 -0.6% 6662570 will-it-scale.24.processes
49.17 +0.0% 49.18 will-it-scale.24.processes_idle
279200 -0.6% 277606 will-it-scale.per_process_ops
6700813 -0.6% 6662570 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/process/50%/debian/dellr630/page_fault1/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
8052059 -1.2% 7952172 will-it-scale.24.processes
49.48 -0.4% 49.29 will-it-scale.24.processes_idle
335502 -1.2% 331340 will-it-scale.per_process_ops
8052059 -1.2% 7952172 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/process/16/debian/dellr630/page_fault3/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
10152559 +0.7% 10221240 will-it-scale.16.processes
66.10 -0.0% 66.09 will-it-scale.16.processes_idle
634534 +0.7% 638827 will-it-scale.per_process_ops
10152559 +0.7% 10221240 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/process/16/debian/dellr630/page_fault2/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
4621434 -1.0% 4576959 will-it-scale.16.processes
66.14 -0.2% 65.98 will-it-scale.16.processes_idle
288839 -1.0% 286059 will-it-scale.per_process_ops
4621434 -1.0% 4576959 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/process/16/debian/dellr630/page_fault1/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
5546153 -1.3% 5474778 will-it-scale.16.processes
66.02 -0.1% 65.98 will-it-scale.16.processes_idle
346634 -1.3% 342173 will-it-scale.per_process_ops
5546153 -1.3% 5474778 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/process/100%/debian/dellr630/page_fault3/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
20575719 +0.4% 20651992 will-it-scale.48.processes
0.06 +5.6% 0.06 ± 7% will-it-scale.48.processes_idle
428660 +0.4% 430249 will-it-scale.per_process_ops
20575719 +0.4% 20651992 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/process/100%/debian/dellr630/page_fault2/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
6984071 -1.1% 6906022 will-it-scale.48.processes
0.07 +4.8% 0.07 ± 6% will-it-scale.48.processes_idle
145501 -1.1% 143875 will-it-scale.per_process_ops
6984071 -1.1% 6906022 will-it-scale.workload

=========================================================================================
compiler/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-10/defconfig/process/100%/debian/dellr630/page_fault1/will-it-scale

commit:
5.13.0-rc1-mm1-max-order-11+
5.13.0-rc1-mm1-max-order-20+

5.13.0-rc1-mm1-m 5.13.0-rc1-mm1-max-order-20
---------------- ---------------------------
%stddev %change %stddev
\ | \
7527654 -1.7% 7399284 will-it-scale.48.processes
0.07 +0.0% 0.07 will-it-scale.48.processes_idle
156826 -1.7% 154151 will-it-scale.per_process_ops
7527654 -1.7% 7399284 will-it-scale.workload





Best Regards,
Yan, Zi


Attachments:
dellr630_2NUMA_24core_256GB_full_results.txt (68.74 kB)
signature.asc (871.00 B)
OpenPGP digital signature
Download all attachments

2021-06-14 11:43:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

On 02.06.21 17:56, Zi Yan wrote:
> On 10 May 2021, at 10:36, Zi Yan wrote:
>
>> On 7 May 2021, at 10:00, David Hildenbrand wrote:
>>
>>> On 07.05.21 13:55, Michal Hocko wrote:
>>>> [I haven't read through respective patches due to lack of time but let
>>>> me comment on the general idea and the underlying justification]
>>>>
>>>> On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
>>>>> On 06.05.21 17:26, Zi Yan wrote:
>>>>>> From: Zi Yan <[email protected]>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> This patchset tries to remove the restriction on memory hotplug/hotremove
>>>>>> granularity, which is always greater or equal to memory section size[1].
>>>>>> With the patchset, kernel is able to online/offline memory at a size independent
>>>>>> of memory section size, as small as 2MB (the subsection size).
>>>>>
>>>>> ... which doesn't make any sense as we can only online/offline whole memory
>>>>> block devices.
>>>>
>>>> Agreed. The subsection thingy is just a hack to workaround pmem
>>>> alignement problems. For the real memory hotplug it is quite hard to
>>>> argue for reasonable hotplug scenarios for very small physical memory
>>>> ranges wrt. to the existing sparsemem memory model.
>>>>
>>>>>> The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
>>>>>> size without increasing memory hotplug/hotremove granularity at the same time,
>>>>>
>>>>> Gah, no. Please no. No.
>>>>
>>>> Agreed. Those are completely independent concepts. MAX_ORDER is can be
>>>> really arbitrary irrespective of the section size with vmemmap sparse
>>>> model. The existing restriction is due to old sparse model not being
>>>> able to do page pointer arithmetic across memory sections. Is there any
>>>> reason to stick with that memory model for an advance feature you are
>>>> working on?
>>
>> No. I just want to increase MAX_ORDER. If the existing restriction can
>> be removed, that will be great.
>>
>>>
>>> I gave it some more thought yesterday. I guess the first thing we should look into is increasing MAX_ORDER and leaving pageblock_order and section size as is -- finding out what we have to tweak to get that up and running. Once we have that in place, we can actually look into better fragmentation avoidance etc. One step at a time.
>>
>> It makes sense to me.
>>
>>>
>>> Because that change itself might require some thought. Requiring that bigger MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do.
>>
>> OK, if with SPARSE_VMEMMAP MAX_ORDER can be set to be bigger than
>> SECTION_SIZE, it is perfectly OK to me. Since 1GB THP support, which I
>> want to add ultimately, will require SPARSE_VMEMMAP too (otherwise,
>> all page++ will need to be changed to nth_page(page,1)).
>>
>>>
>>> As stated somewhere here already, we'll have to look into making alloc_contig_range() (and main users CMA and virtio-mem) independent of MAX_ORDER and mainly rely on pageblock_order. The current handling in alloc_contig_range() is far from optimal as we have to isolate a whole MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains something unmovable although we don't even want to allocate that part. I actually have that on my list (to be able to fully support pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to look into it.
>>
>> So in your mind, for gigantic page allocation (> MAX_ORDER), alloc_contig_range()
>> should be used instead of buddy allocator while pageblock_order is kept at a small
>> granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
>> when any of the pageblocks within a gigantic page range (like 1GB) becomes unmovable?
>> Are you thinking additional mechanism/policy to prevent such thing happening as
>> an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE idea?
>>
>>>
>>> Further, page onlining / offlining code and early init code most probably also needs care if MAX_ORDER - 1 crosses sections. Memory holes we might suddenly have in MAX_ORDER - 1 pages might become a problem and will have to be handled. Not sure which other code has to be tweaked (compaction? page isolation?).
>>
>> Can you elaborate it a little more? From what I understand, memory holes mean valid
>> PFNs are not contiguous before and after a hole, so pfn++ will not work, but
>> struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning page++
>> would still work. So when MAX_ORDER - 1 crosses sections, additional code would be
>> needed instead of simple pfn++. Is there anything I am missing?
>>
>> BTW, to test a system with memory holes, do you know is there an easy of adding
>> random memory holes to an x86_64 VM, which can help reveal potential missing pieces
>> in the code? Changing BIOS-e820 table might be one way, but I have no idea on
>> how to do it on QEMU.
>>
>>>
>>> Figuring out what needs care itself might take quite some effort.
>>>
>>> One thing I was thinking about as well: The bigger our MAX_ORDER, the slower it could be to allocate smaller pages. If we have 1G pages, splitting them down to 4k then takes 8 additional steps if I'm, not wrong. Of course, that's the worst case. Would be interesting to evaluate.
>>
>> Sure. I am planning to check it too. As a simple start, I am going to run will it scale
>> benchmarks to see if there is any performance difference between different MAX_ORDERs.
>
> I ran vm-scalablity and memory-related will-it-scale on a server with 256GB memory to
> see the impact of increasing MAX_ORDER and didn’t see much difference for most of
> the workloads like page_fault1, page_fault2, and page_fault3 from will-it-scale.
> But feel free to check the attached complete results and let me know what should be
> looked into. Thanks.

Right, for will-it-scale it looks like there are mostly minor
differences, although I am not sure if the results are really stable
(reaching from -6% to +6%). For vm-scalability the numbers seem to vary
even more (e.g., stddev of ± 63%), so I have no idea how expressive they
are. But I guess for these benchmarks, the net change won't really be
significant.

Thanks!

--
Thanks,

David / dhildenb