2018-11-16 10:14:07

by Oscar Salvador

[permalink] [raw]
Subject: [RFC PATCH 0/4] mm, memory_hotplug: allocate memmap from hotadded memory

Hi,

this patchset is based on Michal's patchset [1].
Patch#1, patch#2 and patch#4 are quite the same.
They just needed little changes to adapt it to current codestream,
so it seemed fair to leave them.

---------
Original cover:

This is another step to make the memory hotplug more usable. The primary
goal of this patchset is to reduce memory overhead of the hot added
memory (at least for SPARSE_VMEMMAP memory model). Currently we use
kmalloc to poppulate memmap (struct page array) which has two main
drawbacks a) it consumes an additional memory until the hotadded memory
itslef is onlined and b) memmap might end up on a different numa node
which is especially true for movable_node configuration.

a) is problem especially for memory hotplug based memory "ballooning"
solutions when the delay between physical memory hotplug and the
onlining can lead to OOM and that led to introduction of hacks like auto
onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory")).
b) can have performance drawbacks.

One way to mitigate both issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hotadded memory itself. VMEMMAP memory model allows us to map
any pfn range so the memory doesn't need to be online to be usable
for the array. See patch 3 for more details. In short I am reusing an
existing vmem_altmap which wants to achieve the same thing for nvdim
device memory.

There is also one potential drawback, though. If somebody uses memory
hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
for them obviously because each memory block will contain reserved
area. Large x86 machines will use 2G memblocks so at least one 1G page
will be available but this is still not 2G...

I am not really sure somebody does that and how reliable that can work
actually. Nevertheless, I _believe_ that onlining more memory into
virtual machines is much more common usecase. Anyway if there ever is a
strong demand for such a usecase we have basically 3 options a) enlarge
memory blocks even more b) enhance altmap allocation strategy and reuse
low memory sections to host memmaps of other sections on the same NUMA
node c) have the memmap allocation strategy configurable to fallback to
the current allocation.

---------

Old version of this patchset would blow up because we were clearing the
pmds while we still had to reference pages backed by that memory.
I picked another approach which does not force us to touch arch specific code
in that regard.

Overall design:

With the preface of:

1) Whenever you hot-add a range, this is the same range that will be hot-removed.
This is just because you can't remove half of a DIMM, in the same way you can't
remove half of a device in qemu.
A device/DIMM are added/removed as a whole.

2) Every add_memory()->add_memory_resource()->arch_add_memory()->__add_pages()
will use a new altmap because it is a different hot-added range.

3) When you hot-remove a range, the sections will be removed sequantially
starting from the first section of the range and ending with the last one.

4) hot-remove operations are protected by hotplug lock, so no parallel operations
can take place.

The current design is as follows:

hot-remove operation)

- __kfree_section_memmap will be called for every section to be removed.
- We catch the first vmemmap_page and we pin it to a global variable.
- Further calls to __kfree_section_memmap will decrease refcount of
the vmemmap page without calling vmemmap_free().
We defer the call to vmemmap_free() untill all sections are removed
- If the refcount drops to 0, we know that we hit the last section.
- We clear the global variable.
- We call vmemmap_free for [last_section, current_vmemmap_page)

In case we are hot-removing a range that used altmap, the call to
vmemmap_free must be done backwards, because the beginning of memory
is used for the pagetables.
Doing it this way, we ensure that by the time we remove the pagetables,
those pages will not have to be referenced anymore.

An example:

(qemu) object_add memory-backend-ram,id=ram0,size=10G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1

- This has added: ffffea0004000000 - ffffea000427ffc0 (refcount: 80)

When refcount of ffffea0004000000 drops to 0, vmemmap_free()
will be called in this way:

vmemmap_free: start/end: ffffea000de00000 - ffffea000e000000
vmemmap_free: start/end: ffffea000dc00000 - ffffea000de00000
vmemmap_free: start/end: ffffea000da00000 - ffffea000dc00000
vmemmap_free: start/end: ffffea000d800000 - ffffea000da00000
vmemmap_free: start/end: ffffea000d600000 - ffffea000d800000
vmemmap_free: start/end: ffffea000d400000 - ffffea000d600000
vmemmap_free: start/end: ffffea000d200000 - ffffea000d400000
vmemmap_free: start/end: ffffea000d000000 - ffffea000d200000
vmemmap_free: start/end: ffffea000ce00000 - ffffea000d000000
vmemmap_free: start/end: ffffea000cc00000 - ffffea000ce00000
vmemmap_free: start/end: ffffea000ca00000 - ffffea000cc00000
vmemmap_free: start/end: ffffea000c800000 - ffffea000ca00000
vmemmap_free: start/end: ffffea000c600000 - ffffea000c800000
vmemmap_free: start/end: ffffea000c400000 - ffffea000c600000
vmemmap_free: start/end: ffffea000c200000 - ffffea000c400000
vmemmap_free: start/end: ffffea000c000000 - ffffea000c200000
vmemmap_free: start/end: ffffea000be00000 - ffffea000c000000
...
...
vmemmap_free: start/end: ffffea0004000000 - ffffea0004200000


[Testing]

- Tested ony on x86_64
- Several tests were carried out with memblocks of different sizes.
- Tests were performed adding different memory-range sizes
from 512M to 60GB.

[Todo]
- Look into hotplug gigantic pages case

Before investing more effort, I would like to hear some opinions/thoughts/ideas.

[1] https://lore.kernel.org/lkml/[email protected]/

Michal Hocko (3):
mm, memory_hotplug: cleanup memory offline path
mm, memory_hotplug: provide a more generic restrictions for memory
hotplug
mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap

Oscar Salvador (1):
mm, memory_hotplug: allocate memmap from the added memory range for
sparse-vmemmap

arch/arm64/mm/mmu.c | 5 +-
arch/ia64/mm/init.c | 5 +-
arch/powerpc/mm/init_64.c | 2 +
arch/powerpc/mm/mem.c | 6 +-
arch/s390/mm/init.c | 12 +++-
arch/sh/mm/init.c | 6 +-
arch/x86/mm/init_32.c | 6 +-
arch/x86/mm/init_64.c | 17 ++++--
include/linux/memory_hotplug.h | 35 ++++++++---
include/linux/memremap.h | 65 +++++++++++++++++++-
include/linux/page-flags.h | 18 ++++++
kernel/memremap.c | 12 ++--
mm/compaction.c | 3 +
mm/hmm.c | 6 +-
mm/memory_hotplug.c | 133 ++++++++++++++++++++++++++++-------------
mm/page_alloc.c | 33 ++++++++--
mm/page_isolation.c | 13 +++-
mm/sparse.c | 62 ++++++++++++++++---
18 files changed, 345 insertions(+), 94 deletions(-)

--
2.13.6



2018-11-16 10:14:07

by Oscar Salvador

[permalink] [raw]
Subject: [RFC PATCH 4/4] mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap

From: Michal Hocko <[email protected]>

Both functions will use altmap rather than kmalloc for sparsemem-vmemmap
so rename them to alloc_section_memmap/free_section_memmap which better
reflect the functionality.

Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/sparse.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index 29cbaa0e46c3..719853ef2e55 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -589,13 +589,13 @@ static void free_vmemmap_range(unsigned long limit, unsigned long start, unsigne
}
}

-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+static inline struct page *alloc_section_memmap(unsigned long pnum, int nid,
struct vmem_altmap *altmap)
{
/* This will make the necessary allocations eventually. */
return sparse_mem_map_populate(pnum, nid, altmap);
}
-static void __kfree_section_memmap(struct page *memmap,
+static void free_section_memmap(struct page *memmap,
struct vmem_altmap *altmap)
{
unsigned long start = (unsigned long)memmap;
@@ -646,13 +646,13 @@ static struct page *__kmalloc_section_memmap(void)
return ret;
}

-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+static inline struct page *alloc_section_memmap(unsigned long pnum, int nid,
struct vmem_altmap *altmap)
{
return __kmalloc_section_memmap();
}

-static void __kfree_section_memmap(struct page *memmap,
+static void free_section_memmap(struct page *memmap,
struct vmem_altmap *altmap)
{
if (is_vmalloc_addr(memmap))
@@ -718,12 +718,12 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
if (ret < 0 && ret != -EEXIST)
return ret;
ret = 0;
- memmap = kmalloc_section_memmap(section_nr, pgdat->node_id, altmap);
+ memmap = alloc_section_memmap(section_nr, pgdat->node_id, altmap);
if (!memmap)
return -ENOMEM;
usemap = __kmalloc_section_usemap();
if (!usemap) {
- __kfree_section_memmap(memmap, altmap);
+ free_section_memmap(memmap, altmap);
return -ENOMEM;
}

@@ -756,7 +756,7 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
pgdat_resize_unlock(pgdat, &flags);
if (ret < 0) {
kfree(usemap);
- __kfree_section_memmap(memmap, altmap);
+ free_section_memmap(memmap, altmap);
}
return ret;
}
@@ -798,7 +798,7 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap,
if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
kfree(usemap);
if (memmap)
- __kfree_section_memmap(memmap, altmap);
+ free_section_memmap(memmap, altmap);
return;
}

--
2.13.6


2018-11-16 10:14:09

by Oscar Salvador

[permalink] [raw]
Subject: [RFC PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug

From: Michal Hocko <[email protected]>

arch_add_memory, __add_pages take a want_memblock which controls whether
the newly added memory should get the sysfs memblock user API (e.g.
ZONE_DEVICE users do not want/need this interface). Some callers even
want to control where do we allocate the memmap from by configuring
altmap. This is currently done quite ugly by searching for altmap down
in memory hotplug (to_vmem_altmap). It should be the caller to provide
the altmap down the call chain.

Add a more generic hotplug context for arch_add_memory and __add_pages.
struct mhp_restrictions contains flags which contains additional
features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
currently) and altmap for alternative memmap allocator.

Please note that the complete altmap propagation down to vmemmap code
is still not done in this patch. It will be done in the follow up to
reduce the churn here.

This patch shouldn't introduce any functional change.

Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
---
arch/ia64/mm/init.c | 5 ++---
arch/powerpc/mm/mem.c | 6 +++---
arch/s390/mm/init.c | 6 +++---
arch/sh/mm/init.c | 6 +++---
arch/x86/mm/init_32.c | 6 +++---
arch/x86/mm/init_64.c | 10 +++++-----
include/linux/memory_hotplug.h | 27 ++++++++++++++++++++-------
kernel/memremap.c | 6 +++++-
mm/memory_hotplug.c | 10 ++++++----
9 files changed, 50 insertions(+), 32 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index d5e12ff1d73c..c0c6da053db8 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -645,14 +645,13 @@ mem_init (void)
}

#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
- bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size, struct mhp_restrictions *restrictions)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;

- ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+ ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
if (ret)
printk("%s: Problem encountered in __add_pages() as ret=%d\n",
__func__, ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 0a64fffabee1..2535471daad7 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -117,8 +117,8 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
return -ENODEV;
}

-int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
- bool want_memblock)
+int __meminit arch_add_memory(int nid, u64 start, u64 size,
+ struct mhp_restrictions *restrictions)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -135,7 +135,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *
}
flush_inval_dcache_range(start, start + size);

- return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+ return __add_pages(nid, start_pfn, nr_pages, restrictions);
}

#ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 76d0708438e9..4139affd6157 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -224,8 +224,8 @@ device_initcall(s390_cma_mem_init);

#endif /* CONFIG_CMA */

-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
- bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+ struct mhp_restrictions *restrictions)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long size_pages = PFN_DOWN(size);
@@ -235,7 +235,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
if (rc)
return rc;

- rc = __add_pages(nid, start_pfn, size_pages, altmap, want_memblock);
+ rc = __add_pages(nid, start_pfn, size_pages, restrictions);
if (rc)
vmem_remove_mapping(start, size);
return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index c8c13c777162..f0b489d6f73b 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -418,15 +418,15 @@ void free_initrd_mem(unsigned long start, unsigned long end)
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
- bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+ struct mhp_restrictions *restrictions)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;

/* We only have ZONE_NORMAL, so this is easy.. */
- ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+ ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
if (unlikely(ret))
printk("%s: Failed, __add_pages() == %d\n", __func__, ret);

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 49ecf5ecf6d3..c47b33019dbc 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -850,13 +850,13 @@ void __init mem_init(void)
}

#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
- bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+ struct mhp_restrictions *restrictions)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;

- return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+ return __add_pages(nid, start_pfn, nr_pages, restrictions);
}

#ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..fd06bcbd9535 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -783,11 +783,11 @@ static void update_end_of_memory_vars(u64 start, u64 size)
}

int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, bool want_memblock)
+ struct mhp_restrictions *restrictions)
{
int ret;

- ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+ ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
WARN_ON_ONCE(ret);

/* update max_pfn, max_low_pfn and high_memory */
@@ -797,15 +797,15 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
return ret;
}

-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
- bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+ struct mhp_restrictions *restrictions)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;

init_memory_mapping(start, start + size);

- return add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+ return add_pages(nid, start_pfn, nr_pages, restrictions);
}

#define PAGE_INUSE 0xFD
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 1cb7054cdc03..7249dab00ac9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -96,7 +96,7 @@ extern void __online_page_set_limits(struct page *page);
extern void __online_page_increment_counters(struct page *page);
extern void __online_page_free(struct page *page);

-extern int try_online_node(int nid);
+ extern int try_online_node(int nid);

extern bool memhp_auto_online;
/* If movable_node boot option specified */
@@ -113,20 +113,33 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
#endif /* CONFIG_MEMORY_HOTREMOVE */

+/*
+ * Do we want sysfs memblock files created. This will allow userspace to online
+ * and offline memory explicitly. Lack of this bit means that the caller has to
+ * call move_pfn_range_to_zone to finish the initialization.
+ */
+
+#define MHP_MEMBLOCK_API 1<<0
+
+/* Restrictions for the memory hotplug */
+struct mhp_restrictions {
+ unsigned long flags; /* MHP_ flags */
+ struct vmem_altmap *altmap; /* use this alternative allocatro for memmaps */
+};
+
/* reasonably generic interface to expand the physical pages */
extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, bool want_memblock);
+ struct mhp_restrictions *restrictions);

#ifndef CONFIG_ARCH_HAS_ADD_PAGES
static inline int add_pages(int nid, unsigned long start_pfn,
- unsigned long nr_pages, struct vmem_altmap *altmap,
- bool want_memblock)
+ unsigned long nr_pages, struct mhp_restrictions *restrictions)
{
- return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+ return __add_pages(nid, start_pfn, nr_pages, restrictions);
}
#else /* ARCH_HAS_ADD_PAGES */
int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
- struct vmem_altmap *altmap, bool want_memblock);
+ struct mhp_restrictions *restrictions);
#endif /* ARCH_HAS_ADD_PAGES */

#ifdef CONFIG_NUMA
@@ -328,7 +341,7 @@ extern int __add_memory(int nid, u64 start, u64 size);
extern int add_memory(int nid, u64 start, u64 size);
extern int add_memory_resource(int nid, struct resource *resource, bool online);
extern int arch_add_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap, bool want_memblock);
+ struct mhp_restrictions *restrictions);
extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 9eced2cc9f94..248082bfea5c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -143,6 +143,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
struct resource *res = &pgmap->res;
struct dev_pagemap *conflict_pgmap;
pgprot_t pgprot = PAGE_KERNEL;
+ struct mhp_restrictions restrictions = {};
int error, nid, is_ram;

align_start = res->start & ~(SECTION_SIZE - 1);
@@ -195,6 +196,9 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
if (error)
goto err_pfn_remap;

+ /* We do not want any optional features only our own memmap */
+ restrictions.altmap = altmap;
+
mem_hotplug_begin();
error = kasan_add_zero_shadow(__va(align_start), align_size);
if (error) {
@@ -202,7 +206,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
goto err_kasan;
}

- error = arch_add_memory(nid, align_start, align_size, altmap, false);
+ error = arch_add_memory(nid, align_start, align_size, &restrictions);
if (!error)
move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
align_start >> PAGE_SHIFT,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d19e5f33e33b..8a97bda770c1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -270,12 +270,12 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
* add the new pages.
*/
int __ref __add_pages(int nid, unsigned long phys_start_pfn,
- unsigned long nr_pages, struct vmem_altmap *altmap,
- bool want_memblock)
+ unsigned long nr_pages, struct mhp_restrictions *restrictions)
{
unsigned long i;
int err = 0;
int start_sec, end_sec;
+ struct vmem_altmap *altmap = restrictions->altmap;

/* during initialize mem_map, align hot-added range to section */
start_sec = pfn_to_section_nr(phys_start_pfn);
@@ -296,7 +296,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,

for (i = start_sec; i <= end_sec; i++) {
err = __add_section(nid, section_nr_to_pfn(i), altmap,
- want_memblock);
+ restrictions->flags & MHP_MEMBLOCK_API);

/*
* EEXIST is finally dealt with by ioresource collision
@@ -1100,6 +1100,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
u64 start, size;
bool new_node = false;
int ret;
+ struct mhp_restrictions restrictions = {};

start = res->start;
size = resource_size(res);
@@ -1124,7 +1125,8 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
new_node = ret;

/* call arch's memory hotadd */
- ret = arch_add_memory(nid, start, size, NULL, true);
+ restrictions.flags = MHP_MEMBLOCK_API;
+ ret = arch_add_memory(nid, start, size, &restrictions);
if (ret < 0)
goto error;

--
2.13.6


2018-11-16 10:14:11

by Oscar Salvador

[permalink] [raw]
Subject: [RFC PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

From: Oscar Salvador <[email protected]>

Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, kmalloc is used for those
allocations.

This has some disadvantages:
a) an existing memory is consumed for that purpose (~2MB per 128MB memory section)
b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.

a) has turned out to be a problem for memory hotplug based ballooning
because the userspace might not react in time to online memory while
the memory consumed during physical hotadd consumes enough memory to push
system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory") has been added to workaround that
problem.

We can do much better when CONFIG_SPARSEMEM_VMEMMAP=y because vmemap
page tables can map arbitrary memory. That means that we can simply
use the beginning of each memory section and map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Add {_Set,_Clear}PageVmemmap helpers to distinguish those pages in pfn
walkers. We do not have any spare page flag for this purpose so use the
combination of PageReserved bit which already tells that the page should
be ignored by the core mm code and store VMEMMAP_PAGE (which sets all
bits but PAGE_MAPPING_FLAGS) into page->mapping.

On the memory hotplug front add a new MHP_MEMMAP_FROM_RANGE restriction
flag. User is supposed to set the flag if the memmap should be allocated
from the hotadded range. Please note that this is just a hint and
architecture code can veto this if this cannot be supported. E.g. s390
cannot support this currently beause the physical memory range is made
accessible only during memory online.

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by __vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation
and this is done by a new vmem_altmap::flush_alloc_pfns callback.
mark_vmemmap_pages implementation then simply __SetPageVmemmap all
struct pages backing those pfns.
The callback is called from sparse_add_one_section.

mark_vmemmap_pages will take care of marking the pages as PageVmemmap,
and to increase the refcount of the first Vmemmap page.
This is done to know how much do we have to defer the call to vmemmap_free().

We also have to be careful about those pages during online and offline
operations. They are simply skipped now so online will keep them
reserved and so unusable for any other purpose and offline ignores them
so they do not block the offline operation.

When hot-remove the range, since sections are removed sequantially
starting from the first one and moving on, __kfree_section_memmap will
catch the first Vmemmap page and will get its reference count.
In this way, __kfree_section_memmap knows how much does it have to defer
the call to vmemmap_free().

in case we are hot-removing a range that used altmap, the call to
vmemmap_free must be done backwards, because the beginning of memory
is used for the pagetables.
Doing it this way, we ensure that by the time we remove the pagetables,
those pages will not have to be referenced anymore.

Please note that only the memory hotplug is currently using this
allocation scheme. The boot time memmap allocation could use the same
trick as well but this is not done yet.

Signed-off-by: Oscar Salvador <[email protected]>
---
arch/arm64/mm/mmu.c | 5 ++-
arch/powerpc/mm/init_64.c | 2 ++
arch/s390/mm/init.c | 6 ++++
arch/x86/mm/init_64.c | 7 ++++
include/linux/memory_hotplug.h | 8 ++++-
include/linux/memremap.h | 65 +++++++++++++++++++++++++++++++--
include/linux/page-flags.h | 18 ++++++++++
kernel/memremap.c | 6 ----
mm/compaction.c | 3 ++
mm/hmm.c | 6 ++--
mm/memory_hotplug.c | 81 +++++++++++++++++++++++++++++++++++++++---
mm/page_alloc.c | 22 ++++++++++--
mm/page_isolation.c | 13 ++++++-
mm/sparse.c | 46 ++++++++++++++++++++++++
14 files changed, 268 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 394b8d554def..8fa6e2ade5be 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -733,7 +733,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
if (pmd_none(READ_ONCE(*pmdp))) {
void *p = NULL;

- p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+ if (altmap)
+ p = altmap_alloc_block_buf(PMD_SIZE, altmap);
+ else
+ p = vmemmap_alloc_block_buf(PMD_SIZE, node);
if (!p)
return -ENOMEM;

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 7a9886f98b0c..03f014abd4eb 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -278,6 +278,8 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
continue;

page = pfn_to_page(addr >> PAGE_SHIFT);
+ if (PageVmemmap(page))
+ continue;
section_base = pfn_to_page(vmemmap_section_start(start));
nr_pages = 1 << page_order;

diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 4139affd6157..bc1523bcb09d 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -231,6 +231,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
unsigned long size_pages = PFN_DOWN(size);
int rc;

+ /*
+ * Physical memory is added only later during the memory online so we
+ * cannot use the added range at this stage unfortunatelly.
+ */
+ restrictions->flags &= ~MHP_MEMMAP_FROM_RANGE;
+
rc = vmem_add_mapping(start, size);
if (rc)
return rc;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index fd06bcbd9535..d5234ca5c483 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -815,6 +815,13 @@ static void __meminit free_pagetable(struct page *page, int order)
unsigned long magic;
unsigned int nr_pages = 1 << order;

+ /*
+ * runtime vmemmap pages are residing inside the memory section so
+ * they do not have to be freed anywhere.
+ */
+ if (PageVmemmap(page))
+ return;
+
/* bootmem page has reserved flag */
if (PageReserved(page)) {
__ClearPageReserved(page);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7249dab00ac9..244e0e2c030c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -118,9 +118,15 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
* and offline memory explicitly. Lack of this bit means that the caller has to
* call move_pfn_range_to_zone to finish the initialization.
*/
-
#define MHP_MEMBLOCK_API 1<<0

+/*
+ * Do we want memmap (struct page array) allocated from the hotadded range.
+ * Please note that only SPARSE_VMEMMAP implements this feauture and some
+ * architectures might not support it even for that memory model (e.g. s390)
+ */
+#define MHP_MEMMAP_FROM_RANGE 1<<1
+
/* Restrictions for the memory hotplug */
struct mhp_restrictions {
unsigned long flags; /* MHP_ flags */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 0ac69ddf5fc4..863f339224e6 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -5,6 +5,8 @@
#include <linux/percpu-refcount.h>

#include <asm/pgtable.h>
+#include <linux/page-flags.h>
+#include <linux/page_ref.h>

struct resource;
struct device;
@@ -16,13 +18,18 @@ struct device;
* @free: free pages set aside in the mapping for memmap storage
* @align: pages reserved to meet allocation alignments
* @alloc: track pages consumed, private to vmemmap_populate()
+ * @flush_alloc_pfns: callback to be called on the allocated range after it
+ * @nr_sects: nr of sects filled with memmap allocations
+ * is mapped to the vmemmap - see mark_vmemmap_pages
*/
struct vmem_altmap {
- const unsigned long base_pfn;
+ unsigned long base_pfn;
const unsigned long reserve;
unsigned long free;
unsigned long align;
unsigned long alloc;
+ int nr_sects;
+ void (*flush_alloc_pfns)(struct vmem_altmap *self);
};

/*
@@ -133,8 +140,62 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
struct dev_pagemap *pgmap);

-unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
+static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
+{
+ /* number of pfns from base where pfn_to_page() is valid */
+ return altmap->reserve + altmap->free;
+}
+
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
+
+static inline void mark_vmemmap_pages(struct vmem_altmap *self)
+{
+ unsigned long pfn = self->base_pfn + self->reserve;
+ unsigned long nr_pages = self->alloc;
+ unsigned long align = PAGES_PER_SECTION * sizeof(struct page);
+ struct page *head;
+ unsigned long i;
+
+ pr_debug("%s: marking %px - %px as Vmemmap\n", __func__,
+ pfn_to_page(pfn),
+ pfn_to_page(pfn + nr_pages - 1));
+ /*
+ * We keep track of the sections using this altmap by means
+ * of a refcount, so we know how much do we have to defer
+ * the call to vmemmap_free for this memory range.
+ * The refcount is kept in the first vmemmap page.
+ * For example:
+ * We add 10GB: (ffffea0004000000 - ffffea000427ffc0)
+ * ffffea0004000000 will have a refcount of 80.
+ */
+ head = (struct page *)ALIGN_DOWN((unsigned long)pfn_to_page(pfn), align);
+ head = (struct page *)((unsigned long)head - (align * self->nr_sects));
+ page_ref_inc(head);
+
+ /*
+ * We have used a/another section only with memmap allocations.
+ * We need to keep track of it in order to get the first head page
+ * to increase its refcount.
+ * This makes it easier to compute.
+ */
+ if (!((page_ref_count(head) * PAGES_PER_SECTION) % align))
+ self->nr_sects++;
+
+ /*
+ * All allocations for the memory hotplug are the same sized so align
+ * should be 0
+ */
+ WARN_ON(self->align);
+ for (i = 0; i < nr_pages; i++, pfn++) {
+ struct page *page = pfn_to_page(pfn);
+ __SetPageVmemmap(page);
+ init_page_count(page);
+ }
+
+ self->alloc = 0;
+ self->base_pfn += nr_pages + self->reserve;
+ self->free -= nr_pages;
+}
#else
static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50ce1bddaf56..e79054fcc96e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -437,6 +437,24 @@ static __always_inline int __PageMovable(struct page *page)
PAGE_MAPPING_MOVABLE;
}

+#define VMEMMAP_PAGE ~PAGE_MAPPING_FLAGS
+static __always_inline int PageVmemmap(struct page *page)
+{
+ return PageReserved(page) && (unsigned long)page->mapping == VMEMMAP_PAGE;
+}
+
+static __always_inline void __ClearPageVmemmap(struct page *page)
+{
+ ClearPageReserved(page);
+ page->mapping = NULL;
+}
+
+static __always_inline void __SetPageVmemmap(struct page *page)
+{
+ SetPageReserved(page);
+ page->mapping = (void *)VMEMMAP_PAGE;
+}
+
#ifdef CONFIG_KSM
/*
* A KSM page is one of those write-protected "shared pages" or "merged pages"
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 248082bfea5c..10f5bd912780 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -239,12 +239,6 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
}
EXPORT_SYMBOL(devm_memremap_pages);

-unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
-{
- /* number of pfns from base where pfn_to_page() is valid */
- return altmap->reserve + altmap->free;
-}
-
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
{
altmap->alloc -= nr_pfns;
diff --git a/mm/compaction.c b/mm/compaction.c
index 7c607479de4a..c94a480e01b5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -768,6 +768,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,

page = pfn_to_page(low_pfn);

+ if (PageVmemmap(page))
+ goto isolate_fail;
+
if (!valid_page)
valid_page = page;

diff --git a/mm/hmm.c b/mm/hmm.c
index 90c34f3d1243..ec1e8d97a3ce 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -1072,6 +1072,7 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
resource_size_t key, align_start, align_size, align_end;
struct device *device = devmem->device;
int ret, nid, is_ram;
+ struct mhp_restrictions restrictions = {};

align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
align_size = ALIGN(devmem->resource->start +
@@ -1142,11 +1143,10 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
* want the linear mapping and thus use arch_add_memory().
*/
if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
- ret = arch_add_memory(nid, align_start, align_size, NULL,
- false);
+ ret = arch_add_memory(nid, align_start, align_size, &restrictions);
else
ret = add_pages(nid, align_start >> PAGE_SHIFT,
- align_size >> PAGE_SHIFT, NULL, false);
+ align_size >> PAGE_SHIFT, &restrictions);
if (ret) {
mem_hotplug_done();
goto error_add_memory;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8a97bda770c1..5a53d29b4101 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -276,11 +276,22 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
int err = 0;
int start_sec, end_sec;
struct vmem_altmap *altmap = restrictions->altmap;
+ struct vmem_altmap __memblk_altmap;

/* during initialize mem_map, align hot-added range to section */
start_sec = pfn_to_section_nr(phys_start_pfn);
end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);

+ if(!altmap && (restrictions->flags & MHP_MEMMAP_FROM_RANGE)) {
+ __memblk_altmap.base_pfn = phys_start_pfn;
+ __memblk_altmap.alloc = 0;
+ __memblk_altmap.align = 0;
+ __memblk_altmap.free = nr_pages;
+ __memblk_altmap.nr_sects = 0;
+ __memblk_altmap.flush_alloc_pfns = mark_vmemmap_pages;
+ altmap = &__memblk_altmap;
+ }
+
if (altmap) {
/*
* Validate altmap is within bounds of the total request
@@ -685,13 +696,72 @@ static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
return onlined_pages;
}

+static int __online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
+{
+ if (PageReserved(pfn_to_page(start_pfn)))
+ return online_pages_blocks(start_pfn, nr_pages);
+
+ return 0;
+}
+
+
static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
- void *arg)
+ void *arg)
{
unsigned long onlined_pages = *(unsigned long *)arg;
+ unsigned long pfn = start_pfn;
+ unsigned long end_pfn = start_pfn + nr_pages;
+ bool vmemmap_page = false;

- if (PageReserved(pfn_to_page(start_pfn)))
- onlined_pages = online_pages_blocks(start_pfn, nr_pages);
+ for (; pfn < end_pfn; pfn++) {
+ struct page *p = pfn_to_page(pfn);
+
+ /*
+ * Let us check if we got vmemmap pages.
+ */
+ if (PageVmemmap(p)) {
+ vmemmap_page = true;
+ break;
+ }
+ }
+
+ if (!vmemmap_page) {
+ /*
+ * We can send the whole range to __online_pages_range,
+ * as we know for sure that there are not vmemmap pages.
+ */
+ onlined_pages += __online_pages_range(start_pfn, nr_pages);
+ } else {
+ /*
+ * We need to strip the vmemmap pages here,
+ * as we do not want to send them to the buddy allocator.
+ */
+ unsigned long sections = nr_pages / PAGES_PER_SECTION;
+ unsigned long sect_nr = 0;
+
+ for (; sect_nr < sections; sect_nr++) {
+ unsigned pfn_end_section;
+ unsigned long memmap_pages = 0;
+
+ pfn = start_pfn + (PAGES_PER_SECTION * sect_nr);
+ pfn_end_section = pfn + PAGES_PER_SECTION;
+
+ while (pfn < pfn_end_section) {
+ struct page *p = pfn_to_page(pfn);
+
+ if (PageVmemmap(p))
+ memmap_pages++;
+ pfn++;
+ }
+ pfn = start_pfn + (PAGES_PER_SECTION * sect_nr);
+ if (!memmap_pages) {
+ onlined_pages += __online_pages_range(pfn, PAGES_PER_SECTION);
+ } else if (memmap_pages && memmap_pages < PAGES_PER_SECTION) {
+ pfn += memmap_pages;
+ onlined_pages += __online_pages_range(pfn, PAGES_PER_SECTION - memmap_pages);
+ }
+ }
+ }

online_mem_sections(start_pfn, start_pfn + nr_pages);

@@ -1125,7 +1195,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
new_node = ret;

/* call arch's memory hotadd */
- restrictions.flags = MHP_MEMBLOCK_API;
+ restrictions.flags = MHP_MEMBLOCK_API|MHP_MEMMAP_FROM_RANGE;
ret = arch_add_memory(nid, start, size, &restrictions);
if (ret < 0)
goto error;
@@ -1374,6 +1444,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
continue;
page = pfn_to_page(pfn);

+ if (PageVmemmap(page))
+ continue;
+
if (PageHuge(page)) {
struct page *head = compound_head(page);
pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d417c724551..2b236f4ab0fe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -926,6 +926,9 @@ static void free_pages_check_bad(struct page *page)

static inline int free_pages_check(struct page *page)
{
+ if(PageVmemmap(page))
+ return 0;
+
if (likely(page_expected_state(page, PAGE_FLAGS_CHECK_AT_FREE)))
return 0;

@@ -1178,9 +1181,11 @@ static void free_one_page(struct zone *zone,
static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
- mm_zero_struct_page(page);
+ if (!PageVmemmap(page))
+ mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
- init_page_count(page);
+ if (!PageVmemmap(page))
+ init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);

@@ -7781,6 +7786,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,

page = pfn_to_page(check);

+ if(PageVmemmap(page))
+ continue;
+
if (PageReserved(page))
goto unmovable;

@@ -8138,6 +8146,16 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
continue;
}
page = pfn_to_page(pfn);
+
+ /*
+ * vmemmap pages are residing inside the memory section so
+ * they do not have to be freed anywhere.
+ */
+ if (PageVmemmap(page)) {
+ pfn++;
+ continue;
+ }
+
/*
* The HWPoisoned page may be not in buddy system, and
* page_count() is not 0.
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 43e085608846..0991548b7ab5 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -259,6 +259,11 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
continue;
}
page = pfn_to_page(pfn);
+ if (PageVmemmap(page)) {
+ pfn++;
+ continue;
+ }
+
if (PageBuddy(page))
/*
* If the page is on a free list, it has to be on
@@ -289,10 +294,16 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
* are not aligned to pageblock_nr_pages.
* Then we just check migratetype first.
*/
- for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ for (pfn = start_pfn; pfn < end_pfn;) {
page = __first_valid_page(pfn, pageblock_nr_pages);
+ if (PageVmemmap(page)) {
+ pfn++;
+ continue;
+ }
+
if (page && !is_migrate_isolate_page(page))
break;
+ pfn += pageblock_nr_pages;
}
page = __first_valid_page(start_pfn, end_pfn - start_pfn);
if ((pfn < end_pfn) || !page)
diff --git a/mm/sparse.c b/mm/sparse.c
index 33307fc05c4d..29cbaa0e46c3 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -563,6 +563,32 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
#endif

#ifdef CONFIG_SPARSEMEM_VMEMMAP
+
+static struct page *current_vmemmap_page = NULL;
+static bool vmemmap_dec_and_test(void)
+{
+ bool ret = false;
+
+ if (page_ref_dec_and_test(current_vmemmap_page))
+ ret = true;
+ return ret;
+}
+
+static void free_vmemmap_range(unsigned long limit, unsigned long start, unsigned long end)
+{
+ unsigned long range_start;
+ unsigned long range_end;
+ unsigned long align = sizeof(struct page) * PAGES_PER_SECTION;
+
+ range_end = end;
+ range_start = end - align;
+ while (range_start >= limit) {
+ vmemmap_free(range_start, range_end, NULL);
+ range_end = range_start;
+ range_start -= align;
+ }
+}
+
static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
struct vmem_altmap *altmap)
{
@@ -575,6 +601,18 @@ static void __kfree_section_memmap(struct page *memmap,
unsigned long start = (unsigned long)memmap;
unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);

+ if (PageVmemmap(memmap) && !current_vmemmap_page)
+ current_vmemmap_page = memmap;
+
+ if (current_vmemmap_page) {
+ if (vmemmap_dec_and_test()) {
+ unsigned long start_vmemmap = (unsigned long)current_vmemmap_page;
+ free_vmemmap_range(start_vmemmap, start, end);
+ current_vmemmap_page = NULL;
+ }
+ return;
+ }
+
vmemmap_free(start, end, altmap);
}
#ifdef CONFIG_MEMORY_HOTREMOVE
@@ -703,6 +741,14 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
*/
page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);

+ /*
+ * now that we have a valid vmemmap mapping we can use
+ * pfn_to_page and flush struct pages which back the
+ * memmap
+ */
+ if (altmap && altmap->flush_alloc_pfns)
+ altmap->flush_alloc_pfns(altmap);
+
section_mark_present(ms);
sparse_init_one_section(ms, section_nr, memmap, usemap);

--
2.13.6


2018-11-16 10:16:13

by Oscar Salvador

[permalink] [raw]
Subject: [RFC PATCH 1/4] mm, memory_hotplug: cleanup memory offline path

From: Michal Hocko <[email protected]>

check_pages_isolated_cb currently accounts the whole pfn range as being
offlined if test_pages_isolated suceeds on the range. This is based on
the assumption that all pages in the range are freed which is currently
the case in most cases but it won't be with later changes. I haven't
double checked but if the range contains invalid pfns we could
theoretically over account and underflow zone's managed pages.

Move the offlined pages counting to offline_isolated_pages_cb and
rely on __offline_isolated_pages to return the correct value.
check_pages_isolated_cb will still do it's primary job and check the pfn
range.

While we are at it remove check_pages_isolated and offline_isolated_pages
and use directly walk_system_ram_range as do in online_pages.

Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/memory_hotplug.h | 2 +-
mm/memory_hotplug.c | 44 +++++++++++-------------------------------
mm/page_alloc.c | 11 +++++++++--
3 files changed, 21 insertions(+), 36 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 84e9ae205930..1cb7054cdc03 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -85,7 +85,7 @@ extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
extern int online_pages(unsigned long, unsigned long, int);
extern int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
unsigned long *valid_start, unsigned long *valid_end);
-extern void __offline_isolated_pages(unsigned long, unsigned long);
+extern unsigned long __offline_isolated_pages(unsigned long, unsigned long);

typedef int (*online_page_callback_t)(struct page *page, unsigned int order);

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index dbbb94547ad0..d19e5f33e33b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1442,17 +1442,12 @@ static int
offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
void *data)
{
- __offline_isolated_pages(start, start + nr_pages);
+ unsigned long offlined_pages;
+ offlined_pages = __offline_isolated_pages(start, start + nr_pages);
+ *(unsigned long *)data += offlined_pages;
return 0;
}

-static void
-offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
-{
- walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
- offline_isolated_pages_cb);
-}
-
/*
* Check all pages in range, recoreded as memory resource, are isolated.
*/
@@ -1460,26 +1455,7 @@ static int
check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
void *data)
{
- int ret;
- long offlined = *(long *)data;
- ret = test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
- offlined = nr_pages;
- if (!ret)
- *(long *)data += offlined;
- return ret;
-}
-
-static long
-check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
-{
- long offlined = 0;
- int ret;
-
- ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined,
- check_pages_isolated_cb);
- if (ret < 0)
- offlined = (long)ret;
- return offlined;
+ return test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
}

static int __init cmdline_parse_movable_node(char *p)
@@ -1564,7 +1540,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
unsigned long end_pfn)
{
unsigned long pfn, nr_pages;
- long offlined_pages;
+ unsigned long offlined_pages = 0;
int ret, node;
unsigned long flags;
unsigned long valid_start, valid_end;
@@ -1633,13 +1609,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
if (ret)
goto failed_removal;
/* check again */
- offlined_pages = check_pages_isolated(start_pfn, end_pfn);
- if (offlined_pages < 0)
+ if (walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
+ check_pages_isolated_cb)) {
goto repeat;
- pr_info("Offlined Pages %ld\n", offlined_pages);
+ }
/* Ok, all of our target is isolated.
We cannot do rollback at this point. */
- offline_isolated_pages(start_pfn, end_pfn);
+ walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined_pages,
+ offline_isolated_pages_cb);
+ pr_info("Offlined Pages %ld\n", offlined_pages);
/* reset pagetype flags and makes migrate type to be MOVABLE */
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
/* removal success */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..3d417c724551 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8111,7 +8111,7 @@ void zone_pcp_reset(struct zone *zone)
* All pages in the range must be in a single zone and isolated
* before calling this.
*/
-void
+unsigned long
__offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
{
struct page *page;
@@ -8119,12 +8119,15 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
unsigned int order, i;
unsigned long pfn;
unsigned long flags;
+ unsigned long offlined_pages = 0;
+
/* find the first valid pfn */
for (pfn = start_pfn; pfn < end_pfn; pfn++)
if (pfn_valid(pfn))
break;
if (pfn == end_pfn)
- return;
+ return offlined_pages;
+
offline_mem_sections(pfn, end_pfn);
zone = page_zone(pfn_to_page(pfn));
spin_lock_irqsave(&zone->lock, flags);
@@ -8142,12 +8145,14 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
if (unlikely(!PageBuddy(page) && PageHWPoison(page))) {
pfn++;
SetPageReserved(page);
+ offlined_pages++;
continue;
}

BUG_ON(page_count(page));
BUG_ON(!PageBuddy(page));
order = page_order(page);
+ offlined_pages += 1 << order;
#ifdef CONFIG_DEBUG_VM
pr_info("remove from free list %lx %d %lx\n",
pfn, 1 << order, end_pfn);
@@ -8160,6 +8165,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
pfn += (1 << order);
}
spin_unlock_irqrestore(&zone->lock, flags);
+
+ return offlined_pages;
}
#endif

--
2.13.6


2018-11-16 21:41:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug

On 11/16/18 2:12 AM, Oscar Salvador wrote:
> +/*
> + * Do we want sysfs memblock files created. This will allow userspace to online
> + * and offline memory explicitly. Lack of this bit means that the caller has to
> + * call move_pfn_range_to_zone to finish the initialization.
> + */
> +
> +#define MHP_MEMBLOCK_API 1<<0
> +
> +/* Restrictions for the memory hotplug */
> +struct mhp_restrictions {
> + unsigned long flags; /* MHP_ flags */
> + struct vmem_altmap *altmap; /* use this alternative allocatro for memmaps */

"allocatro" -> "allocator"

2018-11-16 22:43:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

On 11/16/18 2:12 AM, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, kmalloc is used for those
> allocations.

Did you literally mean kmalloc? I thought we had a bunch of ways of
allocating memmaps, but I didn't think kmalloc() was actually used.

Like vmemmap_alloc_block(), for instance, uses alloc_pages_node().

So, can the ZONE_DEVICE altmaps move over to this infrastructure?
Doesn't this effectively duplicate that code?

...
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index 7a9886f98b0c..03f014abd4eb 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -278,6 +278,8 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
> continue;
>
> page = pfn_to_page(addr >> PAGE_SHIFT);
> + if (PageVmemmap(page))
> + continue;
> section_base = pfn_to_page(vmemmap_section_start(start));
> nr_pages = 1 << page_order;

Reading this, I'm wondering if PageVmemmap() could be named better.
From this is reads like "skip PageVmemmap() pages if freeing vmemmap",
which does not make much sense.

This probably at _least_ needs a comment to explain why the pages are
being skipped.

> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 4139affd6157..bc1523bcb09d 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -231,6 +231,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long size_pages = PFN_DOWN(size);
> int rc;
>
> + /*
> + * Physical memory is added only later during the memory online so we
> + * cannot use the added range at this stage unfortunatelly.

unfortunately ^

> + */
> + restrictions->flags &= ~MHP_MEMMAP_FROM_RANGE;

Could you also add to the comment about this being specific to s390?

> rc = vmem_add_mapping(start, size);
> if (rc)
> return rc;
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index fd06bcbd9535..d5234ca5c483 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -815,6 +815,13 @@ static void __meminit free_pagetable(struct page *page, int order)
> unsigned long magic;
> unsigned int nr_pages = 1 << order;
>
> + /*
> + * runtime vmemmap pages are residing inside the memory section so
> + * they do not have to be freed anywhere.
> + */
> + if (PageVmemmap(page))
> + return;

Thanks for the comment on this one, this one is right on.

> @@ -16,13 +18,18 @@ struct device;
> * @free: free pages set aside in the mapping for memmap storage
> * @align: pages reserved to meet allocation alignments
> * @alloc: track pages consumed, private to vmemmap_populate()
> + * @flush_alloc_pfns: callback to be called on the allocated range after it
> + * @nr_sects: nr of sects filled with memmap allocations
> + * is mapped to the vmemmap - see mark_vmemmap_pages
> */

I think you split up the "@flush_alloc_pfns" comment accidentally.

> struct vmem_altmap {
> - const unsigned long base_pfn;
> + unsigned long base_pfn;
> const unsigned long reserve;
> unsigned long free;
> unsigned long align;
> unsigned long alloc;
> + int nr_sects;
> + void (*flush_alloc_pfns)(struct vmem_altmap *self);
> };
>
> /*
> @@ -133,8 +140,62 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> struct dev_pagemap *pgmap);
>
> -unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
> +static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
> +{
> + /* number of pfns from base where pfn_to_page() is valid */
> + return altmap->reserve + altmap->free;
> +}
> void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
> +
> +static inline void mark_vmemmap_pages(struct vmem_altmap *self)
> +{
> + unsigned long pfn = self->base_pfn + self->reserve;
> + unsigned long nr_pages = self->alloc;
> + unsigned long align = PAGES_PER_SECTION * sizeof(struct page);
> + struct page *head;
> + unsigned long i;
> +
> + pr_debug("%s: marking %px - %px as Vmemmap\n", __func__,
> + pfn_to_page(pfn),
> + pfn_to_page(pfn + nr_pages - 1));
> + /*
> + * We keep track of the sections using this altmap by means
> + * of a refcount, so we know how much do we have to defer
> + * the call to vmemmap_free for this memory range.
> + * The refcount is kept in the first vmemmap page.
> + * For example:
> + * We add 10GB: (ffffea0004000000 - ffffea000427ffc0)
> + * ffffea0004000000 will have a refcount of 80.
> + */

The example is good, but it took me a minute to realize that 80 is
because 10GB is roughly 80 sections.

> + head = (struct page *)ALIGN_DOWN((unsigned long)pfn_to_page(pfn), align);

Is this ALIGN_DOWN() OK? It seems like it might be aligning 'pfn' down
into the reserved are that lies before it.

> + head = (struct page *)((unsigned long)head - (align * self->nr_sects));
> + page_ref_inc(head);
> +
> + /*
> + * We have used a/another section only with memmap allocations.
> + * We need to keep track of it in order to get the first head page
> + * to increase its refcount.
> + * This makes it easier to compute.
> + */
> + if (!((page_ref_count(head) * PAGES_PER_SECTION) % align))
> + self->nr_sects++;
> +
> + /*
> + * All allocations for the memory hotplug are the same sized so align
> + * should be 0
> + */
> + WARN_ON(self->align);
> + for (i = 0; i < nr_pages; i++, pfn++) {
> + struct page *page = pfn_to_page(pfn);
> + __SetPageVmemmap(page);
> + init_page_count(page);
> + }

Looks like some tabs vs. space problems.

> + self->alloc = 0;
> + self->base_pfn += nr_pages + self->reserve;
> + self->free -= nr_pages;
> +}
> #else
> static inline void *devm_memremap_pages(struct device *dev,
> struct dev_pagemap *pgmap)
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 50ce1bddaf56..e79054fcc96e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -437,6 +437,24 @@ static __always_inline int __PageMovable(struct page *page)
> PAGE_MAPPING_MOVABLE;
> }
>
> +#define VMEMMAP_PAGE ~PAGE_MAPPING_FLAGS
> +static __always_inline int PageVmemmap(struct page *page)
> +{
> + return PageReserved(page) && (unsigned long)page->mapping == VMEMMAP_PAGE;
> +}
> +
> +static __always_inline void __ClearPageVmemmap(struct page *page)
> +{
> + ClearPageReserved(page);
> + page->mapping = NULL;
> +}
> +
> +static __always_inline void __SetPageVmemmap(struct page *page)
> +{
> + SetPageReserved(page);
> + page->mapping = (void *)VMEMMAP_PAGE;
> +}

Just curious, but why are these __ versions? I thought we used that for
non-atomic bit operations, but this uses the atomic SetPageReserved().

> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7c607479de4a..c94a480e01b5 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -768,6 +768,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>
> page = pfn_to_page(low_pfn);
>
> + if (PageVmemmap(page))
> + goto isolate_fail;

Comments, please.

...
> +static int __online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> +{
> + if (PageReserved(pfn_to_page(start_pfn)))
> + return online_pages_blocks(start_pfn, nr_pages);
> +
> + return 0;
> +}

Why is it important that 'start_pfn' is PageReserved()?

> static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> - void *arg)
> + void *arg)
> {
> unsigned long onlined_pages = *(unsigned long *)arg;
> + unsigned long pfn = start_pfn;
> + unsigned long end_pfn = start_pfn + nr_pages;
> + bool vmemmap_page = false;
>
> - if (PageReserved(pfn_to_page(start_pfn)))
> - onlined_pages = online_pages_blocks(start_pfn, nr_pages);
> + for (; pfn < end_pfn; pfn++) {
> + struct page *p = pfn_to_page(pfn);
> +
> + /*
> + * Let us check if we got vmemmap pages.
> + */
> + if (PageVmemmap(p)) {
> + vmemmap_page = true;
> + break;
> + }
> + }

OK, so we did the hot-add, and allocated some of the memmap[] inside the
area being hot-added. Now, we're onlining the page. We search every
page in the *entire* area being onlined to try to find a PageVmemmap()?
That seems a bit inefficient, especially for sections where we don't
have a PageVmemmap().

> + if (!vmemmap_page) {
> + /*
> + * We can send the whole range to __online_pages_range,
> + * as we know for sure that there are not vmemmap pages.
> + */
> + onlined_pages += __online_pages_range(start_pfn, nr_pages);
> + } else {
> + /*
> + * We need to strip the vmemmap pages here,
> + * as we do not want to send them to the buddy allocator.
> + */
> + unsigned long sections = nr_pages / PAGES_PER_SECTION;
> + unsigned long sect_nr = 0;
> +
> + for (; sect_nr < sections; sect_nr++) {
> + unsigned pfn_end_section;
> + unsigned long memmap_pages = 0;
> +
> + pfn = start_pfn + (PAGES_PER_SECTION * sect_nr);
> + pfn_end_section = pfn + PAGES_PER_SECTION;
> +
> + while (pfn < pfn_end_section) {
> + struct page *p = pfn_to_page(pfn);
> +
> + if (PageVmemmap(p))
> + memmap_pages++;
> + pfn++;
> + }

... and another lienar search through the entire area being added.

> + pfn = start_pfn + (PAGES_PER_SECTION * sect_nr);
> + if (!memmap_pages) {
> + onlined_pages += __online_pages_range(pfn, PAGES_PER_SECTION);

If I read this right, this if() and the first block are unneeded. The
second block is funcationally identical if memmap_pages==0. Seems like
we can simplify the code. Also, is this _really_ under 80 columns?
Seems kinda long.

> + if (PageVmemmap(page))
> + continue;

FWIW, all these random-looking PageVmemmap() calls are a little
worrying. What happens when we get one wrong? Seems like we're kinda
bringing back all the PageReserved() checks we used to have scattered
all over.

> @@ -8138,6 +8146,16 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
> continue;
> }
> page = pfn_to_page(pfn);
> +
> + /*
> + * vmemmap pages are residing inside the memory section so
> + * they do not have to be freed anywhere.
> + */
> + if (PageVmemmap(page)) {
> + pfn++;
> + continue;
> + }


> +static struct page *current_vmemmap_page = NULL;
> +static bool vmemmap_dec_and_test(void)
> +{
> + bool ret = false;
> +
> + if (page_ref_dec_and_test(current_vmemmap_page))
> + ret = true;
> + return ret;
> +}

That's a bit of an obfuscated way to do:

return page_ref_dec_and_test(current_vmemmap_page));

:)

But, I also see a global variable, and this immediately makes me think
about locking and who "owns" this. Comments would help.

> +static void free_vmemmap_range(unsigned long limit, unsigned long start, unsigned long end)
> +{
> + unsigned long range_start;
> + unsigned long range_end;
> + unsigned long align = sizeof(struct page) * PAGES_PER_SECTION;
> +
> + range_end = end;
> + range_start = end - align;
> + while (range_start >= limit) {
> + vmemmap_free(range_start, range_end, NULL);
> + range_end = range_start;
> + range_start -= align;
> + }
> +}

This loop looks like it's working from the end of the range back to the
beginning. I guess that it works, but it's a bit unconventional to go
backwards. Was there a reason?

Overall, there's a lot of complexity here. This certainly doesn't make
the memory hotplug code simpler.

2018-11-18 23:30:28

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

On Fri, 2018-11-16 at 14:41 -0800, Dave Hansen wrote:
> On 11/16/18 2:12 AM, Oscar Salvador wrote:
> > Physical memory hotadd has to allocate a memmap (struct page array)
> > for
> > the newly added memory section. Currently, kmalloc is used for
> > those
> > allocations.
>
> Did you literally mean kmalloc? I thought we had a bunch of ways of
> allocating memmaps, but I didn't think kmalloc() was actually used.

No, sorry.
The name of the fuctions used for allocating a memmap contain the word
kmalloc, so it was a confusion.
Indeed, vmemmap_alloc_block() ends up calling alloc_pages_node().
__kmalloc_section_usemap() is the one that calls kmalloc.

>
> So, can the ZONE_DEVICE altmaps move over to this infrastructure?
> Doesn't this effectively duplicate that code?

Actually, we are reciclyng/using part of the ZONE_DEVICE altmap code,
and the "struct vmemmap_altmap" itself.

The only thing we added in that regard is the callback function
mark_vmemmap_pages(), that controls the refcount and marks the pages as
Vmemmap.


> ...
> > diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> > index 7a9886f98b0c..03f014abd4eb 100644
> > --- a/arch/powerpc/mm/init_64.c
> > +++ b/arch/powerpc/mm/init_64.c
> > @@ -278,6 +278,8 @@ void __ref vmemmap_free(unsigned long start,
> > unsigned long end,
> > continue;
> >
> > page = pfn_to_page(addr >> PAGE_SHIFT);
> > + if (PageVmemmap(page))
> > + continue;
> > section_base =
> > pfn_to_page(vmemmap_section_start(start));
> > nr_pages = 1 << page_order;
>
> Reading this, I'm wondering if PageVmemmap() could be named better.
> From this is reads like "skip PageVmemmap() pages if freeing
> vmemmap",
> which does not make much sense.
>
> This probably at _least_ needs a comment to explain why the pages are
> being skipped.

The thing is that we do not need to send Vmemmap pages to the buddy
system by means of free_pages/free_page_reserved, as those pages reside
within the memory section.
The only thing we need is to clear the mapping(pagetables).

I just realized that that piece of code is wrong, as it does not allow
to clear the mapping.
One of the consequences to only have tested this on x86_64.

>
> > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> > index 4139affd6157..bc1523bcb09d 100644
> > --- a/arch/s390/mm/init.c
> > +++ b/arch/s390/mm/init.c
> > @@ -231,6 +231,12 @@ int arch_add_memory(int nid, u64 start, u64
> > size,
> > unsigned long size_pages = PFN_DOWN(size);
> > int rc;
> >
> > + /*
> > + * Physical memory is added only later during the memory
> > online so we
> > + * cannot use the added range at this stage
> > unfortunatelly.
>
> unfortunately ^
>
> > + */
> > + restrictions->flags &= ~MHP_MEMMAP_FROM_RANGE;
>
> Could you also add to the comment about this being specific to s390?

Sure, will do.

> > rc = vmem_add_mapping(start, size);
> > if (rc)
> > return rc;
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index fd06bcbd9535..d5234ca5c483 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -815,6 +815,13 @@ static void __meminit free_pagetable(struct
> > page *page, int order)
> > unsigned long magic;
> > unsigned int nr_pages = 1 << order;
> >
> > + /*
> > + * runtime vmemmap pages are residing inside the memory
> > section so
> > + * they do not have to be freed anywhere.
> > + */
> > + if (PageVmemmap(page))
> > + return;
>
> Thanks for the comment on this one, this one is right on.
>
> > @@ -16,13 +18,18 @@ struct device;
> > * @free: free pages set aside in the mapping for memmap storage
> > * @align: pages reserved to meet allocation alignments
> > * @alloc: track pages consumed, private to vmemmap_populate()
> > + * @flush_alloc_pfns: callback to be called on the allocated range
> > after it
> > + * @nr_sects: nr of sects filled with memmap allocations
> > + * is mapped to the vmemmap - see mark_vmemmap_pages
> > */
>
> I think you split up the "@flush_alloc_pfns" comment accidentally.

Indeed, looks "broken".
I will fix it.


> > + /*
> > + * We keep track of the sections using this altmap by
> > means
> > + * of a refcount, so we know how much do we have to defer
> > + * the call to vmemmap_free for this memory range.
> > + * The refcount is kept in the first vmemmap page.
> > + * For example:
> > + * We add 10GB: (ffffea0004000000 - ffffea000427ffc0)
> > + * ffffea0004000000 will have a refcount of 80.
> > + */
>
> The example is good, but it took me a minute to realize that 80 is
> because 10GB is roughly 80 sections.

I will try to make it more clear.

>
> > + head = (struct page *)ALIGN_DOWN((unsigned
> > long)pfn_to_page(pfn), align);
>
> Is this ALIGN_DOWN() OK? It seems like it might be aligning 'pfn'
> down
> into the reserved are that lies before it.

This aligns down to the section.
It makes sure that given any page within a section, it will return the
first page of it.
For example, for pages from ffffea0004000000 to 0xffffea00041f8000,
it will always return ffffea0004000000. (which is first vmemmap page).

Then, we have another computation:

head = (struct page *)((unsigned long)head - (align * self->nr_sects));

In case we filled up a complete section with memmap allocations, self-
>nr_sects get increased.
So, when we cross sections, we know how much do we have to go backwards
to get the first vmemmap page of the first section.

So, in case we got the page 0xffffea0004208000, the first ALIGN_DOWN
would give us 0xffffea0004200000, and then the second computation will
give us ffffea0004000000.

> > + WARN_ON(self->align);
> > + for (i = 0; i < nr_pages; i++, pfn++) {
> > + struct page *page = pfn_to_page(pfn);
> > + __SetPageVmemmap(page);
> > + init_page_count(page);
> > + }
>
> Looks like some tabs vs. space problems.

Sorry, I fixed it.


> > +static __always_inline void __ClearPageVmemmap(struct page *page)
> > +{
> > + ClearPageReserved(page);
> > + page->mapping = NULL;
> > +}
> > +
> > +static __always_inline void __SetPageVmemmap(struct page *page)
> > +{
> > + SetPageReserved(page);
> > + page->mapping = (void *)VMEMMAP_PAGE;
> > +}
>
> Just curious, but why are these __ versions? I thought we used that
> for
> non-atomic bit operations, but this uses the atomic
> SetPageReserved().

I think that we can use __SetPageReserved and __ClearPageReserved here,
as

a) these pages are not initialized yet
b) hot-add operations are serialized
c) we should be the only ones making use of this mem range

>
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 7c607479de4a..c94a480e01b5 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -768,6 +768,9 @@ isolate_migratepages_block(struct
> > compact_control *cc, unsigned long low_pfn,
> >
> > page = pfn_to_page(low_pfn);
> >
> > + if (PageVmemmap(page))
> > + goto isolate_fail;
>
> Comments, please.

Will do.


> ...
> > +static int __online_pages_range(unsigned long start_pfn, unsigned
> > long nr_pages)
> > +{
> > + if (PageReserved(pfn_to_page(start_pfn)))
> > + return online_pages_blocks(start_pfn, nr_pages);
> > +
> > + return 0;
> > +}
>
> Why is it important that 'start_pfn' is PageReserved()?

This check condition on PageReserved was already there before my patch.
I think that this is done because we do set all the pages within the
page as PageReserved by means of:

online_pages()->move_pfn_range()->move_pfn_range_to_zone()-
>memmap_init_zone()

memmap_init_zone marks the page as reserved.
I guess that this is later being checked when onlining the page to
check that no one touched those pages in the meantime.

>
> > static int online_pages_range(unsigned long start_pfn, unsigned
> > long nr_pages,
> > - void *arg)
> > + vo
> > id *arg)
> > {
> > unsigned long onlined_pages = *(unsigned long *)arg;
> > + unsigned long pfn = start_pfn;
> > + unsigned long end_pfn = start_pfn + nr_pages;
> > + bool vmemmap_page = false;
> >
> > - if (PageReserved(pfn_to_page(start_pfn)))
> > - onlined_pages = online_pages_blocks(start_pfn,
> > nr_pages);
> > + for (; pfn < end_pfn; pfn++) {
> > + struct page *p = pfn_to_page(pfn);
> > +
> > + /*
> > + * Let us check if we got vmemmap pages.
> > + */
> > + if (PageVmemmap(p)) {
> > + vmemmap_page = true;
> > + break;
> > + }
> > + }
>
> OK, so we did the hot-add, and allocated some of the memmap[] inside
> the
> area being hot-added. Now, we're onlining the page. We search every
> page in the *entire* area being onlined to try to find a
> PageVmemmap()?
> That seems a bit inefficient, especially for sections where we don't
> have a PageVmemmap().
>
[...]
>
> If I read this right, this if() and the first block are
> unneeded. The
> second block is funcationally identical if memmap_pages==0. Seems
> like
> we can simplify the code. Also, is this _really_ under 80 columns?
> Seems kinda long.

Yeah, before the optimization for freeing pages higher order, this
would have been much easier.
I just wanted to be very carefull and strip the vmemmap pages there
before sending the range to the buddy allocator.

But this can be done better, you are right.

I will think more about this.

>
> > + if (PageVmemmap(page))
> > + continue;
>
> FWIW, all these random-looking PageVmemmap() calls are a little
> worrying. What happens when we get one wrong? Seems like we're
> kinda
> bringing back all the PageReserved() checks we used to have scattered
> all over.

Well, we check for Vmemmap pages in:

* has_unmovable_pages
* __offline_isolated_pages
* free_pages_check
* test_pages_isolated
* __test_page_isolated_in_pageblock
* arch-specific code to not free them

The check in free_pages_check can be gone, as a vmemmap page should
never reach that.

The rest of the checks are either because we have to skip to make
forward progess, as it is the case in
has_unmovable_pages/__test_page_isolated_in_pageblock, or because we do
not need to perform any action on them.

I will try to see if I can get rid of some, and I will improve the
comenting.


>
> > +static struct page *current_vmemmap_page = NULL;
> > +static bool vmemmap_dec_and_test(void)
> > +{
> > + bool ret = false;
> > +
> > + if (page_ref_dec_and_test(current_vmemmap_page))
> > + ret = true;
> > + return ret;
> > +}
>
> That's a bit of an obfuscated way to do:
>
> return page_ref_dec_and_test(current_vmemmap_page));
>
> :)

Yes, much easier and better.

>
> But, I also see a global variable, and this immediately makes me
> think
> about locking and who "owns" this. Comments would help.

The thing is that __kfree_section_memmap is only called from
sparse_remove_one_section by means of free_section_usemap.
sparse_remove_one_section is called from the hot-remove operation,
which is serialized by the hotplug lock.

But I agree that comments would help to understand this much better.
I will add them.

>
> > +static void free_vmemmap_range(unsigned long limit, unsigned long
> > start, unsigned long end)
> > +{
> > + unsigned long range_start;
> > + unsigned long range_end;
> > + unsigned long align = sizeof(struct page) *
> > PAGES_PER_SECTION;
> > +
> > + range_end = end;
> > + range_start = end - align;
> > + while (range_start >= limit) {
> > + vmemmap_free(range_start, range_end, NULL);
> > + range_end = range_start;
> > + range_start -= align;
> > + }
> > +}
>
> This loop looks like it's working from the end of the range back to
> the
> beginning. I guess that it works, but it's a bit unconventional to
> go
> backwards. Was there a reason?

Yes, it is intended to be that way.
The thing is that the memory for the page tables for the memmap array
is comming from the beggining of the hot-added section/s.
If we call vmemmap_free() from the beginning, the PMDs will be cleared
up, and future references to the next sections when called from
vmemmap_free() will blow up.

I know it is not very elegant to call it backwards, but the alternative
was to fiddle into arch specifics to hold the pmd clearing phase until
all sections that were using vmemmap pages were torn down.

So I picked this approach that does not involve any arch specifics and
just works quite well.

>
> Overall, there's a lot of complexity here. This certainly doesn't
> make
> the memory hotplug code simpler.

I will try to make it more understandable and simple.

Thanks a lot for your review Dan!


2018-11-23 07:16:17

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] mm, memory_hotplug: allocate memmap from hotadded memory

On 16.11.18 11:12, Oscar Salvador wrote:
> Hi,
>
> this patchset is based on Michal's patchset [1].
> Patch#1, patch#2 and patch#4 are quite the same.
> They just needed little changes to adapt it to current codestream,
> so it seemed fair to leave them.
>
> ---------
> Original cover:
>
> This is another step to make the memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot added
> memory (at least for SPARSE_VMEMMAP memory model). Currently we use
> kmalloc to poppulate memmap (struct page array) which has two main
> drawbacks a) it consumes an additional memory until the hotadded memory
> itslef is onlined and b) memmap might end up on a different numa node
> which is especially true for movable_node configuration.

I haven't looked at the patches but have some questions.

1. How are we going to present such memory to the system statistics?

In my opinion, this vmemmap memory should
a) still account to total memory
b) show up as allocated

So just like before.


2. Is this optional, in other words, can a device driver decide to not
to it like that?

You mention ballooning. Now, both XEN and Hyper-V (the only balloon
drivers that add new memory as of now), usually add e.g. a 128MB segment
to only actually some part of it (e.g. 64MB, but could vary). Now, going
ahead and assuming that all memory of a section can be read/written is
wrong. A device driver will indicate which pages may actually be used
via set_online_page_callback() when new memory is added. But at that
point you already happily accessed some memory for vmmap - which might
lead to crashes.

For now the rule was: Memory that was not onlined will not be
read/written, that's why it works for XEN and Hyper-V.

It *could* work for them if they could know and communicate to
add_memory() which part of a newly added memory block is definitely usable.

So, especially for the case of balloning that you describe, things are
more tricky than a simple "let's just use some memory of the memory
block we're adding" unfortunately. For DIMMs it can work.

>
> a) is problem especially for memory hotplug based memory "ballooning"
> solutions when the delay between physical memory hotplug and the
> onlining can lead to OOM and that led to introduction of hacks like auto
> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
> policy for the newly added memory")).
> b) can have performance drawbacks.
>
> One way to mitigate both issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hotadded memory itself. VMEMMAP memory model allows us to map
> any pfn range so the memory doesn't need to be online to be usable
> for the array. See patch 3 for more details. In short I am reusing an
> existing vmem_altmap which wants to achieve the same thing for nvdim
> device memory.
>
> There is also one potential drawback, though. If somebody uses memory
> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
> for them obviously because each memory block will contain reserved
> area. Large x86 machines will use 2G memblocks so at least one 1G page
> will be available but this is still not 2G...

Yes, I think this is a possible use case. So it would have to be
configurable somewehere - opt-in most probably. But related to
ballooning, they will usually add the minimum possible granularity (e.g.
128MB) and that seems to work for these setups. DIMMs are probably
different.

>
> I am not really sure somebody does that and how reliable that can work
> actually. Nevertheless, I _believe_ that onlining more memory into
> virtual machines is much more common usecase. Anyway if there ever is a
> strong demand for such a usecase we have basically 3 options a) enlarge
> memory blocks even more b) enhance altmap allocation strategy and reuse
> low memory sections to host memmaps of other sections on the same NUMA
> node c) have the memmap allocation strategy configurable to fallback to
> the current allocation.
>
> ---------
>
> Old version of this patchset would blow up because we were clearing the
> pmds while we still had to reference pages backed by that memory.
> I picked another approach which does not force us to touch arch specific code
> in that regard.
>
> Overall design:
>
> With the preface of:
>
> 1) Whenever you hot-add a range, this is the same range that will be hot-removed.
> This is just because you can't remove half of a DIMM, in the same way you can't
> remove half of a device in qemu.
> A device/DIMM are added/removed as a whole.
>
> 2) Every add_memory()->add_memory_resource()->arch_add_memory()->__add_pages()
> will use a new altmap because it is a different hot-added range.
>
> 3) When you hot-remove a range, the sections will be removed sequantially
> starting from the first section of the range and ending with the last one.
>
> 4) hot-remove operations are protected by hotplug lock, so no parallel operations
> can take place.
>
> The current design is as follows:
>
> hot-remove operation)
>
> - __kfree_section_memmap will be called for every section to be removed.
> - We catch the first vmemmap_page and we pin it to a global variable.
> - Further calls to __kfree_section_memmap will decrease refcount of
> the vmemmap page without calling vmemmap_free().
> We defer the call to vmemmap_free() untill all sections are removed
> - If the refcount drops to 0, we know that we hit the last section.
> - We clear the global variable.
> - We call vmemmap_free for [last_section, current_vmemmap_page)
>
> In case we are hot-removing a range that used altmap, the call to
> vmemmap_free must be done backwards, because the beginning of memory
> is used for the pagetables.
> Doing it this way, we ensure that by the time we remove the pagetables,
> those pages will not have to be referenced anymore.
>
> An example:
>
> (qemu) object_add memory-backend-ram,id=ram0,size=10G
> (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
>
> - This has added: ffffea0004000000 - ffffea000427ffc0 (refcount: 80)
>
> When refcount of ffffea0004000000 drops to 0, vmemmap_free()
> will be called in this way:
>
> vmemmap_free: start/end: ffffea000de00000 - ffffea000e000000
> vmemmap_free: start/end: ffffea000dc00000 - ffffea000de00000
> vmemmap_free: start/end: ffffea000da00000 - ffffea000dc00000
> vmemmap_free: start/end: ffffea000d800000 - ffffea000da00000
> vmemmap_free: start/end: ffffea000d600000 - ffffea000d800000
> vmemmap_free: start/end: ffffea000d400000 - ffffea000d600000
> vmemmap_free: start/end: ffffea000d200000 - ffffea000d400000
> vmemmap_free: start/end: ffffea000d000000 - ffffea000d200000
> vmemmap_free: start/end: ffffea000ce00000 - ffffea000d000000
> vmemmap_free: start/end: ffffea000cc00000 - ffffea000ce00000
> vmemmap_free: start/end: ffffea000ca00000 - ffffea000cc00000
> vmemmap_free: start/end: ffffea000c800000 - ffffea000ca00000
> vmemmap_free: start/end: ffffea000c600000 - ffffea000c800000
> vmemmap_free: start/end: ffffea000c400000 - ffffea000c600000
> vmemmap_free: start/end: ffffea000c200000 - ffffea000c400000
> vmemmap_free: start/end: ffffea000c000000 - ffffea000c200000
> vmemmap_free: start/end: ffffea000be00000 - ffffea000c000000
> ...
> ...
> vmemmap_free: start/end: ffffea0004000000 - ffffea0004200000
>
>
> [Testing]
>
> - Tested ony on x86_64
> - Several tests were carried out with memblocks of different sizes.
> - Tests were performed adding different memory-range sizes
> from 512M to 60GB.
>
> [Todo]
> - Look into hotplug gigantic pages case
>
> Before investing more effort, I would like to hear some opinions/thoughts/ideas.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Michal Hocko (3):
> mm, memory_hotplug: cleanup memory offline path
> mm, memory_hotplug: provide a more generic restrictions for memory
> hotplug
> mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap
>
> Oscar Salvador (1):
> mm, memory_hotplug: allocate memmap from the added memory range for
> sparse-vmemmap
>
> arch/arm64/mm/mmu.c | 5 +-
> arch/ia64/mm/init.c | 5 +-
> arch/powerpc/mm/init_64.c | 2 +
> arch/powerpc/mm/mem.c | 6 +-
> arch/s390/mm/init.c | 12 +++-
> arch/sh/mm/init.c | 6 +-
> arch/x86/mm/init_32.c | 6 +-
> arch/x86/mm/init_64.c | 17 ++++--
> include/linux/memory_hotplug.h | 35 ++++++++---
> include/linux/memremap.h | 65 +++++++++++++++++++-
> include/linux/page-flags.h | 18 ++++++
> kernel/memremap.c | 12 ++--
> mm/compaction.c | 3 +
> mm/hmm.c | 6 +-
> mm/memory_hotplug.c | 133 ++++++++++++++++++++++++++++-------------
> mm/page_alloc.c | 33 ++++++++--
> mm/page_isolation.c | 13 +++-
> mm/sparse.c | 62 ++++++++++++++++---
> 18 files changed, 345 insertions(+), 94 deletions(-)
>


--

Thanks,

David / dhildenb

2018-11-24 08:30:58

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] mm, memory_hotplug: allocate memmap from hotadded memory

On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
> 1. How are we going to present such memory to the system statistics?
>
> In my opinion, this vmemmap memory should
> a) still account to total memory
> b) show up as allocated
>
> So just like before.

No, it does not show up under total memory and neither as allocated memory.
This memory is not for use for anything but for creating the pagetables
for the memmap array for the section/s.

It is not memory that the system can use.

I also guess that if there is a strong opinion on this, we could create
a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.

> 2. Is this optional, in other words, can a device driver decide to not
> to it like that?

Right now, is a per arch setup.
For example, x86_64/powerpc/arm64 will do it inconditionally.

If we want to restrict this a per device-driver thing, I guess that we could
allow to pass a flag to add_memory()->add_memory_resource(), and there
unset MHP_MEMMAP_FROM_RANGE in case that flag is enabled.

> You mention ballooning. Now, both XEN and Hyper-V (the only balloon
> drivers that add new memory as of now), usually add e.g. a 128MB segment
> to only actually some part of it (e.g. 64MB, but could vary). Now, going
> ahead and assuming that all memory of a section can be read/written is
> wrong. A device driver will indicate which pages may actually be used
> via set_online_page_callback() when new memory is added. But at that
> point you already happily accessed some memory for vmmap - which might
> lead to crashes.
>
> For now the rule was: Memory that was not onlined will not be
> read/written, that's why it works for XEN and Hyper-V.

We do not write all memory of the hot-added section, we just write the
first 2MB (first 512 pages), the other 126MB are left untouched.

Assuming that you add a memory-chunk section aligned (128MB), but you only present
the first 64MB or 32MB to the guest as onlined, we still need to allocate the memmap
for the whole section.

I do not really know the tricks behind Hyper-V/Xen, could you expand on that?

So far I only tested this with qemu simulating large machines, but I plan
to try the balloning thing on Xen.

At this moment I am working on a second version of this patchset
to address Dave's feedback.

----
Oscar Salvador
SUSE L3

2018-11-24 08:32:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] mm, memory_hotplug: allocate memmap from hotadded memory

On 23.11.18 12:55, Oscar Salvador wrote:
> On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
>> 1. How are we going to present such memory to the system statistics?
>>
>> In my opinion, this vmemmap memory should
>> a) still account to total memory
>> b) show up as allocated
>>
>> So just like before.
>
> No, it does not show up under total memory and neither as allocated memory.
> This memory is not for use for anything but for creating the pagetables
> for the memmap array for the section/s.
>
> It is not memory that the system can use.
>
> I also guess that if there is a strong opinion on this, we could create
> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.

It's a change if we "hide" such memory. E.g. in a cloud environment you
request to add XGB to your system. You will not see XGB, that can be
"problematic" with some costumers :) - "But I am paying for additional
XGB". (Showing XGB but YMB as allocated is easier to argue with - "your
OS is using it").

>
>> 2. Is this optional, in other words, can a device driver decide to not
>> to it like that?
>
> Right now, is a per arch setup.
> For example, x86_64/powerpc/arm64 will do it inconditionally.

That could indeed break Hyper-V/XEN (if the granularity in which you can
add memory can be smaller than 2MB). Or you have bigger memory blocks.

>
> If we want to restrict this a per device-driver thing, I guess that we could
> allow to pass a flag to add_memory()->add_memory_resource(), and there
> unset MHP_MEMMAP_FROM_RANGE in case that flag is enabled.
>
>> You mention ballooning. Now, both XEN and Hyper-V (the only balloon
>> drivers that add new memory as of now), usually add e.g. a 128MB segment
>> to only actually some part of it (e.g. 64MB, but could vary). Now, going
>> ahead and assuming that all memory of a section can be read/written is
>> wrong. A device driver will indicate which pages may actually be used
>> via set_online_page_callback() when new memory is added. But at that
>> point you already happily accessed some memory for vmmap - which might
>> lead to crashes.
>>
>> For now the rule was: Memory that was not onlined will not be
>> read/written, that's why it works for XEN and Hyper-V.
>
> We do not write all memory of the hot-added section, we just write the
> first 2MB (first 512 pages), the other 126MB are left untouched.

Then that has to be made a rule and we have to make sure that all users
(Hyper-V/XEN) can cope with that.

But it is more problematic because we could have 2GB memory blocks. Then
the 2MB rule does no longer strike. Other archs have other sizes (e.g.
s390x 256MB).

>
> Assuming that you add a memory-chunk section aligned (128MB), but you only present
> the first 64MB or 32MB to the guest as onlined, we still need to allocate the memmap
> for the whole section.

Yes, that's the right thing to do. (the section will be online but some
parts "fake offline")

>
> I do not really know the tricks behind Hyper-V/Xen, could you expand on that?

Let's say you want to add 64MB on Hyper-V. What Linux will do is add a
new section (128MB) but only actually online, say the first 64MB (I have
no idea if it has to be the first 64MB actually!).

It will keep the other pages "fake-offline" and online them later on
when e.g. adding another 64MB.

See drivers/hv/hv_balloon.c:
- set_online_page_callback(&hv_online_page);
- hv_bring_pgs_online() -> hv_page_online_one() -> has_pfn_is_backed()

The other 64MB must not be written (otherwise GP!) but eventually be
read for e.g. dumping (although that is also shaky and I am fixing that
right now to make it more reliable).

Long story short: It is better to allow device drivers to make use of
the old behavior until they eventually can make sure that the "altmap?"
can be read/written when adding memory.

It presents a major change in the add_memory() interface.

>
> So far I only tested this with qemu simulating large machines, but I plan
> to try the balloning thing on Xen.
>
> At this moment I am working on a second version of this patchset
> to address Dave's feedback.

Cool, keep me tuned :)

>
> ----
> Oscar Salvador
> SUSE L3
>


--

Thanks,

David / dhildenb

2018-11-24 08:33:07

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] mm, memory_hotplug: allocate memmap from hotadded memory

On 23.11.18 13:42, Michal Hocko wrote:
> On Fri 23-11-18 12:55:41, Oscar Salvador wrote:
>> On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
>>> 1. How are we going to present such memory to the system statistics?
>>>
>>> In my opinion, this vmemmap memory should
>>> a) still account to total memory
>>> b) show up as allocated
>>>
>>> So just like before.
>>
>> No, it does not show up under total memory and neither as allocated memory.
>> This memory is not for use for anything but for creating the pagetables
>> for the memmap array for the section/s.
>
> I haven't read through your patches yet but wanted to clarfify few
> points here.
>
> This should essentially follow the bootmem allocated memory pattern. So
> it is present and accounted to spanned pages but it is not managed.
>
>> It is not memory that the system can use.
>
> same as bootmem ;)

Fair enough, just saying that it represents a change :)

(but people also already complained if their VM has XGB but they don't
see actual XGB as total memory e.g. due to the crash kernel size)

>
>> I also guess that if there is a strong opinion on this, we could create
>> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.
>
> Do we really have to? Isn't the number quite obvious from the size of
> the hotpluged memory?

At least the size of vmmaps cannot reliably calculated from "MemTotal" .
But maybe based on something else. (there, it is indeed obvious)

>
>>
>>> 2. Is this optional, in other words, can a device driver decide to not
>>> to it like that?
>>
>> Right now, is a per arch setup.
>> For example, x86_64/powerpc/arm64 will do it inconditionally.
>>
>> If we want to restrict this a per device-driver thing, I guess that we could
>> allow to pass a flag to add_memory()->add_memory_resource(), and there
>> unset MHP_MEMMAP_FROM_RANGE in case that flag is enabled.
>
> I believe we will need to make this opt-in. There are some usecases
> which hotplug an expensive (per size) memory via hotplug and it would be
> too wasteful to use it for struct pages. I haven't bothered to address
> that with my previous patches because I just wanted to make the damn
> thing work first.
>

Good point.


--

Thanks,

David / dhildenb

2018-11-24 08:33:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] mm, memory_hotplug: allocate memmap from hotadded memory

On Fri 23-11-18 12:55:41, Oscar Salvador wrote:
> On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
> > 1. How are we going to present such memory to the system statistics?
> >
> > In my opinion, this vmemmap memory should
> > a) still account to total memory
> > b) show up as allocated
> >
> > So just like before.
>
> No, it does not show up under total memory and neither as allocated memory.
> This memory is not for use for anything but for creating the pagetables
> for the memmap array for the section/s.

I haven't read through your patches yet but wanted to clarfify few
points here.

This should essentially follow the bootmem allocated memory pattern. So
it is present and accounted to spanned pages but it is not managed.

> It is not memory that the system can use.

same as bootmem ;)

> I also guess that if there is a strong opinion on this, we could create
> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.

Do we really have to? Isn't the number quite obvious from the size of
the hotpluged memory?

>
> > 2. Is this optional, in other words, can a device driver decide to not
> > to it like that?
>
> Right now, is a per arch setup.
> For example, x86_64/powerpc/arm64 will do it inconditionally.
>
> If we want to restrict this a per device-driver thing, I guess that we could
> allow to pass a flag to add_memory()->add_memory_resource(), and there
> unset MHP_MEMMAP_FROM_RANGE in case that flag is enabled.

I believe we will need to make this opt-in. There are some usecases
which hotplug an expensive (per size) memory via hotplug and it would be
too wasteful to use it for struct pages. I haven't bothered to address
that with my previous patches because I just wanted to make the damn
thing work first.
--
Michal Hocko
SUSE Labs

2018-11-24 08:33:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug

[Cc Alexander - email thread starts http://lkml.kernel.org/r/[email protected]]

On Fri 16-11-18 11:12:20, Oscar Salvador wrote:
> From: Michal Hocko <[email protected]>
>
> arch_add_memory, __add_pages take a want_memblock which controls whether
> the newly added memory should get the sysfs memblock user API (e.g.
> ZONE_DEVICE users do not want/need this interface). Some callers even
> want to control where do we allocate the memmap from by configuring
> altmap. This is currently done quite ugly by searching for altmap down
> in memory hotplug (to_vmem_altmap). It should be the caller to provide
> the altmap down the call chain.
>
> Add a more generic hotplug context for arch_add_memory and __add_pages.
> struct mhp_restrictions contains flags which contains additional
> features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
> currently) and altmap for alternative memmap allocator.

One note here as well. In the retrospect the API I have come up
with here is quite hackish. Considering the recent discussion about
special needs ZONE_DEVICE has for both initialization and struct page
allocations with Alexander Duyck I believe we wanted a more abstracted
API with allocator and constructor callbacks. This would allow different
usecases to fine tune their needs without specialcasing deep in the core
hotplug code paths.
--
Michal Hocko
SUSE Labs

2018-11-24 08:34:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] mm, memory_hotplug: allocate memmap from hotadded memory

On Fri 23-11-18 13:51:57, David Hildenbrand wrote:
> On 23.11.18 13:42, Michal Hocko wrote:
> > On Fri 23-11-18 12:55:41, Oscar Salvador wrote:
[...]
> >> It is not memory that the system can use.
> >
> > same as bootmem ;)
>
> Fair enough, just saying that it represents a change :)
>
> (but people also already complained if their VM has XGB but they don't
> see actual XGB as total memory e.g. due to the crash kernel size)

I can imagine. I have seen many "where's my memory dude" questions... We
have so many unaccounted usages that it is simply impossible to see the
full picture of where the memory is consumed. The current implementation
would account memmaps in unreclaimable slabs but you still do not know
how much was spent for it...

> >> I also guess that if there is a strong opinion on this, we could create
> >> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.
> >
> > Do we really have to? Isn't the number quite obvious from the size of
> > the hotpluged memory?
>
> At least the size of vmmaps cannot reliably calculated from "MemTotal" .
> But maybe based on something else. (there, it is indeed obvious)

Everybody knows the struct page size obviously :p and the rest is a
simple exercise. But more seriously, I see what you are saying. We do
not have a good counter now and the patch doesn't improve that. But I
guess this is a separate discussion.

--
Michal Hocko
SUSE Labs

2018-11-26 13:07:43

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] mm, memory_hotplug: allocate memmap from hotadded memory

On 23.11.18 12:55, Oscar Salvador wrote:
> On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
>> 1. How are we going to present such memory to the system statistics?
>>
>> In my opinion, this vmemmap memory should
>> a) still account to total memory
>> b) show up as allocated
>>
>> So just like before.
>
> No, it does not show up under total memory and neither as allocated memory.
> This memory is not for use for anything but for creating the pagetables
> for the memmap array for the section/s.
>
> It is not memory that the system can use.
>
> I also guess that if there is a strong opinion on this, we could create
> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.
>
>> 2. Is this optional, in other words, can a device driver decide to not
>> to it like that?
>
> Right now, is a per arch setup.
> For example, x86_64/powerpc/arm64 will do it inconditionally.
>

Just FYI another special case is s390x right now when it comes to adding
standby memory: (linux/drivers/s390/char/sclp_cmd.c)

There are two issues:

a) Storage keys

On s390x, storage keys have to be initialized before memory might be
used (think of it as 7bit page status/protection for each 4k page
managed and stored by the HW separately)

Storage keys are initialized in sclp_assign_storage(), when the memory
is going online (MEM_GOING_ONLINE).

b) Hypervisor making memory accessible

Only when onlining memory, the memory is actually made accessible in the
hypervisor (sclp_assign_storage()). Touching it before that is bad and
will fail.

You can think of standby memory on s390x like memory that is only
onlined on request by an administrator. Once onlined, the hypervisor
will allocate memory for it.


However, once we have other ways of adding memory to a s390x guest (e.g.
virtio-mem) at least b) is not an issue anymore. a) would require manual
tweaking (e.g. initialize storage keys of memory for vmmaps early).


So in summary as of now your approach will not work on s390x, but with
e.g. virtio-mem it could. We would need some interface to specify how to
add memory. (To somehow allow a driver to specify it - e.g. SCLP vs.
virtio-mem)

Cheers!

--

Thanks,

David / dhildenb

2019-01-14 13:19:56

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug

On Fri, Nov 23, 2018 at 02:00:43PM +0100, Michal Hocko wrote:
> One note here as well. In the retrospect the API I have come up
> with here is quite hackish. Considering the recent discussion about
> special needs ZONE_DEVICE has for both initialization and struct page
> allocations with Alexander Duyck I believe we wanted a more abstracted
> API with allocator and constructor callbacks. This would allow different
> usecases to fine tune their needs without specialcasing deep in the core
> hotplug code paths.

Hi all,

so, now that vacation is gone, I wanted to come back to this.
I kind of get what you mean with this more abstacted API, but I am not really
sure how we could benefit from it (or maybe I am just short-sighted here).

Right now, struct mhp_restrictions would look like:

struct mhp_restrictions {
unsigned long flags;
struct vmem_altmap *altmap;
};

where flags tell us whether we want a memblock device and whether we should
allocate the memmap array from the hot-added range.
And altmap is the altmap we would use for it.

Indeed, we could add two callbacks, set_up() and construct() (random naming).

When talking about memmap-from-hot_added-range, set_up() could be called
to construct the altmap, i.e:

<--
struct vmem_altmap __memblk_altmap;

__memblk_altmap.base_pfn = phys_start_pfn;
__memblk_altmap.alloc = 0;
__memblk_altmap.align = 0;
__memblk_altmap.free = nr_pages;
-->

and construct() would be called at the very end of __add_pages(), which
basically would be mark_vmemmap_pages().

Now, looking at devm_memremap_pages(ZONE_DEVICE stuff), it does:

hotplug_lock();
arch_add_memory
add_pages
move_pfn_range_to_zone
hotplug_lock();
memmap_init_zone_device

For the ZONE_DEVICE case, move_pfn_range_to_zone() only initializes the pages
containing the memory mapping, while all the remaining pages all initialized later on
in memmap_init_zone_device().
Besides initializing pages, memmap_init_zone_device() also sets page->pgmap field.
So you could say that memmap_init_zone_device would be the construct part.

Anyway, I am currently working on the patch3 of this series to improve it and make it less
complex, but it would be great to sort out this API thing.

Maybe Alexander or you, can provide some suggestions/ideas here.

Thanks

Oscar Salvador
--
Oscar Salvador
SUSE L3