Hi,
It has been while since I sent previous version [1].
In this version I added some feedback I got back then, like letting
the caller decide whether he wants allocating per memory block or
per memory range (patch#2), and having the chance to disable vmemmap when
users want to expose all hotpluggable memory to userspace (patch#5).
[Testing]
While I could test last version on powerpc, and Huawei's fellows helped me out
testing it on arm64, this time I could only test it on x86_64.
The codebase is quite the same, so I would not expect surprises.
- x86_64: small and large memblocks (128MB, 1G and 2G)
- Kernel module that adds memory spanning multiple memblocks
and remove that memory in a different granularity.
So far, only acpi memory hotplug uses the new flag.
The other callers can be changed depending on their needs.
Of course, more testing and feedback is appreciated.
[Coverletter]
This is another step to make memory hotplug more usable. The primary
goal of this patchset is to reduce memory overhead of the hot-added
memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
to populate memmap (struct page array) has two main drawbacks:
a) it consumes an additional memory until the hotadded memory itself is
onlined and
b) memmap might end up on a different numa node which is especially true
for movable_node configuration.
a) it is a problem especially for memory hotplug based memory "ballooning"
solutions when the delay between physical memory hotplug and the
onlining can lead to OOM and that led to introduction of hacks like auto
onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory")).
b) can have performance drawbacks.
Another minor case is that I have seen hot-add operations failing on archs
because they were running out of order-x pages.
E.g On powerpc, in certain configurations, we use order-8 pages,
and given 64KB base pagesize, that is 16MB.
If we run out of those, we just fail the operation and we cannot add
more memory.
We could fallback to base pages as x86_64 does, but we can do better.
One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
us to map any pfn range so the memory doesn't need to be online to be
usable for the array. See patch 3 for more details.
This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.
[Overall design]:
Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation.
If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
altmap structure at the beginning of __add_pages(), and then we call
mark_vmemmap_pages().
The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
in the way they allocate vmemmap pages within the memory blocks.
MHP_MEMMAP_MEMBLOCK:
- With this flag, we will allocate vmemmap pages in each memory block.
This means that if we hot-add a range that spans multiple memory blocks,
we will use the beginning of each memory block for the vmemmap pages.
This strategy is good for cases where the caller wants the flexiblity
to hot-remove memory in a different granularity than when it was added.
MHP_MEMMAP_DEVICE:
- With this flag, we will store all vmemmap pages at the beginning of
hot-added memory.
So it is a tradeoff of flexiblity vs contigous memory.
More info on the above can be found in patch#2.
Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
mark_vmemmap_pages() gets called at a different stage.
With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
sections have been populated.
mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata:
The current layout of the Vmemmap pages are:
[Head->refcount] : Nr sections used by this altmap
[Head->private] : Nr of vmemmap pages
[Tail->freelist] : Pointer to the head page
This is done to easy the computation we need in some places.
E.g:
Example 1)
We hot-add 1GB on x86_64 (memory block 128MB) using
MHP_MEMMAP_DEVICE:
head->_refcount = 8 sections
head->private = 4096 vmemmap pages
tail's->freelist = head
Example 2)
We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK:
[at the beginning of each memblock]
head->_refcount = 1 section
head->private = 512 vmemmap pages
tail's->freelist = head
We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know
how much do we have to defer the call to vmemmap_free().
The thing is that the first pages of the hot-added range are used to create
the memmap mapping, so we cannot remove those first, otherwise we would blow up
when accessing the other pages.
What we do is that since when we hot-remove a memory-range, sections are being
removed sequentially, we wait until we hit the last section, and then we free
the hole range to vmemmap_free backwards.
We know that it is the last section because in every pass we
decrease head->_refcount, and when it reaches 0, we got our last section.
We also have to be careful about those pages during online and offline
operations. They are simply skipped, so online will keep them
reserved and so unusable for any other purpose and offline ignores them
so they do not block the offline operation.
One thing worth mention is that vmemmap pages residing in movable memory is not a
show-stopper for that memory to be offlined/migrated away.
Vmemmap pages are just ignored in that case and they stick around until sections
referred by those vmemmap pages are hot-removed.
[1] https://patchwork.kernel.org/cover/10875017/
Oscar Salvador (5):
drivers/base/memory: Remove unneeded check in
remove_memory_block_devices
mm,memory_hotplug: Introduce MHP_VMEMMAP_FLAGS
mm,memory_hotplug: Introduce Vmemmap page helpers
mm,memory_hotplug: allocate memmap from the added memory range for
sparse-vmemmap
mm,memory_hotplug: Allow userspace to enable/disable vmemmap
arch/arm64/mm/mmu.c | 5 +-
arch/powerpc/mm/init_64.c | 7 ++
arch/s390/mm/init.c | 6 ++
arch/x86/mm/init_64.c | 10 +++
drivers/acpi/acpi_memhotplug.c | 2 +-
drivers/base/memory.c | 41 +++++++++--
drivers/dax/kmem.c | 2 +-
drivers/hv/hv_balloon.c | 2 +-
drivers/s390/char/sclp_cmd.c | 2 +-
drivers/xen/balloon.c | 2 +-
include/linux/memory_hotplug.h | 31 ++++++++-
include/linux/memremap.h | 2 +-
include/linux/page-flags.h | 34 +++++++++
mm/compaction.c | 7 ++
mm/memory_hotplug.c | 152 ++++++++++++++++++++++++++++++++++-------
mm/page_alloc.c | 22 +++++-
mm/page_isolation.c | 14 +++-
mm/sparse.c | 93 +++++++++++++++++++++++++
mm/util.c | 2 +
19 files changed, 394 insertions(+), 42 deletions(-)
--
2.12.3
This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
and prepares the callers that add memory to take a "flags" parameter.
This "flags" parameter will be evaluated later on in Patch#3
to init mhp_restrictions struct.
The callers are:
add_memory
__add_memory
add_memory_resource
Unfortunately, we do not have a single entry point to add memory, as depending
on the requisites of the caller, they want to hook up in different places,
(e.g: Xen reserve_additional_memory()), so we have to spread the parameter
in the three callers.
The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
in the way they allocate vmemmap pages within the memory blocks.
MHP_MEMMAP_MEMBLOCK:
- With this flag, we will allocate vmemmap pages in each memory block.
This means that if we hot-add a range that spans multiple memory blocks,
we will use the beginning of each memory block for the vmemmap pages.
This strategy is good for cases where the caller wants the flexiblity
to hot-remove memory in a different granularity than when it was added.
E.g:
We allocate a range (x,y], that spans 3 memory blocks, and given
memory block size = 128MB.
[memblock#0 ]
[0 - 511 pfns ] - vmemmaps for section#0
[512 - 32767 pfns ] - normal memory
[memblock#1 ]
[32768 - 33279 pfns] - vmemmaps for section#1
[33280 - 65535 pfns] - normal memory
[memblock#2 ]
[65536 - 66047 pfns] - vmemmap for section#2
[66048 - 98304 pfns] - normal memory
MHP_MEMMAP_DEVICE:
- With this flag, we will store all vmemmap pages at the beginning of
hot-added memory.
E.g:
We allocate a range (x,y], that spans 3 memory blocks, and given
memory block size = 128MB.
[memblock #0 ]
[0 - 1533 pfns ] - vmemmap for section#{0-2}
[1534 - 98304 pfns] - normal memory
When using larger memory blocks (1GB or 2GB), the principle is the same.
Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
memory.
Signed-off-by: Oscar Salvador <[email protected]>
---
drivers/acpi/acpi_memhotplug.c | 2 +-
drivers/base/memory.c | 2 +-
drivers/dax/kmem.c | 2 +-
drivers/hv/hv_balloon.c | 2 +-
drivers/s390/char/sclp_cmd.c | 2 +-
drivers/xen/balloon.c | 2 +-
include/linux/memory_hotplug.h | 22 +++++++++++++++++++---
mm/memory_hotplug.c | 10 +++++-----
8 files changed, 30 insertions(+), 14 deletions(-)
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index db013dc21c02..860f84e82dd0 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -218,7 +218,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
- result = __add_memory(node, info->start_addr, info->length);
+ result = __add_memory(node, info->start_addr, info->length, 0);
/*
* If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 07ba731beb42..ad9834b8b7f7 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -516,7 +516,7 @@ static ssize_t probe_store(struct device *dev, struct device_attribute *attr,
nid = memory_add_physaddr_to_nid(phys_addr);
ret = __add_memory(nid, phys_addr,
- MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+ MIN_MEMORY_BLOCK_SIZE * sections_per_block, 0);
if (ret)
goto out;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 3d0a7e702c94..e159184e0ba0 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -65,7 +65,7 @@ int dev_dax_kmem_probe(struct device *dev)
new_res->flags = IORESOURCE_SYSTEM_RAM;
new_res->name = dev_name(dev);
- rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+ rc = add_memory(numa_node, new_res->start, resource_size(new_res), 0);
if (rc) {
release_resource(new_res);
kfree(new_res);
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 6fb4ea5f0304..beb92bc56186 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -731,7 +731,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
ret = add_memory(nid, PFN_PHYS((start_pfn)),
- (HA_CHUNK << PAGE_SHIFT));
+ (HA_CHUNK << PAGE_SHIFT), 0);
if (ret) {
pr_err("hot_add memory failed error is %d\n", ret);
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index 37d42de06079..f61026c7db7e 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -406,7 +406,7 @@ static void __init add_memory_merged(u16 rn)
if (!size)
goto skip_add;
for (addr = start; addr < start + size; addr += block_size)
- add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
+ add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size, 0);
skip_add:
first_rn = rn;
num = 1;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 37a36c6b9f93..33814b3513ca 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -349,7 +349,7 @@ static enum bp_state reserve_additional_memory(void)
mutex_unlock(&balloon_mutex);
/* add_memory_resource() requires the device_hotplug lock */
lock_device_hotplug();
- rc = add_memory_resource(nid, resource);
+ rc = add_memory_resource(nid, resource, 0);
unlock_device_hotplug();
mutex_lock(&balloon_mutex);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 0b8a5e5ef2da..6fdbce9d04f9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -54,6 +54,22 @@ enum {
};
/*
+ * We want memmap (struct page array) to be allocated from the hotadded range.
+ * To do so, there are two possible ways depending on what the caller wants.
+ * 1) Allocate memmap pages per device (whole hot-added range)
+ * 2) Allocate memmap pages per memblock
+ * The former implies that we wil use the beginning of the hot-added range
+ * to store the memmap pages of the whole range, while the latter implies
+ * that we will use the beginning of each memblock to store its own memmap
+ * pages.
+ * Please note that only SPARSE_VMEMMAP implements this feature and some
+ * architectures might not support it even for that memory model (e.g. s390)
+ */
+#define MHP_MEMMAP_DEVICE (1UL<<0)
+#define MHP_MEMMAP_MEMBLOCK (1UL<<1)
+#define MHP_VMEMMAP_FLAGS (MHP_MEMMAP_DEVICE|MHP_MEMMAP_MEMBLOCK)
+
+/*
* Restrictions for the memory hotplug:
* flags: MHP_ flags
* altmap: alternative allocator for memmap array
@@ -342,9 +358,9 @@ static inline void __remove_memory(int nid, u64 start, u64 size) {}
extern void __ref free_area_init_core_hotplug(int nid);
extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
void *arg, int (*func)(struct memory_block *, void *));
-extern int __add_memory(int nid, u64 start, u64 size);
-extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource);
+extern int __add_memory(int nid, u64 start, u64 size, unsigned long flags);
+extern int add_memory(int nid, u64 start, u64 size, unsigned long flags);
+extern int add_memory_resource(int nid, struct resource *resource, unsigned long flags);
extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
extern bool is_memblock_offlined(struct memory_block *mem);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4e8e65954f31..e4e3baa6eaa7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1057,7 +1057,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
*
* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
*/
-int __ref add_memory_resource(int nid, struct resource *res)
+int __ref add_memory_resource(int nid, struct resource *res, unsigned long flags)
{
struct mhp_restrictions restrictions = {};
u64 start, size;
@@ -1135,7 +1135,7 @@ int __ref add_memory_resource(int nid, struct resource *res)
}
/* requires device_hotplug_lock, see add_memory_resource() */
-int __ref __add_memory(int nid, u64 start, u64 size)
+int __ref __add_memory(int nid, u64 start, u64 size, unsigned long flags)
{
struct resource *res;
int ret;
@@ -1144,18 +1144,18 @@ int __ref __add_memory(int nid, u64 start, u64 size)
if (IS_ERR(res))
return PTR_ERR(res);
- ret = add_memory_resource(nid, res);
+ ret = add_memory_resource(nid, res, flags);
if (ret < 0)
release_memory_resource(res);
return ret;
}
-int add_memory(int nid, u64 start, u64 size)
+int add_memory(int nid, u64 start, u64 size, unsigned long flags)
{
int rc;
lock_device_hotplug();
- rc = __add_memory(nid, start, size);
+ rc = __add_memory(nid, start, size, flags);
unlock_device_hotplug();
return rc;
--
2.12.3
Introduce a set of functions for Vmemmap pages.
Set of functions:
- {Set,Clear,Check} Vmemmap flag
- Given a vmemmap page, get its vmemmap-head
- Get #nr of vmemmap pages taking into account the current position
of the page
These functions will be used for the code handling Vmemmap pages.
Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/page-flags.h | 34 ++++++++++++++++++++++++++++++++++
mm/util.c | 2 ++
2 files changed, 36 insertions(+)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b848517da64c..a8b9b57162b3 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -466,6 +466,40 @@ static __always_inline int __PageMovable(struct page *page)
PAGE_MAPPING_MOVABLE;
}
+#define VMEMMAP_PAGE (~PAGE_MAPPING_FLAGS)
+static __always_inline int PageVmemmap(struct page *page)
+{
+ return PageReserved(page) && (unsigned long)page->mapping == VMEMMAP_PAGE;
+}
+
+static __always_inline int __PageVmemmap(struct page *page)
+{
+ return (unsigned long)page->mapping == VMEMMAP_PAGE;
+}
+
+static __always_inline void __ClearPageVmemmap(struct page *page)
+{
+ __ClearPageReserved(page);
+ page->mapping = NULL;
+}
+
+static __always_inline void __SetPageVmemmap(struct page *page)
+{
+ __SetPageReserved(page);
+ page->mapping = (void *)VMEMMAP_PAGE;
+}
+
+static __always_inline struct page *vmemmap_get_head(struct page *page)
+{
+ return (struct page *)page->freelist;
+}
+
+static __always_inline unsigned long get_nr_vmemmap_pages(struct page *page)
+{
+ struct page *head = vmemmap_get_head(page);
+ return head->private - (page - head);
+}
+
#ifdef CONFIG_KSM
/*
* A KSM page is one of those write-protected "shared pages" or "merged pages"
diff --git a/mm/util.c b/mm/util.c
index 021648a8a3a3..5e20563cdef6 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -607,6 +607,8 @@ struct address_space *page_mapping(struct page *page)
mapping = page->mapping;
if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;
+ if ((unsigned long)mapping == VMEMMAP_PAGE)
+ return NULL;
return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}
--
2.12.3
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.
This has some disadvantages:
a) an existing memory is consumed for that purpose
(~2MB per 128MB memory section on x86_64)
b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
a) has turned out to be a problem for memory hotplug based ballooning
because the userspace might not react in time to online memory while
the memory consumed during physical hotadd consumes enough memory to
push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory") has been added to workaround that
problem.
I have also seen hot-add operations failing on powerpc due to the fact
that we try to use order-8 pages. If the base page size is 64KB, this
gives us 16MB, and if we run out of those, we simply fail.
One could arge that we can fall back to basepages as we do in x86_64, but
we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.
Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.
Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by __vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation.
If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
altmap structure at the beginning of __add_pages(), and then we call
mark_vmemmap_pages().
Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
mark_vmemmap_pages() gets called at a different stage.
With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
sections have been populated.
mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata:
The current layout of the Vmemmap pages are:
[Head->refcount] : Nr sections used by this altmap
[Head->private] : Nr of vmemmap pages
[Tail->freelist] : Pointer to the head page
This is done to easy the computation we need in some places.
E.g:
Example 1)
We hot-add 1GB on x86_64 (memory block 128MB) using
MHP_MEMMAP_DEVICE:
head->_refcount = 8 sections
head->private = 4096 vmemmap pages
tail's->freelist = head
Example 2)
We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK:
[at the beginning of each memblock]
head->_refcount = 1 section
head->private = 512 vmemmap pages
tail's->freelist = head
We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know
how much do we have to defer the call to vmemmap_free().
The thing is that the first pages of the hot-added range are used to create
the memmap mapping, so we cannot remove those first, otherwise we would blow up
when accessing the other pages.
What we do is that since when we hot-remove a memory-range, sections are being
removed sequentially, we wait until we hit the last section, and then we free
the hole range to vmemmap_free backwards.
We know that it is the last section because in every pass we
decrease head->_refcount, and when it reaches 0, we got our last section.
We also have to be careful about those pages during online and offline
operations. They are simply skipped, so online will keep them
reserved and so unusable for any other purpose and offline ignores them
so they do not block the offline operation.
In offline operation we only have to check for one particularity.
Depending on how large was the hot-added range, and using MHP_MEMMAP_DEVICE,
can be that one or more than one memory block is filled with only vmemmap pages.
We just need to check for this case and skip 1) isolating 2) migrating,
because those pages do not need to be migrated anywhere, they are self-hosted.
Signed-off-by: Oscar Salvador <[email protected]>
---
arch/arm64/mm/mmu.c | 5 +-
arch/powerpc/mm/init_64.c | 7 +++
arch/s390/mm/init.c | 6 ++
arch/x86/mm/init_64.c | 10 +++
drivers/acpi/acpi_memhotplug.c | 2 +-
drivers/base/memory.c | 2 +-
include/linux/memory_hotplug.h | 6 ++
include/linux/memremap.h | 2 +-
mm/compaction.c | 7 +++
mm/memory_hotplug.c | 138 +++++++++++++++++++++++++++++++++++------
mm/page_alloc.c | 22 ++++++-
mm/page_isolation.c | 14 ++++-
mm/sparse.c | 93 +++++++++++++++++++++++++++
13 files changed, 289 insertions(+), 25 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 93ed0df4df79..d4b5661fa6b6 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -765,7 +765,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
if (pmd_none(READ_ONCE(*pmdp))) {
void *p = NULL;
- p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+ if (altmap)
+ p = altmap_alloc_block_buf(PMD_SIZE, altmap);
+ else
+ p = vmemmap_alloc_block_buf(PMD_SIZE, node);
if (!p)
return -ENOMEM;
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a4e17a979e45..ff9d2c245321 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -289,6 +289,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
if (base_pfn >= alt_start && base_pfn < alt_end) {
vmem_altmap_free(altmap, nr_pages);
+ } else if (PageVmemmap(page)) {
+ /*
+ * runtime vmemmap pages are residing inside the memory
+ * section so they do not have to be freed anywhere.
+ */
+ while (PageVmemmap(page))
+ __ClearPageVmemmap(page++);
} else if (PageReserved(page)) {
/* allocated from bootmem */
if (page_size < PAGE_SIZE) {
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index ffb81fe95c77..c045411552a3 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -226,6 +226,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
unsigned long size_pages = PFN_DOWN(size);
int rc;
+ /*
+ * Physical memory is added only later during the memory online so we
+ * cannot use the added range at this stage unfortunately.
+ */
+ restrictions->flags &= ~restrictions->flags;
+
if (WARN_ON_ONCE(restrictions->altmap))
return -EINVAL;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 688fb0687e55..00d17b666337 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -874,6 +874,16 @@ static void __meminit free_pagetable(struct page *page, int order)
unsigned long magic;
unsigned int nr_pages = 1 << order;
+ /*
+ * Runtime vmemmap pages are residing inside the memory section so
+ * they do not have to be freed anywhere.
+ */
+ if (PageVmemmap(page)) {
+ while (nr_pages--)
+ __ClearPageVmemmap(page++);
+ return;
+ }
+
/* bootmem page has reserved flag */
if (PageReserved(page)) {
__ClearPageReserved(page);
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 860f84e82dd0..3257edb98d90 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -218,7 +218,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
- result = __add_memory(node, info->start_addr, info->length, 0);
+ result = __add_memory(node, info->start_addr, info->length, MHP_MEMMAP_DEVICE);
/*
* If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index ad9834b8b7f7..e0ac9a3b66f8 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -32,7 +32,7 @@ static DEFINE_MUTEX(mem_sysfs_mutex);
#define to_memory_block(dev) container_of(dev, struct memory_block, dev)
-static int sections_per_block;
+int sections_per_block;
static inline int base_memory_block_id(int section_nr)
{
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 6fdbce9d04f9..e28e226c9a20 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -375,4 +375,10 @@ extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_
int online_type);
extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
unsigned long nr_pages);
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+extern void mark_vmemmap_pages(struct vmem_altmap *self);
+#else
+static inline void mark_vmemmap_pages(struct vmem_altmap *self) {}
+#endif
#endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1732dea030b2..6de37e168f57 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -16,7 +16,7 @@ struct device;
* @alloc: track pages consumed, private to vmemmap_populate()
*/
struct vmem_altmap {
- const unsigned long base_pfn;
+ unsigned long base_pfn;
const unsigned long reserve;
unsigned long free;
unsigned long align;
diff --git a/mm/compaction.c b/mm/compaction.c
index 9e1b9acb116b..40697f74b8b4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -855,6 +855,13 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
nr_scanned++;
page = pfn_to_page(low_pfn);
+ /*
+ * Vmemmap pages do not need to be isolated.
+ */
+ if (PageVmemmap(page)) {
+ low_pfn += get_nr_vmemmap_pages(page) - 1;
+ continue;
+ }
/*
* Check if the pageblock has already been marked skipped.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4e3baa6eaa7..b5106cb75795 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,6 +42,8 @@
#include "internal.h"
#include "shuffle.h"
+extern int sections_per_block;
+
/*
* online_page_callback contains pointer to current page onlining function.
* Initially it is generic_online_page(). If it is required it could be
@@ -279,6 +281,24 @@ static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
return 0;
}
+static void mhp_reset_altmap(unsigned long next_pfn,
+ struct vmem_altmap *altmap)
+{
+ altmap->base_pfn = next_pfn;
+ altmap->alloc = 0;
+}
+
+static void mhp_init_altmap(unsigned long pfn, unsigned long nr_pages,
+ unsigned long mhp_flags,
+ struct vmem_altmap *altmap)
+{
+ if (mhp_flags & MHP_MEMMAP_DEVICE)
+ altmap->free = nr_pages;
+ else
+ altmap->free = PAGES_PER_SECTION * sections_per_block;
+ altmap->base_pfn = pfn;
+}
+
/*
* Reasonably generic function for adding memory. It is
* expected that archs that support memory hotplug will
@@ -290,8 +310,17 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
{
unsigned long i;
int start_sec, end_sec, err;
- struct vmem_altmap *altmap = restrictions->altmap;
+ struct vmem_altmap *altmap;
+ struct vmem_altmap __memblk_altmap = {};
+ unsigned long mhp_flags = restrictions->flags;
+ unsigned long sections_added;
+
+ if (mhp_flags & MHP_VMEMMAP_FLAGS) {
+ mhp_init_altmap(pfn, nr_pages, mhp_flags, &__memblk_altmap);
+ restrictions->altmap = &__memblk_altmap;
+ }
+ altmap = restrictions->altmap;
if (altmap) {
/*
* Validate altmap is within bounds of the total request
@@ -308,9 +337,10 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
if (err)
return err;
+ sections_added = 1;
start_sec = pfn_to_section_nr(pfn);
end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
- for (i = start_sec; i <= end_sec; i++) {
+ for (i = start_sec; i <= end_sec; i++, sections_added++) {
unsigned long pfns;
pfns = min(nr_pages, PAGES_PER_SECTION
@@ -320,9 +350,19 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
break;
pfn += pfns;
nr_pages -= pfns;
+
+ if (mhp_flags & MHP_MEMMAP_MEMBLOCK &&
+ !(sections_added % sections_per_block)) {
+ mark_vmemmap_pages(altmap);
+ mhp_reset_altmap(pfn, altmap);
+ }
cond_resched();
}
vmemmap_populate_print_last();
+
+ if (mhp_flags & MHP_MEMMAP_DEVICE)
+ mark_vmemmap_pages(altmap);
+
return err;
}
@@ -642,6 +682,14 @@ static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
while (start < end) {
order = min(MAX_ORDER - 1,
get_order(PFN_PHYS(end) - PFN_PHYS(start)));
+ /*
+ * Check if the pfn is aligned to its order.
+ * If not, we decrement the order until it is,
+ * otherwise __free_one_page will bug us.
+ */
+ while (start & ((1 << order) - 1))
+ order--;
+
(*online_page_callback)(pfn_to_page(start), order);
onlined_pages += (1UL << order);
@@ -654,13 +702,30 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
void *arg)
{
unsigned long onlined_pages = *(unsigned long *)arg;
+ unsigned long pfn = start_pfn;
+ unsigned long nr_vmemmap_pages = 0;
- if (PageReserved(pfn_to_page(start_pfn)))
- onlined_pages += online_pages_blocks(start_pfn, nr_pages);
+ if (PageVmemmap(pfn_to_page(pfn))) {
+ /*
+ * Do not send vmemmap pages to the page allocator.
+ */
+ nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
+ nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
+ pfn += nr_vmemmap_pages;
+ if (nr_vmemmap_pages == nr_pages)
+ /*
+ * If the entire range contains only vmemmap pages,
+ * there are no pages left for the page allocator.
+ */
+ goto skip_online;
+ }
+ if (PageReserved(pfn_to_page(pfn)))
+ onlined_pages += online_pages_blocks(pfn, nr_pages - nr_vmemmap_pages);
+skip_online:
online_mem_sections(start_pfn, start_pfn + nr_pages);
- *(unsigned long *)arg = onlined_pages;
+ *(unsigned long *)arg = onlined_pages + nr_vmemmap_pages;
return 0;
}
@@ -1051,6 +1116,23 @@ static int online_memory_block(struct memory_block *mem, void *arg)
return device_online(&mem->dev);
}
+static bool mhp_check_correct_flags(unsigned long flags)
+{
+ if (flags & MHP_VMEMMAP_FLAGS) {
+ if (!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) {
+ WARN(1, "Vmemmap capability can only be used on"
+ "CONFIG_SPARSEMEM_VMEMMAP. Ignoring flags.\n");
+ return false;
+ }
+ if ((flags & MHP_VMEMMAP_FLAGS) == MHP_VMEMMAP_FLAGS) {
+ WARN(1, "Both MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK"
+ "were passed. Ignoring flags.\n");
+ return false;
+ }
+ }
+ return true;
+}
+
/*
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
* and online/offline operations (triggered e.g. by sysfs).
@@ -1086,6 +1168,9 @@ int __ref add_memory_resource(int nid, struct resource *res, unsigned long flags
goto error;
new_node = ret;
+ if (mhp_check_correct_flags(flags))
+ restrictions.flags = flags;
+
/* call arch's memory hotadd */
ret = arch_add_memory(nid, start, size, &restrictions);
if (ret < 0)
@@ -1518,12 +1603,14 @@ static int __ref __offline_pages(unsigned long start_pfn,
{
unsigned long pfn, nr_pages;
unsigned long offlined_pages = 0;
+ unsigned long nr_vmemmap_pages = 0;
int ret, node, nr_isolate_pageblock;
unsigned long flags;
unsigned long valid_start, valid_end;
struct zone *zone;
struct memory_notify arg;
char *reason;
+ bool skip = false;
mem_hotplug_begin();
@@ -1540,15 +1627,24 @@ static int __ref __offline_pages(unsigned long start_pfn,
node = zone_to_nid(zone);
nr_pages = end_pfn - start_pfn;
- /* set above range as isolated */
- ret = start_isolate_page_range(start_pfn, end_pfn,
- MIGRATE_MOVABLE,
- SKIP_HWPOISON | REPORT_FAILURE);
- if (ret < 0) {
- reason = "failure to isolate range";
- goto failed_removal;
+ if (PageVmemmap(pfn_to_page(start_pfn))) {
+ nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
+ nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
+ if (nr_vmemmap_pages == nr_pages)
+ skip = true;
+ }
+
+ if (!skip) {
+ /* set above range as isolated */
+ ret = start_isolate_page_range(start_pfn, end_pfn,
+ MIGRATE_MOVABLE,
+ SKIP_HWPOISON | REPORT_FAILURE);
+ if (ret < 0) {
+ reason = "failure to isolate range";
+ goto failed_removal;
+ }
+ nr_isolate_pageblock = ret;
}
- nr_isolate_pageblock = ret;
arg.start_pfn = start_pfn;
arg.nr_pages = nr_pages;
@@ -1561,6 +1657,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
goto failed_removal_isolated;
}
+ if (skip)
+ goto skip_migration;
+
do {
for (pfn = start_pfn; pfn;) {
if (signal_pending(current)) {
@@ -1601,7 +1700,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
We cannot do rollback at this point. */
walk_system_ram_range(start_pfn, end_pfn - start_pfn,
&offlined_pages, offline_isolated_pages_cb);
- pr_info("Offlined Pages %ld\n", offlined_pages);
+
+skip_migration:
+ pr_info("Offlined Pages %ld\n", offlined_pages + nr_vmemmap_pages);
/*
* Onlining will reset pagetype flags and makes migrate type
* MOVABLE, so just need to decrease the number of isolated
@@ -1612,11 +1713,12 @@ static int __ref __offline_pages(unsigned long start_pfn,
spin_unlock_irqrestore(&zone->lock, flags);
/* removal success */
- adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
- zone->present_pages -= offlined_pages;
+ if (offlined_pages)
+ adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
+ zone->present_pages -= offlined_pages + nr_vmemmap_pages;
pgdat_resize_lock(zone->zone_pgdat, &flags);
- zone->zone_pgdat->node_present_pages -= offlined_pages;
+ zone->zone_pgdat->node_present_pages -= offlined_pages + nr_vmemmap_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
init_per_zone_wmark_min();
@@ -1645,7 +1747,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
memory_notify(MEM_CANCEL_OFFLINE, &arg);
failed_removal:
pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
- (unsigned long long) start_pfn << PAGE_SHIFT,
+ (unsigned long long) (start_pfn - nr_vmemmap_pages) << PAGE_SHIFT,
((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
reason);
/* pushback to free area */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b3266d63521..7a73a06c5730 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1282,9 +1282,14 @@ static void free_one_page(struct zone *zone,
static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
- mm_zero_struct_page(page);
+ if (!__PageVmemmap(page)) {
+ /*
+ * Vmemmap pages need to preserve their state.
+ */
+ mm_zero_struct_page(page);
+ init_page_count(page);
+ }
set_page_links(page, zone, nid, pfn);
- init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
page_kasan_tag_reset(page);
@@ -8143,6 +8148,14 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
page = pfn_to_page(check);
+ /*
+ * Vmemmap pages are not needed to be moved around.
+ */
+ if (PageVmemmap(page)) {
+ iter += get_nr_vmemmap_pages(page) - 1;
+ continue;
+ }
+
if (PageReserved(page))
goto unmovable;
@@ -8510,6 +8523,11 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
continue;
}
page = pfn_to_page(pfn);
+
+ if (PageVmemmap(page)) {
+ pfn += get_nr_vmemmap_pages(page);
+ continue;
+ }
/*
* The HWPoisoned page may be not in buddy system, and
* page_count() is not 0.
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index e3638a5bafff..128c47a27925 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -146,7 +146,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
static inline struct page *
__first_valid_page(unsigned long pfn, unsigned long nr_pages)
{
- int i;
+ unsigned long i;
for (i = 0; i < nr_pages; i++) {
struct page *page;
@@ -154,6 +154,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
page = pfn_to_online_page(pfn + i);
if (!page)
continue;
+ if (PageVmemmap(page)) {
+ i += get_nr_vmemmap_pages(page) - 1;
+ continue;
+ }
return page;
}
return NULL;
@@ -268,6 +272,14 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
continue;
}
page = pfn_to_page(pfn);
+ /*
+ * Vmemmap pages are not isolated. Skip them.
+ */
+ if (PageVmemmap(page)) {
+ pfn += get_nr_vmemmap_pages(page);
+ continue;
+ }
+
if (PageBuddy(page))
/*
* If the page is on a free list, it has to be on
diff --git a/mm/sparse.c b/mm/sparse.c
index b77ca21a27a4..04b395fb4463 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -635,6 +635,94 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
#endif
#ifdef CONFIG_SPARSEMEM_VMEMMAP
+void mark_vmemmap_pages(struct vmem_altmap *self)
+{
+ unsigned long pfn = self->base_pfn + self->reserve;
+ unsigned long nr_pages = self->alloc;
+ unsigned long nr_sects = self->free / PAGES_PER_SECTION;
+ unsigned long i;
+ struct page *head;
+
+ if (!nr_pages)
+ return;
+
+ pr_debug("%s: marking %px - %px as Vmemmap (%ld pages)\n",
+ __func__,
+ pfn_to_page(pfn),
+ pfn_to_page(pfn + nr_pages - 1),
+ nr_pages);
+
+ /*
+ * All allocations for the memory hotplug are the same sized so align
+ * should be 0.
+ */
+ WARN_ON(self->align);
+
+ /*
+ * Layout of vmemmap pages:
+ * [Head->refcount] : Nr sections used by this altmap
+ * [Head->private] : Nr of vmemmap pages
+ * [Tail->freelist] : Pointer to the head page
+ */
+
+ /*
+ * Head, first vmemmap page
+ */
+ head = pfn_to_page(pfn);
+ for (i = 0; i < nr_pages; i++, pfn++) {
+ struct page *page = pfn_to_page(pfn);
+
+ mm_zero_struct_page(page);
+ __SetPageVmemmap(page);
+ page->freelist = head;
+ init_page_count(page);
+ }
+ set_page_count(head, (int)nr_sects);
+ set_page_private(head, nr_pages);
+}
+/*
+ * If the range we are trying to remove was hot-added with vmemmap pages
+ * using MHP_MEMMAP_DEVICE, we need to keep track of it to know how much
+ * do we have do defer the free up.
+ * Since sections are removed sequentally in __remove_pages()->
+ * __remove_section(), we just wait until we hit the last section.
+ * Once that happens, we can trigger free_deferred_vmemmap_range to actually
+ * free the whole memory-range.
+ */
+static struct page *head_vmemmap_page = NULL;;
+static bool freeing_vmemmap_range = false;
+
+static inline bool vmemmap_dec_and_test(void)
+{
+ return page_ref_dec_and_test(head_vmemmap_page);
+}
+
+static void free_deferred_vmemmap_range(unsigned long start,
+ unsigned long end)
+{
+ unsigned long nr_pages = end - start;
+ unsigned long first_section = (unsigned long)head_vmemmap_page;
+
+ while (start >= first_section) {
+ vmemmap_free(start, end, NULL);
+ end = start;
+ start -= nr_pages;
+ }
+ head_vmemmap_page = NULL;
+ freeing_vmemmap_range = false;
+}
+
+static void deferred_vmemmap_free(unsigned long start, unsigned long end)
+{
+ if (!freeing_vmemmap_range) {
+ freeing_vmemmap_range = true;
+ head_vmemmap_page = (struct page *)start;
+ }
+
+ if (vmemmap_dec_and_test())
+ free_deferred_vmemmap_range(start, end);
+}
+
static struct page *populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
{
@@ -647,6 +735,11 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
unsigned long start = (unsigned long) pfn_to_page(pfn);
unsigned long end = start + nr_pages * sizeof(struct page);
+ if (PageVmemmap((struct page *)start) || freeing_vmemmap_range) {
+ deferred_vmemmap_free(start, end);
+ return;
+ }
+
vmemmap_free(start, end, altmap);
}
static void free_map_bootmem(struct page *memmap)
--
2.12.3
On 25.06.19 09:52, Oscar Salvador wrote:
> Hi,
>
> It has been while since I sent previous version [1].
>
> In this version I added some feedback I got back then, like letting
> the caller decide whether he wants allocating per memory block or
> per memory range (patch#2), and having the chance to disable vmemmap when
> users want to expose all hotpluggable memory to userspace (patch#5).
>
> [Testing]
>
> While I could test last version on powerpc, and Huawei's fellows helped me out
> testing it on arm64, this time I could only test it on x86_64.
> The codebase is quite the same, so I would not expect surprises.
>
> - x86_64: small and large memblocks (128MB, 1G and 2G)
> - Kernel module that adds memory spanning multiple memblocks
> and remove that memory in a different granularity.
>
> So far, only acpi memory hotplug uses the new flag.
> The other callers can be changed depending on their needs.
>
> Of course, more testing and feedback is appreciated.
>
> [Coverletter]
>
> This is another step to make memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot-added
> memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
> to populate memmap (struct page array) has two main drawbacks:
Mental note: How will it be handled if a caller specifies "Allocate
memmap from hotadded memory", but we are running under SPARSEMEM where
we can't do this.
>
> a) it consumes an additional memory until the hotadded memory itself is
> onlined and
> b) memmap might end up on a different numa node which is especially true
> for movable_node configuration.
>
> a) it is a problem especially for memory hotplug based memory "ballooning"
> solutions when the delay between physical memory hotplug and the
> onlining can lead to OOM and that led to introduction of hacks like auto
> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
> policy for the newly added memory")).
>
> b) can have performance drawbacks.
>
> Another minor case is that I have seen hot-add operations failing on archs
> because they were running out of order-x pages.
> E.g On powerpc, in certain configurations, we use order-8 pages,
> and given 64KB base pagesize, that is 16MB.
> If we run out of those, we just fail the operation and we cannot add
> more memory.
At least for SPARSEMEM, we fallback to vmalloc() to work around this
issue. I haven't looked into the populate_section_memmap() internals
yet. Can you point me at the code that performs this allocation?
> We could fallback to base pages as x86_64 does, but we can do better.
>
> One way to mitigate all these issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
> us to map any pfn range so the memory doesn't need to be online to be
> usable for the array. See patch 3 for more details.
> This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.
>
> [Overall design]:
>
> Implementation wise we reuse vmem_altmap infrastructure to override
> the default allocator used by vmemap_populate. Once the memmap is
> allocated we need a way to mark altmap pfns used for the allocation.
> If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
> altmap structure at the beginning of __add_pages(), and then we call
> mark_vmemmap_pages().
>
> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
> in the way they allocate vmemmap pages within the memory blocks.
>
> MHP_MEMMAP_MEMBLOCK:
> - With this flag, we will allocate vmemmap pages in each memory block.
> This means that if we hot-add a range that spans multiple memory blocks,
> we will use the beginning of each memory block for the vmemmap pages.
> This strategy is good for cases where the caller wants the flexiblity
> to hot-remove memory in a different granularity than when it was added.
>
> MHP_MEMMAP_DEVICE:
> - With this flag, we will store all vmemmap pages at the beginning of
> hot-added memory.
>
> So it is a tradeoff of flexiblity vs contigous memory.
> More info on the above can be found in patch#2.
>
> Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
> mark_vmemmap_pages() gets called at a different stage.
> With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
> fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
> sections have been populated.
>
> mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata:
>
> The current layout of the Vmemmap pages are:
>
> [Head->refcount] : Nr sections used by this altmap
> [Head->private] : Nr of vmemmap pages
> [Tail->freelist] : Pointer to the head page
>
> This is done to easy the computation we need in some places.
> E.g:
>
> Example 1)
> We hot-add 1GB on x86_64 (memory block 128MB) using
> MHP_MEMMAP_DEVICE:
>
> head->_refcount = 8 sections
> head->private = 4096 vmemmap pages
> tail's->freelist = head
>
> Example 2)
> We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK:
>
> [at the beginning of each memblock]
> head->_refcount = 1 section
> head->private = 512 vmemmap pages
> tail's->freelist = head
>
> We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know
> how much do we have to defer the call to vmemmap_free().
> The thing is that the first pages of the hot-added range are used to create
> the memmap mapping, so we cannot remove those first, otherwise we would blow up
> when accessing the other pages.
So, assuming we add_memory(1GB, MHP_MEMMAP_DEVICE) and then
remove_memory(128MB) of the added memory, this will work?
add_memory(8GB, MHP_MEMMAP_DEVICE)
For 8GB, we will need exactly 128MB of memmap if I did the math right.
So exactly one section. This section will still be marked as being
online (although not pages on it are actually online)?
>
> What we do is that since when we hot-remove a memory-range, sections are being
> removed sequentially, we wait until we hit the last section, and then we free
> the hole range to vmemmap_free backwards.
> We know that it is the last section because in every pass we
> decrease head->_refcount, and when it reaches 0, we got our last section.
>
> We also have to be careful about those pages during online and offline
> operations. They are simply skipped, so online will keep them
> reserved and so unusable for any other purpose and offline ignores them
> so they do not block the offline operation.
I assume that they will still be dumped normally by user space. (as they
are described by a "memory resource" and not PG_Offline)
>
> One thing worth mention is that vmemmap pages residing in movable memory is not a
> show-stopper for that memory to be offlined/migrated away.
> Vmemmap pages are just ignored in that case and they stick around until sections
> referred by those vmemmap pages are hot-removed.
>
> [1] https://patchwork.kernel.org/cover/10875017/
>
> Oscar Salvador (5):
> drivers/base/memory: Remove unneeded check in
> remove_memory_block_devices
> mm,memory_hotplug: Introduce MHP_VMEMMAP_FLAGS
> mm,memory_hotplug: Introduce Vmemmap page helpers
> mm,memory_hotplug: allocate memmap from the added memory range for
> sparse-vmemmap
> mm,memory_hotplug: Allow userspace to enable/disable vmemmap
>
> arch/arm64/mm/mmu.c | 5 +-
> arch/powerpc/mm/init_64.c | 7 ++
> arch/s390/mm/init.c | 6 ++
> arch/x86/mm/init_64.c | 10 +++
> drivers/acpi/acpi_memhotplug.c | 2 +-
> drivers/base/memory.c | 41 +++++++++--
> drivers/dax/kmem.c | 2 +-
> drivers/hv/hv_balloon.c | 2 +-
> drivers/s390/char/sclp_cmd.c | 2 +-
> drivers/xen/balloon.c | 2 +-
> include/linux/memory_hotplug.h | 31 ++++++++-
> include/linux/memremap.h | 2 +-
> include/linux/page-flags.h | 34 +++++++++
> mm/compaction.c | 7 ++
> mm/memory_hotplug.c | 152 ++++++++++++++++++++++++++++++++++-------
> mm/page_alloc.c | 22 +++++-
> mm/page_isolation.c | 14 +++-
> mm/sparse.c | 93 +++++++++++++++++++++++++
> mm/util.c | 2 +
> 19 files changed, 394 insertions(+), 42 deletions(-)
>
Thanks for doing this, this will be very helpful :)
--
Thanks,
David / dhildenb
On 25.06.19 09:52, Oscar Salvador wrote:
> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
> and prepares the callers that add memory to take a "flags" parameter.
> This "flags" parameter will be evaluated later on in Patch#3
> to init mhp_restrictions struct.
>
> The callers are:
>
> add_memory
> __add_memory
> add_memory_resource
>
> Unfortunately, we do not have a single entry point to add memory, as depending
> on the requisites of the caller, they want to hook up in different places,
> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
> in the three callers.
>
> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
> in the way they allocate vmemmap pages within the memory blocks.
>
> MHP_MEMMAP_MEMBLOCK:
> - With this flag, we will allocate vmemmap pages in each memory block.
> This means that if we hot-add a range that spans multiple memory blocks,
> we will use the beginning of each memory block for the vmemmap pages.
> This strategy is good for cases where the caller wants the flexiblity
> to hot-remove memory in a different granularity than when it was added.
>
> E.g:
> We allocate a range (x,y], that spans 3 memory blocks, and given
> memory block size = 128MB.
> [memblock#0 ]
> [0 - 511 pfns ] - vmemmaps for section#0
> [512 - 32767 pfns ] - normal memory
>
> [memblock#1 ]
> [32768 - 33279 pfns] - vmemmaps for section#1
> [33280 - 65535 pfns] - normal memory
>
> [memblock#2 ]
> [65536 - 66047 pfns] - vmemmap for section#2
> [66048 - 98304 pfns] - normal memory
>
> MHP_MEMMAP_DEVICE:
> - With this flag, we will store all vmemmap pages at the beginning of
> hot-added memory.
>
> E.g:
> We allocate a range (x,y], that spans 3 memory blocks, and given
> memory block size = 128MB.
> [memblock #0 ]
> [0 - 1533 pfns ] - vmemmap for section#{0-2}
> [1534 - 98304 pfns] - normal memory
>
> When using larger memory blocks (1GB or 2GB), the principle is the same.
>
> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
> memory.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> drivers/acpi/acpi_memhotplug.c | 2 +-
> drivers/base/memory.c | 2 +-
> drivers/dax/kmem.c | 2 +-
> drivers/hv/hv_balloon.c | 2 +-
> drivers/s390/char/sclp_cmd.c | 2 +-
> drivers/xen/balloon.c | 2 +-
> include/linux/memory_hotplug.h | 22 +++++++++++++++++++---
> mm/memory_hotplug.c | 10 +++++-----
> 8 files changed, 30 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index db013dc21c02..860f84e82dd0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -218,7 +218,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
> if (node < 0)
> node = memory_add_physaddr_to_nid(info->start_addr);
>
> - result = __add_memory(node, info->start_addr, info->length);
> + result = __add_memory(node, info->start_addr, info->length, 0);
>
> /*
> * If the memory block has been used by the kernel, add_memory()
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 07ba731beb42..ad9834b8b7f7 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -516,7 +516,7 @@ static ssize_t probe_store(struct device *dev, struct device_attribute *attr,
>
> nid = memory_add_physaddr_to_nid(phys_addr);
> ret = __add_memory(nid, phys_addr,
> - MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> + MIN_MEMORY_BLOCK_SIZE * sections_per_block, 0);
>
> if (ret)
> goto out;
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 3d0a7e702c94..e159184e0ba0 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -65,7 +65,7 @@ int dev_dax_kmem_probe(struct device *dev)
> new_res->flags = IORESOURCE_SYSTEM_RAM;
> new_res->name = dev_name(dev);
>
> - rc = add_memory(numa_node, new_res->start, resource_size(new_res));
> + rc = add_memory(numa_node, new_res->start, resource_size(new_res), 0);
> if (rc) {
> release_resource(new_res);
> kfree(new_res);
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index 6fb4ea5f0304..beb92bc56186 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -731,7 +731,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
>
> nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
> ret = add_memory(nid, PFN_PHYS((start_pfn)),
> - (HA_CHUNK << PAGE_SHIFT));
> + (HA_CHUNK << PAGE_SHIFT), 0);
>
> if (ret) {
> pr_err("hot_add memory failed error is %d\n", ret);
> diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
> index 37d42de06079..f61026c7db7e 100644
> --- a/drivers/s390/char/sclp_cmd.c
> +++ b/drivers/s390/char/sclp_cmd.c
> @@ -406,7 +406,7 @@ static void __init add_memory_merged(u16 rn)
> if (!size)
> goto skip_add;
> for (addr = start; addr < start + size; addr += block_size)
> - add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
> + add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size, 0);
> skip_add:
> first_rn = rn;
> num = 1;
> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> index 37a36c6b9f93..33814b3513ca 100644
> --- a/drivers/xen/balloon.c
> +++ b/drivers/xen/balloon.c
> @@ -349,7 +349,7 @@ static enum bp_state reserve_additional_memory(void)
> mutex_unlock(&balloon_mutex);
> /* add_memory_resource() requires the device_hotplug lock */
> lock_device_hotplug();
> - rc = add_memory_resource(nid, resource);
> + rc = add_memory_resource(nid, resource, 0);
> unlock_device_hotplug();
> mutex_lock(&balloon_mutex);
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 0b8a5e5ef2da..6fdbce9d04f9 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -54,6 +54,22 @@ enum {
> };
>
> /*
> + * We want memmap (struct page array) to be allocated from the hotadded range.
> + * To do so, there are two possible ways depending on what the caller wants.
> + * 1) Allocate memmap pages per device (whole hot-added range)
> + * 2) Allocate memmap pages per memblock
> + * The former implies that we wil use the beginning of the hot-added range
s/wil/will/
> + * to store the memmap pages of the whole range, while the latter implies
> + * that we will use the beginning of each memblock to store its own memmap
> + * pages.
> + * Please note that only SPARSE_VMEMMAP implements this feature and some
> + * architectures might not support it even for that memory model (e.g. s390)
Probably rephrase to "This is only a hint, not a guarantee. Only
selected architectures support it with SPARSE_VMEMMAP."
> + */
> +#define MHP_MEMMAP_DEVICE (1UL<<0)
> +#define MHP_MEMMAP_MEMBLOCK (1UL<<1)
> +#define MHP_VMEMMAP_FLAGS (MHP_MEMMAP_DEVICE|MHP_MEMMAP_MEMBLOCK)
> +
> +/*
> * Restrictions for the memory hotplug:
> * flags: MHP_ flags
> * altmap: alternative allocator for memmap array
> @@ -342,9 +358,9 @@ static inline void __remove_memory(int nid, u64 start, u64 size) {}
> extern void __ref free_area_init_core_hotplug(int nid);
> extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> void *arg, int (*func)(struct memory_block *, void *));
> -extern int __add_memory(int nid, u64 start, u64 size);
> -extern int add_memory(int nid, u64 start, u64 size);
> -extern int add_memory_resource(int nid, struct resource *resource);
> +extern int __add_memory(int nid, u64 start, u64 size, unsigned long flags);
> +extern int add_memory(int nid, u64 start, u64 size, unsigned long flags);
> +extern int add_memory_resource(int nid, struct resource *resource, unsigned long flags);
> extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> unsigned long nr_pages, struct vmem_altmap *altmap);
> extern bool is_memblock_offlined(struct memory_block *mem);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4e8e65954f31..e4e3baa6eaa7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1057,7 +1057,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
> *
> * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
> */
> -int __ref add_memory_resource(int nid, struct resource *res)
> +int __ref add_memory_resource(int nid, struct resource *res, unsigned long flags)
> {
> struct mhp_restrictions restrictions = {};
> u64 start, size;
> @@ -1135,7 +1135,7 @@ int __ref add_memory_resource(int nid, struct resource *res)
> }
>
> /* requires device_hotplug_lock, see add_memory_resource() */
> -int __ref __add_memory(int nid, u64 start, u64 size)
> +int __ref __add_memory(int nid, u64 start, u64 size, unsigned long flags)
> {
> struct resource *res;
> int ret;
> @@ -1144,18 +1144,18 @@ int __ref __add_memory(int nid, u64 start, u64 size)
> if (IS_ERR(res))
> return PTR_ERR(res);
>
> - ret = add_memory_resource(nid, res);
> + ret = add_memory_resource(nid, res, flags);
> if (ret < 0)
> release_memory_resource(res);
> return ret;
> }
>
> -int add_memory(int nid, u64 start, u64 size)
> +int add_memory(int nid, u64 start, u64 size, unsigned long flags)
> {
> int rc;
>
> lock_device_hotplug();
> - rc = __add_memory(nid, start, size);
> + rc = __add_memory(nid, start, size, flags);
> unlock_device_hotplug();
>
> return rc;
>
Apart from that, looks good to me.
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On 25.06.19 10:25, David Hildenbrand wrote:
> On 25.06.19 09:52, Oscar Salvador wrote:
>> Hi,
>>
>> It has been while since I sent previous version [1].
>>
>> In this version I added some feedback I got back then, like letting
>> the caller decide whether he wants allocating per memory block or
>> per memory range (patch#2), and having the chance to disable vmemmap when
>> users want to expose all hotpluggable memory to userspace (patch#5).
>>
>> [Testing]
>>
>> While I could test last version on powerpc, and Huawei's fellows helped me out
>> testing it on arm64, this time I could only test it on x86_64.
>> The codebase is quite the same, so I would not expect surprises.
>>
>> - x86_64: small and large memblocks (128MB, 1G and 2G)
>> - Kernel module that adds memory spanning multiple memblocks
>> and remove that memory in a different granularity.
>>
>> So far, only acpi memory hotplug uses the new flag.
>> The other callers can be changed depending on their needs.
>>
>> Of course, more testing and feedback is appreciated.
>>
>> [Coverletter]
>>
>> This is another step to make memory hotplug more usable. The primary
>> goal of this patchset is to reduce memory overhead of the hot-added
>> memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
>> to populate memmap (struct page array) has two main drawbacks:
>
> Mental note: How will it be handled if a caller specifies "Allocate
> memmap from hotadded memory", but we are running under SPARSEMEM where
> we can't do this.
>
>>
>> a) it consumes an additional memory until the hotadded memory itself is
>> onlined and
>> b) memmap might end up on a different numa node which is especially true
>> for movable_node configuration.
>>
>> a) it is a problem especially for memory hotplug based memory "ballooning"
>> solutions when the delay between physical memory hotplug and the
>> onlining can lead to OOM and that led to introduction of hacks like auto
>> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>> policy for the newly added memory")).
>>
>> b) can have performance drawbacks.
>>
>> Another minor case is that I have seen hot-add operations failing on archs
>> because they were running out of order-x pages.
>> E.g On powerpc, in certain configurations, we use order-8 pages,
>> and given 64KB base pagesize, that is 16MB.
>> If we run out of those, we just fail the operation and we cannot add
>> more memory.
>
> At least for SPARSEMEM, we fallback to vmalloc() to work around this
> issue. I haven't looked into the populate_section_memmap() internals
> yet. Can you point me at the code that performs this allocation?
>
>> We could fallback to base pages as x86_64 does, but we can do better.
>>
>> One way to mitigate all these issues is to simply allocate memmap array
>> (which is the largest memory footprint of the physical memory hotplug)
>> from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
>> us to map any pfn range so the memory doesn't need to be online to be
>> usable for the array. See patch 3 for more details.
>> This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.
>>
>> [Overall design]:
>>
>> Implementation wise we reuse vmem_altmap infrastructure to override
>> the default allocator used by vmemap_populate. Once the memmap is
>> allocated we need a way to mark altmap pfns used for the allocation.
>> If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
>> altmap structure at the beginning of __add_pages(), and then we call
>> mark_vmemmap_pages().
>>
>> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
>> in the way they allocate vmemmap pages within the memory blocks.
>>
>> MHP_MEMMAP_MEMBLOCK:
>> - With this flag, we will allocate vmemmap pages in each memory block.
>> This means that if we hot-add a range that spans multiple memory blocks,
>> we will use the beginning of each memory block for the vmemmap pages.
>> This strategy is good for cases where the caller wants the flexiblity
>> to hot-remove memory in a different granularity than when it was added.
>>
>> MHP_MEMMAP_DEVICE:
>> - With this flag, we will store all vmemmap pages at the beginning of
>> hot-added memory.
>>
>> So it is a tradeoff of flexiblity vs contigous memory.
>> More info on the above can be found in patch#2.
>>
>> Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
>> mark_vmemmap_pages() gets called at a different stage.
>> With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
>> fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
>> sections have been populated.
>>
>> mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata:
>>
>> The current layout of the Vmemmap pages are:
>>
>> [Head->refcount] : Nr sections used by this altmap
>> [Head->private] : Nr of vmemmap pages
>> [Tail->freelist] : Pointer to the head page
>>
>> This is done to easy the computation we need in some places.
>> E.g:
>>
>> Example 1)
>> We hot-add 1GB on x86_64 (memory block 128MB) using
>> MHP_MEMMAP_DEVICE:
>>
>> head->_refcount = 8 sections
>> head->private = 4096 vmemmap pages
>> tail's->freelist = head
>>
>> Example 2)
>> We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK:
>>
>> [at the beginning of each memblock]
>> head->_refcount = 1 section
>> head->private = 512 vmemmap pages
>> tail's->freelist = head
>>
>> We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know
>> how much do we have to defer the call to vmemmap_free().
>> The thing is that the first pages of the hot-added range are used to create
>> the memmap mapping, so we cannot remove those first, otherwise we would blow up
>> when accessing the other pages.
>
> So, assuming we add_memory(1GB, MHP_MEMMAP_DEVICE) and then
> remove_memory(128MB) of the added memory, this will work?
Hmm, I guess this won't work - especially when removing the first 128MB
first, where the memmap resides.
Do we need MHP_MEMMAP_DEVICE at this point or could we start with
MHP_MEMMAP_MEMBLOCK? That "smells" like being the easier case.
--
Thanks,
David / dhildenb
On 25.06.19 09:52, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
>
> This has some disadvantages:
> a) an existing memory is consumed for that purpose
> (~2MB per 128MB memory section on x86_64)
> b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
>
> a) has turned out to be a problem for memory hotplug based ballooning
> because the userspace might not react in time to online memory while
> the memory consumed during physical hotadd consumes enough memory to
> push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
> policy for the newly added memory") has been added to workaround that
> problem.
>
> I have also seen hot-add operations failing on powerpc due to the fact
> that we try to use order-8 pages. If the base page size is 64KB, this
> gives us 16MB, and if we run out of those, we simply fail.
> One could arge that we can fall back to basepages as we do in x86_64, but
> we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
>
> Implementation wise we reuse vmem_altmap infrastructure to override
> the default allocator used by __vmemap_populate. Once the memmap is
> allocated we need a way to mark altmap pfns used for the allocation.
> If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
> altmap structure at the beginning of __add_pages(), and then we call
> mark_vmemmap_pages().
>
> Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
> mark_vmemmap_pages() gets called at a different stage.
> With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
> fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
> sections have been populated.
So, only MHP_MEMMAP_DEVICE will be used. Would it make sense to only
implement one for now (after we decide which one to use), to make things
simpler?
Or do you have a real user in mind for the other?
>
> mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata:
>
> The current layout of the Vmemmap pages are:
>
> [Head->refcount] : Nr sections used by this altmap
> [Head->private] : Nr of vmemmap pages
> [Tail->freelist] : Pointer to the head page
>
> This is done to easy the computation we need in some places.
> E.g:
>
> Example 1)
> We hot-add 1GB on x86_64 (memory block 128MB) using
> MHP_MEMMAP_DEVICE:
>
> head->_refcount = 8 sections
> head->private = 4096 vmemmap pages
> tail's->freelist = head
>
> Example 2)
> We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK:
>
> [at the beginning of each memblock]
> head->_refcount = 1 section
> head->private = 512 vmemmap pages
> tail's->freelist = head
>
> We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know
> how much do we have to defer the call to vmemmap_free().
> The thing is that the first pages of the hot-added range are used to create
> the memmap mapping, so we cannot remove those first, otherwise we would blow up
> when accessing the other pages.
>
> What we do is that since when we hot-remove a memory-range, sections are being
> removed sequentially, we wait until we hit the last section, and then we free
> the hole range to vmemmap_free backwards.
> We know that it is the last section because in every pass we
> decrease head->_refcount, and when it reaches 0, we got our last section.
>
> We also have to be careful about those pages during online and offline
> operations. They are simply skipped, so online will keep them
> reserved and so unusable for any other purpose and offline ignores them
> so they do not block the offline operation.
>
> In offline operation we only have to check for one particularity.
> Depending on how large was the hot-added range, and using MHP_MEMMAP_DEVICE,
> can be that one or more than one memory block is filled with only vmemmap pages.
> We just need to check for this case and skip 1) isolating 2) migrating,
> because those pages do not need to be migrated anywhere, they are self-hosted.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 5 +-
> arch/powerpc/mm/init_64.c | 7 +++
> arch/s390/mm/init.c | 6 ++
> arch/x86/mm/init_64.c | 10 +++
> drivers/acpi/acpi_memhotplug.c | 2 +-
> drivers/base/memory.c | 2 +-
> include/linux/memory_hotplug.h | 6 ++
> include/linux/memremap.h | 2 +-
> mm/compaction.c | 7 +++
> mm/memory_hotplug.c | 138 +++++++++++++++++++++++++++++++++++------
> mm/page_alloc.c | 22 ++++++-
> mm/page_isolation.c | 14 ++++-
> mm/sparse.c | 93 +++++++++++++++++++++++++++
> 13 files changed, 289 insertions(+), 25 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 93ed0df4df79..d4b5661fa6b6 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -765,7 +765,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> if (pmd_none(READ_ONCE(*pmdp))) {
> void *p = NULL;
>
> - p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> + if (altmap)
> + p = altmap_alloc_block_buf(PMD_SIZE, altmap);
> + else
> + p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> if (!p)
> return -ENOMEM;
>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a4e17a979e45..ff9d2c245321 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -289,6 +289,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
>
> if (base_pfn >= alt_start && base_pfn < alt_end) {
> vmem_altmap_free(altmap, nr_pages);
> + } else if (PageVmemmap(page)) {
> + /*
> + * runtime vmemmap pages are residing inside the memory
> + * section so they do not have to be freed anywhere.
> + */
> + while (PageVmemmap(page))
> + __ClearPageVmemmap(page++);
> } else if (PageReserved(page)) {
> /* allocated from bootmem */
> if (page_size < PAGE_SIZE) {
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index ffb81fe95c77..c045411552a3 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -226,6 +226,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long size_pages = PFN_DOWN(size);
> int rc;
>
> + /*
> + * Physical memory is added only later during the memory online so we
> + * cannot use the added range at this stage unfortunately.
> + */
> + restrictions->flags &= ~restrictions->flags;
> +
> if (WARN_ON_ONCE(restrictions->altmap))
> return -EINVAL;
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 688fb0687e55..00d17b666337 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -874,6 +874,16 @@ static void __meminit free_pagetable(struct page *page, int order)
> unsigned long magic;
> unsigned int nr_pages = 1 << order;
>
> + /*
> + * Runtime vmemmap pages are residing inside the memory section so
> + * they do not have to be freed anywhere.
> + */
> + if (PageVmemmap(page)) {
> + while (nr_pages--)
> + __ClearPageVmemmap(page++);
> + return;
> + }
> +
> /* bootmem page has reserved flag */
> if (PageReserved(page)) {
> __ClearPageReserved(page);
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 860f84e82dd0..3257edb98d90 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -218,7 +218,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
> if (node < 0)
> node = memory_add_physaddr_to_nid(info->start_addr);
>
> - result = __add_memory(node, info->start_addr, info->length, 0);
> + result = __add_memory(node, info->start_addr, info->length, MHP_MEMMAP_DEVICE);
>
> /*
> * If the memory block has been used by the kernel, add_memory()
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index ad9834b8b7f7..e0ac9a3b66f8 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -32,7 +32,7 @@ static DEFINE_MUTEX(mem_sysfs_mutex);
>
> #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>
> -static int sections_per_block;
> +int sections_per_block;
>
> static inline int base_memory_block_id(int section_nr)
> {
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 6fdbce9d04f9..e28e226c9a20 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -375,4 +375,10 @@ extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_
> int online_type);
> extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
> unsigned long nr_pages);
> +
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +extern void mark_vmemmap_pages(struct vmem_altmap *self);
> +#else
> +static inline void mark_vmemmap_pages(struct vmem_altmap *self) {}
> +#endif
> #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 1732dea030b2..6de37e168f57 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -16,7 +16,7 @@ struct device;
> * @alloc: track pages consumed, private to vmemmap_populate()
> */
> struct vmem_altmap {
> - const unsigned long base_pfn;
> + unsigned long base_pfn;
> const unsigned long reserve;
> unsigned long free;
> unsigned long align;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9e1b9acb116b..40697f74b8b4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -855,6 +855,13 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> nr_scanned++;
>
> page = pfn_to_page(low_pfn);
> + /*
> + * Vmemmap pages do not need to be isolated.
> + */
> + if (PageVmemmap(page)) {
> + low_pfn += get_nr_vmemmap_pages(page) - 1;
> + continue;
> + }
>
> /*
> * Check if the pageblock has already been marked skipped.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e4e3baa6eaa7..b5106cb75795 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,8 @@
> #include "internal.h"
> #include "shuffle.h"
>
> +extern int sections_per_block;
> +
> /*
> * online_page_callback contains pointer to current page onlining function.
> * Initially it is generic_online_page(). If it is required it could be
> @@ -279,6 +281,24 @@ static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
> return 0;
> }
>
> +static void mhp_reset_altmap(unsigned long next_pfn,
> + struct vmem_altmap *altmap)
> +{
> + altmap->base_pfn = next_pfn;
> + altmap->alloc = 0;
> +}
> +
> +static void mhp_init_altmap(unsigned long pfn, unsigned long nr_pages,
> + unsigned long mhp_flags,
> + struct vmem_altmap *altmap)
> +{
> + if (mhp_flags & MHP_MEMMAP_DEVICE)
> + altmap->free = nr_pages;
> + else
> + altmap->free = PAGES_PER_SECTION * sections_per_block;
> + altmap->base_pfn = pfn;
> +}
> +
> /*
> * Reasonably generic function for adding memory. It is
> * expected that archs that support memory hotplug will
> @@ -290,8 +310,17 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> {
> unsigned long i;
> int start_sec, end_sec, err;
> - struct vmem_altmap *altmap = restrictions->altmap;
> + struct vmem_altmap *altmap;
> + struct vmem_altmap __memblk_altmap = {};
> + unsigned long mhp_flags = restrictions->flags;
> + unsigned long sections_added;
> +
> + if (mhp_flags & MHP_VMEMMAP_FLAGS) {
> + mhp_init_altmap(pfn, nr_pages, mhp_flags, &__memblk_altmap);
> + restrictions->altmap = &__memblk_altmap;
> + }
>
> + altmap = restrictions->altmap;
> if (altmap) {
> /*
> * Validate altmap is within bounds of the total request
> @@ -308,9 +337,10 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> if (err)
> return err;
>
> + sections_added = 1;
> start_sec = pfn_to_section_nr(pfn);
> end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
> - for (i = start_sec; i <= end_sec; i++) {
> + for (i = start_sec; i <= end_sec; i++, sections_added++) {
> unsigned long pfns;
>
> pfns = min(nr_pages, PAGES_PER_SECTION
> @@ -320,9 +350,19 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> break;
> pfn += pfns;
> nr_pages -= pfns;
> +
> + if (mhp_flags & MHP_MEMMAP_MEMBLOCK &&
> + !(sections_added % sections_per_block)) {
> + mark_vmemmap_pages(altmap);
> + mhp_reset_altmap(pfn, altmap);
> + }
> cond_resched();
> }
> vmemmap_populate_print_last();
> +
> + if (mhp_flags & MHP_MEMMAP_DEVICE)
> + mark_vmemmap_pages(altmap);
> +
> return err;
> }
>
> @@ -642,6 +682,14 @@ static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
> while (start < end) {
> order = min(MAX_ORDER - 1,
> get_order(PFN_PHYS(end) - PFN_PHYS(start)));
> + /*
> + * Check if the pfn is aligned to its order.
> + * If not, we decrement the order until it is,
> + * otherwise __free_one_page will bug us.
> + */
> + while (start & ((1 << order) - 1))
> + order--;
> +
> (*online_page_callback)(pfn_to_page(start), order);
>
> onlined_pages += (1UL << order);
> @@ -654,13 +702,30 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> void *arg)
> {
> unsigned long onlined_pages = *(unsigned long *)arg;
> + unsigned long pfn = start_pfn;
> + unsigned long nr_vmemmap_pages = 0;
>
> - if (PageReserved(pfn_to_page(start_pfn)))
> - onlined_pages += online_pages_blocks(start_pfn, nr_pages);
> + if (PageVmemmap(pfn_to_page(pfn))) {
> + /*
> + * Do not send vmemmap pages to the page allocator.
> + */
> + nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
> + nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
> + pfn += nr_vmemmap_pages;
> + if (nr_vmemmap_pages == nr_pages)
> + /*
> + * If the entire range contains only vmemmap pages,
> + * there are no pages left for the page allocator.
> + */
> + goto skip_online;
> + }
>
> + if (PageReserved(pfn_to_page(pfn)))
> + onlined_pages += online_pages_blocks(pfn, nr_pages - nr_vmemmap_pages);
> +skip_online:
> online_mem_sections(start_pfn, start_pfn + nr_pages);
>
> - *(unsigned long *)arg = onlined_pages;
> + *(unsigned long *)arg = onlined_pages + nr_vmemmap_pages;
> return 0;
> }
>
> @@ -1051,6 +1116,23 @@ static int online_memory_block(struct memory_block *mem, void *arg)
> return device_online(&mem->dev);
> }
>
> +static bool mhp_check_correct_flags(unsigned long flags)
> +{
> + if (flags & MHP_VMEMMAP_FLAGS) {
> + if (!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) {
> + WARN(1, "Vmemmap capability can only be used on"
> + "CONFIG_SPARSEMEM_VMEMMAP. Ignoring flags.\n");
> + return false;
> + }
> + if ((flags & MHP_VMEMMAP_FLAGS) == MHP_VMEMMAP_FLAGS) {
> + WARN(1, "Both MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK"
> + "were passed. Ignoring flags.\n");
> + return false;
> + }
> + }
> + return true;
> +}
> +
> /*
> * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> * and online/offline operations (triggered e.g. by sysfs).
> @@ -1086,6 +1168,9 @@ int __ref add_memory_resource(int nid, struct resource *res, unsigned long flags
> goto error;
> new_node = ret;
>
> + if (mhp_check_correct_flags(flags))
> + restrictions.flags = flags;
> +
> /* call arch's memory hotadd */
> ret = arch_add_memory(nid, start, size, &restrictions);
> if (ret < 0)
> @@ -1518,12 +1603,14 @@ static int __ref __offline_pages(unsigned long start_pfn,
> {
> unsigned long pfn, nr_pages;
> unsigned long offlined_pages = 0;
> + unsigned long nr_vmemmap_pages = 0;
> int ret, node, nr_isolate_pageblock;
> unsigned long flags;
> unsigned long valid_start, valid_end;
> struct zone *zone;
> struct memory_notify arg;
> char *reason;
> + bool skip = false;
>
> mem_hotplug_begin();
>
> @@ -1540,15 +1627,24 @@ static int __ref __offline_pages(unsigned long start_pfn,
> node = zone_to_nid(zone);
> nr_pages = end_pfn - start_pfn;
>
> - /* set above range as isolated */
> - ret = start_isolate_page_range(start_pfn, end_pfn,
> - MIGRATE_MOVABLE,
> - SKIP_HWPOISON | REPORT_FAILURE);
> - if (ret < 0) {
> - reason = "failure to isolate range";
> - goto failed_removal;
> + if (PageVmemmap(pfn_to_page(start_pfn))) {
> + nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
> + nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
> + if (nr_vmemmap_pages == nr_pages)
> + skip = true;
> + }
> +
> + if (!skip) {
> + /* set above range as isolated */
> + ret = start_isolate_page_range(start_pfn, end_pfn,
> + MIGRATE_MOVABLE,
> + SKIP_HWPOISON | REPORT_FAILURE);
> + if (ret < 0) {
> + reason = "failure to isolate range";
> + goto failed_removal;
> + }
> + nr_isolate_pageblock = ret;
> }
> - nr_isolate_pageblock = ret;
>
> arg.start_pfn = start_pfn;
> arg.nr_pages = nr_pages;
> @@ -1561,6 +1657,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
> goto failed_removal_isolated;
> }
>
> + if (skip)
> + goto skip_migration;
> +
> do {
> for (pfn = start_pfn; pfn;) {
> if (signal_pending(current)) {
> @@ -1601,7 +1700,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
> We cannot do rollback at this point. */
> walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> &offlined_pages, offline_isolated_pages_cb);
> - pr_info("Offlined Pages %ld\n", offlined_pages);
> +
> +skip_migration:
> + pr_info("Offlined Pages %ld\n", offlined_pages + nr_vmemmap_pages);
> /*
> * Onlining will reset pagetype flags and makes migrate type
> * MOVABLE, so just need to decrease the number of isolated
> @@ -1612,11 +1713,12 @@ static int __ref __offline_pages(unsigned long start_pfn,
> spin_unlock_irqrestore(&zone->lock, flags);
>
> /* removal success */
> - adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
> - zone->present_pages -= offlined_pages;
> + if (offlined_pages)
> + adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
> + zone->present_pages -= offlined_pages + nr_vmemmap_pages;
>
> pgdat_resize_lock(zone->zone_pgdat, &flags);
> - zone->zone_pgdat->node_present_pages -= offlined_pages;
> + zone->zone_pgdat->node_present_pages -= offlined_pages + nr_vmemmap_pages;
> pgdat_resize_unlock(zone->zone_pgdat, &flags);
>
> init_per_zone_wmark_min();
> @@ -1645,7 +1747,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
> memory_notify(MEM_CANCEL_OFFLINE, &arg);
> failed_removal:
> pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
> - (unsigned long long) start_pfn << PAGE_SHIFT,
> + (unsigned long long) (start_pfn - nr_vmemmap_pages) << PAGE_SHIFT,
> ((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
> reason);
> /* pushback to free area */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b3266d63521..7a73a06c5730 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1282,9 +1282,14 @@ static void free_one_page(struct zone *zone,
> static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> unsigned long zone, int nid)
> {
> - mm_zero_struct_page(page);
> + if (!__PageVmemmap(page)) {
> + /*
> + * Vmemmap pages need to preserve their state.
> + */
> + mm_zero_struct_page(page);
> + init_page_count(page);
> + }
> set_page_links(page, zone, nid, pfn);
> - init_page_count(page);
> page_mapcount_reset(page);
> page_cpupid_reset_last(page);
> page_kasan_tag_reset(page);
> @@ -8143,6 +8148,14 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>
> page = pfn_to_page(check);
>
> + /*
> + * Vmemmap pages are not needed to be moved around.
> + */
> + if (PageVmemmap(page)) {
> + iter += get_nr_vmemmap_pages(page) - 1;
> + continue;
> + }
> +
> if (PageReserved(page))
> goto unmovable;
>
> @@ -8510,6 +8523,11 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
> continue;
> }
> page = pfn_to_page(pfn);
> +
> + if (PageVmemmap(page)) {
> + pfn += get_nr_vmemmap_pages(page);
> + continue;
> + }
> /*
> * The HWPoisoned page may be not in buddy system, and
> * page_count() is not 0.
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index e3638a5bafff..128c47a27925 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -146,7 +146,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> static inline struct page *
> __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> {
> - int i;
> + unsigned long i;
>
> for (i = 0; i < nr_pages; i++) {
> struct page *page;
> @@ -154,6 +154,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> page = pfn_to_online_page(pfn + i);
> if (!page)
> continue;
> + if (PageVmemmap(page)) {
> + i += get_nr_vmemmap_pages(page) - 1;
> + continue;
> + }
> return page;
> }
> return NULL;
> @@ -268,6 +272,14 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
> continue;
> }
> page = pfn_to_page(pfn);
> + /*
> + * Vmemmap pages are not isolated. Skip them.
> + */
> + if (PageVmemmap(page)) {
> + pfn += get_nr_vmemmap_pages(page);
> + continue;
> + }
> +
> if (PageBuddy(page))
> /*
> * If the page is on a free list, it has to be on
> diff --git a/mm/sparse.c b/mm/sparse.c
> index b77ca21a27a4..04b395fb4463 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -635,6 +635,94 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
> #endif
>
> #ifdef CONFIG_SPARSEMEM_VMEMMAP
> +void mark_vmemmap_pages(struct vmem_altmap *self)
> +{
> + unsigned long pfn = self->base_pfn + self->reserve;
> + unsigned long nr_pages = self->alloc;
> + unsigned long nr_sects = self->free / PAGES_PER_SECTION;
> + unsigned long i;
> + struct page *head;
> +
> + if (!nr_pages)
> + return;
> +
> + pr_debug("%s: marking %px - %px as Vmemmap (%ld pages)\n",
> + __func__,
> + pfn_to_page(pfn),
> + pfn_to_page(pfn + nr_pages - 1),
> + nr_pages);
> +
> + /*
> + * All allocations for the memory hotplug are the same sized so align
> + * should be 0.
> + */
> + WARN_ON(self->align);
> +
> + /*
> + * Layout of vmemmap pages:
> + * [Head->refcount] : Nr sections used by this altmap
> + * [Head->private] : Nr of vmemmap pages
> + * [Tail->freelist] : Pointer to the head page
> + */
> +
> + /*
> + * Head, first vmemmap page
> + */
> + head = pfn_to_page(pfn);
> + for (i = 0; i < nr_pages; i++, pfn++) {
> + struct page *page = pfn_to_page(pfn);
> +
> + mm_zero_struct_page(page);
> + __SetPageVmemmap(page);
> + page->freelist = head;
> + init_page_count(page);
> + }
> + set_page_count(head, (int)nr_sects);
> + set_page_private(head, nr_pages);
> +}
> +/*
> + * If the range we are trying to remove was hot-added with vmemmap pages
> + * using MHP_MEMMAP_DEVICE, we need to keep track of it to know how much
> + * do we have do defer the free up.
> + * Since sections are removed sequentally in __remove_pages()->
> + * __remove_section(), we just wait until we hit the last section.
> + * Once that happens, we can trigger free_deferred_vmemmap_range to actually
> + * free the whole memory-range.
> + */
> +static struct page *head_vmemmap_page = NULL;;
> +static bool freeing_vmemmap_range = false;
> +
> +static inline bool vmemmap_dec_and_test(void)
> +{
> + return page_ref_dec_and_test(head_vmemmap_page);
> +}
> +
> +static void free_deferred_vmemmap_range(unsigned long start,
> + unsigned long end)
> +{
> + unsigned long nr_pages = end - start;
> + unsigned long first_section = (unsigned long)head_vmemmap_page;
> +
> + while (start >= first_section) {
> + vmemmap_free(start, end, NULL);
> + end = start;
> + start -= nr_pages;
> + }
> + head_vmemmap_page = NULL;
> + freeing_vmemmap_range = false;
> +}
> +
> +static void deferred_vmemmap_free(unsigned long start, unsigned long end)
> +{
> + if (!freeing_vmemmap_range) {
> + freeing_vmemmap_range = true;
> + head_vmemmap_page = (struct page *)start;
> + }
> +
> + if (vmemmap_dec_and_test())
> + free_deferred_vmemmap_range(start, end);
> +}
> +
> static struct page *populate_section_memmap(unsigned long pfn,
> unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> {
> @@ -647,6 +735,11 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
> unsigned long start = (unsigned long) pfn_to_page(pfn);
> unsigned long end = start + nr_pages * sizeof(struct page);
>
> + if (PageVmemmap((struct page *)start) || freeing_vmemmap_range) {
> + deferred_vmemmap_free(start, end);
> + return;
> + }
> +
> vmemmap_free(start, end, altmap);
> }
> static void free_map_bootmem(struct page *memmap)
>
--
Thanks,
David / dhildenb
On Tue, Jun 25, 2019 at 10:25:48AM +0200, David Hildenbrand wrote:
> > [Coverletter]
> >
> > This is another step to make memory hotplug more usable. The primary
> > goal of this patchset is to reduce memory overhead of the hot-added
> > memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
> > to populate memmap (struct page array) has two main drawbacks:
First off, thanks for looking into this :-)
>
> Mental note: How will it be handled if a caller specifies "Allocate
> memmap from hotadded memory", but we are running under SPARSEMEM where
> we can't do this.
In add_memory_resource(), we have a call to mhp_check_correct_flags(), which is
in charge of checking if the flags passed are compliant with our configuration
among other things.
It also checks if both flags were passed (_MEMBLOCK|_DEVICE).
If a) any of the flags were specified and we are not on CONFIG_SPARSEMEM_VMEMMAP,
b) the flags are colliding with each other or c) the flags just do not make sense,
we print out a warning and drop the flags to 0, so we just ignore them.
I just realized that I can adjust the check even more (something for the next
version).
But to answer your question, flags are ignored under !CONFIG_SPARSEMEM_VMEMMAP.
>
> >
> > a) it consumes an additional memory until the hotadded memory itself is
> > onlined and
> > b) memmap might end up on a different numa node which is especially true
> > for movable_node configuration.
> >
> > a) it is a problem especially for memory hotplug based memory "ballooning"
> > solutions when the delay between physical memory hotplug and the
> > onlining can lead to OOM and that led to introduction of hacks like auto
> > onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
> > policy for the newly added memory")).
> >
> > b) can have performance drawbacks.
> >
> > Another minor case is that I have seen hot-add operations failing on archs
> > because they were running out of order-x pages.
> > E.g On powerpc, in certain configurations, we use order-8 pages,
> > and given 64KB base pagesize, that is 16MB.
> > If we run out of those, we just fail the operation and we cannot add
> > more memory.
>
> At least for SPARSEMEM, we fallback to vmalloc() to work around this
> issue. I haven't looked into the populate_section_memmap() internals
> yet. Can you point me at the code that performs this allocation?
Yes, on SPARSEMEM we first try to allocate the pages physical configuous, and
then fallback to vmalloc.
This is because on CONFIG_SPARSEMEM memory model, the translations pfn_to_page/
page_to_pfn do not expect the memory to be contiguous.
But that is not the case on CONFIG_SPARSEMEM_VMEMMAP.
We do expect the memory to be physical contigous there, that is why a simply
pfn_to_page/page_to_pfn is a matter of adding/substracting vmemmap/pfn.
Powerpc code is at:
https://elixir.bootlin.com/linux/v5.2-rc6/source/arch/powerpc/mm/init_64.c#L175
> So, assuming we add_memory(1GB, MHP_MEMMAP_DEVICE) and then
> remove_memory(128MB) of the added memory, this will work?
No, MHP_MEMMAP_DEVICE is meant to be used when hot-adding and hot-removing work
in the same granularity.
This is because all memmap pages will be stored at the beginning of the memory
range.
Allowing hot-removing in a different granularity on MHP_MEMMAP_DEVICE would imply
a lot of extra work.
For example, we would have to parse the vmemmap-head of the hot-removed range,
and punch a hole in there to clear the vmemmap pages, and then be very carefull
when deleting those pagetables.
So I followed Michal's advice, and I decided to let the caller specify if he
either wants to allocate per memory block or per hot-added range(device).
Where per memory block, allows us to do:
add_memory(1GB, MHP_MEMMAP_MEMBLOCK)
remove_memory(128MB)
> add_memory(8GB, MHP_MEMMAP_DEVICE)
>
> For 8GB, we will need exactly 128MB of memmap if I did the math right.
> So exactly one section. This section will still be marked as being
> online (although not pages on it are actually online)?
Yap, 8GB will fill the first section with vmemmap pages.
It will be marked as online, yes.
This is not to diverge too much from what we have right now, and starting
treat some sections different than others.
E.g: Early sections that are used for memmap pages on early boot stage.
> >
> > What we do is that since when we hot-remove a memory-range, sections are being
> > removed sequentially, we wait until we hit the last section, and then we free
> > the hole range to vmemmap_free backwards.
> > We know that it is the last section because in every pass we
> > decrease head->_refcount, and when it reaches 0, we got our last section.
> >
> > We also have to be careful about those pages during online and offline
> > operations. They are simply skipped, so online will keep them
> > reserved and so unusable for any other purpose and offline ignores them
> > so they do not block the offline operation.
>
> I assume that they will still be dumped normally by user space. (as they
> are described by a "memory resource" and not PG_Offline)
They are PG_Reserved.
Anyway, you mean by crash-tool?
--
Oscar Salvador
SUSE L3
On 26.06.19 10:03, Oscar Salvador wrote:
> On Tue, Jun 25, 2019 at 10:25:48AM +0200, David Hildenbrand wrote:
>>> [Coverletter]
>>>
>>> This is another step to make memory hotplug more usable. The primary
>>> goal of this patchset is to reduce memory overhead of the hot-added
>>> memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
>>> to populate memmap (struct page array) has two main drawbacks:
>
> First off, thanks for looking into this :-)
Thanks for working on this ;)
>
>>
>> Mental note: How will it be handled if a caller specifies "Allocate
>> memmap from hotadded memory", but we are running under SPARSEMEM where
>> we can't do this.
>
> In add_memory_resource(), we have a call to mhp_check_correct_flags(), which is
> in charge of checking if the flags passed are compliant with our configuration
> among other things.
> It also checks if both flags were passed (_MEMBLOCK|_DEVICE).
>
> If a) any of the flags were specified and we are not on CONFIG_SPARSEMEM_VMEMMAP,
> b) the flags are colliding with each other or c) the flags just do not make sense,
> we print out a warning and drop the flags to 0, so we just ignore them.
>
> I just realized that I can adjust the check even more (something for the next
> version).
>
> But to answer your question, flags are ignored under !CONFIG_SPARSEMEM_VMEMMAP.
So it is indeed a hint only.
>
>>
>>>
>>> a) it consumes an additional memory until the hotadded memory itself is
>>> onlined and
>>> b) memmap might end up on a different numa node which is especially true
>>> for movable_node configuration.
>>>
>>> a) it is a problem especially for memory hotplug based memory "ballooning"
>>> solutions when the delay between physical memory hotplug and the
>>> onlining can lead to OOM and that led to introduction of hacks like auto
>>> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>>> policy for the newly added memory")).
>>>
>>> b) can have performance drawbacks.
>>>
>>> Another minor case is that I have seen hot-add operations failing on archs
>>> because they were running out of order-x pages.
>>> E.g On powerpc, in certain configurations, we use order-8 pages,
>>> and given 64KB base pagesize, that is 16MB.
>>> If we run out of those, we just fail the operation and we cannot add
>>> more memory.
>>
>> At least for SPARSEMEM, we fallback to vmalloc() to work around this
>> issue. I haven't looked into the populate_section_memmap() internals
>> yet. Can you point me at the code that performs this allocation?
>
> Yes, on SPARSEMEM we first try to allocate the pages physical configuous, and
> then fallback to vmalloc.
> This is because on CONFIG_SPARSEMEM memory model, the translations pfn_to_page/
> page_to_pfn do not expect the memory to be contiguous.
>
> But that is not the case on CONFIG_SPARSEMEM_VMEMMAP.
> We do expect the memory to be physical contigous there, that is why a simply
> pfn_to_page/page_to_pfn is a matter of adding/substracting vmemmap/pfn.
Yeas, I explored that last week but didn't figure out where the actual
vmmap population code resided - thanks :)
>
> Powerpc code is at:
>
> https://elixir.bootlin.com/linux/v5.2-rc6/source/arch/powerpc/mm/init_64.c#L175
>
>
>
>> So, assuming we add_memory(1GB, MHP_MEMMAP_DEVICE) and then
>> remove_memory(128MB) of the added memory, this will work?
>
> No, MHP_MEMMAP_DEVICE is meant to be used when hot-adding and hot-removing work
> in the same granularity.
> This is because all memmap pages will be stored at the beginning of the memory
> range.
> Allowing hot-removing in a different granularity on MHP_MEMMAP_DEVICE would imply
> a lot of extra work.
> For example, we would have to parse the vmemmap-head of the hot-removed range,
> and punch a hole in there to clear the vmemmap pages, and then be very carefull
> when deleting those pagetables.
>
> So I followed Michal's advice, and I decided to let the caller specify if he
> either wants to allocate per memory block or per hot-added range(device).
> Where per memory block, allows us to do:
>
> add_memory(1GB, MHP_MEMMAP_MEMBLOCK)
> remove_memory(128MB)
Back then, I already mentioned that we might have some users that
remove_memory() they never added in a granularity it wasn't added. My
concerns back then were never fully sorted out.
arch/powerpc/platforms/powernv/memtrace.c
- Will remove memory in memory block size chunks it never added
- What if that memory resides on a DIMM added via MHP_MEMMAP_DEVICE?
Will it at least bail out? Or simply break?
IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save to be
introduced.
--
Thanks,
David / dhildenb
On Tue, Jun 25, 2019 at 10:49:10AM +0200, David Hildenbrand wrote:
> On 25.06.19 09:52, Oscar Salvador wrote:
> > Physical memory hotadd has to allocate a memmap (struct page array) for
> > the newly added memory section. Currently, alloc_pages_node() is used
> > for those allocations.
> >
> > This has some disadvantages:
> > a) an existing memory is consumed for that purpose
> > (~2MB per 128MB memory section on x86_64)
> > b) if the whole node is movable then we have off-node struct pages
> > which has performance drawbacks.
> >
> > a) has turned out to be a problem for memory hotplug based ballooning
> > because the userspace might not react in time to online memory while
> > the memory consumed during physical hotadd consumes enough memory to
> > push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
> > policy for the newly added memory") has been added to workaround that
> > problem.
> >
> > I have also seen hot-add operations failing on powerpc due to the fact
> > that we try to use order-8 pages. If the base page size is 64KB, this
> > gives us 16MB, and if we run out of those, we simply fail.
> > One could arge that we can fall back to basepages as we do in x86_64, but
> > we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> >
> > Vmemap page tables can map arbitrary memory.
> > That means that we can simply use the beginning of each memory section and
> > map struct pages there.
> > struct pages which back the allocated space then just need to be treated
> > carefully.
> >
> > Implementation wise we reuse vmem_altmap infrastructure to override
> > the default allocator used by __vmemap_populate. Once the memmap is
> > allocated we need a way to mark altmap pfns used for the allocation.
> > If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
> > altmap structure at the beginning of __add_pages(), and then we call
> > mark_vmemmap_pages().
> >
> > Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
> > mark_vmemmap_pages() gets called at a different stage.
> > With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
> > fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
> > sections have been populated.
>
> So, only MHP_MEMMAP_DEVICE will be used. Would it make sense to only
> implement one for now (after we decide which one to use), to make things
> simpler?
>
> Or do you have a real user in mind for the other?
Currently, only MHP_MEMMAP_DEVICE will be used, as we only pass flags from
acpi memory-hotplug path.
All the others: hyper-v, Xen,... will have to be evaluated to see which one
do they want to use.
Although MHP_MEMMAP_DEVICE is the only one used right now, I introduced
MHP_MEMMAP_MEMBLOCK to give the callers the choice of using MHP_MEMMAP_MEMBLOCK
if they think that a strategy where hot-removing works in a different granularity
makes sense.
Moreover, since they both use the same API, there is no extra code needed to
handle it. (Just two lines in __add_pages())
This arose here [1].
[1] https://patchwork.kernel.org/project/linux-mm/list/?submitter=137061
--
Oscar Salvador
SUSE L3
On Wed, Jun 26, 2019 at 10:11:06AM +0200, David Hildenbrand wrote:
> Back then, I already mentioned that we might have some users that
> remove_memory() they never added in a granularity it wasn't added. My
> concerns back then were never fully sorted out.
>
> arch/powerpc/platforms/powernv/memtrace.c
>
> - Will remove memory in memory block size chunks it never added
> - What if that memory resides on a DIMM added via MHP_MEMMAP_DEVICE?
>
> Will it at least bail out? Or simply break?
>
> IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save to be
> introduced.
Uhm, I will take a closer look and see if I can clear your concerns.
TBH, I did not try to use arch/powerpc/platforms/powernv/memtrace.c
yet.
I will get back to you once I tried it out.
--
Oscar Salvador
SUSE L3
On 26.06.19 10:13, Oscar Salvador wrote:
> On Tue, Jun 25, 2019 at 10:49:10AM +0200, David Hildenbrand wrote:
>> On 25.06.19 09:52, Oscar Salvador wrote:
>>> Physical memory hotadd has to allocate a memmap (struct page array) for
>>> the newly added memory section. Currently, alloc_pages_node() is used
>>> for those allocations.
>>>
>>> This has some disadvantages:
>>> a) an existing memory is consumed for that purpose
>>> (~2MB per 128MB memory section on x86_64)
>>> b) if the whole node is movable then we have off-node struct pages
>>> which has performance drawbacks.
>>>
>>> a) has turned out to be a problem for memory hotplug based ballooning
>>> because the userspace might not react in time to online memory while
>>> the memory consumed during physical hotadd consumes enough memory to
>>> push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
>>> policy for the newly added memory") has been added to workaround that
>>> problem.
>>>
>>> I have also seen hot-add operations failing on powerpc due to the fact
>>> that we try to use order-8 pages. If the base page size is 64KB, this
>>> gives us 16MB, and if we run out of those, we simply fail.
>>> One could arge that we can fall back to basepages as we do in x86_64, but
>>> we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>>>
>>> Vmemap page tables can map arbitrary memory.
>>> That means that we can simply use the beginning of each memory section and
>>> map struct pages there.
>>> struct pages which back the allocated space then just need to be treated
>>> carefully.
>>>
>>> Implementation wise we reuse vmem_altmap infrastructure to override
>>> the default allocator used by __vmemap_populate. Once the memmap is
>>> allocated we need a way to mark altmap pfns used for the allocation.
>>> If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
>>> altmap structure at the beginning of __add_pages(), and then we call
>>> mark_vmemmap_pages().
>>>
>>> Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
>>> mark_vmemmap_pages() gets called at a different stage.
>>> With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
>>> fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
>>> sections have been populated.
>>
>> So, only MHP_MEMMAP_DEVICE will be used. Would it make sense to only
>> implement one for now (after we decide which one to use), to make things
>> simpler?
>>
>> Or do you have a real user in mind for the other?
>
> Currently, only MHP_MEMMAP_DEVICE will be used, as we only pass flags from
> acpi memory-hotplug path.
>
> All the others: hyper-v, Xen,... will have to be evaluated to see which one
> do they want to use.
>
> Although MHP_MEMMAP_DEVICE is the only one used right now, I introduced
> MHP_MEMMAP_MEMBLOCK to give the callers the choice of using MHP_MEMMAP_MEMBLOCK
> if they think that a strategy where hot-removing works in a different granularity
> makes sense.
>
> Moreover, since they both use the same API, there is no extra code needed to
> handle it. (Just two lines in __add_pages())
>
> This arose here [1].
>
> [1] https://patchwork.kernel.org/project/linux-mm/list/?submitter=137061
>
Just noting that you can emulate MHP_MEMMAP_MEMBLOCK via
MHP_MEMMAP_DEVICE by adding memory in memory block granularity (which is
what hyper-v and xen do if I am not wrong!).
Not yet convinced that both, MHP_MEMMAP_MEMBLOCK and MHP_MEMMAP_DEVICE
are needed. But we can sort that out later.
--
Thanks,
David / dhildenb
Hello Oscar,
On 06/25/2019 01:22 PM, Oscar Salvador wrote:
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 93ed0df4df79..d4b5661fa6b6 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -765,7 +765,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> if (pmd_none(READ_ONCE(*pmdp))) {
> void *p = NULL;
>
> - p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> + if (altmap)
> + p = altmap_alloc_block_buf(PMD_SIZE, altmap);
> + else
> + p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> if (!p)
> return -ENOMEM;
Is this really required to be part of this series ? I have an ongoing work
(reworked https://patchwork.kernel.org/patch/10882781/) enabling altmap
support on arm64 during memory hot add and remove path which is waiting on
arm64 memory-hot remove to be merged first.
- Anshuman
On Wed, Jun 26, 2019 at 10:15:16AM +0200, Oscar Salvador wrote:
> On Wed, Jun 26, 2019 at 10:11:06AM +0200, David Hildenbrand wrote:
> > Back then, I already mentioned that we might have some users that
> > remove_memory() they never added in a granularity it wasn't added. My
> > concerns back then were never fully sorted out.
> >
> > arch/powerpc/platforms/powernv/memtrace.c
> >
> > - Will remove memory in memory block size chunks it never added
> > - What if that memory resides on a DIMM added via MHP_MEMMAP_DEVICE?
> >
> > Will it at least bail out? Or simply break?
> >
> > IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save to be
> > introduced.
>
> Uhm, I will take a closer look and see if I can clear your concerns.
> TBH, I did not try to use arch/powerpc/platforms/powernv/memtrace.c
> yet.
>
> I will get back to you once I tried it out.
On a second though, it would be quite trivial to implement a check in
remove_memory() that does not allow to remove memory used with MHP_MEMMAP_DEVICE
in a different granularity:
+static bool check_vmemmap_granularity(u64 start, u64 size);
+{
+ unsigned long pfn;
+ unsigned int nr_pages;
+ struct page *p;
+
+ pfn = PHYS_PFN(start);
+ p = pfn_to_page(pfn);
+ nr_pages = size >> PAGE_SIZE;
+
+ if (PageVmemmap(p)) {
+ struct page *h = vmemmap_get_head(p);
+ unsigned long sections = (unsigned long)h->private;
+
+ if (sections * PAGES_PER_SECTION > nr_pages)
+ fail;
+ }
+ no_fail;
+}
+
+
static int __ref try_remove_memory(int nid, u64 start, u64 size)
{
int rc = 0;
BUG_ON(check_hotplug_memory_range(start, size));
mem_hotplug_begin();
+ rc = check_vmemmap_granularity(start, size);
+ if (rc)
+ goto done;
The above is quite hacky, but it gives an idea.
I will try the code from arch/powerpc/platforms/powernv/memtrace.c and see how
can I implement a check.
--
Oscar Salvador
SUSE L3
On Wed, Jun 26, 2019 at 01:47:32PM +0530, Anshuman Khandual wrote:
> Hello Oscar,
>
> On 06/25/2019 01:22 PM, Oscar Salvador wrote:
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 93ed0df4df79..d4b5661fa6b6 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -765,7 +765,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> > if (pmd_none(READ_ONCE(*pmdp))) {
> > void *p = NULL;
> >
> > - p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> > + if (altmap)
> > + p = altmap_alloc_block_buf(PMD_SIZE, altmap);
> > + else
> > + p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> > if (!p)
> > return -ENOMEM;
>
> Is this really required to be part of this series ? I have an ongoing work
> (reworked https://patchwork.kernel.org/patch/10882781/) enabling altmap
> support on arm64 during memory hot add and remove path which is waiting on
> arm64 memory-hot remove to be merged first.
Hi Anshuman,
I can drop this chunk in the next version.
No problem.
--
Oscar Salvador
SUSE L3
On 26.06.19 10:15, Oscar Salvador wrote:
> On Wed, Jun 26, 2019 at 10:11:06AM +0200, David Hildenbrand wrote:
>> Back then, I already mentioned that we might have some users that
>> remove_memory() they never added in a granularity it wasn't added. My
>> concerns back then were never fully sorted out.
>>
>> arch/powerpc/platforms/powernv/memtrace.c
>>
>> - Will remove memory in memory block size chunks it never added
>> - What if that memory resides on a DIMM added via MHP_MEMMAP_DEVICE?
>>
>> Will it at least bail out? Or simply break?
>>
>> IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save to be
>> introduced.
>
> Uhm, I will take a closer look and see if I can clear your concerns.
> TBH, I did not try to use arch/powerpc/platforms/powernv/memtrace.c
> yet.
>
> I will get back to you once I tried it out.
>
BTW, I consider the code in arch/powerpc/platforms/powernv/memtrace.c
very ugly and dangerous. We should never allow to manually
offline/online pages / hack into memory block states.
What I would want to see here is rather:
1. User space offlines the blocks to be used
2. memtrace installs a hotplug notifier and hinders the blocks it wants
to use from getting onlined.
3. memory is not added/removed/onlined/offlined in memtrace code.
CCing the DEVs.
--
Thanks,
David / dhildenb
On 26.06.19 10:27, Oscar Salvador wrote:
> On Wed, Jun 26, 2019 at 10:15:16AM +0200, Oscar Salvador wrote:
>> On Wed, Jun 26, 2019 at 10:11:06AM +0200, David Hildenbrand wrote:
>>> Back then, I already mentioned that we might have some users that
>>> remove_memory() they never added in a granularity it wasn't added. My
>>> concerns back then were never fully sorted out.
>>>
>>> arch/powerpc/platforms/powernv/memtrace.c
>>>
>>> - Will remove memory in memory block size chunks it never added
>>> - What if that memory resides on a DIMM added via MHP_MEMMAP_DEVICE?
>>>
>>> Will it at least bail out? Or simply break?
>>>
>>> IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save to be
>>> introduced.
>>
>> Uhm, I will take a closer look and see if I can clear your concerns.
>> TBH, I did not try to use arch/powerpc/platforms/powernv/memtrace.c
>> yet.
>>
>> I will get back to you once I tried it out.
>
> On a second though, it would be quite trivial to implement a check in
> remove_memory() that does not allow to remove memory used with MHP_MEMMAP_DEVICE
> in a different granularity:
>
> +static bool check_vmemmap_granularity(u64 start, u64 size);
> +{
> + unsigned long pfn;
> + unsigned int nr_pages;
> + struct page *p;
> +
> + pfn = PHYS_PFN(start);
> + p = pfn_to_page(pfn);
> + nr_pages = size >> PAGE_SIZE;
> +
> + if (PageVmemmap(p)) {
> + struct page *h = vmemmap_get_head(p);
> + unsigned long sections = (unsigned long)h->private;
> +
> + if (sections * PAGES_PER_SECTION > nr_pages)
> + fail;
> + }
> + no_fail;
> +}
> +
> +
> static int __ref try_remove_memory(int nid, u64 start, u64 size)
> {
> int rc = 0;
>
> BUG_ON(check_hotplug_memory_range(start, size));
>
> mem_hotplug_begin();
>
> + rc = check_vmemmap_granularity(start, size);
> + if (rc)
> + goto done;
>
>
> The above is quite hacky, but it gives an idea.
> I will try the code from arch/powerpc/platforms/powernv/memtrace.c and see how
> can I implement a check.
>
Yeah, I would consider such a safety check mandatory for MHP_MEMMAP_DEVICE.
--
Thanks,
David / dhildenb
Hi David,
Sorry for the late reply.
On Wed, 2019-06-26 at 10:28 +0200, David Hildenbrand wrote:
> On 26.06.19 10:15, Oscar Salvador wrote:
> > On Wed, Jun 26, 2019 at 10:11:06AM +0200, David Hildenbrand wrote:
> > > Back then, I already mentioned that we might have some users that
> > > remove_memory() they never added in a granularity it wasn't
> > > added. My
> > > concerns back then were never fully sorted out.
> > >
> > > arch/powerpc/platforms/powernv/memtrace.c
> > >
> > > - Will remove memory in memory block size chunks it never added
> > > - What if that memory resides on a DIMM added via
> > > MHP_MEMMAP_DEVICE?
> > >
> > > Will it at least bail out? Or simply break?
> > >
> > > IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save
> > > to be
> > > introduced.
> >
> > Uhm, I will take a closer look and see if I can clear your
> > concerns.
> > TBH, I did not try to use arch/powerpc/platforms/powernv/memtrace.c
> > yet.
> >
> > I will get back to you once I tried it out.
> >
>
> BTW, I consider the code in arch/powerpc/platforms/powernv/memtrace.c
> very ugly and dangerous.
Yes it would be nice to clean this up.
> We should never allow to manually
> offline/online pages / hack into memory block states.
>
> What I would want to see here is rather:
>
> 1. User space offlines the blocks to be used
> 2. memtrace installs a hotplug notifier and hinders the blocks it
> wants
> to use from getting onlined.
> 3. memory is not added/removed/onlined/offlined in memtrace code.
>
I remember looking into doing it a similar way. I can't recall the
details but my issue was probably 'how does userspace indicate to
the kernel that this memory being offlined should be removed'?
I don't know the mm code nor how the notifiers work very well so I
can't quite see how the above would work. I'm assuming memtrace would
register a hotplug notifier and when memory is offlined from userspace,
the callback func in memtrace would be called if the priority was high
enough? But how do we know that the memory being offlined is intended
for usto touch? Is there a way to offline memory from userspace not
using sysfs or have I missed something in the sysfs interface?
On a second read, perhaps you are assuming that memtrace is used after
adding new memory at runtime? If so, that is not the case. If not, then
would you be able to clarify what I'm not seeing?
Thanks.
> CCing the DEVs.
>
On Tue, Jul 02, 2019 at 04:42:34PM +1000, Rashmica Gupta wrote:
> Hi David,
>
> Sorry for the late reply.
>
> On Wed, 2019-06-26 at 10:28 +0200, David Hildenbrand wrote:
> > On 26.06.19 10:15, Oscar Salvador wrote:
> > > On Wed, Jun 26, 2019 at 10:11:06AM +0200, David Hildenbrand wrote:
> > > > Back then, I already mentioned that we might have some users that
> > > > remove_memory() they never added in a granularity it wasn't
> > > > added. My
> > > > concerns back then were never fully sorted out.
> > > >
> > > > arch/powerpc/platforms/powernv/memtrace.c
> > > >
> > > > - Will remove memory in memory block size chunks it never added
> > > > - What if that memory resides on a DIMM added via
> > > > MHP_MEMMAP_DEVICE?
> > > >
> > > > Will it at least bail out? Or simply break?
> > > >
> > > > IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save
> > > > to be
> > > > introduced.
> > >
> > > Uhm, I will take a closer look and see if I can clear your
> > > concerns.
> > > TBH, I did not try to use arch/powerpc/platforms/powernv/memtrace.c
> > > yet.
> > >
> > > I will get back to you once I tried it out.
> > >
> >
> > BTW, I consider the code in arch/powerpc/platforms/powernv/memtrace.c
> > very ugly and dangerous.
>
> Yes it would be nice to clean this up.
>
> > We should never allow to manually
> > offline/online pages / hack into memory block states.
> >
> > What I would want to see here is rather:
> >
> > 1. User space offlines the blocks to be used
> > 2. memtrace installs a hotplug notifier and hinders the blocks it
> > wants
> > to use from getting onlined.
> > 3. memory is not added/removed/onlined/offlined in memtrace code.
> >
>
> I remember looking into doing it a similar way. I can't recall the
> details but my issue was probably 'how does userspace indicate to
> the kernel that this memory being offlined should be removed'?
>
> I don't know the mm code nor how the notifiers work very well so I
> can't quite see how the above would work. I'm assuming memtrace would
> register a hotplug notifier and when memory is offlined from userspace,
> the callback func in memtrace would be called if the priority was high
> enough? But how do we know that the memory being offlined is intended
> for usto touch? Is there a way to offline memory from userspace not
> using sysfs or have I missed something in the sysfs interface?
>
> On a second read, perhaps you are assuming that memtrace is used after
> adding new memory at runtime? If so, that is not the case. If not, then
> would you be able to clarify what I'm not seeing?
Hi Rashmica,
let us go the easy way here.
Could you please explain:
1) How memtrace works
2) Why it was designed, what is the goal of the interface?
3) When it is supposed to be used?
I have seen a couple of reports in the past from people running memtrace
and failing to do so sometimes, and back then I could not grasp why people
was using it, or under which circumstances was nice to have.
So it would be nice to have a detailed explanation from the person who wrote
it.
Thanks
--
Oscar Salvador
SUSE L3
Woops, looks like my phone doesn't send plain text emails :/
On Tue, Jul 2, 2019 at 6:52 PM Rashmica Gupta <[email protected]> wrote:
>
> On Tue, Jul 2, 2019 at 5:48 PM Oscar Salvador <[email protected]> wrote:
>>
>> On Tue, Jul 02, 2019 at 04:42:34PM +1000, Rashmica Gupta wrote:
>> > Hi David,
>> >
>> > Sorry for the late reply.
>> >
>> > On Wed, 2019-06-26 at 10:28 +0200, David Hildenbrand wrote:
>> > > On 26.06.19 10:15, Oscar Salvador wrote:
>> > > > On Wed, Jun 26, 2019 at 10:11:06AM +0200, David Hildenbrand wrote:
>> > > > > Back then, I already mentioned that we might have some users that
>> > > > > remove_memory() they never added in a granularity it wasn't
>> > > > > added. My
>> > > > > concerns back then were never fully sorted out.
>> > > > >
>> > > > > arch/powerpc/platforms/powernv/memtrace.c
>> > > > >
>> > > > > - Will remove memory in memory block size chunks it never added
>> > > > > - What if that memory resides on a DIMM added via
>> > > > > MHP_MEMMAP_DEVICE?
>> > > > >
>> > > > > Will it at least bail out? Or simply break?
>> > > > >
>> > > > > IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save
>> > > > > to be
>> > > > > introduced.
>> > > >
>> > > > Uhm, I will take a closer look and see if I can clear your
>> > > > concerns.
>> > > > TBH, I did not try to use arch/powerpc/platforms/powernv/memtrace.c
>> > > > yet.
>> > > >
>> > > > I will get back to you once I tried it out.
>> > > >
>> > >
>> > > BTW, I consider the code in arch/powerpc/platforms/powernv/memtrace.c
>> > > very ugly and dangerous.
>> >
>> > Yes it would be nice to clean this up.
>> >
>> > > We should never allow to manually
>> > > offline/online pages / hack into memory block states.
>> > >
>> > > What I would want to see here is rather:
>> > >
>> > > 1. User space offlines the blocks to be used
>> > > 2. memtrace installs a hotplug notifier and hinders the blocks it
>> > > wants
>> > > to use from getting onlined.
>> > > 3. memory is not added/removed/onlined/offlined in memtrace code.
>> > >
>> >
>> > I remember looking into doing it a similar way. I can't recall the
>> > details but my issue was probably 'how does userspace indicate to
>> > the kernel that this memory being offlined should be removed'?
>> >
>> > I don't know the mm code nor how the notifiers work very well so I
>> > can't quite see how the above would work. I'm assuming memtrace would
>> > register a hotplug notifier and when memory is offlined from userspace,
>> > the callback func in memtrace would be called if the priority was high
>> > enough? But how do we know that the memory being offlined is intended
>> > for usto touch? Is there a way to offline memory from userspace not
>> > using sysfs or have I missed something in the sysfs interface?
>> >
>> > On a second read, perhaps you are assuming that memtrace is used after
>> > adding new memory at runtime? If so, that is not the case. If not, then
>> > would you be able to clarify what I'm not seeing?
>>
>> Hi Rashmica,
>>
>> let us go the easy way here.
>> Could you please explain:
>>
>
> Sure!
>
>>
>> 1) How memtrace works
>
>
> You write the size of the chunk of memory you want into the debugfs file
> and memtrace will attempt to find a contiguous section of memory of that size
> that can be offlined. If it finds that, then the memory is removed from the
> kernel's mappings. If you want a different size, then you write that to the
> debugsfs file and memtrace will re-add the memory it first removed and then
> try to offline and remove the a chunk of the new size.
>
>
>>
>> 2) Why it was designed, what is the goal of the interface?
>> 3) When it is supposed to be used?
>>
>
> There is a hardware debugging facility (htm) on some power chips. To use
> this you need a contiguous portion of memory for the output to be dumped
> to - and we obviously don't want this memory to be simultaneously used by
> the kernel.
>
> At boot time we can portion off a section of memory for this (and not tell the
> kernel about it), but sometimes you want to be able to use the hardware
> debugging facilities and you haven't done this and you don't want to reboot
> your machine - and memtrace is the solution for this.
>
> If you're curious one tool that uses this debugging facility is here:
> https://github.com/open-power/pdbg. Relevant files are libpdbg/htm.c and src/htm.c.
>
>
>> I have seen a couple of reports in the past from people running memtrace
>> and failing to do so sometimes, and back then I could not grasp why people
>> was using it, or under which circumstances was nice to have.
>> So it would be nice to have a detailed explanation from the person who wrote
>> it.
>>
>
> Is that enough detail?
>
>>
>> Thanks
>>
>> --
>> Oscar Salvador
>> SUSE L3
On 02.07.19 08:42, Rashmica Gupta wrote:
> Hi David,
>
> Sorry for the late reply.
Hi,
sorry I was on PTO :)
>
> On Wed, 2019-06-26 at 10:28 +0200, David Hildenbrand wrote:
>> On 26.06.19 10:15, Oscar Salvador wrote:
>>> On Wed, Jun 26, 2019 at 10:11:06AM +0200, David Hildenbrand wrote:
>>>> Back then, I already mentioned that we might have some users that
>>>> remove_memory() they never added in a granularity it wasn't
>>>> added. My
>>>> concerns back then were never fully sorted out.
>>>>
>>>> arch/powerpc/platforms/powernv/memtrace.c
>>>>
>>>> - Will remove memory in memory block size chunks it never added
>>>> - What if that memory resides on a DIMM added via
>>>> MHP_MEMMAP_DEVICE?
>>>>
>>>> Will it at least bail out? Or simply break?
>>>>
>>>> IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save
>>>> to be
>>>> introduced.
>>>
>>> Uhm, I will take a closer look and see if I can clear your
>>> concerns.
>>> TBH, I did not try to use arch/powerpc/platforms/powernv/memtrace.c
>>> yet.
>>>
>>> I will get back to you once I tried it out.
>>>
>>
>> BTW, I consider the code in arch/powerpc/platforms/powernv/memtrace.c
>> very ugly and dangerous.
>
> Yes it would be nice to clean this up.
>
>> We should never allow to manually
>> offline/online pages / hack into memory block states.
>>
>> What I would want to see here is rather:
>>
>> 1. User space offlines the blocks to be used
>> 2. memtrace installs a hotplug notifier and hinders the blocks it
>> wants
>> to use from getting onlined.
>> 3. memory is not added/removed/onlined/offlined in memtrace code.
>>
>
> I remember looking into doing it a similar way. I can't recall the
> details but my issue was probably 'how does userspace indicate to
> the kernel that this memory being offlined should be removed'?
Instead of indicating a "size", indicate the offline memory blocks that
the driver should use. E.g. by memory block id for each node
0:20-24,1:30-32
Of course, other interfaces might make sense.
You can then start using these memory blocks and hinder them from
getting onlined (as a safety net) via memory notifiers.
That would at least avoid you having to call
add_memory/remove_memory/offline_pages/device_online/modifying memblock
states manually.
(binding the memory block devices to a driver would be nicer, but the
infrastructure is not really there yet - we have no such drivers in
place yet)
>
> I don't know the mm code nor how the notifiers work very well so I
> can't quite see how the above would work. I'm assuming memtrace would
> register a hotplug notifier and when memory is offlined from userspace,
> the callback func in memtrace would be called if the priority was high
> enough? But how do we know that the memory being offlined is intended
> for usto touch? Is there a way to offline memory from userspace not
> using sysfs or have I missed something in the sysfs interface?
The notifier would really only be used to hinder onlining as a safety
net. User space prepares (offlines) the memory blocks and then tells the
drivers which memory blocks to use.
>
> On a second read, perhaps you are assuming that memtrace is used after
> adding new memory at runtime? If so, that is not the case. If not, then
> would you be able to clarify what I'm not seeing?
The main problem I see is that you are calling
add_memory/remove_memory() on memory your device driver doesn't own. It
could reside on a DIMM if I am not mistaking (or later on
paravirtualized memory devices like virtio-mem if I ever get to
implement them ;) ).
How is it guaranteed that the memory you are allocating does not reside
on a DIMM for example added via add_memory() by the ACPI driver?
--
Thanks,
David / dhildenb
On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <[email protected]> wrote:
>
> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
> and prepares the callers that add memory to take a "flags" parameter.
> This "flags" parameter will be evaluated later on in Patch#3
> to init mhp_restrictions struct.
>
> The callers are:
>
> add_memory
> __add_memory
> add_memory_resource
>
> Unfortunately, we do not have a single entry point to add memory, as depending
> on the requisites of the caller, they want to hook up in different places,
> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
> in the three callers.
>
> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
> in the way they allocate vmemmap pages within the memory blocks.
>
> MHP_MEMMAP_MEMBLOCK:
> - With this flag, we will allocate vmemmap pages in each memory block.
> This means that if we hot-add a range that spans multiple memory blocks,
> we will use the beginning of each memory block for the vmemmap pages.
> This strategy is good for cases where the caller wants the flexiblity
> to hot-remove memory in a different granularity than when it was added.
>
> E.g:
> We allocate a range (x,y], that spans 3 memory blocks, and given
> memory block size = 128MB.
> [memblock#0 ]
> [0 - 511 pfns ] - vmemmaps for section#0
> [512 - 32767 pfns ] - normal memory
>
> [memblock#1 ]
> [32768 - 33279 pfns] - vmemmaps for section#1
> [33280 - 65535 pfns] - normal memory
>
> [memblock#2 ]
> [65536 - 66047 pfns] - vmemmap for section#2
> [66048 - 98304 pfns] - normal memory
>
> MHP_MEMMAP_DEVICE:
> - With this flag, we will store all vmemmap pages at the beginning of
> hot-added memory.
>
> E.g:
> We allocate a range (x,y], that spans 3 memory blocks, and given
> memory block size = 128MB.
> [memblock #0 ]
> [0 - 1533 pfns ] - vmemmap for section#{0-2}
> [1534 - 98304 pfns] - normal memory
>
> When using larger memory blocks (1GB or 2GB), the principle is the same.
>
> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
> memory.
Concept and patch looks good to me, but I don't quite like the
proliferation of the _DEVICE naming, in theory it need not necessarily
be ZONE_DEVICE that is the only user of that flag. I also think it
might be useful to assign a flag for the default 'allocate from RAM'
case, just so the code is explicit. So, how about:
MHP_MEMMAP_PAGE_ALLOC
MHP_MEMMAP_MEMBLOCK
MHP_MEMMAP_RESERVED
...for the 3 cases?
Other than that, feel free to add:
Reviewed-by: Dan Williams <[email protected]>
On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <[email protected]> wrote:
>
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
>
> This has some disadvantages:
> a) an existing memory is consumed for that purpose
> (~2MB per 128MB memory section on x86_64)
> b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
>
> a) has turned out to be a problem for memory hotplug based ballooning
> because the userspace might not react in time to online memory while
> the memory consumed during physical hotadd consumes enough memory to
> push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
> policy for the newly added memory") has been added to workaround that
> problem.
>
> I have also seen hot-add operations failing on powerpc due to the fact
> that we try to use order-8 pages. If the base page size is 64KB, this
> gives us 16MB, and if we run out of those, we simply fail.
> One could arge that we can fall back to basepages as we do in x86_64, but
> we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
>
> Implementation wise we reuse vmem_altmap infrastructure to override
> the default allocator used by __vmemap_populate. Once the memmap is
> allocated we need a way to mark altmap pfns used for the allocation.
> If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
> altmap structure at the beginning of __add_pages(), and then we call
> mark_vmemmap_pages().
>
> Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
> mark_vmemmap_pages() gets called at a different stage.
> With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
> fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
> sections have been populated.
>
> mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata:
>
> The current layout of the Vmemmap pages are:
>
> [Head->refcount] : Nr sections used by this altmap
> [Head->private] : Nr of vmemmap pages
> [Tail->freelist] : Pointer to the head page
>
> This is done to easy the computation we need in some places.
> E.g:
>
> Example 1)
> We hot-add 1GB on x86_64 (memory block 128MB) using
> MHP_MEMMAP_DEVICE:
>
> head->_refcount = 8 sections
> head->private = 4096 vmemmap pages
> tail's->freelist = head
>
> Example 2)
> We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK:
>
> [at the beginning of each memblock]
> head->_refcount = 1 section
> head->private = 512 vmemmap pages
> tail's->freelist = head
>
> We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know
> how much do we have to defer the call to vmemmap_free().
> The thing is that the first pages of the hot-added range are used to create
> the memmap mapping, so we cannot remove those first, otherwise we would blow up
> when accessing the other pages.
>
> What we do is that since when we hot-remove a memory-range, sections are being
> removed sequentially, we wait until we hit the last section, and then we free
> the hole range to vmemmap_free backwards.
> We know that it is the last section because in every pass we
> decrease head->_refcount, and when it reaches 0, we got our last section.
>
> We also have to be careful about those pages during online and offline
> operations. They are simply skipped, so online will keep them
> reserved and so unusable for any other purpose and offline ignores them
> so they do not block the offline operation.
>
> In offline operation we only have to check for one particularity.
> Depending on how large was the hot-added range, and using MHP_MEMMAP_DEVICE,
> can be that one or more than one memory block is filled with only vmemmap pages.
> We just need to check for this case and skip 1) isolating 2) migrating,
> because those pages do not need to be migrated anywhere, they are self-hosted.
Can you rewrite the changelog without using the word 'we' I get
confused when it seems to reference the 'we' current implementation vs
the 'we' new implementation.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 5 +-
> arch/powerpc/mm/init_64.c | 7 +++
> arch/s390/mm/init.c | 6 ++
> arch/x86/mm/init_64.c | 10 +++
> drivers/acpi/acpi_memhotplug.c | 2 +-
> drivers/base/memory.c | 2 +-
> include/linux/memory_hotplug.h | 6 ++
> include/linux/memremap.h | 2 +-
> mm/compaction.c | 7 +++
> mm/memory_hotplug.c | 138 +++++++++++++++++++++++++++++++++++------
> mm/page_alloc.c | 22 ++++++-
> mm/page_isolation.c | 14 ++++-
> mm/sparse.c | 93 +++++++++++++++++++++++++++
> 13 files changed, 289 insertions(+), 25 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 93ed0df4df79..d4b5661fa6b6 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -765,7 +765,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> if (pmd_none(READ_ONCE(*pmdp))) {
> void *p = NULL;
>
> - p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> + if (altmap)
> + p = altmap_alloc_block_buf(PMD_SIZE, altmap);
> + else
> + p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> if (!p)
> return -ENOMEM;
>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a4e17a979e45..ff9d2c245321 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -289,6 +289,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
>
> if (base_pfn >= alt_start && base_pfn < alt_end) {
> vmem_altmap_free(altmap, nr_pages);
> + } else if (PageVmemmap(page)) {
> + /*
> + * runtime vmemmap pages are residing inside the memory
> + * section so they do not have to be freed anywhere.
> + */
> + while (PageVmemmap(page))
> + __ClearPageVmemmap(page++);
> } else if (PageReserved(page)) {
> /* allocated from bootmem */
> if (page_size < PAGE_SIZE) {
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index ffb81fe95c77..c045411552a3 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -226,6 +226,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long size_pages = PFN_DOWN(size);
> int rc;
>
> + /*
> + * Physical memory is added only later during the memory online so we
> + * cannot use the added range at this stage unfortunately.
> + */
> + restrictions->flags &= ~restrictions->flags;
> +
> if (WARN_ON_ONCE(restrictions->altmap))
> return -EINVAL;
Perhaps these per-arch changes should be pulled out into separate prep patches?
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 688fb0687e55..00d17b666337 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -874,6 +874,16 @@ static void __meminit free_pagetable(struct page *page, int order)
> unsigned long magic;
> unsigned int nr_pages = 1 << order;
>
> + /*
> + * Runtime vmemmap pages are residing inside the memory section so
> + * they do not have to be freed anywhere.
> + */
> + if (PageVmemmap(page)) {
> + while (nr_pages--)
> + __ClearPageVmemmap(page++);
> + return;
> + }
If there is nothing to do and these pages are just going to be
released, why spend any effort clearing the vmemmap state?
> +
> /* bootmem page has reserved flag */
> if (PageReserved(page)) {
> __ClearPageReserved(page);
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 860f84e82dd0..3257edb98d90 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -218,7 +218,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
> if (node < 0)
> node = memory_add_physaddr_to_nid(info->start_addr);
>
> - result = __add_memory(node, info->start_addr, info->length, 0);
> + result = __add_memory(node, info->start_addr, info->length, MHP_MEMMAP_DEVICE);
Why is this changed to MHP_MEMMAP_DEVICE? Where does it get the altmap?
>
> /*
> * If the memory block has been used by the kernel, add_memory()
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index ad9834b8b7f7..e0ac9a3b66f8 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -32,7 +32,7 @@ static DEFINE_MUTEX(mem_sysfs_mutex);
>
> #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>
> -static int sections_per_block;
> +int sections_per_block;
>
> static inline int base_memory_block_id(int section_nr)
> {
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 6fdbce9d04f9..e28e226c9a20 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -375,4 +375,10 @@ extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_
> int online_type);
> extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
> unsigned long nr_pages);
> +
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +extern void mark_vmemmap_pages(struct vmem_altmap *self);
> +#else
> +static inline void mark_vmemmap_pages(struct vmem_altmap *self) {}
> +#endif
> #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 1732dea030b2..6de37e168f57 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -16,7 +16,7 @@ struct device;
> * @alloc: track pages consumed, private to vmemmap_populate()
> */
> struct vmem_altmap {
> - const unsigned long base_pfn;
> + unsigned long base_pfn;
> const unsigned long reserve;
> unsigned long free;
> unsigned long align;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9e1b9acb116b..40697f74b8b4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -855,6 +855,13 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> nr_scanned++;
>
> page = pfn_to_page(low_pfn);
> + /*
> + * Vmemmap pages do not need to be isolated.
> + */
> + if (PageVmemmap(page)) {
> + low_pfn += get_nr_vmemmap_pages(page) - 1;
I'm failing to grok the get_nr_vmemmap_pages() api. It seems this is
more of a get_next_mapped_page() and perhaps it should VM_BUG_ON if it
is not passed a Vmemmap page.
> + continue;
> + }
>
> /*
> * Check if the pageblock has already been marked skipped.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e4e3baa6eaa7..b5106cb75795 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,8 @@
> #include "internal.h"
> #include "shuffle.h"
>
> +extern int sections_per_block;
> +
> /*
> * online_page_callback contains pointer to current page onlining function.
> * Initially it is generic_online_page(). If it is required it could be
> @@ -279,6 +281,24 @@ static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
> return 0;
> }
>
> +static void mhp_reset_altmap(unsigned long next_pfn,
> + struct vmem_altmap *altmap)
> +{
> + altmap->base_pfn = next_pfn;
> + altmap->alloc = 0;
> +}
> +
> +static void mhp_init_altmap(unsigned long pfn, unsigned long nr_pages,
> + unsigned long mhp_flags,
> + struct vmem_altmap *altmap)
> +{
> + if (mhp_flags & MHP_MEMMAP_DEVICE)
> + altmap->free = nr_pages;
> + else
> + altmap->free = PAGES_PER_SECTION * sections_per_block;
> + altmap->base_pfn = pfn;
The ->free member is meant to be the number of free pages in the
altmap this seems to be set to the number of pages being mapped. Am I
misreading?
> +}
> +
> /*
> * Reasonably generic function for adding memory. It is
> * expected that archs that support memory hotplug will
> @@ -290,8 +310,17 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> {
> unsigned long i;
> int start_sec, end_sec, err;
> - struct vmem_altmap *altmap = restrictions->altmap;
> + struct vmem_altmap *altmap;
> + struct vmem_altmap __memblk_altmap = {};
> + unsigned long mhp_flags = restrictions->flags;
> + unsigned long sections_added;
> +
> + if (mhp_flags & MHP_VMEMMAP_FLAGS) {
> + mhp_init_altmap(pfn, nr_pages, mhp_flags, &__memblk_altmap);
> + restrictions->altmap = &__memblk_altmap;
> + }
So this silently overrides a passed in altmap if a flag is set? The
NVDIMM use case can't necessarily trust __memblk_altmap to be
consistent with what the nvdimm namespace has reserved.
>
> + altmap = restrictions->altmap;
> if (altmap) {
> /*
> * Validate altmap is within bounds of the total request
> @@ -308,9 +337,10 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> if (err)
> return err;
>
> + sections_added = 1;
> start_sec = pfn_to_section_nr(pfn);
> end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
> - for (i = start_sec; i <= end_sec; i++) {
> + for (i = start_sec; i <= end_sec; i++, sections_added++) {
> unsigned long pfns;
>
> pfns = min(nr_pages, PAGES_PER_SECTION
> @@ -320,9 +350,19 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> break;
> pfn += pfns;
> nr_pages -= pfns;
> +
> + if (mhp_flags & MHP_MEMMAP_MEMBLOCK &&
> + !(sections_added % sections_per_block)) {
> + mark_vmemmap_pages(altmap);
> + mhp_reset_altmap(pfn, altmap);
> + }
> cond_resched();
> }
> vmemmap_populate_print_last();
> +
> + if (mhp_flags & MHP_MEMMAP_DEVICE)
> + mark_vmemmap_pages(altmap);
> +
> return err;
> }
>
> @@ -642,6 +682,14 @@ static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
> while (start < end) {
> order = min(MAX_ORDER - 1,
> get_order(PFN_PHYS(end) - PFN_PHYS(start)));
> + /*
> + * Check if the pfn is aligned to its order.
> + * If not, we decrement the order until it is,
> + * otherwise __free_one_page will bug us.
> + */
> + while (start & ((1 << order) - 1))
> + order--;
> +
Is this a candidate for a standalone patch? It seems out of place for
this patch.
> (*online_page_callback)(pfn_to_page(start), order);
>
> onlined_pages += (1UL << order);
> @@ -654,13 +702,30 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> void *arg)
> {
> unsigned long onlined_pages = *(unsigned long *)arg;
> + unsigned long pfn = start_pfn;
> + unsigned long nr_vmemmap_pages = 0;
>
> - if (PageReserved(pfn_to_page(start_pfn)))
> - onlined_pages += online_pages_blocks(start_pfn, nr_pages);
> + if (PageVmemmap(pfn_to_page(pfn))) {
> + /*
> + * Do not send vmemmap pages to the page allocator.
> + */
> + nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
> + nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
> + pfn += nr_vmemmap_pages;
> + if (nr_vmemmap_pages == nr_pages)
> + /*
> + * If the entire range contains only vmemmap pages,
> + * there are no pages left for the page allocator.
> + */
> + goto skip_online;
> + }
Seems this should be caller (online_pages()) responsibility rather
than making this fixup internal to the helper... and if it's moved up
can it be pushed one more level up so even online_pages() need not
worry about this fixup? It just does not seem to an operation that
belongs to the online path. Might that eliminate the need for tracking
altmap parameters in struct page?
On 2019-07-24 22:11, Dan Williams wrote:
> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <[email protected]>
> wrote:
>>
>> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
>> and prepares the callers that add memory to take a "flags" parameter.
>> This "flags" parameter will be evaluated later on in Patch#3
>> to init mhp_restrictions struct.
>>
>> The callers are:
>>
>> add_memory
>> __add_memory
>> add_memory_resource
>>
>> Unfortunately, we do not have a single entry point to add memory, as
>> depending
>> on the requisites of the caller, they want to hook up in different
>> places,
>> (e.g: Xen reserve_additional_memory()), so we have to spread the
>> parameter
>> in the three callers.
>>
>> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and
>> only differ
>> in the way they allocate vmemmap pages within the memory blocks.
>>
>> MHP_MEMMAP_MEMBLOCK:
>> - With this flag, we will allocate vmemmap pages in each
>> memory block.
>> This means that if we hot-add a range that spans multiple
>> memory blocks,
>> we will use the beginning of each memory block for the
>> vmemmap pages.
>> This strategy is good for cases where the caller wants the
>> flexiblity
>> to hot-remove memory in a different granularity than when it
>> was added.
>>
>> E.g:
>> We allocate a range (x,y], that spans 3 memory blocks,
>> and given
>> memory block size = 128MB.
>> [memblock#0 ]
>> [0 - 511 pfns ] - vmemmaps for section#0
>> [512 - 32767 pfns ] - normal memory
>>
>> [memblock#1 ]
>> [32768 - 33279 pfns] - vmemmaps for section#1
>> [33280 - 65535 pfns] - normal memory
>>
>> [memblock#2 ]
>> [65536 - 66047 pfns] - vmemmap for section#2
>> [66048 - 98304 pfns] - normal memory
>>
>> MHP_MEMMAP_DEVICE:
>> - With this flag, we will store all vmemmap pages at the
>> beginning of
>> hot-added memory.
>>
>> E.g:
>> We allocate a range (x,y], that spans 3 memory blocks,
>> and given
>> memory block size = 128MB.
>> [memblock #0 ]
>> [0 - 1533 pfns ] - vmemmap for section#{0-2}
>> [1534 - 98304 pfns] - normal memory
>>
>> When using larger memory blocks (1GB or 2GB), the principle is the
>> same.
>>
>> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large
>> contigous
>> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when
>> removing the
>> memory.
>
> Concept and patch looks good to me, but I don't quite like the
> proliferation of the _DEVICE naming, in theory it need not necessarily
> be ZONE_DEVICE that is the only user of that flag. I also think it
> might be useful to assign a flag for the default 'allocate from RAM'
> case, just so the code is explicit. So, how about:
>
> MHP_MEMMAP_PAGE_ALLOC
> MHP_MEMMAP_MEMBLOCK
> MHP_MEMMAP_RESERVED
>
> ...for the 3 cases?
>
> Other than that, feel free to add:
>
> Reviewed-by: Dan Williams <[email protected]>
HI Dan,
I'll be sending V3 tomorrow, with some major rewrites (more simplified).
Thanks
On 25.07.19 11:27, Oscar Salvador wrote:
> On Wed, Jul 24, 2019 at 01:11:52PM -0700, Dan Williams wrote:
>> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <[email protected]> wrote:
>>>
>>> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
>>> and prepares the callers that add memory to take a "flags" parameter.
>>> This "flags" parameter will be evaluated later on in Patch#3
>>> to init mhp_restrictions struct.
>>>
>>> The callers are:
>>>
>>> add_memory
>>> __add_memory
>>> add_memory_resource
>>>
>>> Unfortunately, we do not have a single entry point to add memory, as depending
>>> on the requisites of the caller, they want to hook up in different places,
>>> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
>>> in the three callers.
>>>
>>> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
>>> in the way they allocate vmemmap pages within the memory blocks.
>>>
>>> MHP_MEMMAP_MEMBLOCK:
>>> - With this flag, we will allocate vmemmap pages in each memory block.
>>> This means that if we hot-add a range that spans multiple memory blocks,
>>> we will use the beginning of each memory block for the vmemmap pages.
>>> This strategy is good for cases where the caller wants the flexiblity
>>> to hot-remove memory in a different granularity than when it was added.
>>>
>>> E.g:
>>> We allocate a range (x,y], that spans 3 memory blocks, and given
>>> memory block size = 128MB.
>>> [memblock#0 ]
>>> [0 - 511 pfns ] - vmemmaps for section#0
>>> [512 - 32767 pfns ] - normal memory
>>>
>>> [memblock#1 ]
>>> [32768 - 33279 pfns] - vmemmaps for section#1
>>> [33280 - 65535 pfns] - normal memory
>>>
>>> [memblock#2 ]
>>> [65536 - 66047 pfns] - vmemmap for section#2
>>> [66048 - 98304 pfns] - normal memory
>>>
>>> MHP_MEMMAP_DEVICE:
>>> - With this flag, we will store all vmemmap pages at the beginning of
>>> hot-added memory.
>>>
>>> E.g:
>>> We allocate a range (x,y], that spans 3 memory blocks, and given
>>> memory block size = 128MB.
>>> [memblock #0 ]
>>> [0 - 1533 pfns ] - vmemmap for section#{0-2}
>>> [1534 - 98304 pfns] - normal memory
>>>
>>> When using larger memory blocks (1GB or 2GB), the principle is the same.
>>>
>>> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
>>> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
>>> memory.
>>
>> Concept and patch looks good to me, but I don't quite like the
>> proliferation of the _DEVICE naming, in theory it need not necessarily
>> be ZONE_DEVICE that is the only user of that flag. I also think it
>> might be useful to assign a flag for the default 'allocate from RAM'
>> case, just so the code is explicit. So, how about:
>
> Well, MHP_MEMMAP_DEVICE is not tied to ZONE_DEVICE.
> MHP_MEMMAP_DEVICE was chosen to make a difference between:
>
> * allocate memmap pages for the whole memory-device
> * allocate memmap pages on each memoryblock that this memory-device spans
I agree that DEVICE is misleading here, you are assuming a one-to-one
mapping between a device and add_memory(). You are actually taliing
about "allocate a single chunk of mmap pages for the whole memory range
that is added - which could consist of multiple memory blocks".
--
Thanks,
David / dhildenb
On Wed, Jul 24, 2019 at 01:11:52PM -0700, Dan Williams wrote:
> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <[email protected]> wrote:
> >
> > This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
> > and prepares the callers that add memory to take a "flags" parameter.
> > This "flags" parameter will be evaluated later on in Patch#3
> > to init mhp_restrictions struct.
> >
> > The callers are:
> >
> > add_memory
> > __add_memory
> > add_memory_resource
> >
> > Unfortunately, we do not have a single entry point to add memory, as depending
> > on the requisites of the caller, they want to hook up in different places,
> > (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
> > in the three callers.
> >
> > The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
> > in the way they allocate vmemmap pages within the memory blocks.
> >
> > MHP_MEMMAP_MEMBLOCK:
> > - With this flag, we will allocate vmemmap pages in each memory block.
> > This means that if we hot-add a range that spans multiple memory blocks,
> > we will use the beginning of each memory block for the vmemmap pages.
> > This strategy is good for cases where the caller wants the flexiblity
> > to hot-remove memory in a different granularity than when it was added.
> >
> > E.g:
> > We allocate a range (x,y], that spans 3 memory blocks, and given
> > memory block size = 128MB.
> > [memblock#0 ]
> > [0 - 511 pfns ] - vmemmaps for section#0
> > [512 - 32767 pfns ] - normal memory
> >
> > [memblock#1 ]
> > [32768 - 33279 pfns] - vmemmaps for section#1
> > [33280 - 65535 pfns] - normal memory
> >
> > [memblock#2 ]
> > [65536 - 66047 pfns] - vmemmap for section#2
> > [66048 - 98304 pfns] - normal memory
> >
> > MHP_MEMMAP_DEVICE:
> > - With this flag, we will store all vmemmap pages at the beginning of
> > hot-added memory.
> >
> > E.g:
> > We allocate a range (x,y], that spans 3 memory blocks, and given
> > memory block size = 128MB.
> > [memblock #0 ]
> > [0 - 1533 pfns ] - vmemmap for section#{0-2}
> > [1534 - 98304 pfns] - normal memory
> >
> > When using larger memory blocks (1GB or 2GB), the principle is the same.
> >
> > Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
> > area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
> > memory.
>
> Concept and patch looks good to me, but I don't quite like the
> proliferation of the _DEVICE naming, in theory it need not necessarily
> be ZONE_DEVICE that is the only user of that flag. I also think it
> might be useful to assign a flag for the default 'allocate from RAM'
> case, just so the code is explicit. So, how about:
Well, MHP_MEMMAP_DEVICE is not tied to ZONE_DEVICE.
MHP_MEMMAP_DEVICE was chosen to make a difference between:
* allocate memmap pages for the whole memory-device
* allocate memmap pages on each memoryblock that this memory-device spans
>
> MHP_MEMMAP_PAGE_ALLOC
> MHP_MEMMAP_MEMBLOCK
> MHP_MEMMAP_RESERVED
>
> ...for the 3 cases?
>
> Other than that, feel free to add:
>
> Reviewed-by: Dan Williams <[email protected]>
--
Oscar Salvador
SUSE L3
On 25.07.19 11:40, Oscar Salvador wrote:
> On Thu, Jul 25, 2019 at 11:30:23AM +0200, David Hildenbrand wrote:
>> On 25.07.19 11:27, Oscar Salvador wrote:
>>> On Wed, Jul 24, 2019 at 01:11:52PM -0700, Dan Williams wrote:
>>>> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <[email protected]> wrote:
>>>>>
>>>>> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
>>>>> and prepares the callers that add memory to take a "flags" parameter.
>>>>> This "flags" parameter will be evaluated later on in Patch#3
>>>>> to init mhp_restrictions struct.
>>>>>
>>>>> The callers are:
>>>>>
>>>>> add_memory
>>>>> __add_memory
>>>>> add_memory_resource
>>>>>
>>>>> Unfortunately, we do not have a single entry point to add memory, as depending
>>>>> on the requisites of the caller, they want to hook up in different places,
>>>>> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
>>>>> in the three callers.
>>>>>
>>>>> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
>>>>> in the way they allocate vmemmap pages within the memory blocks.
>>>>>
>>>>> MHP_MEMMAP_MEMBLOCK:
>>>>> - With this flag, we will allocate vmemmap pages in each memory block.
>>>>> This means that if we hot-add a range that spans multiple memory blocks,
>>>>> we will use the beginning of each memory block for the vmemmap pages.
>>>>> This strategy is good for cases where the caller wants the flexiblity
>>>>> to hot-remove memory in a different granularity than when it was added.
>>>>>
>>>>> E.g:
>>>>> We allocate a range (x,y], that spans 3 memory blocks, and given
>>>>> memory block size = 128MB.
>>>>> [memblock#0 ]
>>>>> [0 - 511 pfns ] - vmemmaps for section#0
>>>>> [512 - 32767 pfns ] - normal memory
>>>>>
>>>>> [memblock#1 ]
>>>>> [32768 - 33279 pfns] - vmemmaps for section#1
>>>>> [33280 - 65535 pfns] - normal memory
>>>>>
>>>>> [memblock#2 ]
>>>>> [65536 - 66047 pfns] - vmemmap for section#2
>>>>> [66048 - 98304 pfns] - normal memory
>>>>>
>>>>> MHP_MEMMAP_DEVICE:
>>>>> - With this flag, we will store all vmemmap pages at the beginning of
>>>>> hot-added memory.
>>>>>
>>>>> E.g:
>>>>> We allocate a range (x,y], that spans 3 memory blocks, and given
>>>>> memory block size = 128MB.
>>>>> [memblock #0 ]
>>>>> [0 - 1533 pfns ] - vmemmap for section#{0-2}
>>>>> [1534 - 98304 pfns] - normal memory
>>>>>
>>>>> When using larger memory blocks (1GB or 2GB), the principle is the same.
>>>>>
>>>>> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
>>>>> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
>>>>> memory.
>>>>
>>>> Concept and patch looks good to me, but I don't quite like the
>>>> proliferation of the _DEVICE naming, in theory it need not necessarily
>>>> be ZONE_DEVICE that is the only user of that flag. I also think it
>>>> might be useful to assign a flag for the default 'allocate from RAM'
>>>> case, just so the code is explicit. So, how about:
>>>
>>> Well, MHP_MEMMAP_DEVICE is not tied to ZONE_DEVICE.
>>> MHP_MEMMAP_DEVICE was chosen to make a difference between:
>>>
>>> * allocate memmap pages for the whole memory-device
>>> * allocate memmap pages on each memoryblock that this memory-device spans
>>
>> I agree that DEVICE is misleading here, you are assuming a one-to-one
>> mapping between a device and add_memory(). You are actually taliing
>> about "allocate a single chunk of mmap pages for the whole memory range
>> that is added - which could consist of multiple memory blocks".
>
> Well, I could not come up with a better name.
>
> MHP_MEMMAP_ALL?
> MHP_MEMMAP_WHOLE?
As I said somewhere already (as far as I recall), one mode would be
sufficient. If you want per memblock, add the memory in memblock
granularity.
So having a MHP_MEMMAP_ON_MEMORY that allocates it in one chunk would be
sufficient for the current use cases (DIMMs, Hyper-V).
MHP_MEMMAP_ON_MEMORY: Allocate the memmap for the added memory in one
chunk from the beginning of the added memory. This piece of memory will
be accessed and used even before the memory is onlined.
Of course, if we want to make it configurable (e.g., for ACPI) it would
be a different story. But for now this isn't really needed as far as I
can tell.
--
Thanks,
David / dhildenb
On 25.07.19 12:13, Oscar Salvador wrote:
> On Thu, Jul 25, 2019 at 12:04:08PM +0200, David Hildenbrand wrote:
>> As I said somewhere already (as far as I recall), one mode would be
>> sufficient. If you want per memblock, add the memory in memblock
>> granularity.
>>
>> So having a MHP_MEMMAP_ON_MEMORY that allocates it in one chunk would be
>> sufficient for the current use cases (DIMMs, Hyper-V).
>>
>> MHP_MEMMAP_ON_MEMORY: Allocate the memmap for the added memory in one
>> chunk from the beginning of the added memory. This piece of memory will
>> be accessed and used even before the memory is onlined.
>
> This is what I had in my early versions of the patchset, but I do remember
> that Michal suggested to let the caller specify if it wants the memmaps
> to be allocated per memblock, or per whole-range.
>
> I still think it makes somse sense, you can just pass a large chunk
> (spanning multiple memory-blocks) at once and yet specify to allocate
> it per memory-blocks.
>
> Of course, I also agree that having only one mode would ease things
> (not that much as v3 does not suppose that difference wrt. range vs
> memory-block).
I prefer simplicity. No user, no implementation. We can easily add this
later on if there is a good reason/user.
--
Thanks,
David / dhildenb
On Thu, Jul 25, 2019 at 11:30:23AM +0200, David Hildenbrand wrote:
> On 25.07.19 11:27, Oscar Salvador wrote:
> > On Wed, Jul 24, 2019 at 01:11:52PM -0700, Dan Williams wrote:
> >> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <[email protected]> wrote:
> >>>
> >>> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
> >>> and prepares the callers that add memory to take a "flags" parameter.
> >>> This "flags" parameter will be evaluated later on in Patch#3
> >>> to init mhp_restrictions struct.
> >>>
> >>> The callers are:
> >>>
> >>> add_memory
> >>> __add_memory
> >>> add_memory_resource
> >>>
> >>> Unfortunately, we do not have a single entry point to add memory, as depending
> >>> on the requisites of the caller, they want to hook up in different places,
> >>> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
> >>> in the three callers.
> >>>
> >>> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
> >>> in the way they allocate vmemmap pages within the memory blocks.
> >>>
> >>> MHP_MEMMAP_MEMBLOCK:
> >>> - With this flag, we will allocate vmemmap pages in each memory block.
> >>> This means that if we hot-add a range that spans multiple memory blocks,
> >>> we will use the beginning of each memory block for the vmemmap pages.
> >>> This strategy is good for cases where the caller wants the flexiblity
> >>> to hot-remove memory in a different granularity than when it was added.
> >>>
> >>> E.g:
> >>> We allocate a range (x,y], that spans 3 memory blocks, and given
> >>> memory block size = 128MB.
> >>> [memblock#0 ]
> >>> [0 - 511 pfns ] - vmemmaps for section#0
> >>> [512 - 32767 pfns ] - normal memory
> >>>
> >>> [memblock#1 ]
> >>> [32768 - 33279 pfns] - vmemmaps for section#1
> >>> [33280 - 65535 pfns] - normal memory
> >>>
> >>> [memblock#2 ]
> >>> [65536 - 66047 pfns] - vmemmap for section#2
> >>> [66048 - 98304 pfns] - normal memory
> >>>
> >>> MHP_MEMMAP_DEVICE:
> >>> - With this flag, we will store all vmemmap pages at the beginning of
> >>> hot-added memory.
> >>>
> >>> E.g:
> >>> We allocate a range (x,y], that spans 3 memory blocks, and given
> >>> memory block size = 128MB.
> >>> [memblock #0 ]
> >>> [0 - 1533 pfns ] - vmemmap for section#{0-2}
> >>> [1534 - 98304 pfns] - normal memory
> >>>
> >>> When using larger memory blocks (1GB or 2GB), the principle is the same.
> >>>
> >>> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
> >>> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
> >>> memory.
> >>
> >> Concept and patch looks good to me, but I don't quite like the
> >> proliferation of the _DEVICE naming, in theory it need not necessarily
> >> be ZONE_DEVICE that is the only user of that flag. I also think it
> >> might be useful to assign a flag for the default 'allocate from RAM'
> >> case, just so the code is explicit. So, how about:
> >
> > Well, MHP_MEMMAP_DEVICE is not tied to ZONE_DEVICE.
> > MHP_MEMMAP_DEVICE was chosen to make a difference between:
> >
> > * allocate memmap pages for the whole memory-device
> > * allocate memmap pages on each memoryblock that this memory-device spans
>
> I agree that DEVICE is misleading here, you are assuming a one-to-one
> mapping between a device and add_memory(). You are actually taliing
> about "allocate a single chunk of mmap pages for the whole memory range
> that is added - which could consist of multiple memory blocks".
Well, I could not come up with a better name.
MHP_MEMMAP_ALL?
MHP_MEMMAP_WHOLE?
I will send v3 shortly and then we can think of a better name.
>
> --
>
> Thanks,
>
> David / dhildenb
>
--
Oscar Salvador
SUSE L3
On Thu, Jul 25, 2019 at 12:04:08PM +0200, David Hildenbrand wrote:
> As I said somewhere already (as far as I recall), one mode would be
> sufficient. If you want per memblock, add the memory in memblock
> granularity.
>
> So having a MHP_MEMMAP_ON_MEMORY that allocates it in one chunk would be
> sufficient for the current use cases (DIMMs, Hyper-V).
>
> MHP_MEMMAP_ON_MEMORY: Allocate the memmap for the added memory in one
> chunk from the beginning of the added memory. This piece of memory will
> be accessed and used even before the memory is onlined.
This is what I had in my early versions of the patchset, but I do remember
that Michal suggested to let the caller specify if it wants the memmaps
to be allocated per memblock, or per whole-range.
I still think it makes somse sense, you can just pass a large chunk
(spanning multiple memory-blocks) at once and yet specify to allocate
it per memory-blocks.
Of course, I also agree that having only one mode would ease things
(not that much as v3 does not suppose that difference wrt. range vs
memory-block).
--
Oscar Salvador
SUSE L3
On Tue, 2019-07-16 at 14:28 +0200, David Hildenbrand wrote:
> On 02.07.19 08:42, Rashmica Gupta wrote:
> > Hi David,
> >
> > Sorry for the late reply.
>
> Hi,
>
> sorry I was on PTO :)
>
> > On Wed, 2019-06-26 at 10:28 +0200, David Hildenbrand wrote:
> > > On 26.06.19 10:15, Oscar Salvador wrote:
> > > > On Wed, Jun 26, 2019 at 10:11:06AM +0200, David Hildenbrand
> > > > wrote:
> > > > > Back then, I already mentioned that we might have some users
> > > > > that
> > > > > remove_memory() they never added in a granularity it wasn't
> > > > > added. My
> > > > > concerns back then were never fully sorted out.
> > > > >
> > > > > arch/powerpc/platforms/powernv/memtrace.c
> > > > >
> > > > > - Will remove memory in memory block size chunks it never
> > > > > added
> > > > > - What if that memory resides on a DIMM added via
> > > > > MHP_MEMMAP_DEVICE?
> > > > >
> > > > > Will it at least bail out? Or simply break?
> > > > >
> > > > > IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is
> > > > > save
> > > > > to be
> > > > > introduced.
> > > >
> > > > Uhm, I will take a closer look and see if I can clear your
> > > > concerns.
> > > > TBH, I did not try to use
> > > > arch/powerpc/platforms/powernv/memtrace.c
> > > > yet.
> > > >
> > > > I will get back to you once I tried it out.
> > > >
> > >
> > > BTW, I consider the code in
> > > arch/powerpc/platforms/powernv/memtrace.c
> > > very ugly and dangerous.
> >
> > Yes it would be nice to clean this up.
> >
> > > We should never allow to manually
> > > offline/online pages / hack into memory block states.
> > >
> > > What I would want to see here is rather:
> > >
> > > 1. User space offlines the blocks to be used
> > > 2. memtrace installs a hotplug notifier and hinders the blocks it
> > > wants
> > > to use from getting onlined.
> > > 3. memory is not added/removed/onlined/offlined in memtrace code.
> > >
> >
> > I remember looking into doing it a similar way. I can't recall the
> > details but my issue was probably 'how does userspace indicate to
> > the kernel that this memory being offlined should be removed'?
>
> Instead of indicating a "size", indicate the offline memory blocks
> that
> the driver should use. E.g. by memory block id for each node
>
> 0:20-24,1:30-32
>
> Of course, other interfaces might make sense.
>
> You can then start using these memory blocks and hinder them from
> getting onlined (as a safety net) via memory notifiers.
>
> That would at least avoid you having to call
> add_memory/remove_memory/offline_pages/device_online/modifying
> memblock
> states manually.
I see what you're saying and that definitely sounds safer.
We would still need to call remove_memory and add_memory from memtrace
as
just offlining memory doesn't remove it from the linear page tables
(if
it's still in the page tables then hardware can prefetch it and if
hardware tracing is using it then the box checkstops).
>
> (binding the memory block devices to a driver would be nicer, but the
> infrastructure is not really there yet - we have no such drivers in
> place yet)
>
> > I don't know the mm code nor how the notifiers work very well so I
> > can't quite see how the above would work. I'm assuming memtrace
> > would
> > register a hotplug notifier and when memory is offlined from
> > userspace,
> > the callback func in memtrace would be called if the priority was
> > high
> > enough? But how do we know that the memory being offlined is
> > intended
> > for usto touch? Is there a way to offline memory from userspace not
> > using sysfs or have I missed something in the sysfs interface?
>
> The notifier would really only be used to hinder onlining as a safety
> net. User space prepares (offlines) the memory blocks and then tells
> the
> drivers which memory blocks to use.
>
> > On a second read, perhaps you are assuming that memtrace is used
> > after
> > adding new memory at runtime? If so, that is not the case. If not,
> > then
> > would you be able to clarify what I'm not seeing?
>
> The main problem I see is that you are calling
> add_memory/remove_memory() on memory your device driver doesn't own.
> It
> could reside on a DIMM if I am not mistaking (or later on
> paravirtualized memory devices like virtio-mem if I ever get to
> implement them ;) ).
This is just for baremetal/powernv so shouldn't affect virtual memory
devices.
>
> How is it guaranteed that the memory you are allocating does not
> reside
> on a DIMM for example added via add_memory() by the ACPI driver?
Good point. We don't have ACPI on powernv but currently this would try
to remove memory from any online memory node, not just the ones that
are backed by RAM. oops.
>> Of course, other interfaces might make sense.
>>
>> You can then start using these memory blocks and hinder them from
>> getting onlined (as a safety net) via memory notifiers.
>>
>> That would at least avoid you having to call
>> add_memory/remove_memory/offline_pages/device_online/modifying
>> memblock
>> states manually.
>
> I see what you're saying and that definitely sounds safer.
>
> We would still need to call remove_memory and add_memory from memtrace
> as
> just offlining memory doesn't remove it from the linear page tables
> (if
> it's still in the page tables then hardware can prefetch it and if
> hardware tracing is using it then the box checkstops).
That prefetching part is interesting (and nasty as well). If we could at
least get rid of the manual onlining/offlining, I would be able to sleep
better at night ;) One step at a time.
>
>>
>> (binding the memory block devices to a driver would be nicer, but the
>> infrastructure is not really there yet - we have no such drivers in
>> place yet)
>>
>>> I don't know the mm code nor how the notifiers work very well so I
>>> can't quite see how the above would work. I'm assuming memtrace
>>> would
>>> register a hotplug notifier and when memory is offlined from
>>> userspace,
>>> the callback func in memtrace would be called if the priority was
>>> high
>>> enough? But how do we know that the memory being offlined is
>>> intended
>>> for usto touch? Is there a way to offline memory from userspace not
>>> using sysfs or have I missed something in the sysfs interface?
>>
>> The notifier would really only be used to hinder onlining as a safety
>> net. User space prepares (offlines) the memory blocks and then tells
>> the
>> drivers which memory blocks to use.
>>
>>> On a second read, perhaps you are assuming that memtrace is used
>>> after
>>> adding new memory at runtime? If so, that is not the case. If not,
>>> then
>>> would you be able to clarify what I'm not seeing?
>>
>> The main problem I see is that you are calling
>> add_memory/remove_memory() on memory your device driver doesn't own.
>> It
>> could reside on a DIMM if I am not mistaking (or later on
>> paravirtualized memory devices like virtio-mem if I ever get to
>> implement them ;) ).
>
> This is just for baremetal/powernv so shouldn't affect virtual memory
> devices.
Good to now.
>
>>
>> How is it guaranteed that the memory you are allocating does not
>> reside
>> on a DIMM for example added via add_memory() by the ACPI driver?
>
> Good point. We don't have ACPI on powernv but currently this would try
> to remove memory from any online memory node, not just the ones that
> are backed by RAM. oops.
Okay, so essentially no memory hotplug/unplug along with memtrace. (can
we document that somewhere?). I think add_memory()/try_remove_memory()
could be tolerable in these environments (as it's only boot memory).
--
Thanks,
David / dhildenb
On Mon, 2019-07-29 at 10:06 +0200, David Hildenbrand wrote:
> > > Of course, other interfaces might make sense.
> > >
> > > You can then start using these memory blocks and hinder them from
> > > getting onlined (as a safety net) via memory notifiers.
> > >
> > > That would at least avoid you having to call
> > > add_memory/remove_memory/offline_pages/device_online/modifying
> > > memblock
> > > states manually.
> >
> > I see what you're saying and that definitely sounds safer.
> >
> > We would still need to call remove_memory and add_memory from
> > memtrace
> > as
> > just offlining memory doesn't remove it from the linear page tables
> > (if
> > it's still in the page tables then hardware can prefetch it and if
> > hardware tracing is using it then the box checkstops).
>
> That prefetching part is interesting (and nasty as well). If we could
> at
> least get rid of the manual onlining/offlining, I would be able to
> sleep
> better at night ;) One step at a time.
>
Ok, I'll get to that soon :)
> > > (binding the memory block devices to a driver would be nicer, but
> > > the
> > > infrastructure is not really there yet - we have no such drivers
> > > in
> > > place yet)
> > >
> > > > I don't know the mm code nor how the notifiers work very well
> > > > so I
> > > > can't quite see how the above would work. I'm assuming memtrace
> > > > would
> > > > register a hotplug notifier and when memory is offlined from
> > > > userspace,
> > > > the callback func in memtrace would be called if the priority
> > > > was
> > > > high
> > > > enough? But how do we know that the memory being offlined is
> > > > intended
> > > > for usto touch? Is there a way to offline memory from userspace
> > > > not
> > > > using sysfs or have I missed something in the sysfs interface?
> > >
> > > The notifier would really only be used to hinder onlining as a
> > > safety
> > > net. User space prepares (offlines) the memory blocks and then
> > > tells
> > > the
> > > drivers which memory blocks to use.
> > >
> > > > On a second read, perhaps you are assuming that memtrace is
> > > > used
> > > > after
> > > > adding new memory at runtime? If so, that is not the case. If
> > > > not,
> > > > then
> > > > would you be able to clarify what I'm not seeing?
> > >
> > > The main problem I see is that you are calling
> > > add_memory/remove_memory() on memory your device driver doesn't
> > > own.
> > > It
> > > could reside on a DIMM if I am not mistaking (or later on
> > > paravirtualized memory devices like virtio-mem if I ever get to
> > > implement them ;) ).
> >
> > This is just for baremetal/powernv so shouldn't affect virtual
> > memory
> > devices.
>
> Good to now.
>
> > > How is it guaranteed that the memory you are allocating does not
> > > reside
> > > on a DIMM for example added via add_memory() by the ACPI driver?
> >
> > Good point. We don't have ACPI on powernv but currently this would
> > try
> > to remove memory from any online memory node, not just the ones
> > that
> > are backed by RAM. oops.
>
> Okay, so essentially no memory hotplug/unplug along with memtrace.
> (can
> we document that somewhere?). I think
> add_memory()/try_remove_memory()
> could be tolerable in these environments (as it's only boot memory).
>
Sure thing.
On Mon, 2019-07-29 at 10:06 +0200, David Hildenbrand wrote:
> > > Of course, other interfaces might make sense.
> > >
> > > You can then start using these memory blocks and hinder them from
> > > getting onlined (as a safety net) via memory notifiers.
> > >
> > > That would at least avoid you having to call
> > > add_memory/remove_memory/offline_pages/device_online/modifying
> > > memblock
> > > states manually.
> >
> > I see what you're saying and that definitely sounds safer.
> >
> > We would still need to call remove_memory and add_memory from
> > memtrace
> > as
> > just offlining memory doesn't remove it from the linear page tables
> > (if
> > it's still in the page tables then hardware can prefetch it and if
> > hardware tracing is using it then the box checkstops).
>
> That prefetching part is interesting (and nasty as well). If we could
> at
> least get rid of the manual onlining/offlining, I would be able to
> sleep
> better at night ;) One step at a time.
>
What are your thoughts on adding remove to state_store in
drivers/base/memory.c? And an accompanying add? So then userspace could
do "echo remove > memory34/state"?
Then most of the memtrace code could be moved to a userspace tool. The
only bit that we would need to keep in the kernel is setting up debugfs
files in memtrace_init_debugfs.
On 31.07.19 04:21, Rashmica Gupta wrote:
> On Mon, 2019-07-29 at 10:06 +0200, David Hildenbrand wrote:
>>>> Of course, other interfaces might make sense.
>>>>
>>>> You can then start using these memory blocks and hinder them from
>>>> getting onlined (as a safety net) via memory notifiers.
>>>>
>>>> That would at least avoid you having to call
>>>> add_memory/remove_memory/offline_pages/device_online/modifying
>>>> memblock
>>>> states manually.
>>>
>>> I see what you're saying and that definitely sounds safer.
>>>
>>> We would still need to call remove_memory and add_memory from
>>> memtrace
>>> as
>>> just offlining memory doesn't remove it from the linear page tables
>>> (if
>>> it's still in the page tables then hardware can prefetch it and if
>>> hardware tracing is using it then the box checkstops).
>>
>> That prefetching part is interesting (and nasty as well). If we could
>> at
>> least get rid of the manual onlining/offlining, I would be able to
>> sleep
>> better at night ;) One step at a time.
>>
>
> What are your thoughts on adding remove to state_store in
> drivers/base/memory.c? And an accompanying add? So then userspace could
> do "echo remove > memory34/state"?
I consider such an interface dangerous (removing random memory you don't
own, especially in the context of Oscars work some blocks would suddenly
not be removable again) and only of limited used. The requirement of
memtrace is really special, and it only works somewhat reliably with
boot memory.
FWIW, we do have a "probe" interface on some architectures but decided
that a "remove" interface belongs into a debug/test kernel module. The
memory block device API is already messed up enough (e.g., "removable"
property which doesn't indicate at all if memory can be removed, only if
it could be offlined) - we decided to keep changes at a minimum for now
- rather clean up than add new stuff.
> Then most of the memtrace code could be moved to a userspace tool. The
> only bit that we would need to keep in the kernel is setting up debugfs
> files in memtrace_init_debugfs.
The nice thing is, with remove_memory() you will now get an error in
case all memory blocks are not offline yet. So it only boils down to
calling add_memory()/remove_memory() without even checking the state of
the blocks. (only nids have to be handled manually correctly)
--
Thanks,
David / dhildenb
On Tue 02-07-19 18:52:01, Rashmica Gupta wrote:
[...]
> > 2) Why it was designed, what is the goal of the interface?
> > 3) When it is supposed to be used?
> >
> >
> There is a hardware debugging facility (htm) on some power chips. To use
> this you need a contiguous portion of memory for the output to be dumped
> to - and we obviously don't want this memory to be simultaneously used by
> the kernel.
How much memory are we talking about here? Just curious.
--
Michal Hocko
SUSE Labs
On Wed, 2019-07-31 at 14:08 +0200, Michal Hocko wrote:
> On Tue 02-07-19 18:52:01, Rashmica Gupta wrote:
> [...]
> > > 2) Why it was designed, what is the goal of the interface?
> > > 3) When it is supposed to be used?
> > >
> > >
> > There is a hardware debugging facility (htm) on some power chips.
> > To use
> > this you need a contiguous portion of memory for the output to be
> > dumped
> > to - and we obviously don't want this memory to be simultaneously
> > used by
> > the kernel.
>
> How much memory are we talking about here? Just curious.
From what I've seen a couple of GB per node, so maybe 2-10GB total.
On 01.08.19 09:24, Michal Hocko wrote:
> On Thu 01-08-19 09:18:47, David Hildenbrand wrote:
>> On 01.08.19 09:17, Michal Hocko wrote:
>>> On Thu 01-08-19 09:06:40, Rashmica Gupta wrote:
>>>> On Wed, 2019-07-31 at 14:08 +0200, Michal Hocko wrote:
>>>>> On Tue 02-07-19 18:52:01, Rashmica Gupta wrote:
>>>>> [...]
>>>>>>> 2) Why it was designed, what is the goal of the interface?
>>>>>>> 3) When it is supposed to be used?
>>>>>>>
>>>>>>>
>>>>>> There is a hardware debugging facility (htm) on some power chips.
>>>>>> To use
>>>>>> this you need a contiguous portion of memory for the output to be
>>>>>> dumped
>>>>>> to - and we obviously don't want this memory to be simultaneously
>>>>>> used by
>>>>>> the kernel.
>>>>>
>>>>> How much memory are we talking about here? Just curious.
>>>>
>>>> From what I've seen a couple of GB per node, so maybe 2-10GB total.
>>>
>>> OK, that is really a lot to keep around unused just in case the
>>> debugging is going to be used.
>>>
>>> I am still not sure the current approach of (ab)using memory hotplug is
>>> ideal. Sure there is some overlap but you shouldn't really need to
>>> offline the required memory range at all. All you need is to isolate the
>>> memory from any existing user and the page allocator. Have you checked
>>> alloc_contig_range?
>>>
>>
>> Rashmica mentioned somewhere in this thread that the virtual mapping
>> must not be in place, otherwise the HW might prefetch some of this
>> memory, leading to errors with memtrace (which checks that in HW).
>
> Does anything prevent from unmapping the pfn range from the direct
> mapping?
I am not sure about the implications of having
pfn_valid()/pfn_present()/pfn_online() return true but accessing it
results in crashes. (suspend, kdump, whatever other technology touches
online memory)
(sounds more like a hack to me than just going ahead and
removing/readding the memory via a clean interface we have)
--
Thanks,
David / dhildenb
On Thu 01-08-19 09:26:35, David Hildenbrand wrote:
> On 01.08.19 09:24, Michal Hocko wrote:
> > On Thu 01-08-19 09:18:47, David Hildenbrand wrote:
> >> On 01.08.19 09:17, Michal Hocko wrote:
> >>> On Thu 01-08-19 09:06:40, Rashmica Gupta wrote:
> >>>> On Wed, 2019-07-31 at 14:08 +0200, Michal Hocko wrote:
> >>>>> On Tue 02-07-19 18:52:01, Rashmica Gupta wrote:
> >>>>> [...]
> >>>>>>> 2) Why it was designed, what is the goal of the interface?
> >>>>>>> 3) When it is supposed to be used?
> >>>>>>>
> >>>>>>>
> >>>>>> There is a hardware debugging facility (htm) on some power chips.
> >>>>>> To use
> >>>>>> this you need a contiguous portion of memory for the output to be
> >>>>>> dumped
> >>>>>> to - and we obviously don't want this memory to be simultaneously
> >>>>>> used by
> >>>>>> the kernel.
> >>>>>
> >>>>> How much memory are we talking about here? Just curious.
> >>>>
> >>>> From what I've seen a couple of GB per node, so maybe 2-10GB total.
> >>>
> >>> OK, that is really a lot to keep around unused just in case the
> >>> debugging is going to be used.
> >>>
> >>> I am still not sure the current approach of (ab)using memory hotplug is
> >>> ideal. Sure there is some overlap but you shouldn't really need to
> >>> offline the required memory range at all. All you need is to isolate the
> >>> memory from any existing user and the page allocator. Have you checked
> >>> alloc_contig_range?
> >>>
> >>
> >> Rashmica mentioned somewhere in this thread that the virtual mapping
> >> must not be in place, otherwise the HW might prefetch some of this
> >> memory, leading to errors with memtrace (which checks that in HW).
> >
> > Does anything prevent from unmapping the pfn range from the direct
> > mapping?
>
> I am not sure about the implications of having
> pfn_valid()/pfn_present()/pfn_online() return true but accessing it
> results in crashes. (suspend, kdump, whatever other technology touches
> online memory)
If those pages are marked as Reserved then nobody should be touching
them anyway.
> (sounds more like a hack to me than just going ahead and
> removing/readding the memory via a clean interface we have)
Right, but the interface that we have is quite restricted in what it can
really offline.
--
Michal Hocko
SUSE Labs
On Thu 01-08-19 09:31:09, David Hildenbrand wrote:
> On 01.08.19 09:26, David Hildenbrand wrote:
[...]
> > I am not sure about the implications of having
> > pfn_valid()/pfn_present()/pfn_online() return true but accessing it
> > results in crashes. (suspend, kdump, whatever other technology touches
> > online memory)
>
> (oneidea: we could of course go ahead and mark the pages PG_offline
> before unmapping the pfn range to work around these issues)
PG_reserved and an elevated reference count should be enough to drive
any pfn walker out. Pfn walkers shouldn't touch any page unless they
know and recognize their type.
--
Michal Hocko
SUSE Labs
On Thu 01-08-19 09:39:57, Michal Hocko wrote:
> On Thu 01-08-19 09:31:09, David Hildenbrand wrote:
> > On 01.08.19 09:26, David Hildenbrand wrote:
> [...]
> > > I am not sure about the implications of having
> > > pfn_valid()/pfn_present()/pfn_online() return true but accessing it
> > > results in crashes. (suspend, kdump, whatever other technology touches
> > > online memory)
> >
> > (oneidea: we could of course go ahead and mark the pages PG_offline
> > before unmapping the pfn range to work around these issues)
>
> PG_reserved and an elevated reference count should be enough to drive
> any pfn walker out. Pfn walkers shouldn't touch any page unless they
> know and recognize their type.
Btw. this shouldn't be much different from DEBUG_PAGE_ALLOC in
principle. The memory is valid, but not mapped to the kernel virtual
space. Nobody should be really touching it anyway.
--
Michal Hocko
SUSE Labs
On 01.08.19 09:17, Michal Hocko wrote:
> On Thu 01-08-19 09:06:40, Rashmica Gupta wrote:
>> On Wed, 2019-07-31 at 14:08 +0200, Michal Hocko wrote:
>>> On Tue 02-07-19 18:52:01, Rashmica Gupta wrote:
>>> [...]
>>>>> 2) Why it was designed, what is the goal of the interface?
>>>>> 3) When it is supposed to be used?
>>>>>
>>>>>
>>>> There is a hardware debugging facility (htm) on some power chips.
>>>> To use
>>>> this you need a contiguous portion of memory for the output to be
>>>> dumped
>>>> to - and we obviously don't want this memory to be simultaneously
>>>> used by
>>>> the kernel.
>>>
>>> How much memory are we talking about here? Just curious.
>>
>> From what I've seen a couple of GB per node, so maybe 2-10GB total.
>
> OK, that is really a lot to keep around unused just in case the
> debugging is going to be used.
>
> I am still not sure the current approach of (ab)using memory hotplug is
> ideal. Sure there is some overlap but you shouldn't really need to
> offline the required memory range at all. All you need is to isolate the
> memory from any existing user and the page allocator. Have you checked
> alloc_contig_range?
>
Rashmica mentioned somewhere in this thread that the virtual mapping
must not be in place, otherwise the HW might prefetch some of this
memory, leading to errors with memtrace (which checks that in HW).
--
Thanks,
David / dhildenb
On Thu 01-08-19 09:06:40, Rashmica Gupta wrote:
> On Wed, 2019-07-31 at 14:08 +0200, Michal Hocko wrote:
> > On Tue 02-07-19 18:52:01, Rashmica Gupta wrote:
> > [...]
> > > > 2) Why it was designed, what is the goal of the interface?
> > > > 3) When it is supposed to be used?
> > > >
> > > >
> > > There is a hardware debugging facility (htm) on some power chips.
> > > To use
> > > this you need a contiguous portion of memory for the output to be
> > > dumped
> > > to - and we obviously don't want this memory to be simultaneously
> > > used by
> > > the kernel.
> >
> > How much memory are we talking about here? Just curious.
>
> From what I've seen a couple of GB per node, so maybe 2-10GB total.
OK, that is really a lot to keep around unused just in case the
debugging is going to be used.
I am still not sure the current approach of (ab)using memory hotplug is
ideal. Sure there is some overlap but you shouldn't really need to
offline the required memory range at all. All you need is to isolate the
memory from any existing user and the page allocator. Have you checked
alloc_contig_range?
--
Michal Hocko
SUSE Labs
On Thu 01-08-19 09:18:47, David Hildenbrand wrote:
> On 01.08.19 09:17, Michal Hocko wrote:
> > On Thu 01-08-19 09:06:40, Rashmica Gupta wrote:
> >> On Wed, 2019-07-31 at 14:08 +0200, Michal Hocko wrote:
> >>> On Tue 02-07-19 18:52:01, Rashmica Gupta wrote:
> >>> [...]
> >>>>> 2) Why it was designed, what is the goal of the interface?
> >>>>> 3) When it is supposed to be used?
> >>>>>
> >>>>>
> >>>> There is a hardware debugging facility (htm) on some power chips.
> >>>> To use
> >>>> this you need a contiguous portion of memory for the output to be
> >>>> dumped
> >>>> to - and we obviously don't want this memory to be simultaneously
> >>>> used by
> >>>> the kernel.
> >>>
> >>> How much memory are we talking about here? Just curious.
> >>
> >> From what I've seen a couple of GB per node, so maybe 2-10GB total.
> >
> > OK, that is really a lot to keep around unused just in case the
> > debugging is going to be used.
> >
> > I am still not sure the current approach of (ab)using memory hotplug is
> > ideal. Sure there is some overlap but you shouldn't really need to
> > offline the required memory range at all. All you need is to isolate the
> > memory from any existing user and the page allocator. Have you checked
> > alloc_contig_range?
> >
>
> Rashmica mentioned somewhere in this thread that the virtual mapping
> must not be in place, otherwise the HW might prefetch some of this
> memory, leading to errors with memtrace (which checks that in HW).
Does anything prevent from unmapping the pfn range from the direct
mapping?
--
Michal Hocko
SUSE Labs
On 01.08.19 09:26, David Hildenbrand wrote:
> On 01.08.19 09:24, Michal Hocko wrote:
>> On Thu 01-08-19 09:18:47, David Hildenbrand wrote:
>>> On 01.08.19 09:17, Michal Hocko wrote:
>>>> On Thu 01-08-19 09:06:40, Rashmica Gupta wrote:
>>>>> On Wed, 2019-07-31 at 14:08 +0200, Michal Hocko wrote:
>>>>>> On Tue 02-07-19 18:52:01, Rashmica Gupta wrote:
>>>>>> [...]
>>>>>>>> 2) Why it was designed, what is the goal of the interface?
>>>>>>>> 3) When it is supposed to be used?
>>>>>>>>
>>>>>>>>
>>>>>>> There is a hardware debugging facility (htm) on some power chips.
>>>>>>> To use
>>>>>>> this you need a contiguous portion of memory for the output to be
>>>>>>> dumped
>>>>>>> to - and we obviously don't want this memory to be simultaneously
>>>>>>> used by
>>>>>>> the kernel.
>>>>>>
>>>>>> How much memory are we talking about here? Just curious.
>>>>>
>>>>> From what I've seen a couple of GB per node, so maybe 2-10GB total.
>>>>
>>>> OK, that is really a lot to keep around unused just in case the
>>>> debugging is going to be used.
>>>>
>>>> I am still not sure the current approach of (ab)using memory hotplug is
>>>> ideal. Sure there is some overlap but you shouldn't really need to
>>>> offline the required memory range at all. All you need is to isolate the
>>>> memory from any existing user and the page allocator. Have you checked
>>>> alloc_contig_range?
>>>>
>>>
>>> Rashmica mentioned somewhere in this thread that the virtual mapping
>>> must not be in place, otherwise the HW might prefetch some of this
>>> memory, leading to errors with memtrace (which checks that in HW).
>>
>> Does anything prevent from unmapping the pfn range from the direct
>> mapping?
>
> I am not sure about the implications of having
> pfn_valid()/pfn_present()/pfn_online() return true but accessing it
> results in crashes. (suspend, kdump, whatever other technology touches
> online memory)
(oneidea: we could of course go ahead and mark the pages PG_offline
before unmapping the pfn range to work around these issues)
>
> (sounds more like a hack to me than just going ahead and
> removing/readding the memory via a clean interface we have)
>
--
Thanks,
David / dhildenb
On 01.08.19 09:34, Michal Hocko wrote:
> On Thu 01-08-19 09:26:35, David Hildenbrand wrote:
>> On 01.08.19 09:24, Michal Hocko wrote:
>>> On Thu 01-08-19 09:18:47, David Hildenbrand wrote:
>>>> On 01.08.19 09:17, Michal Hocko wrote:
>>>>> On Thu 01-08-19 09:06:40, Rashmica Gupta wrote:
>>>>>> On Wed, 2019-07-31 at 14:08 +0200, Michal Hocko wrote:
>>>>>>> On Tue 02-07-19 18:52:01, Rashmica Gupta wrote:
>>>>>>> [...]
>>>>>>>>> 2) Why it was designed, what is the goal of the interface?
>>>>>>>>> 3) When it is supposed to be used?
>>>>>>>>>
>>>>>>>>>
>>>>>>>> There is a hardware debugging facility (htm) on some power chips.
>>>>>>>> To use
>>>>>>>> this you need a contiguous portion of memory for the output to be
>>>>>>>> dumped
>>>>>>>> to - and we obviously don't want this memory to be simultaneously
>>>>>>>> used by
>>>>>>>> the kernel.
>>>>>>>
>>>>>>> How much memory are we talking about here? Just curious.
>>>>>>
>>>>>> From what I've seen a couple of GB per node, so maybe 2-10GB total.
>>>>>
>>>>> OK, that is really a lot to keep around unused just in case the
>>>>> debugging is going to be used.
>>>>>
>>>>> I am still not sure the current approach of (ab)using memory hotplug is
>>>>> ideal. Sure there is some overlap but you shouldn't really need to
>>>>> offline the required memory range at all. All you need is to isolate the
>>>>> memory from any existing user and the page allocator. Have you checked
>>>>> alloc_contig_range?
>>>>>
>>>>
>>>> Rashmica mentioned somewhere in this thread that the virtual mapping
>>>> must not be in place, otherwise the HW might prefetch some of this
>>>> memory, leading to errors with memtrace (which checks that in HW).
>>>
>>> Does anything prevent from unmapping the pfn range from the direct
>>> mapping?
>>
>> I am not sure about the implications of having
>> pfn_valid()/pfn_present()/pfn_online() return true but accessing it
>> results in crashes. (suspend, kdump, whatever other technology touches
>> online memory)
>
> If those pages are marked as Reserved then nobody should be touching
> them anyway.
Which is not true as I remember we already discussed - I even documented
what PG_reserved can mean after that discussion in page-flags.h (e.g.,
memmap of boot memory) - that's why we introduced PG_offline after all.
--
Thanks,
David / dhildenb
On Thu 01-08-19 09:50:29, David Hildenbrand wrote:
> On 01.08.19 09:34, Michal Hocko wrote:
> > On Thu 01-08-19 09:26:35, David Hildenbrand wrote:
> >> On 01.08.19 09:24, Michal Hocko wrote:
> >>> On Thu 01-08-19 09:18:47, David Hildenbrand wrote:
> >>>> On 01.08.19 09:17, Michal Hocko wrote:
> >>>>> On Thu 01-08-19 09:06:40, Rashmica Gupta wrote:
> >>>>>> On Wed, 2019-07-31 at 14:08 +0200, Michal Hocko wrote:
> >>>>>>> On Tue 02-07-19 18:52:01, Rashmica Gupta wrote:
> >>>>>>> [...]
> >>>>>>>>> 2) Why it was designed, what is the goal of the interface?
> >>>>>>>>> 3) When it is supposed to be used?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> There is a hardware debugging facility (htm) on some power chips.
> >>>>>>>> To use
> >>>>>>>> this you need a contiguous portion of memory for the output to be
> >>>>>>>> dumped
> >>>>>>>> to - and we obviously don't want this memory to be simultaneously
> >>>>>>>> used by
> >>>>>>>> the kernel.
> >>>>>>>
> >>>>>>> How much memory are we talking about here? Just curious.
> >>>>>>
> >>>>>> From what I've seen a couple of GB per node, so maybe 2-10GB total.
> >>>>>
> >>>>> OK, that is really a lot to keep around unused just in case the
> >>>>> debugging is going to be used.
> >>>>>
> >>>>> I am still not sure the current approach of (ab)using memory hotplug is
> >>>>> ideal. Sure there is some overlap but you shouldn't really need to
> >>>>> offline the required memory range at all. All you need is to isolate the
> >>>>> memory from any existing user and the page allocator. Have you checked
> >>>>> alloc_contig_range?
> >>>>>
> >>>>
> >>>> Rashmica mentioned somewhere in this thread that the virtual mapping
> >>>> must not be in place, otherwise the HW might prefetch some of this
> >>>> memory, leading to errors with memtrace (which checks that in HW).
> >>>
> >>> Does anything prevent from unmapping the pfn range from the direct
> >>> mapping?
> >>
> >> I am not sure about the implications of having
> >> pfn_valid()/pfn_present()/pfn_online() return true but accessing it
> >> results in crashes. (suspend, kdump, whatever other technology touches
> >> online memory)
> >
> > If those pages are marked as Reserved then nobody should be touching
> > them anyway.
>
> Which is not true as I remember we already discussed - I even documented
> what PG_reserved can mean after that discussion in page-flags.h (e.g.,
> memmap of boot memory) - that's why we introduced PG_offline after all.
Sorry, my statement was imprecise. What I meant is what we have
documented:
* PG_reserved is set for special pages. The "struct page" of such a page
* should in general not be touched (e.g. set dirty) except by its owner.
the owner part is important.
--
Michal Hocko
SUSE Labs
On 01.08.19 09:48, Michal Hocko wrote:
> On Thu 01-08-19 09:39:57, Michal Hocko wrote:
>> On Thu 01-08-19 09:31:09, David Hildenbrand wrote:
>>> On 01.08.19 09:26, David Hildenbrand wrote:
>> [...]
>>>> I am not sure about the implications of having
>>>> pfn_valid()/pfn_present()/pfn_online() return true but accessing it
>>>> results in crashes. (suspend, kdump, whatever other technology touches
>>>> online memory)
>>>
>>> (oneidea: we could of course go ahead and mark the pages PG_offline
>>> before unmapping the pfn range to work around these issues)
>>
>> PG_reserved and an elevated reference count should be enough to drive
>> any pfn walker out. Pfn walkers shouldn't touch any page unless they
>> know and recognize their type.
>
> Btw. this shouldn't be much different from DEBUG_PAGE_ALLOC in
> principle. The memory is valid, but not mapped to the kernel virtual
> space. Nobody should be really touching it anyway.
>
I guess that could work (I am happy with anything that gets rid of
offline_pages()/device_online() here :D ).
So for each node, alloc_contig_range() (if I remember correctly, all
pages in the range have to be in the same zone), set them PG_reserved (+
maybe something else, we'll have to see). Then, unmap them.
The reverse when freeing the memory. Guess this should leave the current
user space interface unmodified.
I can see that guard pages use a special page type (PageGuard), not sure
if something like that is really required. We'll have to see.
--
Thanks,
David / dhildenb