2024-01-08 13:28:33

by Sumanth Korikkar

[permalink] [raw]
Subject: [PATCH v6 0/5] implement "memmap on memory" feature on s390

Hi All,

This series provides "memmap on memory" support on s390 platform.
"memmap on memory" allows struct pages array to be allocated from the
hotplugged memory range instead of allocating it from main system
memory.

s390 currently preallocates struct pages array for all potentially
possible memory, which ensures memory onlining always succeeds, but with
the cost of significant memory consumption from the available system
memory during boottime. In certain extreme configuration, this could
lead to ipl failure.

"memmap on memory" ensures struct pages array are populated from self
contained hotplugged memory range instead of depleting the available
system memory and this could eliminate ipl failure on s390 platform.

On other platforms, system might go OOM when the physically hotplugged
memory depletes the available memory before it is onlined. Hence,
"memmap on memory" feature was introduced as described in commit
a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory
range").

Unlike other architectures, s390 memory blocks are not physically accessible
until it is online. To make it physically accessible two new memory
notifiers MEM_PREPARE_ONLINE / MEM_FINISH_OFFLINE are added and this
notifier lets the hypervisor inform that the memory should be made
physically accessible. This allows for "memmap on memory" initialization
during memory hotplug onlining phase, which is performed before calling
MEM_GOING_ONLINE notifier.

Patch 1 introduces MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory
notifiers to prepare the transition of memory to and from a physically
accessible state. New mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced to
ensure altmap cannot be written when adding memory - before it is set
online. This enhancement is crucial for implementing the "memmap on
memory" feature for s390 in a subsequent patch.

Patches 2 allocates vmemmap pages from self-contained memory range for
s390. It allocates memory map (struct pages array) from the hotplugged
memory range, rather than using system memory by passing altmap to
vmemmap functions.

Patch 3 removes unhandled memory notifier types on s390.

Patch 4 implements MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory
notifiers on s390. MEM_PREPARE_ONLINE memory notifier makes memory block
physical accessible via sclp assign command. The notifier ensures
self-contained memory maps are accessible and hence enabling the "memmap
on memory" on s390. MEM_FINISH_OFFLINE memory notifier shifts the memory
block to an inaccessible state via sclp unassign command.

Patch 5 finally enables MHP_MEMMAP_ON_MEMORY on s390.

v6:
* Added usecase description in cover letter.
* Rebased against mm branch. Added mhp_flag parameter to
create_altmaps_and_memory_blocks() inorder to rebase patch1.

v5:
* Added reviewed-by
* Removed variables altmap_start, altmap_size in sclp_cmd.c
* Used PFN_PHYS macro.

Thanks for the valualble feedback.

v4:
* Introduced two new fields, altmap_start_pfn and altmap_nr_pages, in
the memory_notify structure and document it that it is used only in
MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE callbacks.
* Incorporated the newly added fields into s390's
MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifier callbacks.
* Prevent access to memblock->altmap->free in the s390 notifier callback.
* page_init_poison() could be performed similar to when adding new
memory in sparse_add_section(). Perform it without cond_resched().

v3:
* added comments to MHP_OFFLINE_ACCESSIBLE as suggested by David.
* Squashed three commits related to new memory notifier.

v2:
* Fixes are integrated and hence removed from this patch series
Suggestions from David:
* Add new flag MHP_OFFLINE_INACCESSIBLE to avoid accessing memory
during memory hotplug addition phase.
* Avoid page_init_poison() on memmap during mhp addition phase, when
MHP_OFFLINE_INACCESSIBLE mhp_flag is passed in add_memory().
* Do not skip add_pages() in arch_add_memory(). Similarly, remove
similar hacks in arch_remove_memory().
* Use MHP_PREPARE_ONLINE/MHP_FINISH_OFFLINE naming convention for
new memory notifiers.
* Rearrange removal of unused s390 memory notifier.
* Necessary commit messages changes.

Thank you

Sumanth Korikkar (5):
mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE
notifiers
s390/mm: allocate vmemmap pages from self-contained memory range
s390/sclp: remove unhandled memory notifier type
s390/mm: implement MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers
s390: enable MHP_MEMMAP_ON_MEMORY

arch/s390/Kconfig | 1 +
arch/s390/mm/init.c | 3 --
arch/s390/mm/vmem.c | 62 +++++++++++++++++++---------------
drivers/base/memory.c | 23 ++++++++++++-
drivers/s390/char/sclp_cmd.c | 44 +++++++++++++++++++-----
include/linux/memory.h | 9 +++++
include/linux/memory_hotplug.h | 18 +++++++++-
include/linux/memremap.h | 1 +
mm/memory_hotplug.c | 17 ++++++++--
mm/sparse.c | 3 +-
10 files changed, 136 insertions(+), 45 deletions(-)

--
2.40.1



2024-01-08 13:28:42

by Sumanth Korikkar

[permalink] [raw]
Subject: [PATCH v6 5/5] s390: enable MHP_MEMMAP_ON_MEMORY

Enable MHP_MEMMAP_ON_MEMORY to support "memmap on memory".
memory_hotplug.memmap_on_memory=true kernel parameter should be set in
kernel boot option to enable the feature.

Reviewed-by: Gerald Schaefer <[email protected]>
Signed-off-by: Sumanth Korikkar <[email protected]>
---
arch/s390/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index d5d8f99d1f25..747d994c6dfa 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -113,6 +113,7 @@ config S390
select ARCH_INLINE_WRITE_UNLOCK_BH
select ARCH_INLINE_WRITE_UNLOCK_IRQ
select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
+ select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_STACKWALK
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_DEBUG_PAGEALLOC
--
2.40.1


2024-01-08 13:29:02

by Sumanth Korikkar

[permalink] [raw]
Subject: [PATCH v6 2/5] s390/mm: allocate vmemmap pages from self-contained memory range

Allocate memory map (struct pages array) from the hotplugged memory
range, rather than using system memory. The change addresses the issue
where standby memory, when configured to be much larger than online
memory, could potentially lead to ipl failure due to memory map
allocation from online memory. For example, 16MB of memory map
allocation is needed for a memory block size of 1GB and when standby
memory is configured much larger than online memory, this could lead to
ipl failure.

To address this issue, the solution involves introducing "memmap on
memory" using the vmem_altmap structure on s390. Architectures that
want to implement it should pass the altmap to the vmemmap_populate()
function and its associated callchain. This enhancement is discussed in
commit 4b94ffdc4163 ("x86, mm: introduce vmem_altmap to augment
vmemmap_populate()")

Provide "memmap on memory" support for s390 by passing the altmap in
vmemmap_populate() and its callchain. The allocation path is described
as follows:
* When altmap is NULL in vmemmap_populate(), memory map allocation
occurs using the existing vmemmap_alloc_block_buf().
* When altmap is not NULL in vmemmap_populate(), memory map allocation
still uses vmemmap_alloc_block_buf(), but this function internally
calls altmap_alloc_block_buf().

For deallocation, the process is outlined as follows:
* When altmap is NULL in vmemmap_free(), memory map deallocation happens
through free_pages().
* When altmap is not NULL in vmemmap_free(), memory map deallocation
occurs via vmem_altmap_free().

While memory map allocation is primarily handled through the
self-contained memory map range, there might still be a small amount of
system memory allocation required for vmemmap pagetables. To mitigate
this impact, this feature will be limited to machines with EDAT1
support.

Reviewed-by: Gerald Schaefer <[email protected]>
Signed-off-by: Sumanth Korikkar <[email protected]>
---
arch/s390/mm/init.c | 3 ---
arch/s390/mm/vmem.c | 62 +++++++++++++++++++++++++--------------------
2 files changed, 35 insertions(+), 30 deletions(-)

diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 43e612bc2bcd..8d9a60ccb777 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -281,9 +281,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
unsigned long size_pages = PFN_DOWN(size);
int rc;

- if (WARN_ON_ONCE(params->altmap))
- return -EINVAL;
-
if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
return -EINVAL;

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 186a020857cf..eb100479f7be 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -33,8 +33,12 @@ static void __ref *vmem_alloc_pages(unsigned int order)
return memblock_alloc(size, size);
}

-static void vmem_free_pages(unsigned long addr, int order)
+static void vmem_free_pages(unsigned long addr, int order, struct vmem_altmap *altmap)
{
+ if (altmap) {
+ vmem_altmap_free(altmap, 1 << order);
+ return;
+ }
/* We don't expect boot memory to be removed ever. */
if (!slab_is_available() ||
WARN_ON_ONCE(PageReserved(virt_to_page((void *)addr))))
@@ -156,7 +160,8 @@ static bool vmemmap_unuse_sub_pmd(unsigned long start, unsigned long end)

/* __ref: we'll only call vmemmap_alloc_block() via vmemmap_populate() */
static int __ref modify_pte_table(pmd_t *pmd, unsigned long addr,
- unsigned long end, bool add, bool direct)
+ unsigned long end, bool add, bool direct,
+ struct vmem_altmap *altmap)
{
unsigned long prot, pages = 0;
int ret = -ENOMEM;
@@ -172,11 +177,11 @@ static int __ref modify_pte_table(pmd_t *pmd, unsigned long addr,
if (pte_none(*pte))
continue;
if (!direct)
- vmem_free_pages((unsigned long) pfn_to_virt(pte_pfn(*pte)), 0);
+ vmem_free_pages((unsigned long)pfn_to_virt(pte_pfn(*pte)), get_order(PAGE_SIZE), altmap);
pte_clear(&init_mm, addr, pte);
} else if (pte_none(*pte)) {
if (!direct) {
- void *new_page = vmemmap_alloc_block(PAGE_SIZE, NUMA_NO_NODE);
+ void *new_page = vmemmap_alloc_block_buf(PAGE_SIZE, NUMA_NO_NODE, altmap);

if (!new_page)
goto out;
@@ -213,7 +218,8 @@ static void try_free_pte_table(pmd_t *pmd, unsigned long start)

/* __ref: we'll only call vmemmap_alloc_block() via vmemmap_populate() */
static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
- unsigned long end, bool add, bool direct)
+ unsigned long end, bool add, bool direct,
+ struct vmem_altmap *altmap)
{
unsigned long next, prot, pages = 0;
int ret = -ENOMEM;
@@ -234,11 +240,11 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
if (IS_ALIGNED(addr, PMD_SIZE) &&
IS_ALIGNED(next, PMD_SIZE)) {
if (!direct)
- vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE));
+ vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE), altmap);
pmd_clear(pmd);
pages++;
} else if (!direct && vmemmap_unuse_sub_pmd(addr, next)) {
- vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE));
+ vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE), altmap);
pmd_clear(pmd);
}
continue;
@@ -261,7 +267,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
* page tables since vmemmap_populate gets
* called for each section separately.
*/
- new_page = vmemmap_alloc_block(PMD_SIZE, NUMA_NO_NODE);
+ new_page = vmemmap_alloc_block_buf(PMD_SIZE, NUMA_NO_NODE, altmap);
if (new_page) {
set_pmd(pmd, __pmd(__pa(new_page) | prot));
if (!IS_ALIGNED(addr, PMD_SIZE) ||
@@ -280,7 +286,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
vmemmap_use_sub_pmd(addr, next);
continue;
}
- ret = modify_pte_table(pmd, addr, next, add, direct);
+ ret = modify_pte_table(pmd, addr, next, add, direct, altmap);
if (ret)
goto out;
if (!add)
@@ -302,12 +308,12 @@ static void try_free_pmd_table(pud_t *pud, unsigned long start)
for (i = 0; i < PTRS_PER_PMD; i++, pmd++)
if (!pmd_none(*pmd))
return;
- vmem_free_pages(pud_deref(*pud), CRST_ALLOC_ORDER);
+ vmem_free_pages(pud_deref(*pud), CRST_ALLOC_ORDER, NULL);
pud_clear(pud);
}

static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
- bool add, bool direct)
+ bool add, bool direct, struct vmem_altmap *altmap)
{
unsigned long next, prot, pages = 0;
int ret = -ENOMEM;
@@ -347,7 +353,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
} else if (pud_large(*pud)) {
continue;
}
- ret = modify_pmd_table(pud, addr, next, add, direct);
+ ret = modify_pmd_table(pud, addr, next, add, direct, altmap);
if (ret)
goto out;
if (!add)
@@ -370,12 +376,12 @@ static void try_free_pud_table(p4d_t *p4d, unsigned long start)
if (!pud_none(*pud))
return;
}
- vmem_free_pages(p4d_deref(*p4d), CRST_ALLOC_ORDER);
+ vmem_free_pages(p4d_deref(*p4d), CRST_ALLOC_ORDER, NULL);
p4d_clear(p4d);
}

static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
- bool add, bool direct)
+ bool add, bool direct, struct vmem_altmap *altmap)
{
unsigned long next;
int ret = -ENOMEM;
@@ -394,7 +400,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
goto out;
p4d_populate(&init_mm, p4d, pud);
}
- ret = modify_pud_table(p4d, addr, next, add, direct);
+ ret = modify_pud_table(p4d, addr, next, add, direct, altmap);
if (ret)
goto out;
if (!add)
@@ -415,12 +421,12 @@ static void try_free_p4d_table(pgd_t *pgd, unsigned long start)
if (!p4d_none(*p4d))
return;
}
- vmem_free_pages(pgd_deref(*pgd), CRST_ALLOC_ORDER);
+ vmem_free_pages(pgd_deref(*pgd), CRST_ALLOC_ORDER, NULL);
pgd_clear(pgd);
}

static int modify_pagetable(unsigned long start, unsigned long end, bool add,
- bool direct)
+ bool direct, struct vmem_altmap *altmap)
{
unsigned long addr, next;
int ret = -ENOMEM;
@@ -445,7 +451,7 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
goto out;
pgd_populate(&init_mm, pgd, p4d);
}
- ret = modify_p4d_table(pgd, addr, next, add, direct);
+ ret = modify_p4d_table(pgd, addr, next, add, direct, altmap);
if (ret)
goto out;
if (!add)
@@ -458,14 +464,16 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
return ret;
}

-static int add_pagetable(unsigned long start, unsigned long end, bool direct)
+static int add_pagetable(unsigned long start, unsigned long end, bool direct,
+ struct vmem_altmap *altmap)
{
- return modify_pagetable(start, end, true, direct);
+ return modify_pagetable(start, end, true, direct, altmap);
}

-static int remove_pagetable(unsigned long start, unsigned long end, bool direct)
+static int remove_pagetable(unsigned long start, unsigned long end, bool direct,
+ struct vmem_altmap *altmap)
{
- return modify_pagetable(start, end, false, direct);
+ return modify_pagetable(start, end, false, direct, altmap);
}

/*
@@ -474,7 +482,7 @@ static int remove_pagetable(unsigned long start, unsigned long end, bool direct)
static int vmem_add_range(unsigned long start, unsigned long size)
{
start = (unsigned long)__va(start);
- return add_pagetable(start, start + size, true);
+ return add_pagetable(start, start + size, true, NULL);
}

/*
@@ -483,7 +491,7 @@ static int vmem_add_range(unsigned long start, unsigned long size)
static void vmem_remove_range(unsigned long start, unsigned long size)
{
start = (unsigned long)__va(start);
- remove_pagetable(start, start + size, true);
+ remove_pagetable(start, start + size, true, NULL);
}

/*
@@ -496,9 +504,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,

mutex_lock(&vmem_mutex);
/* We don't care about the node, just use NUMA_NO_NODE on allocations */
- ret = add_pagetable(start, end, false);
+ ret = add_pagetable(start, end, false, altmap);
if (ret)
- remove_pagetable(start, end, false);
+ remove_pagetable(start, end, false, altmap);
mutex_unlock(&vmem_mutex);
return ret;
}
@@ -509,7 +517,7 @@ void vmemmap_free(unsigned long start, unsigned long end,
struct vmem_altmap *altmap)
{
mutex_lock(&vmem_mutex);
- remove_pagetable(start, end, false);
+ remove_pagetable(start, end, false, altmap);
mutex_unlock(&vmem_mutex);
}

--
2.40.1


2024-01-08 13:29:14

by Sumanth Korikkar

[permalink] [raw]
Subject: [PATCH v6 4/5] s390/mm: implement MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers

MEM_PREPARE_ONLINE memory notifier makes memory block physical
accessible via sclp assign command. The notifier ensures self-contained
memory maps are accessible and hence enabling the "memmap on memory" on
s390.

MEM_FINISH_OFFLINE memory notifier shifts the memory block to an
inaccessible state via sclp unassign command.

Implementation considerations:
* When MHP_MEMMAP_ON_MEMORY is disabled, the system retains the old
behavior. This means the memory map is allocated from default memory.
* If MACHINE_HAS_EDAT1 is unavailable, MHP_MEMMAP_ON_MEMORY is
automatically disabled. This ensures that vmemmap pagetables do not
consume additional memory from the default memory allocator.
* The MEM_GOING_ONLINE notifier has been modified to perform no
operation, as MEM_PREPARE_ONLINE already executes the sclp assign
command.
* The MEM_CANCEL_ONLINE/MEM_OFFLINE notifier now performs no operation, as
MEM_FINISH_OFFLINE already executes the sclp unassign command.

Reviewed-by: Gerald Schaefer <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Sumanth Korikkar <[email protected]>
---
drivers/s390/char/sclp_cmd.c | 41 ++++++++++++++++++++++++++++++------
1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index 355e63e44e95..7815e9bea69a 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -18,6 +18,7 @@
#include <linux/mm.h>
#include <linux/mmzone.h>
#include <linux/memory.h>
+#include <linux/memory_hotplug.h>
#include <linux/module.h>
#include <asm/ctlreg.h>
#include <asm/chpid.h>
@@ -26,6 +27,7 @@
#include <asm/sclp.h>
#include <asm/numa.h>
#include <asm/facility.h>
+#include <asm/page-states.h>

#include "sclp.h"

@@ -340,13 +342,38 @@ static int sclp_mem_notifier(struct notifier_block *nb,
if (contains_standby_increment(start, start + size))
rc = -EPERM;
break;
- case MEM_GOING_ONLINE:
+ case MEM_PREPARE_ONLINE:
+ /*
+ * Access the altmap_start_pfn and altmap_nr_pages fields
+ * within the struct memory_notify specifically when dealing
+ * with only MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers.
+ *
+ * When altmap is in use, take the specified memory range
+ * online, which includes the altmap.
+ */
+ if (arg->altmap_nr_pages) {
+ start = PFN_PHYS(arg->altmap_start_pfn);
+ size += PFN_PHYS(arg->altmap_nr_pages);
+ }
rc = sclp_mem_change_state(start, size, 1);
+ if (rc || !arg->altmap_nr_pages)
+ break;
+ /*
+ * Set CMMA state to nodat here, since the struct page memory
+ * at the beginning of the memory block will not go through the
+ * buddy allocator later.
+ */
+ __arch_set_page_nodat((void *)__va(start), arg->altmap_nr_pages);
break;
- case MEM_CANCEL_ONLINE:
- sclp_mem_change_state(start, size, 0);
- break;
- case MEM_OFFLINE:
+ case MEM_FINISH_OFFLINE:
+ /*
+ * When altmap is in use, take the specified memory range
+ * offline, which includes the altmap.
+ */
+ if (arg->altmap_nr_pages) {
+ start = PFN_PHYS(arg->altmap_start_pfn);
+ size += PFN_PHYS(arg->altmap_nr_pages);
+ }
sclp_mem_change_state(start, size, 0);
break;
default:
@@ -397,7 +424,9 @@ static void __init add_memory_merged(u16 rn)
if (!size)
goto skip_add;
for (addr = start; addr < start + size; addr += block_size)
- add_memory(0, addr, block_size, MHP_NONE);
+ add_memory(0, addr, block_size,
+ MACHINE_HAS_EDAT1 ?
+ MHP_MEMMAP_ON_MEMORY | MHP_OFFLINE_INACCESSIBLE : MHP_NONE);
skip_add:
first_rn = rn;
num = 1;
--
2.40.1


2024-01-08 13:29:30

by Sumanth Korikkar

[permalink] [raw]
Subject: [PATCH v6 3/5] s390/sclp: remove unhandled memory notifier type

Remove memory notifier types which are unhandled by s390. Unhandled
memory notifier types are covered by default case.

Suggested-by: Alexander Gordeev <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Sumanth Korikkar <[email protected]>
---
drivers/s390/char/sclp_cmd.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index 11c428f4c7cf..355e63e44e95 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -340,9 +340,6 @@ static int sclp_mem_notifier(struct notifier_block *nb,
if (contains_standby_increment(start, start + size))
rc = -EPERM;
break;
- case MEM_ONLINE:
- case MEM_CANCEL_OFFLINE:
- break;
case MEM_GOING_ONLINE:
rc = sclp_mem_change_state(start, size, 1);
break;
--
2.40.1


2024-01-08 13:29:39

by Sumanth Korikkar

[permalink] [raw]
Subject: [PATCH v6 1/5] mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers

Introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers to
prepare the transition of memory to and from a physically accessible
state. This enhancement is crucial for implementing the "memmap on
memory" feature for s390 in a subsequent patch.

Platforms such as x86 can support physical memory hotplug via ACPI. When
there is physical memory hotplug, ACPI event leads to the memory
addition with the following callchain:
acpi_memory_device_add()
-> acpi_memory_enable_device()
-> __add_memory()

After this, the hotplugged memory is physically accessible, and altmap
support prepared, before the "memmap on memory" initialization in
memory_block_online() is called.

On s390, memory hotplug works in a different way. The available hotplug
memory has to be defined upfront in the hypervisor, but it is made
physically accessible only when the user sets it online via sysfs,
currently in the MEM_GOING_ONLINE notifier. This is too late and "memmap
on memory" initialization is performed before calling MEM_GOING_ONLINE
notifier.

During the memory hotplug addition phase, altmap support is prepared and
during the memory onlining phase s390 requires memory to be physically
accessible and then subsequently initiate the "memmap on memory"
initialization process.

The memory provider will handle new MEM_PREPARE_ONLINE /
MEM_FINISH_OFFLINE notifications and make the memory accessible.

The mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced and is relevant when
used along with MHP_MEMMAP_ON_MEMORY, because the altmap cannot be
written (e.g., poisoned) when adding memory -- before it is set online.
This allows for adding memory with an altmap that is not currently made
available by a hypervisor. When onlining that memory, the hypervisor can
be instructed to make that memory accessible via the new notifiers and
the onlining phase will not require any memory allocations, which is
helpful in low-memory situations.

All architectures ignore unknown memory notifiers. Therefore, the
introduction of these new notifiers does not result in any functional
modifications across architectures.

Suggested-by: Gerald Schaefer <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Sumanth Korikkar <[email protected]>
---
drivers/base/memory.c | 23 ++++++++++++++++++++++-
include/linux/memory.h | 9 +++++++++
include/linux/memory_hotplug.h | 18 +++++++++++++++++-
include/linux/memremap.h | 1 +
mm/memory_hotplug.c | 17 ++++++++++++++---
mm/sparse.c | 3 ++-
6 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 8a13babd826c..b99bcc70d6e5 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -188,6 +188,7 @@ static int memory_block_online(struct memory_block *mem)
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
unsigned long nr_vmemmap_pages = 0;
+ struct memory_notify arg;
struct zone *zone;
int ret;

@@ -207,9 +208,19 @@ static int memory_block_online(struct memory_block *mem)
if (mem->altmap)
nr_vmemmap_pages = mem->altmap->free;

+ arg.altmap_start_pfn = start_pfn;
+ arg.altmap_nr_pages = nr_vmemmap_pages;
+ arg.start_pfn = start_pfn + nr_vmemmap_pages;
+ arg.nr_pages = nr_pages - nr_vmemmap_pages;
mem_hotplug_begin();
+ ret = memory_notify(MEM_PREPARE_ONLINE, &arg);
+ ret = notifier_to_errno(ret);
+ if (ret)
+ goto out_notifier;
+
if (nr_vmemmap_pages) {
- ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
+ ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages,
+ zone, mem->altmap->inaccessible);
if (ret)
goto out;
}
@@ -231,7 +242,11 @@ static int memory_block_online(struct memory_block *mem)
nr_vmemmap_pages);

mem->zone = zone;
+ mem_hotplug_done();
+ return ret;
out:
+ memory_notify(MEM_FINISH_OFFLINE, &arg);
+out_notifier:
mem_hotplug_done();
return ret;
}
@@ -244,6 +259,7 @@ static int memory_block_offline(struct memory_block *mem)
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
unsigned long nr_vmemmap_pages = 0;
+ struct memory_notify arg;
int ret;

if (!mem->zone)
@@ -275,6 +291,11 @@ static int memory_block_offline(struct memory_block *mem)
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);

mem->zone = NULL;
+ arg.altmap_start_pfn = start_pfn;
+ arg.altmap_nr_pages = nr_vmemmap_pages;
+ arg.start_pfn = start_pfn + nr_vmemmap_pages;
+ arg.nr_pages = nr_pages - nr_vmemmap_pages;
+ memory_notify(MEM_FINISH_OFFLINE, &arg);
out:
mem_hotplug_done();
return ret;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index f53cfdaaaa41..939a16bd5cea 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -96,8 +96,17 @@ int set_memory_block_size_order(unsigned int order);
#define MEM_GOING_ONLINE (1<<3)
#define MEM_CANCEL_ONLINE (1<<4)
#define MEM_CANCEL_OFFLINE (1<<5)
+#define MEM_PREPARE_ONLINE (1<<6)
+#define MEM_FINISH_OFFLINE (1<<7)

struct memory_notify {
+ /*
+ * The altmap_start_pfn and altmap_nr_pages fields are designated for
+ * specifying the altmap range and are exclusively intended for use in
+ * MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers.
+ */
+ unsigned long altmap_start_pfn;
+ unsigned long altmap_nr_pages;
unsigned long start_pfn;
unsigned long nr_pages;
int status_change_nid_normal;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7d2076583494..ee00015575aa 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -106,6 +106,22 @@ typedef int __bitwise mhp_t;
* implies the node id (nid).
*/
#define MHP_NID_IS_MGID ((__force mhp_t)BIT(2))
+/*
+ * The hotplugged memory is completely inaccessible while the memory is
+ * offline. The memory provider will handle MEM_PREPARE_ONLINE /
+ * MEM_FINISH_OFFLINE notifications and make the memory accessible.
+ *
+ * This flag is only relevant when used along with MHP_MEMMAP_ON_MEMORY,
+ * because the altmap cannot be written (e.g., poisoned) when adding
+ * memory -- before it is set online.
+ *
+ * This allows for adding memory with an altmap that is not currently
+ * made available by a hypervisor. When onlining that memory, the
+ * hypervisor can be instructed to make that memory available, and
+ * the onlining phase will not require any memory allocations, which is
+ * helpful in low-memory situations.
+ */
+#define MHP_OFFLINE_INACCESSIBLE ((__force mhp_t)BIT(3))

/*
* Extended parameters for memory hotplug:
@@ -154,7 +170,7 @@ extern void adjust_present_page_count(struct page *page,
long nr_pages);
/* VM interface that may be used by firmware interface */
extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
- struct zone *zone);
+ struct zone *zone, bool mhp_off_inaccessible);
extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
extern int online_pages(unsigned long pfn, unsigned long nr_pages,
struct zone *zone, struct memory_group *group);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 744c830f4b13..9837f3e6fb95 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -25,6 +25,7 @@ struct vmem_altmap {
unsigned long free;
unsigned long align;
unsigned long alloc;
+ bool inaccessible;
};

/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 926e1cfb10e9..b8aa42af93dc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1084,7 +1084,7 @@ void adjust_present_page_count(struct page *page, struct memory_group *group,
}

int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
- struct zone *zone)
+ struct zone *zone, bool mhp_off_inaccessible)
{
unsigned long end_pfn = pfn + nr_pages;
int ret, i;
@@ -1093,6 +1093,15 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
if (ret)
return ret;

+ /*
+ * Memory block is accessible at this stage and hence poison the struct
+ * pages now. If the memory block is accessible during memory hotplug
+ * addition phase, then page poisining is already performed in
+ * sparse_add_section().
+ */
+ if (mhp_off_inaccessible)
+ page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
+
move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);

for (i = 0; i < nr_pages; i++)
@@ -1412,7 +1421,7 @@ static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
}

static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
- u64 start, u64 size)
+ u64 start, u64 size, mhp_t mhp_flags)
{
unsigned long memblock_size = memory_block_size_bytes();
u64 cur_start;
@@ -1428,6 +1437,8 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
};

mhp_altmap.free = memory_block_memmap_on_memory_pages();
+ if (mhp_flags & MHP_OFFLINE_INACCESSIBLE)
+ mhp_altmap.inaccessible = true;
params.altmap = kmemdup(&mhp_altmap, sizeof(struct vmem_altmap),
GFP_KERNEL);
if (!params.altmap) {
@@ -1513,7 +1524,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
*/
if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
- ret = create_altmaps_and_memory_blocks(nid, group, start, size);
+ ret = create_altmaps_and_memory_blocks(nid, group, start, size, mhp_flags);
if (ret)
goto error;
} else {
diff --git a/mm/sparse.c b/mm/sparse.c
index 338cf946dee8..aed0951b87fa 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -908,7 +908,8 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
* Poison uninitialized struct pages in order to catch invalid flags
* combinations.
*/
- page_init_poison(memmap, sizeof(struct page) * nr_pages);
+ if (!altmap || !altmap->inaccessible)
+ page_init_poison(memmap, sizeof(struct page) * nr_pages);

ms = __nr_to_section(section_nr);
set_section_nid(section_nr, nid);
--
2.40.1