2020-01-10 03:09:41

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V11 0/5] arm64/mm: Enable memory hot remove

This series enables memory hot remove functionality on arm64 platform. This
is based on Linux 5.5-rc5 and particularly deals with a problem caused when
boot memory is attempted to be removed. It introduces couple of new generic
constructs while trying to solve this boot memory problem.

On arm64 platform, it is essential to ensure that the boot time discovered
memory couldn't be hot-removed so that,

1. FW data structures used across kexec are idempotent
e.g. the EFI memory map.

2. linear map or vmemmap would not have to be dynamically split, and can
map boot memory at a large granularity

3. Avoid penalizing paths that have to walk page tables, where we can be
certain that the memory is not hot-removable

This problem has been extensively discussed previously during V10 version
which can be found here (https://lkml.org/lkml/2019/10/11/233). Never the
less this series adds a new memblock flag MEMBLOCK_BOOT in order to track
boot memory at runtime and an arch specific callback arch_memory_removable()
in try_remove_memory() which can be overridden if required to reject any
given memory removal request like boot memory overlapping ranges on arm64.
It also fixes a potential race condition which might happen while trying to
dump kernel page table entries along with a concurrent memory hot remove
operation.

Concurrent vmalloc() and hot-remove conflict:

As pointed out earlier on the V5 thread [2] there can be potential conflict
between concurrent vmalloc() and memory hot-remove operation. The problem here
is caused by inadequate locking in vmalloc() which protects installation of a
page table page but not the page table walk or the leaf entry modification.

Now free_empty_tables() and it's children functions take into account a maximum
possible range on which it operates as a floor-ceiling boundary. This makes sure
that no page table page is freed unless its fully within the maximum possible
range as decided by the caller.

Testing:

Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config
options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS
combinations.

Changes in V11:

- Bifurcated check_hotplug_memory_range() and carved out check_hotremove_memory_range()
- Introduced arch_memory_removable() call back while validating hot remove range
- Introduced memblock flag MEMBLOCK_BOOT in order to track boot memory at runtime
- Marked all boot memory ranges on arm64 with MEMBLOCK_BOOT flag while processing FDT
- Overridden arch_memory_removable() on arm64 to reject boot memory removal requests
- Added an WARN_ON() in arch_remove_memory() when it receives boot memory removal request
- Added arch_memory_removable() related updates in the commit message for core hot remove

Changes in V10: (https://lkml.org/lkml/2019/10/11/233)

- Perform just single TLBI invalidation for PMD or PUD block mappings per Catalin
- Added comment in free_empty_pte_table() while validating PTE level clears per Catalin
- Added comments in free_empty_pxx_table() while checking for non-clear entries per Catalin

Changes in V9: (https://lkml.org/lkml/2019/10/9/131)

- Dropped ACK tags from Steve and David as this series has changed since
- Dropped WARN(!page) in free_hotplug_page_range() per Matthew Wilcox
- Replaced pxx_page() with virt_to_page() in free_pxx_table() per Catalin
- Dropped page and call virt_to_page() in free_hotplug_pgtable_page()
- Replaced sparse_vmap with free_mapped per Catalin
- Dropped ternary operators in all unmap_hotplug_pxx_range() per Catalin
- Collapsed all free_pxx_table() into free_empty_pxx_table() per Catalin

Changes in V8: (https://lkml.org/lkml/2019/9/23/22)

- Dropped the first patch (memblock_[free|remove] reorder) from the series which
is no longer needed for arm64 hot-remove enablement and was posted separately
as (https://patchwork.kernel.org/patch/11146361/)
- Dropped vmalloc-vmemmap detection and subsequent skipping of free_empty_tables()
- Changed free_empty_[pxx]_tables() functions which now accepts a possible maximum
floor-ceiling address range on which it operates. Also changed free_pxx_table()
functions to check against required alignment as well as maximum floor-ceiling
range as another prerequisite before freeing the page table page.
- Dropped remove_pagetable(), instead call it's constituent functions directly

Changes in V7: (https://lkml.org/lkml/2019/9/3/326)

- vmalloc_vmemmap_overlap gets evaluated early during boot for a given config
- free_empty_tables() gets conditionally called based on vmalloc_vmemmap_overlap

Changes in V6: (https://lkml.org/lkml/2019/7/15/36)

- Implemented most of the suggestions from Mark Rutland
- Added <linux/memory_hotplug.h> in ptdump
- remove_pagetable() now has two distinct passes over the kernel page table
- First pass unmap_hotplug_range() removes leaf level entries at all level
- Second pass free_empty_tables() removes empty page table pages
- Kernel page table lock has been dropped completely
- vmemmap_free() does not call freee_empty_tables() to avoid conflict with vmalloc()
- All address range scanning are converted to do {} while() loop
- Added 'unsigned long end' in __remove_pgd_mapping()
- Callers need not provide starting pointer argument to free_[pte|pmd|pud]_table()
- Drop the starting pointer argument from free_[pte|pmd|pud]_table() functions
- Fetching pxxp[i] in free_[pte|pmd|pud]_table() is wrapped around in READ_ONCE()
- free_[pte|pmd|pud]_table() now computes starting pointer inside the function
- Fixed TLB handling while freeing huge page section mappings at PMD or PUD level
- Added WARN_ON(!page) in free_hotplug_page_range()
- Added WARN_ON(![pm|pud]_table(pud|pmd)) when there is no section mapping

- [PATCH 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
- Request earlier for separate merger (https://patchwork.kernel.org/patch/10986599/)
- s/__remove_memory/try_remove_memory in the subject line
- s/arch_remove_memory/memblock_[free|remove] in the subject line
- A small change in the commit message as re-order happens now for memblock remove
functions not for arch_remove_memory()

Changes in V5: (https://lkml.org/lkml/2019/5/29/218)

- Have some agreement [1] over using memory_hotplug_lock for arm64 ptdump
- Change 7ba36eccb3f8 ("arm64/mm: Inhibit huge-vmap with ptdump") already merged
- Dropped the above patch from this series
- Fixed indentation problem in arch_[add|remove]_memory() as per David
- Collected all new Acked-by tags

Changes in V4: (https://lkml.org/lkml/2019/5/20/19)

- Implemented most of the suggestions from Mark Rutland
- Interchanged patch [PATCH 2/4] <---> [PATCH 3/4] and updated commit message
- Moved CONFIG_PGTABLE_LEVELS inside free_[pud|pmd]_table()
- Used READ_ONCE() in missing instances while accessing page table entries
- s/p???_present()/p???_none() for checking valid kernel page table entries
- WARN_ON() when an entry is !p???_none() and !p???_present() at the same time
- Updated memory hot-remove commit message with additional details as suggested
- Rebased the series on 5.2-rc1 with hotplug changes from David and Michal Hocko
- Collected all new Acked-by tags

Changes in V3: (https://lkml.org/lkml/2019/5/14/197)

- Implemented most of the suggestions from Mark Rutland for remove_pagetable()
- Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions
- Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted polarity
- Changed pointer names ('p' at end) and removed tmp from iterations
- Perform intermediate TLB invalidation while clearing pgtable entries
- Dropped flush_tlb_kernel_range() in remove_pagetable()
- Added flush_tlb_kernel_range() in remove_pte_table() instead
- Renamed page freeing functions for pgtable page and mapped pages
- Used page range size instead of order while freeing mapped or pgtable pages
- Removed all PageReserved() handling while freeing mapped or pgtable pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore
- Added a new patch to protect kernel page table race condition for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

References:

[1] https://lkml.org/lkml/2019/5/28/584
[2] https://lkml.org/lkml/2019/6/11/709

Anshuman Khandual (5):
mm/hotplug: Introduce arch callback validating the hot remove range
mm/memblock: Introduce MEMBLOCK_BOOT flag
of/fdt: Mark boot memory with MEMBLOCK_BOOT
arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
arm64/mm: Enable memory hot remove

arch/arm64/Kconfig | 3 +
arch/arm64/include/asm/memory.h | 6 +
arch/arm64/mm/mmu.c | 334 ++++++++++++++++++++++++++++++++++++++--
arch/arm64/mm/ptdump_debugfs.c | 4 +
drivers/of/fdt.c | 1 +
include/linux/memblock.h | 10 ++
include/linux/memory_hotplug.h | 7 +
mm/memblock.c | 37 +++++
mm/memory_hotplug.c | 21 ++-
9 files changed, 413 insertions(+), 10 deletions(-)

--
2.7.4


2020-01-10 03:10:04

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range

Currently there are two interfaces to initiate memory range hot removal i.e
remove_memory() and __remove_memory() which then calls try_remove_memory().
Platform gets called with arch_remove_memory() to tear down required kernel
page tables and other arch specific procedures. But there are platforms
like arm64 which might want to prevent removal of certain specific memory
ranges irrespective of their present usage or movability properties.

Current arch call back arch_remove_memory() is too late in the process to
abort memory hot removal as memory block devices and firmware memory map
entries would have already been removed. Platforms should be able to abort
the process before taking the mem_hotplug_lock with mem_hotplug_begin().
This essentially requires a new arch callback for memory range validation.

This differentiates memory range validation between memory hot add and hot
remove paths before carving out a new helper check_hotremove_memory_range()
which incorporates a new arch callback. This call back provides platforms
an opportunity to refuse memory removal at the very onset. In future the
same principle can be extended for memory hot add path if required.

Platforms can choose to override this callback in order to reject specific
memory ranges from removal or can just fallback to a default implementation
which allows removal of all memory ranges.

Cc: Andrew Morton <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
include/linux/memory_hotplug.h | 7 +++++++
mm/memory_hotplug.c | 21 ++++++++++++++++++++-
2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ba0dca6..f661bd5 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -305,6 +305,13 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}

#ifdef CONFIG_MEMORY_HOTREMOVE

+#ifndef arch_memory_removable
+static inline bool arch_memory_removable(u64 base, u64 size)
+{
+ return true;
+}
+#endif
+
extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
extern void try_offline_node(int nid);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a91a072..7cdf800 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1014,6 +1014,23 @@ static int check_hotplug_memory_range(u64 start, u64 size)
return 0;
}

+static int check_hotremove_memory_range(u64 start, u64 size)
+{
+ int rc;
+
+ BUG_ON(check_hotplug_memory_range(start, size));
+
+ /*
+ * First check if the platform is willing to have this
+ * memory range removed else just abort.
+ */
+ rc = arch_memory_removable(start, size);
+ if (!rc)
+ return -EINVAL;
+
+ return 0;
+}
+
static int online_memory_block(struct memory_block *mem, void *arg)
{
return device_online(&mem->dev);
@@ -1762,7 +1779,9 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
{
int rc = 0;

- BUG_ON(check_hotplug_memory_range(start, size));
+ rc = check_hotremove_memory_range(start, size);
+ if (rc)
+ return rc;

mem_hotplug_begin();

--
2.7.4

2020-01-10 03:10:13

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V11 2/5] mm/memblock: Introduce MEMBLOCK_BOOT flag

On arm64 platform boot memory should never be hot removed due to certain
platform specific constraints. Hence the platform would like to override
earlier added arch call back arch_memory_removable() for this purpose. In
order to reject boot memory hot removal request, it needs to first track
them at runtime. In the future, there might be other platforms requiring
runtime boot memory enumeration. Hence lets expand the existing generic
memblock framework for this purpose rather then creating one just for
arm64 platforms.

This introduces a new memblock flag MEMBLOCK_BOOT along with helpers which
can be marked by given platform on all memory regions discovered during
boot.

Cc: Mike Rapoport <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
include/linux/memblock.h | 10 ++++++++++
mm/memblock.c | 37 +++++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index b38bbef..fb04c87 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -31,12 +31,14 @@ extern unsigned long long max_possible_pfn;
* @MEMBLOCK_HOTPLUG: hotpluggable region
* @MEMBLOCK_MIRROR: mirrored region
* @MEMBLOCK_NOMAP: don't add to kernel direct mapping
+ * @MEMBLOCK_BOOT: memory received from firmware during boot
*/
enum memblock_flags {
MEMBLOCK_NONE = 0x0, /* No special request */
MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
MEMBLOCK_MIRROR = 0x2, /* mirrored region */
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
+ MEMBLOCK_BOOT = 0x8, /* memory received from firmware during boot */
};

/**
@@ -116,6 +118,8 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);
bool memblock_overlaps_region(struct memblock_type *type,
phys_addr_t base, phys_addr_t size);
+int memblock_mark_boot(phys_addr_t base, phys_addr_t size);
+int memblock_clear_boot(phys_addr_t base, phys_addr_t size);
int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
@@ -216,6 +220,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
return m->flags & MEMBLOCK_NOMAP;
}

+static inline bool memblock_is_boot(struct memblock_region *m)
+{
+ return m->flags & MEMBLOCK_BOOT;
+}
+
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long *end_pfn);
@@ -449,6 +458,7 @@ void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
void memblock_mem_limit_remove_map(phys_addr_t limit);
bool memblock_is_memory(phys_addr_t addr);
bool memblock_is_map_memory(phys_addr_t addr);
+bool memblock_is_boot_memory(phys_addr_t addr);
bool memblock_is_region_memory(phys_addr_t base, phys_addr_t size);
bool memblock_is_reserved(phys_addr_t addr);
bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 4bc2c7d..e10207f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -865,6 +865,30 @@ static int __init_memblock memblock_setclr_flag(phys_addr_t base,
}

/**
+ * memblock_mark_bootmem - Mark boot memory with flag MEMBLOCK_BOOT.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_mark_boot(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 1, MEMBLOCK_BOOT);
+}
+
+/**
+ * memblock_clear_bootmem - Clear flag MEMBLOCK_BOOT for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_clear_boot(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 0, MEMBLOCK_BOOT);
+}
+
+/**
* memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
* @base: the base phys addr of the region
* @size: the size of the region
@@ -974,6 +998,10 @@ static bool should_skip_region(struct memblock_region *m, int nid, int flags)
if ((flags & MEMBLOCK_MIRROR) && !memblock_is_mirror(m))
return true;

+ /* if we want boot memory skip non-boot memory regions */
+ if ((flags & MEMBLOCK_BOOT) && !memblock_is_boot(m))
+ return true;
+
/* skip nomap memory unless we were asked for it explicitly */
if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m))
return true;
@@ -1785,6 +1813,15 @@ bool __init_memblock memblock_is_map_memory(phys_addr_t addr)
return !memblock_is_nomap(&memblock.memory.regions[i]);
}

+bool __init_memblock memblock_is_boot_memory(phys_addr_t addr)
+{
+ int i = memblock_search(&memblock.memory, addr);
+
+ if (i == -1)
+ return false;
+ return memblock_is_boot(&memblock.memory.regions[i]);
+}
+
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
unsigned long *start_pfn, unsigned long *end_pfn)
--
2.7.4

2020-01-10 03:10:34

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V11 3/5] of/fdt: Mark boot memory with MEMBLOCK_BOOT

early_init_dt_add_memory_arch() adds memory into memblock on both UEFI and
DT based arm64 systems. Lets mark these as boot memory right after they get
into memblock. All other platforms using this default implementation for
early_init_dt_add_memory_arch() will also have this memblock flag set on
boot memory ranges but will be upto the platforms if they would like to
use it or not. On arm64 platform this flag will be used to identify boot
memory at runtime and reject any attempt to remove them.

Cc: Rob Herring <[email protected]>
Cc: Frank Rowand <[email protected]>
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
---
drivers/of/fdt.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index 2cdf64d..a2ae2c88 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1143,6 +1143,7 @@ void __init __weak early_init_dt_add_memory_arch(u64 base, u64 size)
base = phys_offset;
}
memblock_add(base, size);
+ memblock_mark_boot(base, size);
}

int __init __weak early_init_dt_mark_hotplug_memory_arch(u64 base, u64 size)
--
2.7.4

2020-01-10 03:10:54

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V11 4/5] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump

The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
arch/arm64/mm/ptdump_debugfs.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f..b5eebc8 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/debugfs.h>
+#include <linux/memory_hotplug.h>
#include <linux/seq_file.h>

#include <asm/ptdump.h>
@@ -7,7 +8,10 @@
static int ptdump_show(struct seq_file *m, void *v)
{
struct ptdump_info *info = m->private;
+
+ get_online_mems();
ptdump_walk_pgd(m, info);
+ put_online_mems();
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
--
2.7.4

2020-01-10 03:10:57

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V11 5/5] arm64/mm: Enable memory hot remove

The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
can be used to tear down either region and calls it from vmemmap_free() and
___remove_pgd_mapping(). The free_mapped argument determines whether the
backing memory will be freed.

It makes two distinct passes over the kernel page table. In the first pass
with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and
frees backing memory if required (vmemmap) for each mapped leaf entry. In
the second pass with free_empty_tables() it looks for empty page table
sections whose page table page can be unmapped, TLB invalidated and freed.

While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling
method which is borrowed from user page table tear with free_pgd_range()
which skips freeing page table pages if intermediate address range is not
aligned or maximum floor-ceiling might not own the entire page table page.

Boot memory on arm64 cannot be removed. Hence subscribe the earlier added
platform call back mechanism arch_memory_removable() and reject any boot
memory removal requests.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture and user page table tear down method.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Steve Capper <[email protected]>
Cc: Mark Rutland <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
arch/arm64/Kconfig | 3 +
arch/arm64/include/asm/memory.h | 6 +
arch/arm64/mm/mmu.c | 334 ++++++++++++++++++++++++++++++++++++++--
3 files changed, 334 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index b1b4476..402a114 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -277,6 +277,9 @@ config ZONE_DMA32
config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y

+config ARCH_ENABLE_MEMORY_HOTREMOVE
+ def_bool y
+
config SMP
def_bool y

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index a4f9ca5..045a512 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,6 +54,7 @@
#define MODULES_VADDR (BPF_JIT_REGION_END)
#define MODULES_VSIZE (SZ_128M)
#define VMEMMAP_START (-VMEMMAP_SIZE - SZ_2M)
+#define VMEMMAP_END (VMEMMAP_START + VMEMMAP_SIZE)
#define PCI_IO_END (VMEMMAP_START - SZ_2M)
#define PCI_IO_START (PCI_IO_END - PCI_IO_SIZE)
#define FIXADDR_TOP (PCI_IO_START - SZ_2M)
@@ -292,6 +293,11 @@ static inline void *phys_to_virt(phys_addr_t x)
return (void *)(__phys_to_virt(x));
}

+#ifdef CONFIG_MEMORY_HOTREMOVE
+#define arch_memory_removable arch_memory_removable
+extern bool arch_memory_removable(u64 base, u64 size);
+#endif
+
/*
* Drivers should NOT use these either.
*/
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 40797cb..2cb1b2e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -17,6 +17,7 @@
#include <linux/mman.h>
#include <linux/nodemask.h>
#include <linux/memblock.h>
+#include <linux/memory.h>
#include <linux/fs.h>
#include <linux/io.h>
#include <linux/mm.h>
@@ -724,6 +725,275 @@ int kern_addr_valid(unsigned long addr)

return pfn_valid(pte_pfn(pte));
}
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+ WARN_ON(PageReserved(page));
+ free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+ free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static bool pgtable_range_aligned(unsigned long start, unsigned long end,
+ unsigned long floor, unsigned long ceiling,
+ unsigned long mask)
+{
+ start &= mask;
+ if (start < floor)
+ return false;
+
+ if (ceiling) {
+ ceiling &= mask;
+ if (!ceiling)
+ return false;
+ }
+
+ if (end - 1 > ceiling - 1)
+ return false;
+ return true;
+}
+
+static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
+ unsigned long end, bool free_mapped)
+{
+ pte_t *ptep, pte;
+
+ do {
+ ptep = pte_offset_kernel(pmdp, addr);
+ pte = READ_ONCE(*ptep);
+ if (pte_none(pte))
+ continue;
+
+ WARN_ON(!pte_present(pte));
+ pte_clear(&init_mm, addr, ptep);
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+ if (free_mapped)
+ free_hotplug_page_range(pte_page(pte), PAGE_SIZE);
+ } while (addr += PAGE_SIZE, addr < end);
+}
+
+static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
+ unsigned long end, bool free_mapped)
+{
+ unsigned long next;
+ pmd_t *pmdp, pmd;
+
+ do {
+ next = pmd_addr_end(addr, end);
+ pmdp = pmd_offset(pudp, addr);
+ pmd = READ_ONCE(*pmdp);
+ if (pmd_none(pmd))
+ continue;
+
+ WARN_ON(!pmd_present(pmd));
+ if (pmd_sect(pmd)) {
+ pmd_clear(pmdp);
+
+ /*
+ * One TLBI should be sufficient here as the PMD_SIZE
+ * range is mapped with a single block entry.
+ */
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+ if (free_mapped)
+ free_hotplug_page_range(pmd_page(pmd),
+ PMD_SIZE);
+ continue;
+ }
+ WARN_ON(!pmd_table(pmd));
+ unmap_hotplug_pte_range(pmdp, addr, next, free_mapped);
+ } while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr,
+ unsigned long end, bool free_mapped)
+{
+ unsigned long next;
+ pud_t *pudp, pud;
+
+ do {
+ next = pud_addr_end(addr, end);
+ pudp = pud_offset(pgdp, addr);
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud))
+ continue;
+
+ WARN_ON(!pud_present(pud));
+ if (pud_sect(pud)) {
+ pud_clear(pudp);
+
+ /*
+ * One TLBI should be sufficient here as the PUD_SIZE
+ * range is mapped with a single block entry.
+ */
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+ if (free_mapped)
+ free_hotplug_page_range(pud_page(pud),
+ PUD_SIZE);
+ continue;
+ }
+ WARN_ON(!pud_table(pud));
+ unmap_hotplug_pmd_range(pudp, addr, next, free_mapped);
+ } while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_range(unsigned long addr, unsigned long end,
+ bool free_mapped)
+{
+ unsigned long next;
+ pgd_t *pgdp, pgd;
+
+ do {
+ next = pgd_addr_end(addr, end);
+ pgdp = pgd_offset_k(addr);
+ pgd = READ_ONCE(*pgdp);
+ if (pgd_none(pgd))
+ continue;
+
+ WARN_ON(!pgd_present(pgd));
+ unmap_hotplug_pud_range(pgdp, addr, next, free_mapped);
+ } while (addr = next, addr < end);
+}
+
+static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
+ unsigned long end, unsigned long floor,
+ unsigned long ceiling)
+{
+ pte_t *ptep, pte;
+ unsigned long i, start = addr;
+
+ do {
+ ptep = pte_offset_kernel(pmdp, addr);
+ pte = READ_ONCE(*ptep);
+
+ /*
+ * This is just a sanity check here which verifies that
+ * pte clearing has been done by earlier unmap loops.
+ */
+ WARN_ON(!pte_none(pte));
+ } while (addr += PAGE_SIZE, addr < end);
+
+ if (!pgtable_range_aligned(start, end, floor, ceiling, PMD_MASK))
+ return;
+
+ /*
+ * Check whether we can free the pte page if the rest of the
+ * entries are empty. Overlap with other regions have been
+ * handled by the floor/ceiling check.
+ */
+ ptep = pte_offset_kernel(pmdp, 0UL);
+ for (i = 0; i < PTRS_PER_PTE; i++) {
+ if (!pte_none(READ_ONCE(ptep[i])))
+ return;
+ }
+
+ pmd_clear(pmdp);
+ __flush_tlb_kernel_pgtable(start);
+ free_hotplug_pgtable_page(virt_to_page(ptep));
+}
+
+static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
+ unsigned long end, unsigned long floor,
+ unsigned long ceiling)
+{
+ pmd_t *pmdp, pmd;
+ unsigned long i, next, start = addr;
+
+ do {
+ next = pmd_addr_end(addr, end);
+ pmdp = pmd_offset(pudp, addr);
+ pmd = READ_ONCE(*pmdp);
+ if (pmd_none(pmd))
+ continue;
+
+ WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd));
+ free_empty_pte_table(pmdp, addr, next, floor, ceiling);
+ } while (addr = next, addr < end);
+
+ if (CONFIG_PGTABLE_LEVELS <= 2)
+ return;
+
+ if (!pgtable_range_aligned(start, end, floor, ceiling, PUD_MASK))
+ return;
+
+ /*
+ * Check whether we can free the pmd page if the rest of the
+ * entries are empty. Overlap with other regions have been
+ * handled by the floor/ceiling check.
+ */
+ pmdp = pmd_offset(pudp, 0UL);
+ for (i = 0; i < PTRS_PER_PMD; i++) {
+ if (!pmd_none(READ_ONCE(pmdp[i])))
+ return;
+ }
+
+ pud_clear(pudp);
+ __flush_tlb_kernel_pgtable(start);
+ free_hotplug_pgtable_page(virt_to_page(pmdp));
+}
+
+static void free_empty_pud_table(pgd_t *pgdp, unsigned long addr,
+ unsigned long end, unsigned long floor,
+ unsigned long ceiling)
+{
+ pud_t *pudp, pud;
+ unsigned long i, next, start = addr;
+
+ do {
+ next = pud_addr_end(addr, end);
+ pudp = pud_offset(pgdp, addr);
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud))
+ continue;
+
+ WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud));
+ free_empty_pmd_table(pudp, addr, next, floor, ceiling);
+ } while (addr = next, addr < end);
+
+ if (CONFIG_PGTABLE_LEVELS <= 3)
+ return;
+
+ if (!pgtable_range_aligned(start, end, floor, ceiling, PGDIR_MASK))
+ return;
+
+ /*
+ * Check whether we can free the pud page if the rest of the
+ * entries are empty. Overlap with other regions have been
+ * handled by the floor/ceiling check.
+ */
+ pudp = pud_offset(pgdp, 0UL);
+ for (i = 0; i < PTRS_PER_PUD; i++) {
+ if (!pud_none(READ_ONCE(pudp[i])))
+ return;
+ }
+
+ pgd_clear(pgdp);
+ __flush_tlb_kernel_pgtable(start);
+ free_hotplug_pgtable_page(virt_to_page(pudp));
+}
+
+static void free_empty_tables(unsigned long addr, unsigned long end,
+ unsigned long floor, unsigned long ceiling)
+{
+ unsigned long next;
+ pgd_t *pgdp, pgd;
+
+ do {
+ next = pgd_addr_end(addr, end);
+ pgdp = pgd_offset_k(addr);
+ pgd = READ_ONCE(*pgdp);
+ if (pgd_none(pgd))
+ continue;
+
+ WARN_ON(!pgd_present(pgd));
+ free_empty_pud_table(pgdp, addr, next, floor, ceiling);
+ } while (addr = next, addr < end);
+}
+#endif
+
#ifdef CONFIG_SPARSEMEM_VMEMMAP
#if !ARM64_SWAPPER_USES_SECTION_MAPS
int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -771,6 +1041,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
void vmemmap_free(unsigned long start, unsigned long end,
struct vmem_altmap *altmap)
{
+#ifdef CONFIG_MEMORY_HOTPLUG
+ WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+
+ unmap_hotplug_range(start, end, true);
+ free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
+#endif
}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */

@@ -1049,10 +1325,41 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
}

#ifdef CONFIG_MEMORY_HOTPLUG
+static bool range_overlaps_bootmem(u64 base, u64 size)
+{
+ unsigned long addr, end = base + size;
+ unsigned long mem_block_size = memory_block_size_bytes();
+
+ WARN_ON(!IS_ALIGNED(base, mem_block_size));
+ WARN_ON(!IS_ALIGNED(size, mem_block_size));
+
+ /*
+ * Both memory hot add and remove happens in memory block
+ * units. Any given memory block on the system was either
+ * added during boot or at runtime via hotplug.
+ */
+ for (addr = base; addr <= end; addr += mem_block_size) {
+ if (memblock_is_boot_memory(addr))
+ return true;
+ }
+ return false;
+}
+
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+ unsigned long end = start + size;
+
+ WARN_ON(pgdir != init_mm.pgd);
+ WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END));
+
+ unmap_hotplug_range(start, end, false);
+ free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
+}
+
int arch_add_memory(int nid, u64 start, u64 size,
struct mhp_restrictions *restrictions)
{
- int flags = 0;
+ int ret, flags = 0;

if (rodata_full || debug_pagealloc_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1062,8 +1369,13 @@ int arch_add_memory(int nid, u64 start, u64 size,

memblock_clear_nomap(start, size);

- return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+ ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
restrictions);
+ if (ret)
+ __remove_pgd_mapping(swapper_pg_dir,
+ __phys_to_virt(start), size);
+ return ret;
+
}
void arch_remove_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap)
@@ -1071,13 +1383,17 @@ void arch_remove_memory(int nid, u64 start, u64 size,
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;

- /*
- * FIXME: Cleanup page tables (also in arch_add_memory() in case
- * adding fails). Until then, this function should only be used
- * during memory hotplug (adding memory), not for memory
- * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
- * unlocked yet.
- */
+ WARN_ON(range_overlaps_bootmem(start, size));
__remove_pages(start_pfn, nr_pages, altmap);
+ __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
+}
+#endif
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+bool arch_memory_removable(u64 base, u64 size)
+{
+ if (range_overlaps_bootmem(base, size))
+ return false;
+ return true;
}
#endif
--
2.7.4

2020-01-10 08:43:45

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range

On 10.01.20 04:09, Anshuman Khandual wrote:
> Currently there are two interfaces to initiate memory range hot removal i.e
> remove_memory() and __remove_memory() which then calls try_remove_memory().
> Platform gets called with arch_remove_memory() to tear down required kernel
> page tables and other arch specific procedures. But there are platforms
> like arm64 which might want to prevent removal of certain specific memory
> ranges irrespective of their present usage or movability properties.

Why? Is this only relevant for boot memory? I hope so, otherwise the
arch code needs fixing IMHO.

If it's only boot memory, we should disallow offlining instead via a
memory notifier - much cleaner.

>
> Current arch call back arch_remove_memory() is too late in the process to
> abort memory hot removal as memory block devices and firmware memory map
> entries would have already been removed. Platforms should be able to abort
> the process before taking the mem_hotplug_lock with mem_hotplug_begin().
> This essentially requires a new arch callback for memory range validation.

I somewhat dislike this very much. Memory removal should never fail if
used sanely. See e.g., __remove_memory(), it will BUG() whenever
something like that would strike.

>
> This differentiates memory range validation between memory hot add and hot
> remove paths before carving out a new helper check_hotremove_memory_range()
> which incorporates a new arch callback. This call back provides platforms
> an opportunity to refuse memory removal at the very onset. In future the
> same principle can be extended for memory hot add path if required.
>
> Platforms can choose to override this callback in order to reject specific
> memory ranges from removal or can just fallback to a default implementation
> which allows removal of all memory ranges.

I suspect we want really want to disallow offlining instead. E.g., I
remember s390x does that with certain areas needed for dumping/kexec.

Somebody who added memory via add_memory() should always be able to
remove the memory via remove_memory() again. Only boot memory can be
treated in a special way, but boot memory is initially always online.

--
Thanks,

David / dhildenb

2020-01-11 14:13:36

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range

Hi Anshuman,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.5-rc5 next-20200110]
[cannot apply to arm64/for-next/core robh/for-next linux/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Anshuman-Khandual/arm64-mm-Enable-memory-hot-remove/20200111-003854
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 4a3033ef6e6bb4c566bd1d556de69b494d76976c
config: arm64-randconfig-a001-20200109 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 7.5.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.5.0 make.cross ARCH=arm64

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>

All errors (new ones prefixed by >>):

mm/memory_hotplug.c: In function 'check_hotremove_memory_range':
>> mm/memory_hotplug.c:1027:7: error: implicit declaration of function 'arch_memory_removable'; did you mean 'add_memory_resource'? [-Werror=implicit-function-declaration]
rc = arch_memory_removable(start, size);
^~~~~~~~~~~~~~~~~~~~~
add_memory_resource
At top level:
mm/memory_hotplug.c:1017:12: warning: 'check_hotremove_memory_range' defined but not used [-Wunused-function]
static int check_hotremove_memory_range(u64 start, u64 size)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

vim +1027 mm/memory_hotplug.c

1016
1017 static int check_hotremove_memory_range(u64 start, u64 size)
1018 {
1019 int rc;
1020
1021 BUG_ON(check_hotplug_memory_range(start, size));
1022
1023 /*
1024 * First check if the platform is willing to have this
1025 * memory range removed else just abort.
1026 */
> 1027 rc = arch_memory_removable(start, size);
1028 if (!rc)
1029 return -EINVAL;
1030
1031 return 0;
1032 }
1033

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation


Attachments:
(No filename) (2.46 kB)
.config.gz (31.53 kB)
Download all attachments

2020-01-11 19:53:08

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range

Hi Anshuman,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.5-rc5 next-20200109]
[cannot apply to arm64/for-next/core robh/for-next linux/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Anshuman-Khandual/arm64-mm-Enable-memory-hot-remove/20200111-003854
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 4a3033ef6e6bb4c566bd1d556de69b494d76976c
config: x86_64-randconfig-s1-20200111 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.2-10+deb8u1) 4.9.2
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>

All errors (new ones prefixed by >>):

mm/memory_hotplug.c: In function 'check_hotremove_memory_range':
>> mm/memory_hotplug.c:1027:2: error: implicit declaration of function 'arch_memory_removable' [-Werror=implicit-function-declaration]
rc = arch_memory_removable(start, size);
^
mm/memory_hotplug.c: At top level:
mm/memory_hotplug.c:1017:12: warning: 'check_hotremove_memory_range' defined but not used [-Wunused-function]
static int check_hotremove_memory_range(u64 start, u64 size)
^
Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size
Cyclomatic Complexity 5 include/linux/compiler.h:__write_once_size
Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:arch_set_bit
Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:arch_clear_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:fls64
Cyclomatic Complexity 1 include/linux/log2.h:__ilog2_u64
Cyclomatic Complexity 1 include/linux/list.h:INIT_LIST_HEAD
Cyclomatic Complexity 1 include/linux/percpu-defs.h:__this_cpu_preempt_check
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_read
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_set
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_inc
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_add_return
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_sub_return
Cyclomatic Complexity 1 arch/x86/include/asm/atomic64_64.h:arch_atomic64_add
Cyclomatic Complexity 2 include/linux/jump_label.h:static_key_false
Cyclomatic Complexity 3 include/linux/string.h:memset
Cyclomatic Complexity 1 include/linux/err.h:ERR_PTR
Cyclomatic Complexity 1 include/linux/err.h:PTR_ERR
Cyclomatic Complexity 5 arch/x86/include/asm/preempt.h:__preempt_count_add
Cyclomatic Complexity 5 arch/x86/include/asm/preempt.h:__preempt_count_sub
Cyclomatic Complexity 1 include/linux/spinlock.h:spinlock_check
Cyclomatic Complexity 1 include/linux/spinlock.h:spin_lock
Cyclomatic Complexity 1 include/linux/spinlock.h:spin_unlock
Cyclomatic Complexity 1 include/linux/spinlock.h:spin_unlock_irqrestore
Cyclomatic Complexity 1 include/linux/seqlock.h:raw_write_seqcount_begin
Cyclomatic Complexity 1 include/linux/seqlock.h:raw_write_seqcount_end
Cyclomatic Complexity 1 include/linux/nodemask.h:node_state
Cyclomatic Complexity 1 include/linux/nodemask.h:node_set_state
Cyclomatic Complexity 2 include/linux/notifier.h:notifier_to_errno
Cyclomatic Complexity 1 include/linux/page-flags.h:ClearPageReserved
Cyclomatic Complexity 1 include/linux/page-flags.h:SetPagePrivate
Cyclomatic Complexity 1 include/linux/page-flags.h:ClearPagePrivate
Cyclomatic Complexity 1 include/linux/mmzone.h:zone_end_pfn
Cyclomatic Complexity 1 include/linux/mmzone.h:zone_is_empty
Cyclomatic Complexity 4 include/linux/mmzone.h:zone_intersects
Cyclomatic Complexity 1 include/linux/mmzone.h:pgdat_end_pfn
Cyclomatic Complexity 1 include/linux/memory_hotplug.h:generic_free_nodedata
Cyclomatic Complexity 1 include/linux/memory_hotplug.h:arch_refresh_nodedata
Cyclomatic Complexity 1 include/linux/memory_hotplug.h:pgdat_resize_unlock
Cyclomatic Complexity 1 include/linux/mmzone.h:populated_zone
Cyclomatic Complexity 1 include/linux/mmzone.h:zone_to_nid
Cyclomatic Complexity 1 include/linux/mmzone.h:pfn_to_section_nr
Cyclomatic Complexity 3 include/linux/mmzone.h:__nr_to_section
Cyclomatic Complexity 1 include/linux/mmzone.h:__section_mem_map_addr
Cyclomatic Complexity 3 include/linux/mmzone.h:valid_section
Cyclomatic Complexity 3 include/linux/mmzone.h:online_section
Cyclomatic Complexity 1 include/linux/mmzone.h:online_section_nr
Cyclomatic Complexity 1 include/linux/mmzone.h:__pfn_to_section
Cyclomatic Complexity 1 include/linux/ioport.h:resource_size
Cyclomatic Complexity 1 include/linux/memremap.h:vmem_altmap_offset
Cyclomatic Complexity 1 include/linux/mm.h:page_zonenum
Cyclomatic Complexity 1 include/linux/mm.h:page_zone
Cyclomatic Complexity 1 include/linux/node.h:link_mem_sections
Cyclomatic Complexity 1 include/linux/node.h:__register_one_node
Cyclomatic Complexity 1 include/linux/node.h:register_one_node
Cyclomatic Complexity 1 include/linux/cpu.h:cpus_read_lock
Cyclomatic Complexity 1 include/linux/cpu.h:cpus_read_unlock
Cyclomatic Complexity 1 include/linux/compaction.h:kcompactd_run
Cyclomatic Complexity 6 mm/memory_hotplug.c:update_pgdat_span
Cyclomatic Complexity 4 mm/memory_hotplug.c:node_states_check_changes_online
Cyclomatic Complexity 4 mm/memory_hotplug.c:node_states_set_node
Cyclomatic Complexity 3 mm/memory_hotplug.c:resize_zone_range
Cyclomatic Complexity 3 mm/memory_hotplug.c:resize_pgdat_range
Cyclomatic Complexity 3 mm/memory_hotplug.c:default_kernel_zone_for_pfn
Cyclomatic Complexity 4 mm/memory_hotplug.c:default_zone_for_pfn
Cyclomatic Complexity 2 mm/memory_hotplug.c:reset_node_present_pages
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_begin_nested
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_begin
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqlock
Cyclomatic Complexity 1 include/linux/memory_hotplug.h:zone_span_writelock
Cyclomatic Complexity 8 mm/memory_hotplug.c:find_smallest_section_pfn
Cyclomatic Complexity 8 mm/memory_hotplug.c:find_biggest_section_pfn
Cyclomatic Complexity 1 include/linux/err.h:IS_ERR
Cyclomatic Complexity 69 include/asm-generic/getorder.h:get_order
Cyclomatic Complexity 4 include/linux/rcu_sync.h:rcu_sync_is_idle
Cyclomatic Complexity 2 include/linux/percpu-rwsem.h:percpu_down_read
Cyclomatic Complexity 2 include/linux/percpu-rwsem.h:percpu_up_read
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_end
Cyclomatic Complexity 1 include/linux/seqlock.h:write_sequnlock
Cyclomatic Complexity 1 include/linux/memory_hotplug.h:zone_span_writeunlock
Cyclomatic Complexity 11 mm/memory_hotplug.c:shrink_zone_span
Cyclomatic Complexity 3 mm/memory_hotplug.c:setup_memhp_default_state
Cyclomatic Complexity 1 include/asm-generic/bitops/instrumented-atomic.h:set_bit
Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_inc
Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_sub_return
Cyclomatic Complexity 1 include/linux/atomic-fallback.h:atomic_dec_return
Cyclomatic Complexity 1 include/asm-generic/bitops/instrumented-atomic.h:clear_bit
Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_set
Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic64_add
Cyclomatic Complexity 1 include/asm-generic/atomic-long.h:atomic_long_add
Cyclomatic Complexity 1 include/linux/mm.h:totalram_pages_add
Cyclomatic Complexity 2 include/linux/page_ref.h:page_ref_inc
Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_read
Cyclomatic Complexity 1 include/linux/jump_label.h:static_key_count

vim +/arch_memory_removable +1027 mm/memory_hotplug.c

1016
1017 static int check_hotremove_memory_range(u64 start, u64 size)
1018 {
1019 int rc;
1020
1021 BUG_ON(check_hotplug_memory_range(start, size));
1022
1023 /*
1024 * First check if the platform is willing to have this
1025 * memory range removed else just abort.
1026 */
> 1027 rc = arch_memory_removable(start, size);
1028 if (!rc)
1029 return -EINVAL;
1030
1031 return 0;
1032 }
1033

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation


Attachments:
(No filename) (8.71 kB)
.config.gz (30.49 kB)
Download all attachments

2020-01-13 04:07:40

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range



On 01/11/2020 07:41 PM, kbuild test robot wrote:
> mm/memory_hotplug.c: In function 'check_hotremove_memory_range':
>>> mm/memory_hotplug.c:1027:7: error: implicit declaration of function 'arch_memory_removable'; did you mean 'add_memory_resource'? [-Werror=implicit-function-declaration]
> rc = arch_memory_removable(start, size);
> ^~~~~~~~~~~~~~~~~~~~~
> add_memory_resource
> At top level:
> mm/memory_hotplug.c:1017:12: warning: 'check_hotremove_memory_range' defined but not used [-Wunused-function]
> static int check_hotremove_memory_range(u64 start, u64 size)
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> cc1: some warnings being treated as errors
>
> vim +1027 mm/memory_hotplug.c
>
> 1016
> 1017 static int check_hotremove_memory_range(u64 start, u64 size)
> 1018 {
> 1019 int rc;
> 1020
> 1021 BUG_ON(check_hotplug_memory_range(start, size));
> 1022
> 1023 /*
> 1024 * First check if the platform is willing to have this
> 1025 * memory range removed else just abort.
> 1026 */
>> 1027 rc = arch_memory_removable(start, size);
> 1028 if (!rc)
> 1029 return -EINVAL;
> 1030
> 1031 return 0;
> 1032 }
> 1033


Both the build failures reported here could be solved by moving
check_hotremove_memory_range() inside CONFIG_MEMORY_HOTREMOVE
wrappers, will fix it.

- Anshuman

2020-01-13 07:39:40

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH V11 2/5] mm/memblock: Introduce MEMBLOCK_BOOT flag

On Fri, Jan 10, 2020 at 08:39:12AM +0530, Anshuman Khandual wrote:
> On arm64 platform boot memory should never be hot removed due to certain
> platform specific constraints. Hence the platform would like to override
> earlier added arch call back arch_memory_removable() for this purpose. In
> order to reject boot memory hot removal request, it needs to first track
> them at runtime. In the future, there might be other platforms requiring
> runtime boot memory enumeration. Hence lets expand the existing generic
> memblock framework for this purpose rather then creating one just for
> arm64 platforms.
>
> This introduces a new memblock flag MEMBLOCK_BOOT along with helpers which
> can be marked by given platform on all memory regions discovered during
> boot.

We already have MEMBLOCK_HOTPLUG to mark hotpluggable region. Can't we use
it for your use-case?

> Cc: Mike Rapoport <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Signed-off-by: Anshuman Khandual <[email protected]>
> ---
> include/linux/memblock.h | 10 ++++++++++
> mm/memblock.c | 37 +++++++++++++++++++++++++++++++++++++
> 2 files changed, 47 insertions(+)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index b38bbef..fb04c87 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -31,12 +31,14 @@ extern unsigned long long max_possible_pfn;
> * @MEMBLOCK_HOTPLUG: hotpluggable region
> * @MEMBLOCK_MIRROR: mirrored region
> * @MEMBLOCK_NOMAP: don't add to kernel direct mapping
> + * @MEMBLOCK_BOOT: memory received from firmware during boot
> */
> enum memblock_flags {
> MEMBLOCK_NONE = 0x0, /* No special request */
> MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
> MEMBLOCK_MIRROR = 0x2, /* mirrored region */
> MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
> + MEMBLOCK_BOOT = 0x8, /* memory received from firmware during boot */
> };
>
> /**
> @@ -116,6 +118,8 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> void memblock_trim_memory(phys_addr_t align);
> bool memblock_overlaps_region(struct memblock_type *type,
> phys_addr_t base, phys_addr_t size);
> +int memblock_mark_boot(phys_addr_t base, phys_addr_t size);
> +int memblock_clear_boot(phys_addr_t base, phys_addr_t size);
> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> @@ -216,6 +220,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
> return m->flags & MEMBLOCK_NOMAP;
> }
>
> +static inline bool memblock_is_boot(struct memblock_region *m)
> +{
> + return m->flags & MEMBLOCK_BOOT;
> +}
> +
> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
> unsigned long *end_pfn);
> @@ -449,6 +458,7 @@ void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
> void memblock_mem_limit_remove_map(phys_addr_t limit);
> bool memblock_is_memory(phys_addr_t addr);
> bool memblock_is_map_memory(phys_addr_t addr);
> +bool memblock_is_boot_memory(phys_addr_t addr);
> bool memblock_is_region_memory(phys_addr_t base, phys_addr_t size);
> bool memblock_is_reserved(phys_addr_t addr);
> bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 4bc2c7d..e10207f 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -865,6 +865,30 @@ static int __init_memblock memblock_setclr_flag(phys_addr_t base,
> }
>
> /**
> + * memblock_mark_bootmem - Mark boot memory with flag MEMBLOCK_BOOT.
> + * @base: the base phys addr of the region
> + * @size: the size of the region
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +int __init_memblock memblock_mark_boot(phys_addr_t base, phys_addr_t size)
> +{
> + return memblock_setclr_flag(base, size, 1, MEMBLOCK_BOOT);
> +}
> +
> +/**
> + * memblock_clear_bootmem - Clear flag MEMBLOCK_BOOT for a specified region.
> + * @base: the base phys addr of the region
> + * @size: the size of the region
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +int __init_memblock memblock_clear_boot(phys_addr_t base, phys_addr_t size)
> +{
> + return memblock_setclr_flag(base, size, 0, MEMBLOCK_BOOT);
> +}
> +
> +/**
> * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
> * @base: the base phys addr of the region
> * @size: the size of the region
> @@ -974,6 +998,10 @@ static bool should_skip_region(struct memblock_region *m, int nid, int flags)
> if ((flags & MEMBLOCK_MIRROR) && !memblock_is_mirror(m))
> return true;
>
> + /* if we want boot memory skip non-boot memory regions */
> + if ((flags & MEMBLOCK_BOOT) && !memblock_is_boot(m))
> + return true;
> +
> /* skip nomap memory unless we were asked for it explicitly */
> if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m))
> return true;
> @@ -1785,6 +1813,15 @@ bool __init_memblock memblock_is_map_memory(phys_addr_t addr)
> return !memblock_is_nomap(&memblock.memory.regions[i]);
> }
>
> +bool __init_memblock memblock_is_boot_memory(phys_addr_t addr)
> +{
> + int i = memblock_search(&memblock.memory, addr);
> +
> + if (i == -1)
> + return false;
> + return memblock_is_boot(&memblock.memory.regions[i]);
> +}
> +
> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
> unsigned long *start_pfn, unsigned long *end_pfn)
> --
> 2.7.4
>

--
Sincerely yours,
Mike.

2020-01-13 08:43:58

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V11 2/5] mm/memblock: Introduce MEMBLOCK_BOOT flag



On 01/13/2020 01:07 PM, Mike Rapoport wrote:
> On Fri, Jan 10, 2020 at 08:39:12AM +0530, Anshuman Khandual wrote:
>> On arm64 platform boot memory should never be hot removed due to certain
>> platform specific constraints. Hence the platform would like to override
>> earlier added arch call back arch_memory_removable() for this purpose. In
>> order to reject boot memory hot removal request, it needs to first track
>> them at runtime. In the future, there might be other platforms requiring
>> runtime boot memory enumeration. Hence lets expand the existing generic
>> memblock framework for this purpose rather then creating one just for
>> arm64 platforms.
>>
>> This introduces a new memblock flag MEMBLOCK_BOOT along with helpers which
>> can be marked by given platform on all memory regions discovered during
>> boot.
>
> We already have MEMBLOCK_HOTPLUG to mark hotpluggable region. Can't we use
> it for your use-case?

At present MEMBLOCK_HOTPLUG flag helps in identifying parts of boot memory
as hotpluggable as indicated by the firmware. This information is then used
to avoid those regions during standard memblock_alloc_*() API requests and
later marking them as ZONE_MOVABLE when buddy gets initialized.

Memory hot remove does not check for MEMBLOCK_HOTPLUG flag as a requirement
before initiating the process. We could probably use this flag if generic
hot remove can be changed to check for MEMBLOCK_HOTPLUG as a prerequisite
which will require changes to memblock handling (boot and runtime) on all
existing platforms currently supporting hot remove. But what about handling
the movable boot memory created with movablecore/kernelcore command line,
should generic MM update their memblock regions with MEMBLOCK_HOTPLUG ?

>
>> Cc: Mike Rapoport <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Signed-off-by: Anshuman Khandual <[email protected]>
>> ---
>> include/linux/memblock.h | 10 ++++++++++
>> mm/memblock.c | 37 +++++++++++++++++++++++++++++++++++++
>> 2 files changed, 47 insertions(+)
>>
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index b38bbef..fb04c87 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -31,12 +31,14 @@ extern unsigned long long max_possible_pfn;
>> * @MEMBLOCK_HOTPLUG: hotpluggable region
>> * @MEMBLOCK_MIRROR: mirrored region
>> * @MEMBLOCK_NOMAP: don't add to kernel direct mapping
>> + * @MEMBLOCK_BOOT: memory received from firmware during boot
>> */
>> enum memblock_flags {
>> MEMBLOCK_NONE = 0x0, /* No special request */
>> MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
>> MEMBLOCK_MIRROR = 0x2, /* mirrored region */
>> MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
>> + MEMBLOCK_BOOT = 0x8, /* memory received from firmware during boot */
>> };
>>
>> /**
>> @@ -116,6 +118,8 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
>> void memblock_trim_memory(phys_addr_t align);
>> bool memblock_overlaps_region(struct memblock_type *type,
>> phys_addr_t base, phys_addr_t size);
>> +int memblock_mark_boot(phys_addr_t base, phys_addr_t size);
>> +int memblock_clear_boot(phys_addr_t base, phys_addr_t size);
>> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>> @@ -216,6 +220,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
>> return m->flags & MEMBLOCK_NOMAP;
>> }
>>
>> +static inline bool memblock_is_boot(struct memblock_region *m)
>> +{
>> + return m->flags & MEMBLOCK_BOOT;
>> +}
>> +
>> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>> int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>> unsigned long *end_pfn);
>> @@ -449,6 +458,7 @@ void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
>> void memblock_mem_limit_remove_map(phys_addr_t limit);
>> bool memblock_is_memory(phys_addr_t addr);
>> bool memblock_is_map_memory(phys_addr_t addr);
>> +bool memblock_is_boot_memory(phys_addr_t addr);
>> bool memblock_is_region_memory(phys_addr_t base, phys_addr_t size);
>> bool memblock_is_reserved(phys_addr_t addr);
>> bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 4bc2c7d..e10207f 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -865,6 +865,30 @@ static int __init_memblock memblock_setclr_flag(phys_addr_t base,
>> }
>>
>> /**
>> + * memblock_mark_bootmem - Mark boot memory with flag MEMBLOCK_BOOT.
>> + * @base: the base phys addr of the region
>> + * @size: the size of the region
>> + *
>> + * Return: 0 on success, -errno on failure.
>> + */
>> +int __init_memblock memblock_mark_boot(phys_addr_t base, phys_addr_t size)
>> +{
>> + return memblock_setclr_flag(base, size, 1, MEMBLOCK_BOOT);
>> +}
>> +
>> +/**
>> + * memblock_clear_bootmem - Clear flag MEMBLOCK_BOOT for a specified region.
>> + * @base: the base phys addr of the region
>> + * @size: the size of the region
>> + *
>> + * Return: 0 on success, -errno on failure.
>> + */
>> +int __init_memblock memblock_clear_boot(phys_addr_t base, phys_addr_t size)
>> +{
>> + return memblock_setclr_flag(base, size, 0, MEMBLOCK_BOOT);
>> +}
>> +
>> +/**
>> * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
>> * @base: the base phys addr of the region
>> * @size: the size of the region
>> @@ -974,6 +998,10 @@ static bool should_skip_region(struct memblock_region *m, int nid, int flags)
>> if ((flags & MEMBLOCK_MIRROR) && !memblock_is_mirror(m))
>> return true;
>>
>> + /* if we want boot memory skip non-boot memory regions */
>> + if ((flags & MEMBLOCK_BOOT) && !memblock_is_boot(m))
>> + return true;
>> +
>> /* skip nomap memory unless we were asked for it explicitly */
>> if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m))
>> return true;
>> @@ -1785,6 +1813,15 @@ bool __init_memblock memblock_is_map_memory(phys_addr_t addr)
>> return !memblock_is_nomap(&memblock.memory.regions[i]);
>> }
>>
>> +bool __init_memblock memblock_is_boot_memory(phys_addr_t addr)
>> +{
>> + int i = memblock_search(&memblock.memory, addr);
>> +
>> + if (i == -1)
>> + return false;
>> + return memblock_is_boot(&memblock.memory.regions[i]);
>> +}
>> +
>> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>> int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
>> unsigned long *start_pfn, unsigned long *end_pfn)
>> --
>> 2.7.4
>>
>

2020-01-13 08:58:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH V11 2/5] mm/memblock: Introduce MEMBLOCK_BOOT flag



> Am 13.01.2020 um 09:41 schrieb Anshuman Khandual <[email protected]>:
>
> 
>
>> On 01/13/2020 01:07 PM, Mike Rapoport wrote:
>>> On Fri, Jan 10, 2020 at 08:39:12AM +0530, Anshuman Khandual wrote:
>>> On arm64 platform boot memory should never be hot removed due to certain
>>> platform specific constraints. Hence the platform would like to override
>>> earlier added arch call back arch_memory_removable() for this purpose. In
>>> order to reject boot memory hot removal request, it needs to first track
>>> them at runtime. In the future, there might be other platforms requiring
>>> runtime boot memory enumeration. Hence lets expand the existing generic
>>> memblock framework for this purpose rather then creating one just for
>>> arm64 platforms.
>>>
>>> This introduces a new memblock flag MEMBLOCK_BOOT along with helpers which
>>> can be marked by given platform on all memory regions discovered during
>>> boot.
>>
>> We already have MEMBLOCK_HOTPLUG to mark hotpluggable region. Can't we use
>> it for your use-case?
>
> At present MEMBLOCK_HOTPLUG flag helps in identifying parts of boot memory
> as hotpluggable as indicated by the firmware. This information is then used
> to avoid those regions during standard memblock_alloc_*() API requests and
> later marking them as ZONE_MOVABLE when buddy gets initialized.
>
> Memory hot remove does not check for MEMBLOCK_HOTPLUG flag as a requirement
> before initiating the process. We could probably use this flag if generic
> hot remove can be changed to check for MEMBLOCK_HOTPLUG as a prerequisite
> which will require changes to memblock handling (boot and runtime) on all
> existing platforms currently supporting hot remove. But what about handling
> the movable boot memory created with movablecore/kernelcore command line,
> should generic MM update their memblock regions with MEMBLOCK_HOTPLUG ?

As I said in my other mail, just disallow offlining of the affected (boot) memory blocks using a memory notifier and you should be good to go. No changes in memory unplug code required.

>
>>
>>> Cc: Mike Rapoport <[email protected]>
>>> Cc: Andrew Morton <[email protected]>
>>> Signed-off-by: Anshuman Khandual <[email protected]>
>>> ---
>>> include/linux/memblock.h | 10 ++++++++++
>>> mm/memblock.c | 37 +++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 47 insertions(+)
>>>
>>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>>> index b38bbef..fb04c87 100644
>>> --- a/include/linux/memblock.h
>>> +++ b/include/linux/memblock.h
>>> @@ -31,12 +31,14 @@ extern unsigned long long max_possible_pfn;
>>> * @MEMBLOCK_HOTPLUG: hotpluggable region
>>> * @MEMBLOCK_MIRROR: mirrored region
>>> * @MEMBLOCK_NOMAP: don't add to kernel direct mapping
>>> + * @MEMBLOCK_BOOT: memory received from firmware during boot
>>> */
>>> enum memblock_flags {
>>> MEMBLOCK_NONE = 0x0, /* No special request */
>>> MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
>>> MEMBLOCK_MIRROR = 0x2, /* mirrored region */
>>> MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
>>> + MEMBLOCK_BOOT = 0x8, /* memory received from firmware during boot */
>>> };
>>>
>>> /**
>>> @@ -116,6 +118,8 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
>>> void memblock_trim_memory(phys_addr_t align);
>>> bool memblock_overlaps_region(struct memblock_type *type,
>>> phys_addr_t base, phys_addr_t size);
>>> +int memblock_mark_boot(phys_addr_t base, phys_addr_t size);
>>> +int memblock_clear_boot(phys_addr_t base, phys_addr_t size);
>>> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>>> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>>> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>>> @@ -216,6 +220,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
>>> return m->flags & MEMBLOCK_NOMAP;
>>> }
>>>
>>> +static inline bool memblock_is_boot(struct memblock_region *m)
>>> +{
>>> + return m->flags & MEMBLOCK_BOOT;
>>> +}
>>> +
>>> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>>> int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>>> unsigned long *end_pfn);
>>> @@ -449,6 +458,7 @@ void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
>>> void memblock_mem_limit_remove_map(phys_addr_t limit);
>>> bool memblock_is_memory(phys_addr_t addr);
>>> bool memblock_is_map_memory(phys_addr_t addr);
>>> +bool memblock_is_boot_memory(phys_addr_t addr);
>>> bool memblock_is_region_memory(phys_addr_t base, phys_addr_t size);
>>> bool memblock_is_reserved(phys_addr_t addr);
>>> bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
>>> diff --git a/mm/memblock.c b/mm/memblock.c
>>> index 4bc2c7d..e10207f 100644
>>> --- a/mm/memblock.c
>>> +++ b/mm/memblock.c
>>> @@ -865,6 +865,30 @@ static int __init_memblock memblock_setclr_flag(phys_addr_t base,
>>> }
>>>
>>> /**
>>> + * memblock_mark_bootmem - Mark boot memory with flag MEMBLOCK_BOOT.
>>> + * @base: the base phys addr of the region
>>> + * @size: the size of the region
>>> + *
>>> + * Return: 0 on success, -errno on failure.
>>> + */
>>> +int __init_memblock memblock_mark_boot(phys_addr_t base, phys_addr_t size)
>>> +{
>>> + return memblock_setclr_flag(base, size, 1, MEMBLOCK_BOOT);
>>> +}
>>> +
>>> +/**
>>> + * memblock_clear_bootmem - Clear flag MEMBLOCK_BOOT for a specified region.
>>> + * @base: the base phys addr of the region
>>> + * @size: the size of the region
>>> + *
>>> + * Return: 0 on success, -errno on failure.
>>> + */
>>> +int __init_memblock memblock_clear_boot(phys_addr_t base, phys_addr_t size)
>>> +{
>>> + return memblock_setclr_flag(base, size, 0, MEMBLOCK_BOOT);
>>> +}
>>> +
>>> +/**
>>> * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
>>> * @base: the base phys addr of the region
>>> * @size: the size of the region
>>> @@ -974,6 +998,10 @@ static bool should_skip_region(struct memblock_region *m, int nid, int flags)
>>> if ((flags & MEMBLOCK_MIRROR) && !memblock_is_mirror(m))
>>> return true;
>>>
>>> + /* if we want boot memory skip non-boot memory regions */
>>> + if ((flags & MEMBLOCK_BOOT) && !memblock_is_boot(m))
>>> + return true;
>>> +
>>> /* skip nomap memory unless we were asked for it explicitly */
>>> if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m))
>>> return true;
>>> @@ -1785,6 +1813,15 @@ bool __init_memblock memblock_is_map_memory(phys_addr_t addr)
>>> return !memblock_is_nomap(&memblock.memory.regions[i]);
>>> }
>>>
>>> +bool __init_memblock memblock_is_boot_memory(phys_addr_t addr)
>>> +{
>>> + int i = memblock_search(&memblock.memory, addr);
>>> +
>>> + if (i == -1)
>>> + return false;
>>> + return memblock_is_boot(&memblock.memory.regions[i]);
>>> +}
>>> +
>>> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>>> int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
>>> unsigned long *start_pfn, unsigned long *end_pfn)
>>> --
>>> 2.7.4
>>>
>>
>

2020-01-13 09:11:18

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range



On 01/10/2020 02:12 PM, David Hildenbrand wrote:
> On 10.01.20 04:09, Anshuman Khandual wrote:
>> Currently there are two interfaces to initiate memory range hot removal i.e
>> remove_memory() and __remove_memory() which then calls try_remove_memory().
>> Platform gets called with arch_remove_memory() to tear down required kernel
>> page tables and other arch specific procedures. But there are platforms
>> like arm64 which might want to prevent removal of certain specific memory
>> ranges irrespective of their present usage or movability properties.
>
> Why? Is this only relevant for boot memory? I hope so, otherwise the
> arch code needs fixing IMHO.

Right, it is relevant only for the boot memory on arm64 platform. But this
new arch callback makes it flexible to reject any given memory range.

>
> If it's only boot memory, we should disallow offlining instead via a
> memory notifier - much cleaner.

Dont have much detail understanding of MMU notifier mechanism but from some
initial reading, it seems like we need to have a mm_struct for a notifier
to monitor various events on the page table. Just wondering how a physical
memory range like boot memory can be monitored because it can be used both
for for kernel (init_mm) or user space process at same time. Is there some
mechanism we could do this ?

>
>>
>> Current arch call back arch_remove_memory() is too late in the process to
>> abort memory hot removal as memory block devices and firmware memory map
>> entries would have already been removed. Platforms should be able to abort
>> the process before taking the mem_hotplug_lock with mem_hotplug_begin().
>> This essentially requires a new arch callback for memory range validation.
>
> I somewhat dislike this very much. Memory removal should never fail if
> used sanely. See e.g., __remove_memory(), it will BUG() whenever
> something like that would strike.
>
>>
>> This differentiates memory range validation between memory hot add and hot
>> remove paths before carving out a new helper check_hotremove_memory_range()
>> which incorporates a new arch callback. This call back provides platforms
>> an opportunity to refuse memory removal at the very onset. In future the
>> same principle can be extended for memory hot add path if required.
>>
>> Platforms can choose to override this callback in order to reject specific
>> memory ranges from removal or can just fallback to a default implementation
>> which allows removal of all memory ranges.
>
> I suspect we want really want to disallow offlining instead. E.g., I

If boot memory pages can be prevented from being offlined for sure, then it
would indirectly definitely prevent hot remove process as well.

> remember s390x does that with certain areas needed for dumping/kexec.

Could not find any references to mmu_notifier in arch/s390 or any other arch
for that matter apart from KVM (which has an user space component), could you
please give some pointers ?

>
> Somebody who added memory via add_memory() should always be able to
> remove the memory via remove_memory() again. Only boot memory can be
> treated in a special way, but boot memory is initially always online.
>

2020-01-13 09:15:46

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range



> Am 13.01.2020 um 10:10 schrieb Anshuman Khandual <[email protected]>:
>
> 
>
>> On 01/10/2020 02:12 PM, David Hildenbrand wrote:
>>> On 10.01.20 04:09, Anshuman Khandual wrote:
>>> Currently there are two interfaces to initiate memory range hot removal i.e
>>> remove_memory() and __remove_memory() which then calls try_remove_memory().
>>> Platform gets called with arch_remove_memory() to tear down required kernel
>>> page tables and other arch specific procedures. But there are platforms
>>> like arm64 which might want to prevent removal of certain specific memory
>>> ranges irrespective of their present usage or movability properties.
>>
>> Why? Is this only relevant for boot memory? I hope so, otherwise the
>> arch code needs fixing IMHO.
>
> Right, it is relevant only for the boot memory on arm64 platform. But this
> new arch callback makes it flexible to reject any given memory range.
>
>>
>> If it's only boot memory, we should disallow offlining instead via a
>> memory notifier - much cleaner.
>
> Dont have much detail understanding of MMU notifier mechanism but from some
> initial reading, it seems like we need to have a mm_struct for a notifier
> to monitor various events on the page table. Just wondering how a physical
> memory range like boot memory can be monitored because it can be used both
> for for kernel (init_mm) or user space process at same time. Is there some
> mechanism we could do this ?
>
>>
>>>
>>> Current arch call back arch_remove_memory() is too late in the process to
>>> abort memory hot removal as memory block devices and firmware memory map
>>> entries would have already been removed. Platforms should be able to abort
>>> the process before taking the mem_hotplug_lock with mem_hotplug_begin().
>>> This essentially requires a new arch callback for memory range validation.
>>
>> I somewhat dislike this very much. Memory removal should never fail if
>> used sanely. See e.g., __remove_memory(), it will BUG() whenever
>> something like that would strike.
>>
>>>
>>> This differentiates memory range validation between memory hot add and hot
>>> remove paths before carving out a new helper check_hotremove_memory_range()
>>> which incorporates a new arch callback. This call back provides platforms
>>> an opportunity to refuse memory removal at the very onset. In future the
>>> same principle can be extended for memory hot add path if required.
>>>
>>> Platforms can choose to override this callback in order to reject specific
>>> memory ranges from removal or can just fallback to a default implementation
>>> which allows removal of all memory ranges.
>>
>> I suspect we want really want to disallow offlining instead. E.g., I
>
> If boot memory pages can be prevented from being offlined for sure, then it
> would indirectly definitely prevent hot remove process as well.
>
>> remember s390x does that with certain areas needed for dumping/kexec.
>
> Could not find any references to mmu_notifier in arch/s390 or any other arch
> for that matter apart from KVM (which has an user space component), could you
> please give some pointers ?

Memory (hotplug) notifier, not MMU notifier :)

Not on my notebook right now, grep for MEM_GOING_OFFLINE, that should be it.

2020-01-13 09:50:10

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range



On 01/13/2020 02:44 PM, David Hildenbrand wrote:
>
>
>> Am 13.01.2020 um 10:10 schrieb Anshuman Khandual <[email protected]>:
>>
>> 
>>
>>> On 01/10/2020 02:12 PM, David Hildenbrand wrote:
>>>> On 10.01.20 04:09, Anshuman Khandual wrote:
>>>> Currently there are two interfaces to initiate memory range hot removal i.e
>>>> remove_memory() and __remove_memory() which then calls try_remove_memory().
>>>> Platform gets called with arch_remove_memory() to tear down required kernel
>>>> page tables and other arch specific procedures. But there are platforms
>>>> like arm64 which might want to prevent removal of certain specific memory
>>>> ranges irrespective of their present usage or movability properties.
>>>
>>> Why? Is this only relevant for boot memory? I hope so, otherwise the
>>> arch code needs fixing IMHO.
>>
>> Right, it is relevant only for the boot memory on arm64 platform. But this
>> new arch callback makes it flexible to reject any given memory range.
>>
>>>
>>> If it's only boot memory, we should disallow offlining instead via a
>>> memory notifier - much cleaner.
>>
>> Dont have much detail understanding of MMU notifier mechanism but from some
>> initial reading, it seems like we need to have a mm_struct for a notifier
>> to monitor various events on the page table. Just wondering how a physical
>> memory range like boot memory can be monitored because it can be used both
>> for for kernel (init_mm) or user space process at same time. Is there some
>> mechanism we could do this ?
>>
>>>
>>>>
>>>> Current arch call back arch_remove_memory() is too late in the process to
>>>> abort memory hot removal as memory block devices and firmware memory map
>>>> entries would have already been removed. Platforms should be able to abort
>>>> the process before taking the mem_hotplug_lock with mem_hotplug_begin().
>>>> This essentially requires a new arch callback for memory range validation.
>>>
>>> I somewhat dislike this very much. Memory removal should never fail if
>>> used sanely. See e.g., __remove_memory(), it will BUG() whenever
>>> something like that would strike.
>>>
>>>>
>>>> This differentiates memory range validation between memory hot add and hot
>>>> remove paths before carving out a new helper check_hotremove_memory_range()
>>>> which incorporates a new arch callback. This call back provides platforms
>>>> an opportunity to refuse memory removal at the very onset. In future the
>>>> same principle can be extended for memory hot add path if required.
>>>>
>>>> Platforms can choose to override this callback in order to reject specific
>>>> memory ranges from removal or can just fallback to a default implementation
>>>> which allows removal of all memory ranges.
>>>
>>> I suspect we want really want to disallow offlining instead. E.g., I
>>
>> If boot memory pages can be prevented from being offlined for sure, then it
>> would indirectly definitely prevent hot remove process as well.
>>
>>> remember s390x does that with certain areas needed for dumping/kexec.
>>
>> Could not find any references to mmu_notifier in arch/s390 or any other arch
>> for that matter apart from KVM (which has an user space component), could you
>> please give some pointers ?
>
> Memory (hotplug) notifier, not MMU notifier :)

They are so similarly named :)

>
> Not on my notebook right now, grep for MEM_GOING_OFFLINE, that should be it.
>

Got it, thanks ! But we will still need boot memory enumeration via MEMBLOCK_BOOT
to reject affected offline requests in the callback.

2020-01-13 10:39:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range

On 13.01.20 10:50, Anshuman Khandual wrote:
>
>
> On 01/13/2020 02:44 PM, David Hildenbrand wrote:
>>
>>
>>> Am 13.01.2020 um 10:10 schrieb Anshuman Khandual <[email protected]>:
>>>
>>> 
>>>
>>>> On 01/10/2020 02:12 PM, David Hildenbrand wrote:
>>>>> On 10.01.20 04:09, Anshuman Khandual wrote:
>>>>> Currently there are two interfaces to initiate memory range hot removal i.e
>>>>> remove_memory() and __remove_memory() which then calls try_remove_memory().
>>>>> Platform gets called with arch_remove_memory() to tear down required kernel
>>>>> page tables and other arch specific procedures. But there are platforms
>>>>> like arm64 which might want to prevent removal of certain specific memory
>>>>> ranges irrespective of their present usage or movability properties.
>>>>
>>>> Why? Is this only relevant for boot memory? I hope so, otherwise the
>>>> arch code needs fixing IMHO.
>>>
>>> Right, it is relevant only for the boot memory on arm64 platform. But this
>>> new arch callback makes it flexible to reject any given memory range.
>>>
>>>>
>>>> If it's only boot memory, we should disallow offlining instead via a
>>>> memory notifier - much cleaner.
>>>
>>> Dont have much detail understanding of MMU notifier mechanism but from some
>>> initial reading, it seems like we need to have a mm_struct for a notifier
>>> to monitor various events on the page table. Just wondering how a physical
>>> memory range like boot memory can be monitored because it can be used both
>>> for for kernel (init_mm) or user space process at same time. Is there some
>>> mechanism we could do this ?
>>>
>>>>
>>>>>
>>>>> Current arch call back arch_remove_memory() is too late in the process to
>>>>> abort memory hot removal as memory block devices and firmware memory map
>>>>> entries would have already been removed. Platforms should be able to abort
>>>>> the process before taking the mem_hotplug_lock with mem_hotplug_begin().
>>>>> This essentially requires a new arch callback for memory range validation.
>>>>
>>>> I somewhat dislike this very much. Memory removal should never fail if
>>>> used sanely. See e.g., __remove_memory(), it will BUG() whenever
>>>> something like that would strike.
>>>>
>>>>>
>>>>> This differentiates memory range validation between memory hot add and hot
>>>>> remove paths before carving out a new helper check_hotremove_memory_range()
>>>>> which incorporates a new arch callback. This call back provides platforms
>>>>> an opportunity to refuse memory removal at the very onset. In future the
>>>>> same principle can be extended for memory hot add path if required.
>>>>>
>>>>> Platforms can choose to override this callback in order to reject specific
>>>>> memory ranges from removal or can just fallback to a default implementation
>>>>> which allows removal of all memory ranges.
>>>>
>>>> I suspect we want really want to disallow offlining instead. E.g., I
>>>
>>> If boot memory pages can be prevented from being offlined for sure, then it
>>> would indirectly definitely prevent hot remove process as well.
>>>
>>>> remember s390x does that with certain areas needed for dumping/kexec.
>>>
>>> Could not find any references to mmu_notifier in arch/s390 or any other arch
>>> for that matter apart from KVM (which has an user space component), could you
>>> please give some pointers ?
>>
>> Memory (hotplug) notifier, not MMU notifier :)
>
> They are so similarly named :)
>
>>
>> Not on my notebook right now, grep for MEM_GOING_OFFLINE, that should be it.
>>
>
> Got it, thanks ! But we will still need boot memory enumeration via MEMBLOCK_BOOT
> to reject affected offline requests in the callback.

Do you really need that?

We have SECTION_IS_EARLY. You could iterate all involved sections (for
which you are getting notified) and check if any one of these is marked
SECTION_IS_EARLY. then, it was added during boot and not via add_memory().


--
Thanks,

David / dhildenb

2020-01-14 02:59:21

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range



On 01/13/2020 04:07 PM, David Hildenbrand wrote:
> On 13.01.20 10:50, Anshuman Khandual wrote:
>>
>>
>> On 01/13/2020 02:44 PM, David Hildenbrand wrote:
>>>
>>>
>>>> Am 13.01.2020 um 10:10 schrieb Anshuman Khandual <[email protected]>:
>>>>
>>>> 
>>>>
>>>>> On 01/10/2020 02:12 PM, David Hildenbrand wrote:
>>>>>> On 10.01.20 04:09, Anshuman Khandual wrote:
>>>>>> Currently there are two interfaces to initiate memory range hot removal i.e
>>>>>> remove_memory() and __remove_memory() which then calls try_remove_memory().
>>>>>> Platform gets called with arch_remove_memory() to tear down required kernel
>>>>>> page tables and other arch specific procedures. But there are platforms
>>>>>> like arm64 which might want to prevent removal of certain specific memory
>>>>>> ranges irrespective of their present usage or movability properties.
>>>>>
>>>>> Why? Is this only relevant for boot memory? I hope so, otherwise the
>>>>> arch code needs fixing IMHO.
>>>>
>>>> Right, it is relevant only for the boot memory on arm64 platform. But this
>>>> new arch callback makes it flexible to reject any given memory range.
>>>>
>>>>>
>>>>> If it's only boot memory, we should disallow offlining instead via a
>>>>> memory notifier - much cleaner.
>>>>
>>>> Dont have much detail understanding of MMU notifier mechanism but from some
>>>> initial reading, it seems like we need to have a mm_struct for a notifier
>>>> to monitor various events on the page table. Just wondering how a physical
>>>> memory range like boot memory can be monitored because it can be used both
>>>> for for kernel (init_mm) or user space process at same time. Is there some
>>>> mechanism we could do this ?
>>>>
>>>>>
>>>>>>
>>>>>> Current arch call back arch_remove_memory() is too late in the process to
>>>>>> abort memory hot removal as memory block devices and firmware memory map
>>>>>> entries would have already been removed. Platforms should be able to abort
>>>>>> the process before taking the mem_hotplug_lock with mem_hotplug_begin().
>>>>>> This essentially requires a new arch callback for memory range validation.
>>>>>
>>>>> I somewhat dislike this very much. Memory removal should never fail if
>>>>> used sanely. See e.g., __remove_memory(), it will BUG() whenever
>>>>> something like that would strike.
>>>>>
>>>>>>
>>>>>> This differentiates memory range validation between memory hot add and hot
>>>>>> remove paths before carving out a new helper check_hotremove_memory_range()
>>>>>> which incorporates a new arch callback. This call back provides platforms
>>>>>> an opportunity to refuse memory removal at the very onset. In future the
>>>>>> same principle can be extended for memory hot add path if required.
>>>>>>
>>>>>> Platforms can choose to override this callback in order to reject specific
>>>>>> memory ranges from removal or can just fallback to a default implementation
>>>>>> which allows removal of all memory ranges.
>>>>>
>>>>> I suspect we want really want to disallow offlining instead. E.g., I
>>>>
>>>> If boot memory pages can be prevented from being offlined for sure, then it
>>>> would indirectly definitely prevent hot remove process as well.
>>>>
>>>>> remember s390x does that with certain areas needed for dumping/kexec.
>>>>
>>>> Could not find any references to mmu_notifier in arch/s390 or any other arch
>>>> for that matter apart from KVM (which has an user space component), could you
>>>> please give some pointers ?
>>>
>>> Memory (hotplug) notifier, not MMU notifier :)
>>
>> They are so similarly named :)
>>
>>>
>>> Not on my notebook right now, grep for MEM_GOING_OFFLINE, that should be it.
>>>
>>
>> Got it, thanks ! But we will still need boot memory enumeration via MEMBLOCK_BOOT
>> to reject affected offline requests in the callback.
>
> Do you really need that?
>
> We have SECTION_IS_EARLY. You could iterate all involved sections (for
> which you are getting notified) and check if any one of these is marked
> SECTION_IS_EARLY. then, it was added during boot and not via add_memory().

Seems to be a better approach than adding a new memblock flag.

>
>

2020-01-14 11:09:05

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range



On 01/14/2020 07:43 AM, Anshuman Khandual wrote:
>
>
> On 01/13/2020 04:07 PM, David Hildenbrand wrote:
>> On 13.01.20 10:50, Anshuman Khandual wrote:
>>>
>>>
>>> On 01/13/2020 02:44 PM, David Hildenbrand wrote:
>>>>
>>>>
>>>>> Am 13.01.2020 um 10:10 schrieb Anshuman Khandual <[email protected]>:
>>>>>
>>>>> 
>>>>>
>>>>>> On 01/10/2020 02:12 PM, David Hildenbrand wrote:
>>>>>>> On 10.01.20 04:09, Anshuman Khandual wrote:
>>>>>>> Currently there are two interfaces to initiate memory range hot removal i.e
>>>>>>> remove_memory() and __remove_memory() which then calls try_remove_memory().
>>>>>>> Platform gets called with arch_remove_memory() to tear down required kernel
>>>>>>> page tables and other arch specific procedures. But there are platforms
>>>>>>> like arm64 which might want to prevent removal of certain specific memory
>>>>>>> ranges irrespective of their present usage or movability properties.
>>>>>>
>>>>>> Why? Is this only relevant for boot memory? I hope so, otherwise the
>>>>>> arch code needs fixing IMHO.
>>>>>
>>>>> Right, it is relevant only for the boot memory on arm64 platform. But this
>>>>> new arch callback makes it flexible to reject any given memory range.
>>>>>
>>>>>>
>>>>>> If it's only boot memory, we should disallow offlining instead via a
>>>>>> memory notifier - much cleaner.
>>>>>
>>>>> Dont have much detail understanding of MMU notifier mechanism but from some
>>>>> initial reading, it seems like we need to have a mm_struct for a notifier
>>>>> to monitor various events on the page table. Just wondering how a physical
>>>>> memory range like boot memory can be monitored because it can be used both
>>>>> for for kernel (init_mm) or user space process at same time. Is there some
>>>>> mechanism we could do this ?
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Current arch call back arch_remove_memory() is too late in the process to
>>>>>>> abort memory hot removal as memory block devices and firmware memory map
>>>>>>> entries would have already been removed. Platforms should be able to abort
>>>>>>> the process before taking the mem_hotplug_lock with mem_hotplug_begin().
>>>>>>> This essentially requires a new arch callback for memory range validation.
>>>>>>
>>>>>> I somewhat dislike this very much. Memory removal should never fail if
>>>>>> used sanely. See e.g., __remove_memory(), it will BUG() whenever
>>>>>> something like that would strike.
>>>>>>
>>>>>>>
>>>>>>> This differentiates memory range validation between memory hot add and hot
>>>>>>> remove paths before carving out a new helper check_hotremove_memory_range()
>>>>>>> which incorporates a new arch callback. This call back provides platforms
>>>>>>> an opportunity to refuse memory removal at the very onset. In future the
>>>>>>> same principle can be extended for memory hot add path if required.
>>>>>>>
>>>>>>> Platforms can choose to override this callback in order to reject specific
>>>>>>> memory ranges from removal or can just fallback to a default implementation
>>>>>>> which allows removal of all memory ranges.
>>>>>>
>>>>>> I suspect we want really want to disallow offlining instead. E.g., I
>>>>>
>>>>> If boot memory pages can be prevented from being offlined for sure, then it
>>>>> would indirectly definitely prevent hot remove process as well.
>>>>>
>>>>>> remember s390x does that with certain areas needed for dumping/kexec.
>>>>>
>>>>> Could not find any references to mmu_notifier in arch/s390 or any other arch
>>>>> for that matter apart from KVM (which has an user space component), could you
>>>>> please give some pointers ?
>>>>
>>>> Memory (hotplug) notifier, not MMU notifier :)
>>>
>>> They are so similarly named :)
>>>
>>>>
>>>> Not on my notebook right now, grep for MEM_GOING_OFFLINE, that should be it.
>>>>
>>>
>>> Got it, thanks ! But we will still need boot memory enumeration via MEMBLOCK_BOOT
>>> to reject affected offline requests in the callback.
>>
>> Do you really need that?
>>
>> We have SECTION_IS_EARLY. You could iterate all involved sections (for
>> which you are getting notified) and check if any one of these is marked
>> SECTION_IS_EARLY. then, it was added during boot and not via add_memory().
>
> Seems to be a better approach than adding a new memblock flag.

These additional changes do the trick and prevent boot memory removal.
Hope this is in line with your earlier suggestion.

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 00f3e1836558..3b59e6a29dea 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -17,6 +17,7 @@
+++ b/arch/arm64/mm/mmu.c
@@ -17,6 +17,7 @@
#include <linux/mman.h>
#include <linux/nodemask.h>
#include <linux/memblock.h>
+#include <linux/memory.h>
#include <linux/fs.h>
#include <linux/io.h>
#include <linux/mm.h>
@@ -1365,4 +1366,37 @@ void arch_remove_memory(int nid, u64 start, u64 size,
__remove_pages(start_pfn, nr_pages, altmap);
__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
}
+
+static int boot_mem_remove_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ unsigned long start_pfn, end_pfn, pfn, section_nr;
+ struct mem_section *ms;
+ struct memory_notify *arg = data;
+
+ start_pfn = arg->start_pfn;
+ end_pfn = start_pfn + arg->nr_pages;
+
+ if (action != MEM_GOING_OFFLINE)
+ return NOTIFY_OK;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+ section_nr = pfn_to_section_nr(pfn);
+ ms = __nr_to_section(section_nr);
+
+ if (early_section(ms))
+ return NOTIFY_BAD;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block boot_mem_remove_nb = {
+ .notifier_call = boot_mem_remove_notifier,
+};
+
+static int __init boot_mem_remove_init(void)
+{
+ return register_memory_notifier(&boot_mem_remove_nb);
+}
+device_initcall(boot_mem_remove_init);
#endif

>
>>
>>
>
>

2020-01-14 12:33:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH V11 1/5] mm/hotplug: Introduce arch callback validating the hot remove range

On 14.01.20 12:09, Anshuman Khandual wrote:
>
>
> On 01/14/2020 07:43 AM, Anshuman Khandual wrote:
>>
>>
>> On 01/13/2020 04:07 PM, David Hildenbrand wrote:
>>> On 13.01.20 10:50, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 01/13/2020 02:44 PM, David Hildenbrand wrote:
>>>>>
>>>>>
>>>>>> Am 13.01.2020 um 10:10 schrieb Anshuman Khandual <[email protected]>:
>>>>>>
>>>>>> 
>>>>>>
>>>>>>> On 01/10/2020 02:12 PM, David Hildenbrand wrote:
>>>>>>>> On 10.01.20 04:09, Anshuman Khandual wrote:
>>>>>>>> Currently there are two interfaces to initiate memory range hot removal i.e
>>>>>>>> remove_memory() and __remove_memory() which then calls try_remove_memory().
>>>>>>>> Platform gets called with arch_remove_memory() to tear down required kernel
>>>>>>>> page tables and other arch specific procedures. But there are platforms
>>>>>>>> like arm64 which might want to prevent removal of certain specific memory
>>>>>>>> ranges irrespective of their present usage or movability properties.
>>>>>>>
>>>>>>> Why? Is this only relevant for boot memory? I hope so, otherwise the
>>>>>>> arch code needs fixing IMHO.
>>>>>>
>>>>>> Right, it is relevant only for the boot memory on arm64 platform. But this
>>>>>> new arch callback makes it flexible to reject any given memory range.
>>>>>>
>>>>>>>
>>>>>>> If it's only boot memory, we should disallow offlining instead via a
>>>>>>> memory notifier - much cleaner.
>>>>>>
>>>>>> Dont have much detail understanding of MMU notifier mechanism but from some
>>>>>> initial reading, it seems like we need to have a mm_struct for a notifier
>>>>>> to monitor various events on the page table. Just wondering how a physical
>>>>>> memory range like boot memory can be monitored because it can be used both
>>>>>> for for kernel (init_mm) or user space process at same time. Is there some
>>>>>> mechanism we could do this ?
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Current arch call back arch_remove_memory() is too late in the process to
>>>>>>>> abort memory hot removal as memory block devices and firmware memory map
>>>>>>>> entries would have already been removed. Platforms should be able to abort
>>>>>>>> the process before taking the mem_hotplug_lock with mem_hotplug_begin().
>>>>>>>> This essentially requires a new arch callback for memory range validation.
>>>>>>>
>>>>>>> I somewhat dislike this very much. Memory removal should never fail if
>>>>>>> used sanely. See e.g., __remove_memory(), it will BUG() whenever
>>>>>>> something like that would strike.
>>>>>>>
>>>>>>>>
>>>>>>>> This differentiates memory range validation between memory hot add and hot
>>>>>>>> remove paths before carving out a new helper check_hotremove_memory_range()
>>>>>>>> which incorporates a new arch callback. This call back provides platforms
>>>>>>>> an opportunity to refuse memory removal at the very onset. In future the
>>>>>>>> same principle can be extended for memory hot add path if required.
>>>>>>>>
>>>>>>>> Platforms can choose to override this callback in order to reject specific
>>>>>>>> memory ranges from removal or can just fallback to a default implementation
>>>>>>>> which allows removal of all memory ranges.
>>>>>>>
>>>>>>> I suspect we want really want to disallow offlining instead. E.g., I
>>>>>>
>>>>>> If boot memory pages can be prevented from being offlined for sure, then it
>>>>>> would indirectly definitely prevent hot remove process as well.
>>>>>>
>>>>>>> remember s390x does that with certain areas needed for dumping/kexec.
>>>>>>
>>>>>> Could not find any references to mmu_notifier in arch/s390 or any other arch
>>>>>> for that matter apart from KVM (which has an user space component), could you
>>>>>> please give some pointers ?
>>>>>
>>>>> Memory (hotplug) notifier, not MMU notifier :)
>>>>
>>>> They are so similarly named :)
>>>>
>>>>>
>>>>> Not on my notebook right now, grep for MEM_GOING_OFFLINE, that should be it.
>>>>>
>>>>
>>>> Got it, thanks ! But we will still need boot memory enumeration via MEMBLOCK_BOOT
>>>> to reject affected offline requests in the callback.
>>>
>>> Do you really need that?
>>>
>>> We have SECTION_IS_EARLY. You could iterate all involved sections (for
>>> which you are getting notified) and check if any one of these is marked
>>> SECTION_IS_EARLY. then, it was added during boot and not via add_memory().
>>
>> Seems to be a better approach than adding a new memblock flag.
>
> These additional changes do the trick and prevent boot memory removal.
> Hope this is in line with your earlier suggestion.
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 00f3e1836558..3b59e6a29dea 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -17,6 +17,7 @@
> +++ b/arch/arm64/mm/mmu.c
> @@ -17,6 +17,7 @@
> #include <linux/mman.h>
> #include <linux/nodemask.h>
> #include <linux/memblock.h>
> +#include <linux/memory.h>
> #include <linux/fs.h>
> #include <linux/io.h>
> #include <linux/mm.h>
> @@ -1365,4 +1366,37 @@ void arch_remove_memory(int nid, u64 start, u64 size,
> __remove_pages(start_pfn, nr_pages, altmap);
> __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
> }
> +
> +static int boot_mem_remove_notifier(struct notifier_block *nb,
> + unsigned long action, void *data)
> +{
> + unsigned long start_pfn, end_pfn, pfn, section_nr;
> + struct mem_section *ms;
> + struct memory_notify *arg = data;
> +
> + start_pfn =
> + end_pfn = start_pfn + arg->nr_pages;

You can initialize some of these directly

struct memory_notify *arg = data;
const unsigned long end_pfn = arg->start_pfn; + arg->nr_pages;
unsigned long pfn = arg->start_pfn;

and avoid start_pfn.

> +
> + if (action != MEM_GOING_OFFLINE)
> + return NOTIFY_OK;
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> + section_nr = ;
> + ms = __nr_to_section(section_nr);

Also, I think you can avoid section_nr.

ms = __nr_to_section(pfn_to_section_nr(pfn));

> +
> + if (early_section(ms))
> + return NOTIFY_BAD;
> + }
> + return NOTIFY_OK;
> +}
> +
> +static struct notifier_block boot_mem_remove_nb = {
> + .notifier_call = boot_mem_remove_notifier,
> +};
> +
> +static int __init boot_mem_remove_init(void)
> +{
> + return register_memory_notifier(&boot_mem_remove_nb);
> +}
> +device_initcall(boot_mem_remove_init);
> #endif

Exactly what I was suggesting :)

If we ever need to offline+re-online boot memory (e.g., to a different
zone), we can think of something else.

--
Thanks,

David / dhildenb