2017-11-23 11:14:39

by Andrea Reale

[permalink] [raw]
Subject: [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2

Hi all,

this is a second round of patches to introduce memory hotplug and
hotremove support for arm64. It builds on the work previously published at
[1] and it implements the feedback received in the first round of reviews.

The patchset applies and has been tested on commit bebc6082da0a ("Linux
4.14").

Due to a small regression introduced with commit 8135d8926c08
("mm: memory_hotplug: memory hotremove supports thp migration"), you
will need to appy patch [2] first, until the fix is not upstreamed.

Comments and feedback are gold.

[1] https://lkml.org/lkml/2017/4/11/536
[2] https://lkml.org/lkml/2017/11/20/902

Changes v1->v2:
- swapper pgtable updated in place on hot add, avoiding unnecessary copy
- stop_machine used to updated swapper on hot add, avoiding races
- introduced check on offlining state before hot remove
- new memblock flag used to mark partially unused vmemmap pages, avoiding
the nasty 0xFD hack used in the prev rev (and in x86 hot remove code)
- proper cleaning sequence for p[um]ds,ptes and related TLB management
- Removed macros that changed hot remove behavior based on number
of pgtable levels. Now this is hidden in the pgtable traversal macros.
- Check on the corner case where P[UM]Ds would have to be split during
hot remove: now this is forbidden.
- Minor fixes and refactoring.

Andrea Reale (4):
mm: memory_hotplug: Remove assumption on memory state before hotremove
mm: memory_hotplug: memblock to track partially removed vmemmap mem
mm: memory_hotplug: Add memory hotremove probe device
mm: memory-hotplug: Add memory hot remove support for arm64

Maciej Bielski (1):
mm: memory_hotplug: Memory hotplug (add) support for arm64

arch/arm64/Kconfig | 15 +
arch/arm64/configs/defconfig | 2 +
arch/arm64/include/asm/mmu.h | 7 +
arch/arm64/mm/init.c | 116 ++++++++
arch/arm64/mm/mmu.c | 609 ++++++++++++++++++++++++++++++++++++++++-
drivers/acpi/acpi_memhotplug.c | 2 +-
drivers/base/memory.c | 34 ++-
include/linux/memblock.h | 12 +
include/linux/memory_hotplug.h | 9 +-
mm/memblock.c | 32 +++
mm/memory_hotplug.c | 13 +-
11 files changed, 835 insertions(+), 16 deletions(-)

--
2.7.4


From 1585612515318023929@xxx Fri Dec 01 19:52:11 +0000 2017
X-GM-THRID: 1585612515318023929
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread


2017-11-23 11:14:59

by Maciej Bielski

[permalink] [raw]
Subject: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64

Introduces memory hotplug functionality (hot-add) for arm64.

Changes v1->v2:
- swapper pgtable updated in place on hot add, avoiding unnecessary copy:
all changes are additive and non destructive.

- stop_machine used to updated swapper on hot add, avoiding races

- checking if pagealloc is under debug to stay coherent with mem_map

Signed-off-by: Maciej Bielski <[email protected]>
Signed-off-by: Andrea Reale <[email protected]>
---
arch/arm64/Kconfig | 12 ++++++
arch/arm64/configs/defconfig | 1 +
arch/arm64/include/asm/mmu.h | 3 ++
arch/arm64/mm/init.c | 87 ++++++++++++++++++++++++++++++++++++++++++++
arch/arm64/mm/mmu.c | 39 ++++++++++++++++++++
5 files changed, 142 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6..c736bba 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -641,6 +641,14 @@ config HOTPLUG_CPU
Say Y here to experiment with turning CPUs off and on. CPUs
can be controlled through /sys/devices/system/cpu.

+config ARCH_HAS_ADD_PAGES
+ def_bool y
+ depends on ARCH_ENABLE_MEMORY_HOTPLUG
+
+config ARCH_ENABLE_MEMORY_HOTPLUG
+ def_bool y
+ depends on !NUMA
+
# Common NUMA Features
config NUMA
bool "Numa Memory Allocation and Scheduler Support"
@@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE

source "mm/Kconfig"

+config ARCH_MEMORY_PROBE
+ def_bool y
+ depends on MEMORY_HOTPLUG
+
config SECCOMP
bool "Enable seccomp to safely compute untrusted bytecode"
---help---
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 34480e9..5fc5656 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
CONFIG_SCHED_MC=y
CONFIG_NUMA=y
CONFIG_PREEMPT=y
+CONFIG_MEMORY_HOTPLUG=y
CONFIG_KSM=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_CMA=y
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 0d34bf0..2b3fa4d 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
pgprot_t prot, bool page_mappings_only);
extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
extern void mark_linear_text_alias_ro(void);
+#ifdef CONFIG_MEMORY_HOTPLUG
+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
+#endif

#endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 5960bef..e96e7d3 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
return 0;
}
__initcall(register_mem_limit_dumper);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+int add_pages(int nid, unsigned long start_pfn,
+ unsigned long nr_pages, bool want_memblock)
+{
+ int ret;
+ u64 start_addr = start_pfn << PAGE_SHIFT;
+ /*
+ * Mark the first page in the range as unusable. This is needed
+ * because __add_section (within __add_pages) wants pfn_valid
+ * of it to be false, and in arm64 pfn falid is implemented by
+ * just checking at the nomap flag for existing blocks.
+ *
+ * A small trick here is that __add_section() requires only
+ * phys_start_pfn (that is the first pfn of a section) to be
+ * invalid. Regardless of whether it was assumed (by the function
+ * author) that all pfns within a section are either all valid
+ * or all invalid, it allows to avoid looping twice (once here,
+ * second when memblock_clear_nomap() is called) through all
+ * pfns of the section and modify only one pfn. Thanks to that,
+ * further, in __add_zone() only this very first pfn is skipped
+ * and corresponding page is not flagged reserved. Therefore it
+ * is enough to correct this setup only for it.
+ *
+ * When arch_add_memory() returns the walk_memory_range() function
+ * is called and passed with online_memory_block() callback,
+ * which execution finally reaches the memory_block_action()
+ * function, where also only the first pfn of a memory block is
+ * checked to be reserved. Above, it was first pfn of a section,
+ * here it is a block but
+ * (drivers/base/memory.c):
+ * sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
+ * (include/linux/memory.h):
+ * #define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS)
+ * so we can consider block and section equivalently
+ */
+ memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
+ ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
+
+ /*
+ * Make the pages usable after they have been added.
+ * This will make pfn_valid return true
+ */
+ memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
+
+ /*
+ * This is a hack to avoid having to mix arch specific code
+ * into arch independent code. SetPageReserved is supposed
+ * to be called by __add_zone (within __add_section, within
+ * __add_pages). However, when it is called there, it assumes that
+ * pfn_valid returns true. For the way pfn_valid is implemented
+ * in arm64 (a check on the nomap flag), the only way to make
+ * this evaluate true inside __add_zone is to clear the nomap
+ * flags of blocks in architecture independent code.
+ *
+ * To avoid this, we set the Reserved flag here after we cleared
+ * the nomap flag in the line above.
+ */
+ SetPageReserved(pfn_to_page(start_pfn));
+
+ return ret;
+}
+
+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+{
+ int ret;
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ unsigned long end_pfn = start_pfn + nr_pages;
+ unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
+
+ if (end_pfn > max_sparsemem_pfn) {
+ pr_err("end_pfn too big");
+ return -1;
+ }
+ hotplug_paging(start, size);
+
+ ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
+
+ if (ret)
+ pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
+ __func__, ret);
+
+ return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index f1eb15e..d93043d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -28,6 +28,7 @@
#include <linux/mman.h>
#include <linux/nodemask.h>
#include <linux/memblock.h>
+#include <linux/stop_machine.h>
#include <linux/fs.h>
#include <linux/io.h>
#include <linux/mm.h>
@@ -615,6 +616,44 @@ void __init paging_init(void)
SWAPPER_DIR_SIZE - PAGE_SIZE);
}

+#ifdef CONFIG_MEMORY_HOTPLUG
+
+/*
+ * hotplug_paging() is used by memory hotplug to build new page tables
+ * for hot added memory.
+ */
+
+struct mem_range {
+ phys_addr_t base;
+ phys_addr_t size;
+};
+
+static int __hotplug_paging(void *data)
+{
+ int flags = 0;
+ struct mem_range *section = data;
+
+ if (debug_pagealloc_enabled())
+ flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
+
+ __create_pgd_mapping(swapper_pg_dir, section->base,
+ __phys_to_virt(section->base), section->size,
+ PAGE_KERNEL, pgd_pgtable_alloc, flags);
+
+ return 0;
+}
+
+inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
+{
+ struct mem_range section = {
+ .base = start,
+ .size = size,
+ };
+
+ stop_machine(__hotplug_paging, &section, NULL);
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
/*
* Check whether a kernel address is valid (derived from arch/x86/).
*/
--
2.7.4


From 1585323123726668458@xxx Tue Nov 28 15:12:26 +0000 2017
X-GM-THRID: 1585323123726668458
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-23 11:16:57

by Andrea Reale

[permalink] [raw]
Subject: [PATCH v2 5/5] mm: memory-hotplug: Add memory hot remove support for arm64

Implementation of pagetable cleanup routines for arm64 memory hot remove.

How to offline:
1. Logical Hot remove (offline)
- # echo offline > /sys/devices/system/memory/memoryXX/state
2. Physical Hot remove (offline)
- (if offline is successful)
- # echo $section_phy_address > /sys/devices/system/memory/remove

Changes v1->v2:
- introduced check on offlining state before hot remove:
in x86 (and possibly other architectures), offlining of pages and hot
remove of physical memory happen in a single step, i.e., via an acpi
event. In this patchset we are introducing a "remove" sysfs handle
that triggers the physical hot-remove process after manual offlining.

- new memblock flag used to mark partially unused vmemmap pages, avoiding
the nasty 0xFD hack used in the prev rev (and in x86 hot remove code):
the hot remove process needs to take care of freeing vmemmap pages
and mappings for the memory being removed. Sometimes, it might be not
possible to free fully a vmemmap page (because it is being used for
other mappings); in such a case we mark part of that page as unused and
we free it only when it is fully unused. In the previous version, in
symmetry to x86 hot remove code, we were doing this marking by filling
the unused parts of the page with an aribitrary 0xFD constant. In this
version, we are using a new memblock flag for the same purpose.

- proper cleaning sequence for p[um]ds,ptes and related TLB management:
i) clear the page table, ii) flush tlb, iii) free the pagetable page

- Removed macros that changed hot remove behavior based on number
of pgtable levels. Now this is hidden in the pgtable traversal macros.

- Check on the corner case where P[UM]Ds would have to be split during
hot remove: now this is forbidden.
Hot addition and removal is done at SECTION_SIZE_BITS granularity
(currently 1GB). The only case when we would have to split a P[UM]D
is when SECTION_SIZE_BITS is smaller than a P[UM]D mapped area (never
by default), AND when we are removing some P[UM]D-mapped memory that
was never hot-added (there since boot). If the above conditions hold,
we avoid splitting the P[UM]Ds and, instead, we forbid hot removal.

- Minor fixes and refactoring.

Signed-off-by: Andrea Reale <[email protected]>
Signed-off-by: Maciej Bielski <[email protected]>
---
arch/arm64/Kconfig | 3 +
arch/arm64/configs/defconfig | 1 +
arch/arm64/include/asm/mmu.h | 4 +
arch/arm64/mm/init.c | 29 +++
arch/arm64/mm/mmu.c | 572 ++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 601 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c736bba..c362ddf 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -649,6 +649,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
depends on !NUMA

+config ARCH_ENABLE_MEMORY_HOTREMOVE
+ def_bool y
+
# Common NUMA Features
config NUMA
bool "Numa Memory Allocation and Scheduler Support"
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 5fc5656..cdac3b8 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -81,6 +81,7 @@ CONFIG_SCHED_MC=y
CONFIG_NUMA=y
CONFIG_PREEMPT=y
CONFIG_MEMORY_HOTPLUG=y
+CONFIG_MEMORY_HOTREMOVE=y
CONFIG_KSM=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_CMA=y
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 2b3fa4d..ca11567 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -42,6 +42,10 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
extern void mark_linear_text_alias_ro(void);
#ifdef CONFIG_MEMORY_HOTPLUG
extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+extern int remove_pagetable(unsigned long start,
+ unsigned long end, bool linear_map, bool check_split);
+#endif
#endif

#endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index e96e7d3..406b378 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -808,4 +808,33 @@ int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
return ret;
}

+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ unsigned long va_start = (unsigned long) __va(start);
+ unsigned long va_end = (unsigned long)__va(start + size);
+ struct page *page = pfn_to_page(start_pfn);
+ struct zone *zone;
+ int ret = 0;
+
+ /*
+ * Check if mem can be removed without splitting
+ * PUD/PMD mappings.
+ */
+ ret = remove_pagetable(va_start, va_end, true, true);
+ if (!ret) {
+ zone = page_zone(page);
+ ret = __remove_pages(zone, start_pfn, nr_pages);
+ WARN_ON_ONCE(ret);
+
+ /* Actually remove the mapping */
+ remove_pagetable(va_start, va_end, true, false);
+ }
+
+ return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTREMOVE */
#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d93043d..e6f8c91 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -25,6 +25,7 @@
#include <linux/ioport.h>
#include <linux/kexec.h>
#include <linux/libfdt.h>
+#include <linux/memremap.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
#include <linux/memblock.h>
@@ -652,12 +653,532 @@ inline void hotplug_paging(phys_addr_t start, phys_addr_t size)

stop_machine(__hotplug_paging, &section, NULL);
}
-#endif /* CONFIG_MEMORY_HOTPLUG */
+#ifdef CONFIG_MEMORY_HOTREMOVE
+
+static void free_pagetable(struct page *page, int order, bool linear_map)
+{
+ unsigned long magic;
+ unsigned int nr_pages = 1 << order;
+ struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
+
+ if (altmap) {
+ vmem_altmap_free(altmap, nr_pages);
+ return;
+ }
+
+ /* bootmem page has reserved flag */
+ if (PageReserved(page)) {
+ __ClearPageReserved(page);
+
+ magic = (unsigned long)page->lru.next;
+ if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
+ while (nr_pages--)
+ put_page_bootmem(page++);
+ } else {
+ while (nr_pages--)
+ free_reserved_page(page++);
+ }
+ } else {
+ /*
+ * Only linear_map pagetable allocation (those allocated via
+ * hotplug) call the pgtable_page_ctor; vmemmap pgtable
+ * allocations don't.
+ */
+ if (linear_map)
+ pgtable_page_dtor(page);
+
+ free_pages((unsigned long)page_address(page), order);
+ }
+}
+
+static void free_pte_table(unsigned long addr, pmd_t *pmd, bool linear_map)
+{
+ pte_t *pte;
+ struct page *page;
+ int i;
+
+ pte = pte_offset_kernel(pmd, 0L);
+ /* Check if there is no valid entry in the PMD */
+ for (i = 0; i < PTRS_PER_PTE; i++, pte++) {
+ if (!pte_none(*pte))
+ return;
+ }
+
+ page = pmd_page(*pmd);
+ /*
+ * This spin lock could be only taken in _pte_aloc_kernel
+ * in mm/memory.c and nowhere else (for arm64). Not sure if
+ * the function above can be called concurrently. In doubt,
+ * I am living it here for now, but it probably can be removed
+ */
+ spin_lock(&init_mm.page_table_lock);
+ pmd_clear(pmd);
+ spin_unlock(&init_mm.page_table_lock);
+
+ /* Make sure addr is aligned with first address of the PMD*/
+ addr &= PMD_MASK;
+ /*
+ * Invalidate TLB walk caches to PTE
+ * Not sure what is the index of the TLB walk caches.
+ * i.e., if it is indexed just by addr & PMD_MASK or it can be
+ * indexed by any address. Flushing everything to stay on the safe
+ * side.
+ */
+ flush_tlb_kernel_range(addr, addr + PMD_SIZE);
+
+ free_pagetable(page, 0, linear_map);
+}
+
+static void free_pmd_table(unsigned long addr, pud_t *pud, bool linear_map)
+{
+ pmd_t *pmd;
+ struct page *page;
+ int i;
+
+ pmd = pmd_offset(pud, 0L);
+ /*
+ * If PMD is folded onto PUD, cleanup was already performed
+ * up in the call stack. No more work needs to be done.
+ */
+ if ((pud_t *) pmd == pud)
+ return;
+
+ /* Check if there is no valid entry in the PMD */
+ for (i = 0; i < PTRS_PER_PMD; i++, pmd++) {
+ if (!pmd_none(*pmd))
+ return;
+ }
+
+ page = pud_page(*pud);
+ /*
+ * This spin lock could be only taken in _pte_aloc_kernel
+ * in mm/memory.c and nowhere else (for arm64). Not sure if
+ * the function above can be called concurrently. In doubt,
+ * I am living it here for now, but it probably can be removed
+ */
+ spin_lock(&init_mm.page_table_lock);
+ pud_clear(pud);
+ spin_unlock(&init_mm.page_table_lock);
+
+ /* Make sure addr is aligned with first address of the PMD*/
+ addr &= PUD_MASK;
+ /*
+ * Invalidate TLB walk caches to PMD
+ * Not sure what is the index of the TLB walk caches.
+ * i.e., if it is indexed just by addr & PUD_MASK or it can be
+ * indexed by any address. Flushing everything to stay on the safe
+ * side.
+ */
+ flush_tlb_kernel_range(addr, addr + PUD_SIZE);
+
+ free_pagetable(page, 0, linear_map);
+}
+
+static void free_pud_table(unsigned long addr, pgd_t *pgd, bool linear_map)
+{
+ pud_t *pud;
+ struct page *page;
+ int i;
+
+ pud = pud_offset(pgd, 0L);
+ /*
+ * If PUD is folded onto PGD, cleanup was already performed
+ * up in the call stack. No more work needs to be done.
+ */
+ if ((pgd_t *)pud == pgd)
+ return;
+
+ /* Check if there is no valid entry in the PUD */
+ for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
+ if (!pud_none(*pud))
+ return;
+ }
+
+ page = pgd_page(*pgd);
+
+ /*
+ * This spin lock could be only
+ * taken in _pte_aloc_kernel in
+ * mm/memory.c and nowhere else
+ * (for arm64). Not sure if the
+ * function above can be called
+ * concurrently. In doubt,
+ * I am living it here for now,
+ * but it probably can be removed.
+ */
+ spin_lock(&init_mm.page_table_lock);
+ pgd_clear(pgd);
+ spin_unlock(&init_mm.page_table_lock);
+
+ /* Make sure addr is aligned with first address of the PUD*/
+ addr &= PGDIR_MASK;
+ /*
+ * Invalidate TLB walk caches to PUD
+ *
+ * Not sure what is the index of the TLB walk caches.
+ * i.e., if it is indexed just by addr & PGDIR_MASK or it can be
+ * indexed by any address. Flushing everything to stay on the safe
+ * side
+ */
+ flush_tlb_kernel_range(addr, addr + PGD_SIZE);
+
+ free_pagetable(page, 0, linear_map);
+}
+
+static void mark_n_free_pte_vmemmap(pte_t *pte,
+ unsigned long addr, unsigned long size)
+{
+ unsigned long page_offset = (addr & (~PAGE_MASK));
+ phys_addr_t page_start = pte_val(*pte) & PHYS_MASK & (s32)PAGE_MASK;
+ phys_addr_t pa_start = page_start + page_offset;
+
+ memblock_mark_unused_vmemmap(pa_start, size);
+
+ if (memblock_is_vmemmap_unused_range(&memblock.memory,
+ page_start, page_start + PAGE_SIZE)) {
+
+ free_pagetable(pte_page(*pte), 0, false);
+ memblock_clear_unused_vmemmap(page_start, PAGE_SIZE);
+
+ /*
+ * This spin lock could be only
+ * taken in _pte_aloc_kernel in
+ * mm/memory.c and nowhere else
+ * (for arm64). Not sure if the
+ * function above can be called
+ * concurrently. In doubt,
+ * I am living it here for now,
+ * but it probably can be removed.
+ */
+ spin_lock(&init_mm.page_table_lock);
+ pte_clear(&init_mm, addr, pte);
+ spin_unlock(&init_mm.page_table_lock);
+
+ flush_tlb_kernel_range(addr & PAGE_MASK,
+ (addr + PAGE_SIZE) & PAGE_MASK);
+ }
+}
+
+static void mark_n_free_pmd_vmemmap(pmd_t *pmd,
+ unsigned long addr, unsigned long size)
+{
+ unsigned long sec_offset = (addr & (~PMD_MASK));
+ phys_addr_t page_start = pmd_page_paddr(*pmd);
+ phys_addr_t pa_start = page_start + sec_offset;
+
+ memblock_mark_unused_vmemmap(pa_start, size);
+
+ if (memblock_is_vmemmap_unused_range(&memblock.memory,
+ page_start, page_start + PMD_SIZE)) {
+
+ free_pagetable(pmd_page(*pmd),
+ get_order(PMD_SIZE), false);
+
+ memblock_clear_unused_vmemmap(page_start, PMD_SIZE);
+ /*
+ * This spin lock could be only
+ * taken in _pte_aloc_kernel in
+ * mm/memory.c and nowhere else
+ * (for arm64). Not sure if the
+ * function above can be called
+ * concurrently. In doubt,
+ * I am living it here for now,
+ * but it probably can be removed.
+ */
+ spin_lock(&init_mm.page_table_lock);
+ pmd_clear(pmd);
+ spin_unlock(&init_mm.page_table_lock);
+
+ flush_tlb_kernel_range(addr & PMD_MASK,
+ (addr + PMD_SIZE) & PMD_MASK);
+ }
+}
+
+static void rm_pte_mapping(pte_t *pte, unsigned long addr,
+ unsigned long next, bool linear_map)
+{
+ /*
+ * Linear map pages were already freed when offlining.
+ * We aonly need to free vmemmap pages.
+ */
+ if (!linear_map)
+ free_pagetable(pte_page(*pte), 0, false);
+
+ /*
+ * This spin lock could be only
+ * taken in _pte_aloc_kernel in
+ * mm/memory.c and nowhere else
+ * (for arm64). Not sure if the
+ * function above can be called
+ * concurrently. In doubt,
+ * I am living it here for now,
+ * but it probably can be removed.
+ */
+ spin_lock(&init_mm.page_table_lock);
+ pte_clear(&init_mm, addr, pte);
+ spin_unlock(&init_mm.page_table_lock);
+
+ flush_tlb_kernel_range(addr, next);
+}
+
+static void rm_pmd_mapping(pmd_t *pmd, unsigned long addr,
+ unsigned long next, bool linear_map)
+{
+ /* Freeing vmemmap pages */
+ if (!linear_map)
+ free_pagetable(pmd_page(*pmd),
+ get_order(PMD_SIZE), false);
+ /*
+ * This spin lock could be only
+ * taken in _pte_aloc_kernel in
+ * mm/memory.c and nowhere else
+ * (for arm64). Not sure if the
+ * function above can be called
+ * concurrently. In doubt,
+ * I am living it here for now,
+ * but it probably can be removed.
+ */
+ spin_lock(&init_mm.page_table_lock);
+ pmd_clear(pmd);
+ spin_unlock(&init_mm.page_table_lock);
+
+ flush_tlb_kernel_range(addr, next);
+}
+
+static void rm_pud_mapping(pud_t *pud, unsigned long addr,
+ unsigned long next, bool linear_map)
+{
+ /** We never map vmemmap space on PUDs */
+ BUG_ON(!linear_map);
+ /*
+ * This spin lock could be only
+ * taken in _pte_aloc_kernel in
+ * mm/memory.c and nowhere else
+ * (for arm64). Not sure if the
+ * function above can be called
+ * concurrently. In doubt,
+ * I am living it here for now,
+ * but it probably can be removed.
+ */
+ spin_lock(&init_mm.page_table_lock);
+ pud_clear(pud);
+ spin_unlock(&init_mm.page_table_lock);
+
+ flush_tlb_kernel_range(addr, next);
+}

/*
- * Check whether a kernel address is valid (derived from arch/x86/).
+ * Used in hot-remove, cleans up PTE entries from addr to end from the pointed
+ * pte table. If linear_map is true, this is used called to remove the tables
+ * for the memory being hot-removed. If false, this is called to clean-up the
+ * tables (and the memory) that were used for the vmemmap of memory being
+ * hot-removed.
*/
-int kern_addr_valid(unsigned long addr)
+static void remove_pte_table(pte_t *pte, unsigned long addr,
+ unsigned long end, bool linear_map)
+{
+ unsigned long next;
+
+
+ for (; addr < end; addr = next, pte++) {
+ next = (addr + PAGE_SIZE) & PAGE_MASK;
+ if (next > end)
+ next = end;
+
+ if (!pte_present(*pte))
+ continue;
+
+ if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
+ rm_pte_mapping(pte, addr, next, linear_map);
+ } else {
+ unsigned long sz = next - addr;
+ /*
+ * If we are here, we are freeing vmemmap pages since
+ * linear_map mapped memory ranges to be freed
+ * are aligned.
+ *
+ * If we are not removing the whole page, it means
+ * other page structs in this page are being used and
+ * we canot remove them. We use memblock to mark these
+ * unused pieces and we only removed when they are fully
+ * unuesed.
+ */
+ mark_n_free_pte_vmemmap(pte, addr, sz);
+ }
+ }
+}
+
+/**
+ * Used in hot-remove, cleans up PMD entries from addr to end from the pointed
+ * pmd table.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PMD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+static int remove_pmd_table(pmd_t *pmd, unsigned long addr,
+ unsigned long end, bool linear_map, bool check_split)
+{
+ int err = 0;
+ unsigned long next;
+ pte_t *pte;
+
+ for (; !err && addr < end; addr = next, pmd++) {
+ next = pmd_addr_end(addr, end);
+
+ if (!pmd_present(*pmd))
+ continue;
+
+ if (pmd_sect(*pmd)) {
+ if (IS_ALIGNED(addr, PMD_SIZE) &&
+ IS_ALIGNED(next, PMD_SIZE)) {
+
+ if (!check_split)
+ rm_pmd_mapping(pmd, addr, next,
+ linear_map);
+
+ } else { /* not aligned to PMD size */
+
+ /*
+ * This should only occur for vmemap.
+ * If it does happen for linear map,
+ * we do not support splitting PMDs,
+ * so we return error
+ */
+ if (linear_map) {
+ pr_warn("Hot-remove failed. Cannot split PMD mapping\n");
+ err = -EBUSY;
+ } else if (!check_split) {
+ unsigned long sz = next - addr;
+ /* Freeing vmemmap pages.*/
+ mark_n_free_pmd_vmemmap(pmd, addr, sz);
+ }
+ }
+ } else { /* ! pmd_sect() */
+
+ BUG_ON(!pmd_table(*pmd));
+ if (!check_split) {
+ pte = pte_offset_map(pmd, addr);
+ remove_pte_table(pte, addr, next, linear_map);
+ free_pte_table(addr, pmd, linear_map);
+ }
+ }
+ }
+
+ return err;
+}
+
+/**
+ * Used in hot-remove, cleans up PUD entries from addr to end from the pointed
+ * pmd table.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PUD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+static int remove_pud_table(pud_t *pud, unsigned long addr,
+ unsigned long end, bool linear_map, bool check_split)
+{
+ int err = 0;
+ unsigned long next;
+ pmd_t *pmd;
+
+ for (; !err && addr < end; addr = next, pud++) {
+ next = pud_addr_end(addr, end);
+ if (!pud_present(*pud))
+ continue;
+
+ /*
+ * If we are using 4K granules, check if we are using
+ * 1GB section mapping.
+ */
+ if (pud_sect(*pud)) {
+ if (IS_ALIGNED(addr, PUD_SIZE) &&
+ IS_ALIGNED(next, PUD_SIZE)) {
+
+ if (!check_split)
+ rm_pud_mapping(pud, addr, next,
+ linear_map);
+
+ } else { /* not aligned to PUD size */
+ /*
+ * As above, we never map vmemmap
+ * space on PUDs
+ */
+ BUG_ON(!linear_map);
+ pr_warn("Hot-remove failed. Cannot split PUD mapping\n");
+ err = -EBUSY;
+ }
+ } else { /* !pud_sect() */
+ BUG_ON(!pud_table(*pud));
+
+ pmd = pmd_offset(pud, addr);
+ err = remove_pmd_table(pmd, addr, next,
+ linear_map, check_split);
+ if (!check_split)
+ free_pmd_table(addr, pud, linear_map);
+ }
+ }
+
+ return err;
+}
+
+/**
+ * Used in hot-remove, cleans up kernel page tables from addr to end.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PUD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+int remove_pagetable(unsigned long start, unsigned long end,
+ bool linear_map, bool check_split)
+{
+ int err;
+ unsigned long next;
+ unsigned long addr;
+ pgd_t *pgd;
+ pud_t *pud;
+
+ for (addr = start; addr < end; addr = next) {
+ next = pgd_addr_end(addr, end);
+
+ pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd))
+ continue;
+
+ pud = pud_offset(pgd, addr);
+ err = remove_pud_table(pud, addr, next,
+ linear_map, check_split);
+ if (err)
+ break;
+
+ if (!check_split)
+ free_pud_table(addr, pgd, linear_map);
+ }
+
+ if (!check_split)
+ flush_tlb_all();
+
+ return err;
+}
+
+
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+static unsigned long walk_kern_pgtable(unsigned long addr)
{
pgd_t *pgd;
pud_t *pud;
@@ -676,26 +1197,51 @@ int kern_addr_valid(unsigned long addr)
return 0;

if (pud_sect(*pud))
- return pfn_valid(pud_pfn(*pud));
+ return pud_pfn(*pud);

pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd))
return 0;

if (pmd_sect(*pmd))
- return pfn_valid(pmd_pfn(*pmd));
+ return pmd_pfn(*pmd);

pte = pte_offset_kernel(pmd, addr);
if (pte_none(*pte))
return 0;

- return pfn_valid(pte_pfn(*pte));
+ return pte_pfn(*pte);
+}
+
+/*
+ * Check whether a kernel address is valid (derived from arch/x86/).
+ */
+int kern_addr_valid(unsigned long addr)
+{
+ return pfn_valid(walk_kern_pgtable(addr));
}
+
#ifdef CONFIG_SPARSEMEM_VMEMMAP
#if !ARM64_SWAPPER_USES_SECTION_MAPS
int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
{
- return vmemmap_populate_basepages(start, end, node);
+ int err;
+
+ err = vmemmap_populate_basepages(start, end, node);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+ /*
+ * A bit inefficient (restarting from PGD every time) but saves
+ * from lots of duplicated code. Also, this is only called
+ * at hot-add time, which should not be a frequent operation
+ */
+ for (; start < end; start += PAGE_SIZE) {
+ unsigned long pfn = walk_kern_pgtable(start);
+ phys_addr_t pa_start = ((phys_addr_t)pfn) << PAGE_SHIFT;
+
+ memblock_clear_unused_vmemmap(pa_start, PAGE_SIZE);
+ }
+#endif
+ return err;
}
#else /* !ARM64_SWAPPER_USES_SECTION_MAPS */
int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
@@ -726,8 +1272,15 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
return -ENOMEM;

set_pmd(pmd, __pmd(__pa(p) | PROT_SECT_NORMAL));
- } else
+ } else {
+ unsigned long sec_offset = (addr & (~PMD_MASK));
+ phys_addr_t pa_start =
+ pmd_page_paddr(*pmd) + sec_offset;
vmemmap_verify((pte_t *)pmd, node, addr, next);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+ memblock_clear_unused_vmemmap(pa_start, next - addr);
+#endif
+ }
} while (addr = next, addr != end);

return 0;
@@ -735,6 +1288,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
#endif /* CONFIG_ARM64_64K_PAGES */
void vmemmap_free(unsigned long start, unsigned long end)
{
+#ifdef CONFIG_MEMORY_HOTREMOVE
+ remove_pagetable(start, end, false, false);
+#endif
}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */

--
2.7.4


From 1584785338761369847@xxx Wed Nov 22 16:44:34 +0000 2017
X-GM-THRID: 1584751409568686856
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread