Received: by 10.223.164.202 with SMTP id h10csp593117wrb; Thu, 23 Nov 2017 03:16:57 -0800 (PST) X-Google-Smtp-Source: AGs4zMYQ1JOOKiOr+syPvO2xFoD9asosbzAIG/C3AOxcxkE5IrJyzRoLMo7BzvD4QBUiogMfr/Yq X-Received: by 10.101.75.78 with SMTP id k14mr23606074pgt.272.1511435817194; Thu, 23 Nov 2017 03:16:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1511435817; cv=none; d=google.com; s=arc-20160816; b=orzOMpUslW4PPeKYXYIxv8YbxwLE+TsN3PUzpIUpXYr7ueonuJaGje02VUVDaOYpM+ +KMUdi1HNPvfzixzP++7Z6SEZ3l24UOSLmdU+nNyJ7DQnKfC/IMF4kJrKKZztt6OkxSY +qF2l3m8Q5BgoVRaXFKMMIIRO80dt2vbJsliM2hfh/zNKuLY7ih58wyhUBTWwlSbbb7e FfPLHORkeLbInHnDyV/sGiwWyQmrMPbfx2mzHeEz82WldPmo33h6IS8/2OriicKMTYAi 7h33nRp7EH6AqCwBsfoavrc81GIF2fBe9J4CIF4zzpIhJR2G2UZ/0nsFltw3hDSj+6Nh KFxg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:subject:cc:to:from:date :arc-authentication-results; bh=AvkZEMO3mlftcGJK0wnPbKZWWiZsWUcxRhlOlLhPzxY=; b=qlqN3SwSs/Zo+WbHodqqXwnZvCX81FExf8hW71ZLV/KEAVwFnN8K/HsZ3hnxdVZLQF cx7WlGF7Siq2ZbWUuYqKBiS//CFopjpGkiKjgKN5+V+BFuRWE2/edy0a1OBkkNlFCkpJ FJnlDlaUcDktqRSBYAfgRMCM9OBNjdfwR5pTvkwiMjYi19jBZXVwQw3gLR//ycilDR7u XuwCkPV+SAibYJMtG0TNzVE7zugubV/quSTvQrxAos5JHBKSGhOMAnCuSQ470Krs8gvo 8UWOwU6mkRQfLRlb8m/rZ97Qnsa++yf5iwUxWjv/nsEYdTGm48nmQdJco3RkmKzPd5AV 1Cdw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i35si7707375plg.638.2017.11.23.03.16.45; Thu, 23 Nov 2017 03:16:57 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752593AbdKWLPx (ORCPT + 77 others); Thu, 23 Nov 2017 06:15:53 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:37170 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751584AbdKWLPv (ORCPT ); Thu, 23 Nov 2017 06:15:51 -0500 Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id vANBFSbp026845 for ; Thu, 23 Nov 2017 06:15:51 -0500 Received: from e06smtp14.uk.ibm.com (e06smtp14.uk.ibm.com [195.75.94.110]) by mx0a-001b2d01.pphosted.com with ESMTP id 2edw7cgdnf-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Thu, 23 Nov 2017 06:15:50 -0500 Received: from localhost by e06smtp14.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 23 Nov 2017 11:15:48 -0000 Received: from b06cxnps4075.portsmouth.uk.ibm.com (9.149.109.197) by e06smtp14.uk.ibm.com (192.168.101.144) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 23 Nov 2017 11:15:44 -0000 Received: from d06av23.portsmouth.uk.ibm.com (d06av23.portsmouth.uk.ibm.com [9.149.105.59]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id vANBFixX24707104; Thu, 23 Nov 2017 11:15:44 GMT Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3F0A3A4053; Thu, 23 Nov 2017 11:10:23 +0000 (GMT) Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B8A9BA404D; Thu, 23 Nov 2017 11:10:22 +0000 (GMT) Received: from samekh (unknown [9.162.48.51]) by d06av23.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Thu, 23 Nov 2017 11:10:22 +0000 (GMT) Date: Thu, 23 Nov 2017 11:15:42 +0000 From: Andrea Reale To: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, m.bielski@virtualopensystems.com, arunks@qti.qualcomm.com, mark.rutland@arm.com, scott.branden@broadcom.com, will.deacon@arm.com, qiuxishi@huawei.com, catalin.marinas@arm.com, mhocko@suse.com, realean2@ie.ibm.com Subject: [PATCH v2 5/5] mm: memory-hotplug: Add memory hot remove support for arm64 References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) X-TM-AS-GCONF: 00 x-cbid: 17112311-0016-0000-0000-00000504C3C0 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17112311-0017-0000-0000-0000284097DF Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-11-23_04:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=3 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1711230157 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Implementation of pagetable cleanup routines for arm64 memory hot remove. How to offline: 1. Logical Hot remove (offline) - # echo offline > /sys/devices/system/memory/memoryXX/state 2. Physical Hot remove (offline) - (if offline is successful) - # echo $section_phy_address > /sys/devices/system/memory/remove Changes v1->v2: - introduced check on offlining state before hot remove: in x86 (and possibly other architectures), offlining of pages and hot remove of physical memory happen in a single step, i.e., via an acpi event. In this patchset we are introducing a "remove" sysfs handle that triggers the physical hot-remove process after manual offlining. - new memblock flag used to mark partially unused vmemmap pages, avoiding the nasty 0xFD hack used in the prev rev (and in x86 hot remove code): the hot remove process needs to take care of freeing vmemmap pages and mappings for the memory being removed. Sometimes, it might be not possible to free fully a vmemmap page (because it is being used for other mappings); in such a case we mark part of that page as unused and we free it only when it is fully unused. In the previous version, in symmetry to x86 hot remove code, we were doing this marking by filling the unused parts of the page with an aribitrary 0xFD constant. In this version, we are using a new memblock flag for the same purpose. - proper cleaning sequence for p[um]ds,ptes and related TLB management: i) clear the page table, ii) flush tlb, iii) free the pagetable page - Removed macros that changed hot remove behavior based on number of pgtable levels. Now this is hidden in the pgtable traversal macros. - Check on the corner case where P[UM]Ds would have to be split during hot remove: now this is forbidden. Hot addition and removal is done at SECTION_SIZE_BITS granularity (currently 1GB). The only case when we would have to split a P[UM]D is when SECTION_SIZE_BITS is smaller than a P[UM]D mapped area (never by default), AND when we are removing some P[UM]D-mapped memory that was never hot-added (there since boot). If the above conditions hold, we avoid splitting the P[UM]Ds and, instead, we forbid hot removal. - Minor fixes and refactoring. Signed-off-by: Andrea Reale Signed-off-by: Maciej Bielski --- arch/arm64/Kconfig | 3 + arch/arm64/configs/defconfig | 1 + arch/arm64/include/asm/mmu.h | 4 + arch/arm64/mm/init.c | 29 +++ arch/arm64/mm/mmu.c | 572 ++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 601 insertions(+), 8 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index c736bba..c362ddf 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -649,6 +649,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG def_bool y depends on !NUMA +config ARCH_ENABLE_MEMORY_HOTREMOVE + def_bool y + # Common NUMA Features config NUMA bool "Numa Memory Allocation and Scheduler Support" diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig index 5fc5656..cdac3b8 100644 --- a/arch/arm64/configs/defconfig +++ b/arch/arm64/configs/defconfig @@ -81,6 +81,7 @@ CONFIG_SCHED_MC=y CONFIG_NUMA=y CONFIG_PREEMPT=y CONFIG_MEMORY_HOTPLUG=y +CONFIG_MEMORY_HOTREMOVE=y CONFIG_KSM=y CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_CMA=y diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index 2b3fa4d..ca11567 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -42,6 +42,10 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys); extern void mark_linear_text_alias_ro(void); #ifdef CONFIG_MEMORY_HOTPLUG extern void hotplug_paging(phys_addr_t start, phys_addr_t size); +#ifdef CONFIG_MEMORY_HOTREMOVE +extern int remove_pagetable(unsigned long start, + unsigned long end, bool linear_map, bool check_split); +#endif #endif #endif diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index e96e7d3..406b378 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -808,4 +808,33 @@ int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock) return ret; } +#ifdef CONFIG_MEMORY_HOTREMOVE +int arch_remove_memory(u64 start, u64 size) +{ + unsigned long start_pfn = start >> PAGE_SHIFT; + unsigned long nr_pages = size >> PAGE_SHIFT; + unsigned long va_start = (unsigned long) __va(start); + unsigned long va_end = (unsigned long)__va(start + size); + struct page *page = pfn_to_page(start_pfn); + struct zone *zone; + int ret = 0; + + /* + * Check if mem can be removed without splitting + * PUD/PMD mappings. + */ + ret = remove_pagetable(va_start, va_end, true, true); + if (!ret) { + zone = page_zone(page); + ret = __remove_pages(zone, start_pfn, nr_pages); + WARN_ON_ONCE(ret); + + /* Actually remove the mapping */ + remove_pagetable(va_start, va_end, true, false); + } + + return ret; +} + +#endif /* CONFIG_MEMORY_HOTREMOVE */ #endif /* CONFIG_MEMORY_HOTPLUG */ diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index d93043d..e6f8c91 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -652,12 +653,532 @@ inline void hotplug_paging(phys_addr_t start, phys_addr_t size) stop_machine(__hotplug_paging, §ion, NULL); } -#endif /* CONFIG_MEMORY_HOTPLUG */ +#ifdef CONFIG_MEMORY_HOTREMOVE + +static void free_pagetable(struct page *page, int order, bool linear_map) +{ + unsigned long magic; + unsigned int nr_pages = 1 << order; + struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page); + + if (altmap) { + vmem_altmap_free(altmap, nr_pages); + return; + } + + /* bootmem page has reserved flag */ + if (PageReserved(page)) { + __ClearPageReserved(page); + + magic = (unsigned long)page->lru.next; + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) { + while (nr_pages--) + put_page_bootmem(page++); + } else { + while (nr_pages--) + free_reserved_page(page++); + } + } else { + /* + * Only linear_map pagetable allocation (those allocated via + * hotplug) call the pgtable_page_ctor; vmemmap pgtable + * allocations don't. + */ + if (linear_map) + pgtable_page_dtor(page); + + free_pages((unsigned long)page_address(page), order); + } +} + +static void free_pte_table(unsigned long addr, pmd_t *pmd, bool linear_map) +{ + pte_t *pte; + struct page *page; + int i; + + pte = pte_offset_kernel(pmd, 0L); + /* Check if there is no valid entry in the PMD */ + for (i = 0; i < PTRS_PER_PTE; i++, pte++) { + if (!pte_none(*pte)) + return; + } + + page = pmd_page(*pmd); + /* + * This spin lock could be only taken in _pte_aloc_kernel + * in mm/memory.c and nowhere else (for arm64). Not sure if + * the function above can be called concurrently. In doubt, + * I am living it here for now, but it probably can be removed + */ + spin_lock(&init_mm.page_table_lock); + pmd_clear(pmd); + spin_unlock(&init_mm.page_table_lock); + + /* Make sure addr is aligned with first address of the PMD*/ + addr &= PMD_MASK; + /* + * Invalidate TLB walk caches to PTE + * Not sure what is the index of the TLB walk caches. + * i.e., if it is indexed just by addr & PMD_MASK or it can be + * indexed by any address. Flushing everything to stay on the safe + * side. + */ + flush_tlb_kernel_range(addr, addr + PMD_SIZE); + + free_pagetable(page, 0, linear_map); +} + +static void free_pmd_table(unsigned long addr, pud_t *pud, bool linear_map) +{ + pmd_t *pmd; + struct page *page; + int i; + + pmd = pmd_offset(pud, 0L); + /* + * If PMD is folded onto PUD, cleanup was already performed + * up in the call stack. No more work needs to be done. + */ + if ((pud_t *) pmd == pud) + return; + + /* Check if there is no valid entry in the PMD */ + for (i = 0; i < PTRS_PER_PMD; i++, pmd++) { + if (!pmd_none(*pmd)) + return; + } + + page = pud_page(*pud); + /* + * This spin lock could be only taken in _pte_aloc_kernel + * in mm/memory.c and nowhere else (for arm64). Not sure if + * the function above can be called concurrently. In doubt, + * I am living it here for now, but it probably can be removed + */ + spin_lock(&init_mm.page_table_lock); + pud_clear(pud); + spin_unlock(&init_mm.page_table_lock); + + /* Make sure addr is aligned with first address of the PMD*/ + addr &= PUD_MASK; + /* + * Invalidate TLB walk caches to PMD + * Not sure what is the index of the TLB walk caches. + * i.e., if it is indexed just by addr & PUD_MASK or it can be + * indexed by any address. Flushing everything to stay on the safe + * side. + */ + flush_tlb_kernel_range(addr, addr + PUD_SIZE); + + free_pagetable(page, 0, linear_map); +} + +static void free_pud_table(unsigned long addr, pgd_t *pgd, bool linear_map) +{ + pud_t *pud; + struct page *page; + int i; + + pud = pud_offset(pgd, 0L); + /* + * If PUD is folded onto PGD, cleanup was already performed + * up in the call stack. No more work needs to be done. + */ + if ((pgd_t *)pud == pgd) + return; + + /* Check if there is no valid entry in the PUD */ + for (i = 0; i < PTRS_PER_PUD; i++, pud++) { + if (!pud_none(*pud)) + return; + } + + page = pgd_page(*pgd); + + /* + * This spin lock could be only + * taken in _pte_aloc_kernel in + * mm/memory.c and nowhere else + * (for arm64). Not sure if the + * function above can be called + * concurrently. In doubt, + * I am living it here for now, + * but it probably can be removed. + */ + spin_lock(&init_mm.page_table_lock); + pgd_clear(pgd); + spin_unlock(&init_mm.page_table_lock); + + /* Make sure addr is aligned with first address of the PUD*/ + addr &= PGDIR_MASK; + /* + * Invalidate TLB walk caches to PUD + * + * Not sure what is the index of the TLB walk caches. + * i.e., if it is indexed just by addr & PGDIR_MASK or it can be + * indexed by any address. Flushing everything to stay on the safe + * side + */ + flush_tlb_kernel_range(addr, addr + PGD_SIZE); + + free_pagetable(page, 0, linear_map); +} + +static void mark_n_free_pte_vmemmap(pte_t *pte, + unsigned long addr, unsigned long size) +{ + unsigned long page_offset = (addr & (~PAGE_MASK)); + phys_addr_t page_start = pte_val(*pte) & PHYS_MASK & (s32)PAGE_MASK; + phys_addr_t pa_start = page_start + page_offset; + + memblock_mark_unused_vmemmap(pa_start, size); + + if (memblock_is_vmemmap_unused_range(&memblock.memory, + page_start, page_start + PAGE_SIZE)) { + + free_pagetable(pte_page(*pte), 0, false); + memblock_clear_unused_vmemmap(page_start, PAGE_SIZE); + + /* + * This spin lock could be only + * taken in _pte_aloc_kernel in + * mm/memory.c and nowhere else + * (for arm64). Not sure if the + * function above can be called + * concurrently. In doubt, + * I am living it here for now, + * but it probably can be removed. + */ + spin_lock(&init_mm.page_table_lock); + pte_clear(&init_mm, addr, pte); + spin_unlock(&init_mm.page_table_lock); + + flush_tlb_kernel_range(addr & PAGE_MASK, + (addr + PAGE_SIZE) & PAGE_MASK); + } +} + +static void mark_n_free_pmd_vmemmap(pmd_t *pmd, + unsigned long addr, unsigned long size) +{ + unsigned long sec_offset = (addr & (~PMD_MASK)); + phys_addr_t page_start = pmd_page_paddr(*pmd); + phys_addr_t pa_start = page_start + sec_offset; + + memblock_mark_unused_vmemmap(pa_start, size); + + if (memblock_is_vmemmap_unused_range(&memblock.memory, + page_start, page_start + PMD_SIZE)) { + + free_pagetable(pmd_page(*pmd), + get_order(PMD_SIZE), false); + + memblock_clear_unused_vmemmap(page_start, PMD_SIZE); + /* + * This spin lock could be only + * taken in _pte_aloc_kernel in + * mm/memory.c and nowhere else + * (for arm64). Not sure if the + * function above can be called + * concurrently. In doubt, + * I am living it here for now, + * but it probably can be removed. + */ + spin_lock(&init_mm.page_table_lock); + pmd_clear(pmd); + spin_unlock(&init_mm.page_table_lock); + + flush_tlb_kernel_range(addr & PMD_MASK, + (addr + PMD_SIZE) & PMD_MASK); + } +} + +static void rm_pte_mapping(pte_t *pte, unsigned long addr, + unsigned long next, bool linear_map) +{ + /* + * Linear map pages were already freed when offlining. + * We aonly need to free vmemmap pages. + */ + if (!linear_map) + free_pagetable(pte_page(*pte), 0, false); + + /* + * This spin lock could be only + * taken in _pte_aloc_kernel in + * mm/memory.c and nowhere else + * (for arm64). Not sure if the + * function above can be called + * concurrently. In doubt, + * I am living it here for now, + * but it probably can be removed. + */ + spin_lock(&init_mm.page_table_lock); + pte_clear(&init_mm, addr, pte); + spin_unlock(&init_mm.page_table_lock); + + flush_tlb_kernel_range(addr, next); +} + +static void rm_pmd_mapping(pmd_t *pmd, unsigned long addr, + unsigned long next, bool linear_map) +{ + /* Freeing vmemmap pages */ + if (!linear_map) + free_pagetable(pmd_page(*pmd), + get_order(PMD_SIZE), false); + /* + * This spin lock could be only + * taken in _pte_aloc_kernel in + * mm/memory.c and nowhere else + * (for arm64). Not sure if the + * function above can be called + * concurrently. In doubt, + * I am living it here for now, + * but it probably can be removed. + */ + spin_lock(&init_mm.page_table_lock); + pmd_clear(pmd); + spin_unlock(&init_mm.page_table_lock); + + flush_tlb_kernel_range(addr, next); +} + +static void rm_pud_mapping(pud_t *pud, unsigned long addr, + unsigned long next, bool linear_map) +{ + /** We never map vmemmap space on PUDs */ + BUG_ON(!linear_map); + /* + * This spin lock could be only + * taken in _pte_aloc_kernel in + * mm/memory.c and nowhere else + * (for arm64). Not sure if the + * function above can be called + * concurrently. In doubt, + * I am living it here for now, + * but it probably can be removed. + */ + spin_lock(&init_mm.page_table_lock); + pud_clear(pud); + spin_unlock(&init_mm.page_table_lock); + + flush_tlb_kernel_range(addr, next); +} /* - * Check whether a kernel address is valid (derived from arch/x86/). + * Used in hot-remove, cleans up PTE entries from addr to end from the pointed + * pte table. If linear_map is true, this is used called to remove the tables + * for the memory being hot-removed. If false, this is called to clean-up the + * tables (and the memory) that were used for the vmemmap of memory being + * hot-removed. */ -int kern_addr_valid(unsigned long addr) +static void remove_pte_table(pte_t *pte, unsigned long addr, + unsigned long end, bool linear_map) +{ + unsigned long next; + + + for (; addr < end; addr = next, pte++) { + next = (addr + PAGE_SIZE) & PAGE_MASK; + if (next > end) + next = end; + + if (!pte_present(*pte)) + continue; + + if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) { + rm_pte_mapping(pte, addr, next, linear_map); + } else { + unsigned long sz = next - addr; + /* + * If we are here, we are freeing vmemmap pages since + * linear_map mapped memory ranges to be freed + * are aligned. + * + * If we are not removing the whole page, it means + * other page structs in this page are being used and + * we canot remove them. We use memblock to mark these + * unused pieces and we only removed when they are fully + * unuesed. + */ + mark_n_free_pte_vmemmap(pte, addr, sz); + } + } +} + +/** + * Used in hot-remove, cleans up PMD entries from addr to end from the pointed + * pmd table. + * + * If linear_map is true, this is used called to remove the tables for the + * memory being hot-removed. If false, this is called to clean-up the tables + * (and the memory) that were used for the vmemmap of memory being hot-removed. + * + * If check_split is true, no change is done on the table: the call only + * checks whether removing the entries would cause a section mapped PMD + * to be split. In such a case, -EBUSY is returned by the method. + */ +static int remove_pmd_table(pmd_t *pmd, unsigned long addr, + unsigned long end, bool linear_map, bool check_split) +{ + int err = 0; + unsigned long next; + pte_t *pte; + + for (; !err && addr < end; addr = next, pmd++) { + next = pmd_addr_end(addr, end); + + if (!pmd_present(*pmd)) + continue; + + if (pmd_sect(*pmd)) { + if (IS_ALIGNED(addr, PMD_SIZE) && + IS_ALIGNED(next, PMD_SIZE)) { + + if (!check_split) + rm_pmd_mapping(pmd, addr, next, + linear_map); + + } else { /* not aligned to PMD size */ + + /* + * This should only occur for vmemap. + * If it does happen for linear map, + * we do not support splitting PMDs, + * so we return error + */ + if (linear_map) { + pr_warn("Hot-remove failed. Cannot split PMD mapping\n"); + err = -EBUSY; + } else if (!check_split) { + unsigned long sz = next - addr; + /* Freeing vmemmap pages.*/ + mark_n_free_pmd_vmemmap(pmd, addr, sz); + } + } + } else { /* ! pmd_sect() */ + + BUG_ON(!pmd_table(*pmd)); + if (!check_split) { + pte = pte_offset_map(pmd, addr); + remove_pte_table(pte, addr, next, linear_map); + free_pte_table(addr, pmd, linear_map); + } + } + } + + return err; +} + +/** + * Used in hot-remove, cleans up PUD entries from addr to end from the pointed + * pmd table. + * + * If linear_map is true, this is used called to remove the tables for the + * memory being hot-removed. If false, this is called to clean-up the tables + * (and the memory) that were used for the vmemmap of memory being hot-removed. + * + * If check_split is true, no change is done on the table: the call only + * checks whether removing the entries would cause a section mapped PUD + * to be split. In such a case, -EBUSY is returned by the method. + */ +static int remove_pud_table(pud_t *pud, unsigned long addr, + unsigned long end, bool linear_map, bool check_split) +{ + int err = 0; + unsigned long next; + pmd_t *pmd; + + for (; !err && addr < end; addr = next, pud++) { + next = pud_addr_end(addr, end); + if (!pud_present(*pud)) + continue; + + /* + * If we are using 4K granules, check if we are using + * 1GB section mapping. + */ + if (pud_sect(*pud)) { + if (IS_ALIGNED(addr, PUD_SIZE) && + IS_ALIGNED(next, PUD_SIZE)) { + + if (!check_split) + rm_pud_mapping(pud, addr, next, + linear_map); + + } else { /* not aligned to PUD size */ + /* + * As above, we never map vmemmap + * space on PUDs + */ + BUG_ON(!linear_map); + pr_warn("Hot-remove failed. Cannot split PUD mapping\n"); + err = -EBUSY; + } + } else { /* !pud_sect() */ + BUG_ON(!pud_table(*pud)); + + pmd = pmd_offset(pud, addr); + err = remove_pmd_table(pmd, addr, next, + linear_map, check_split); + if (!check_split) + free_pmd_table(addr, pud, linear_map); + } + } + + return err; +} + +/** + * Used in hot-remove, cleans up kernel page tables from addr to end. + * + * If linear_map is true, this is used called to remove the tables for the + * memory being hot-removed. If false, this is called to clean-up the tables + * (and the memory) that were used for the vmemmap of memory being hot-removed. + * + * If check_split is true, no change is done on the table: the call only + * checks whether removing the entries would cause a section mapped PUD + * to be split. In such a case, -EBUSY is returned by the method. + */ +int remove_pagetable(unsigned long start, unsigned long end, + bool linear_map, bool check_split) +{ + int err; + unsigned long next; + unsigned long addr; + pgd_t *pgd; + pud_t *pud; + + for (addr = start; addr < end; addr = next) { + next = pgd_addr_end(addr, end); + + pgd = pgd_offset_k(addr); + if (pgd_none(*pgd)) + continue; + + pud = pud_offset(pgd, addr); + err = remove_pud_table(pud, addr, next, + linear_map, check_split); + if (err) + break; + + if (!check_split) + free_pud_table(addr, pgd, linear_map); + } + + if (!check_split) + flush_tlb_all(); + + return err; +} + + +#endif /* CONFIG_MEMORY_HOTREMOVE */ +#endif /* CONFIG_MEMORY_HOTPLUG */ + +static unsigned long walk_kern_pgtable(unsigned long addr) { pgd_t *pgd; pud_t *pud; @@ -676,26 +1197,51 @@ int kern_addr_valid(unsigned long addr) return 0; if (pud_sect(*pud)) - return pfn_valid(pud_pfn(*pud)); + return pud_pfn(*pud); pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) return 0; if (pmd_sect(*pmd)) - return pfn_valid(pmd_pfn(*pmd)); + return pmd_pfn(*pmd); pte = pte_offset_kernel(pmd, addr); if (pte_none(*pte)) return 0; - return pfn_valid(pte_pfn(*pte)); + return pte_pfn(*pte); +} + +/* + * Check whether a kernel address is valid (derived from arch/x86/). + */ +int kern_addr_valid(unsigned long addr) +{ + return pfn_valid(walk_kern_pgtable(addr)); } + #ifdef CONFIG_SPARSEMEM_VMEMMAP #if !ARM64_SWAPPER_USES_SECTION_MAPS int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node) { - return vmemmap_populate_basepages(start, end, node); + int err; + + err = vmemmap_populate_basepages(start, end, node); +#ifdef CONFIG_MEMORY_HOTREMOVE + /* + * A bit inefficient (restarting from PGD every time) but saves + * from lots of duplicated code. Also, this is only called + * at hot-add time, which should not be a frequent operation + */ + for (; start < end; start += PAGE_SIZE) { + unsigned long pfn = walk_kern_pgtable(start); + phys_addr_t pa_start = ((phys_addr_t)pfn) << PAGE_SHIFT; + + memblock_clear_unused_vmemmap(pa_start, PAGE_SIZE); + } +#endif + return err; } #else /* !ARM64_SWAPPER_USES_SECTION_MAPS */ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node) @@ -726,8 +1272,15 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node) return -ENOMEM; set_pmd(pmd, __pmd(__pa(p) | PROT_SECT_NORMAL)); - } else + } else { + unsigned long sec_offset = (addr & (~PMD_MASK)); + phys_addr_t pa_start = + pmd_page_paddr(*pmd) + sec_offset; vmemmap_verify((pte_t *)pmd, node, addr, next); +#ifdef CONFIG_MEMORY_HOTREMOVE + memblock_clear_unused_vmemmap(pa_start, next - addr); +#endif + } } while (addr = next, addr != end); return 0; @@ -735,6 +1288,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node) #endif /* CONFIG_ARM64_64K_PAGES */ void vmemmap_free(unsigned long start, unsigned long end) { +#ifdef CONFIG_MEMORY_HOTREMOVE + remove_pagetable(start, end, false, false); +#endif } #endif /* CONFIG_SPARSEMEM_VMEMMAP */ -- 2.7.4 From 1584785338761369847@xxx Wed Nov 22 16:44:34 +0000 2017 X-GM-THRID: 1584751409568686856 X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread