Hi,
this is another step to make the memory hotplug more usable. The primary
goal of this patchset is to reduce memory overhead of the hot added
memory (at least for SPARSE_VMEMMAP memory model). Currently we use
kmalloc to poppulate memmap (struct page array) which has two main
drawbacks a) it consumes an additional memory until the hotadded memory
itslef is onlined and b) memmap might end up on a different numa node
which is especially true for movable_node configuration.
a) is problem especially for memory hotplug based memory "ballooning"
solutions when the delay between physical memory hotplug and the
onlining can lead to OOM and that led to introduction of hacks like auto
onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory")).
b) can have performance drawbacks.
One way to mitigate both issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hotadded memory itself. VMEMMAP memory model allows us to map
any pfn range so the memory doesn't need to be online to be usable
for the array. See patch 3 for more details. In short I am reusing an
existing vmem_altmap which wants to achieve the same thing for nvdim
device memory.
I am sending this as an RFC because this has seen only a very limited
testing and I am mostly interested about opinions on the chosen
approach. I had to touch some arch code and I have no idea whether my
changes make sense there (especially ppc). Therefore I would highly
appreciate arch maintainers to check patch 2.
Patches 4 and 5 should be straightforward cleanups.
There is also one potential drawback, though. If somebody uses memory
hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
for them obviously because each memory section will contain 2MB reserved
area. I am not really sure somebody does that and how reliable that
can work actually. Nevertheless, I _believe_ that onlining more memory
into virtual machines is much more common usecase. Anyway if there ever
is a strong demand for such a usecase we have basically 3 options a)
enlarge memory sections b) enhance altmap allocation strategy and reuse
low memory sections to host memmaps of other sections on the same NUMA
node c) have the memmap allocation strategy configurable to fallback to
the current allocation.
Are there any other concerns, ideas, comments?
The patches is based on the current mmotm tree (mmotm-2017-07-12-15-11)
Diffstat says
arch/arm64/mm/mmu.c | 9 ++++--
arch/ia64/mm/discontig.c | 4 ++-
arch/powerpc/mm/init_64.c | 34 ++++++++++++++++------
arch/s390/mm/vmem.c | 7 +++--
arch/sparc/mm/init_64.c | 6 ++--
arch/x86/mm/init_64.c | 13 +++++++--
include/linux/memory_hotplug.h | 7 +++--
include/linux/memremap.h | 34 +++++++++++++++-------
include/linux/mm.h | 25 ++++++++++++++--
include/linux/page-flags.h | 18 ++++++++++++
kernel/memremap.c | 6 ----
mm/compaction.c | 3 ++
mm/memory_hotplug.c | 66 +++++++++++++++++++-----------------------
mm/page_alloc.c | 25 ++++++++++++++--
mm/page_isolation.c | 11 ++++++-
mm/sparse-vmemmap.c | 13 +++++++--
mm/sparse.c | 36 ++++++++++++++++-------
17 files changed, 223 insertions(+), 94 deletions(-)
Shortlog
Michal Hocko (5):
mm, memory_hotplug: cleanup memory offline path
mm, arch: unify vmemmap_populate altmap handling
mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap
mm, sparse: complain about implicit altmap usage in vmemmap_populate
mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap
From: Michal Hocko <[email protected]>
check_pages_isolated_cb currently accounts the whole pfn range as being
offlined if test_pages_isolated suceeds on the range. This is based on
the assumption that all pages in the range are freed which is currently
the case in most cases but it won't be with later changes. I haven't
double checked but if the range contains invalid pfns we could
theoretically over account and underflow zone's managed pages.
Move the offlined pages counting to offline_isolated_pages_cb and
rely on __offline_isolated_pages to return the correct value.
check_pages_isolated_cb will still do it's primary job and check the pfn
range.
While we are at it remove check_pages_isolated and offline_isolated_pages
and use directly walk_system_ram_range as do in online_pages.
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memory_hotplug.h | 2 +-
mm/memory_hotplug.c | 43 ++++++++++--------------------------------
mm/page_alloc.c | 11 +++++++++--
3 files changed, 20 insertions(+), 36 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index c8a5056a5ae0..8e8738a6e76d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -101,7 +101,7 @@ extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
extern int online_pages(unsigned long, unsigned long, int);
extern int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
unsigned long *valid_start, unsigned long *valid_end);
-extern void __offline_isolated_pages(unsigned long, unsigned long);
+extern unsigned long __offline_isolated_pages(unsigned long, unsigned long);
typedef void (*online_page_callback_t)(struct page *page);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d620d0427b6b..260139f2581c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1474,17 +1474,12 @@ static int
offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
void *data)
{
- __offline_isolated_pages(start, start + nr_pages);
+ unsigned long offlined_pages;
+ offlined_pages = __offline_isolated_pages(start, start + nr_pages);
+ *(unsigned long *)data += offlined_pages;
return 0;
}
-static void
-offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
-{
- walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
- offline_isolated_pages_cb);
-}
-
/*
* Check all pages in range, recoreded as memory resource, are isolated.
*/
@@ -1492,26 +1487,7 @@ static int
check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
void *data)
{
- int ret;
- long offlined = *(long *)data;
- ret = test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
- offlined = nr_pages;
- if (!ret)
- *(long *)data += offlined;
- return ret;
-}
-
-static long
-check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
-{
- long offlined = 0;
- int ret;
-
- ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined,
- check_pages_isolated_cb);
- if (ret < 0)
- offlined = (long)ret;
- return offlined;
+ return test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
}
static int __init cmdline_parse_movable_node(char *p)
@@ -1620,7 +1596,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
unsigned long end_pfn, unsigned long timeout)
{
unsigned long pfn, nr_pages, expire;
- long offlined_pages;
+ unsigned long offlined_pages = 0;
int ret, drain, retry_max, node;
unsigned long flags;
unsigned long valid_start, valid_end;
@@ -1703,15 +1679,16 @@ static int __ref __offline_pages(unsigned long start_pfn,
if (ret)
goto failed_removal;
/* check again */
- offlined_pages = check_pages_isolated(start_pfn, end_pfn);
- if (offlined_pages < 0) {
+ if (walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
+ check_pages_isolated_cb)) {
ret = -EBUSY;
goto failed_removal;
}
- pr_info("Offlined Pages %ld\n", offlined_pages);
/* Ok, all of our target is isolated.
We cannot do rollback at this point. */
- offline_isolated_pages(start_pfn, end_pfn);
+ walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined_pages,
+ offline_isolated_pages_cb);
+ pr_info("Offlined Pages %ld\n", offlined_pages);
/* reset pagetype flags and makes migrate type to be MOVABLE */
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
/* removal success */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..63a59864a21d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7754,7 +7754,7 @@ void zone_pcp_reset(struct zone *zone)
* All pages in the range must be in a single zone and isolated
* before calling this.
*/
-void
+unsigned long
__offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
{
struct page *page;
@@ -7762,12 +7762,15 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
unsigned int order, i;
unsigned long pfn;
unsigned long flags;
+ unsigned long offlined_pages = 0;
+
/* find the first valid pfn */
for (pfn = start_pfn; pfn < end_pfn; pfn++)
if (pfn_valid(pfn))
break;
if (pfn == end_pfn)
- return;
+ return offlined_pages;
+
offline_mem_sections(pfn, end_pfn);
zone = page_zone(pfn_to_page(pfn));
spin_lock_irqsave(&zone->lock, flags);
@@ -7785,12 +7788,14 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
if (unlikely(!PageBuddy(page) && PageHWPoison(page))) {
pfn++;
SetPageReserved(page);
+ offlined_pages++;
continue;
}
BUG_ON(page_count(page));
BUG_ON(!PageBuddy(page));
order = page_order(page);
+ offlined_pages += 1 << order;
#ifdef CONFIG_DEBUG_VM
pr_info("remove from free list %lx %d %lx\n",
pfn, 1 << order, end_pfn);
@@ -7803,6 +7808,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
pfn += (1 << order);
}
spin_unlock_irqrestore(&zone->lock, flags);
+
+ return offlined_pages;
}
#endif
--
2.13.2
From: Michal Hocko <[email protected]>
vmem_altmap allows vmemmap_populate to allocate memmap (struct page
array) from an alternative allocator rather than bootmem resp.
kmalloc. Only x86 currently supports altmap handling, most likely
because only nvdim code uses this mechanism currently and the code
depends on ZONE_DEVICE which is present only for x86_64. This will
change in follow up changes so we would like other architectures
to support it as well.
Provide vmemmap_populate generic implementation which simply resolves
altmap and then call into arch specific __vmemmap_populate.
Architectures then only need to use __vmemmap_alloc_block_buf to
allocate the memmap. vmemmap_free then needs to call vmem_altmap_free
if there is any altmap associated with the address.
This patch shouldn't introduce any functional changes because
to_vmem_altmap always returns NULL on !x86_x64.
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Michal Hocko <[email protected]>
---
arch/arm64/mm/mmu.c | 9 ++++++---
arch/ia64/mm/discontig.c | 4 +++-
arch/powerpc/mm/init_64.c | 29 ++++++++++++++++++++---------
arch/s390/mm/vmem.c | 7 ++++---
arch/sparc/mm/init_64.c | 6 +++---
arch/x86/mm/init_64.c | 4 ++--
include/linux/memremap.h | 13 ++-----------
include/linux/mm.h | 19 ++++++++++++++++++-
mm/sparse-vmemmap.c | 2 +-
9 files changed, 59 insertions(+), 34 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 0c429ec6fde8..5de1161e7a1b 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -649,12 +649,15 @@ int kern_addr_valid(unsigned long addr)
}
#ifdef CONFIG_SPARSEMEM_VMEMMAP
#if !ARM64_SWAPPER_USES_SECTION_MAPS
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
+ struct vmem_altmap *altmap)
{
+ WARN(altmap, "altmap unsupported\n");
return vmemmap_populate_basepages(start, end, node);
}
#else /* !ARM64_SWAPPER_USES_SECTION_MAPS */
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
+ struct vmem_altmap *altmap)
{
unsigned long addr = start;
unsigned long next;
@@ -677,7 +680,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
if (pmd_none(*pmd)) {
void *p = NULL;
- p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+ p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
if (!p)
return -ENOMEM;
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 878626805369..2a939e877ced 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -753,8 +753,10 @@ void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat)
#endif
#ifdef CONFIG_SPARSEMEM_VMEMMAP
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
+ struct vmem_altmap *altmap)
{
+ WARN(altmap, "altmap unsupported\n");
return vmemmap_populate_basepages(start, end, node);
}
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index ec84b31c6c86..5ea5e870a589 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -44,6 +44,7 @@
#include <linux/slab.h>
#include <linux/of_fdt.h>
#include <linux/libfdt.h>
+#include <linux/memremap.h>
#include <asm/pgalloc.h>
#include <asm/page.h>
@@ -115,7 +116,8 @@ static struct vmemmap_backing *next;
static int num_left;
static int num_freed;
-static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node)
+static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node,
+ struct vmem_altmap *altmap)
{
struct vmemmap_backing *vmem_back;
/* get from freed entries first */
@@ -129,7 +131,7 @@ static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node)
/* allocate a page when required and hand out chunks */
if (!num_left) {
- next = vmemmap_alloc_block(PAGE_SIZE, node);
+ next = __vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
if (unlikely(!next)) {
WARN_ON(1);
return NULL;
@@ -144,11 +146,12 @@ static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node)
static __meminit void vmemmap_list_populate(unsigned long phys,
unsigned long start,
- int node)
+ int node,
+ struct vmem_altmap *altmap)
{
struct vmemmap_backing *vmem_back;
- vmem_back = vmemmap_list_alloc(node);
+ vmem_back = vmemmap_list_alloc(node, altmap);
if (unlikely(!vmem_back)) {
WARN_ON(1);
return;
@@ -161,14 +164,15 @@ static __meminit void vmemmap_list_populate(unsigned long phys,
vmemmap_list = vmem_back;
}
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
+ struct vmem_altmap *altmap)
{
unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
/* Align to the page size of the linear mapping. */
start = _ALIGN_DOWN(start, page_size);
- pr_debug("vmemmap_populate %lx..%lx, node %d\n", start, end, node);
+ pr_debug("__vmemmap_populate %lx..%lx, node %d\n", start, end, node);
for (; start < end; start += page_size) {
void *p;
@@ -177,11 +181,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
if (vmemmap_populated(start, page_size))
continue;
- p = vmemmap_alloc_block(page_size, node);
+ p = __vmemmap_alloc_block_buf(page_size, node, altmap);
if (!p)
return -ENOMEM;
- vmemmap_list_populate(__pa(p), start, node);
+ vmemmap_list_populate(__pa(p), start, node, altmap);
pr_debug(" * %016lx..%016lx allocated at %p\n",
start, start + page_size, p);
@@ -189,7 +193,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
rc = vmemmap_create_mapping(start, page_size, __pa(p));
if (rc < 0) {
pr_warning(
- "vmemmap_populate: Unable to create vmemmap mapping: %d\n",
+ "__vmemmap_populate: Unable to create vmemmap mapping: %d\n",
rc);
return -EFAULT;
}
@@ -253,6 +257,12 @@ void __ref vmemmap_free(unsigned long start, unsigned long end)
addr = vmemmap_list_free(start);
if (addr) {
struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
+ struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
+
+ if (altmap) {
+ vmem_altmap_free(altmap, page_size >> PAGE_SHIFT);
+ goto unmap;
+ }
if (PageReserved(page)) {
/* allocated from bootmem */
@@ -272,6 +282,7 @@ void __ref vmemmap_free(unsigned long start, unsigned long end)
free_pages((unsigned long)(__va(addr)),
get_order(page_size));
+unmap:
vmemmap_remove_mapping(start, page_size);
}
}
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index c33c94b4be60..07120bc137a1 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -208,7 +208,8 @@ static void vmem_remove_range(unsigned long start, unsigned long size)
/*
* Add a backed mem_map array to the virtual mem_map array.
*/
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
+ struct vmem_altmap *altmap)
{
unsigned long pgt_prot, sgt_prot;
unsigned long address = start;
@@ -247,12 +248,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
* use large frames even if they are only partially
* used.
* Otherwise we would have also page tables since
- * vmemmap_populate gets called for each section
+ * __vmemmap_populate gets called for each section
* separately. */
if (MACHINE_HAS_EDAT1) {
void *new_page;
- new_page = vmemmap_alloc_block(PMD_SIZE, node);
+ new_page = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
if (!new_page)
goto out;
pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 3c40ebd50f92..4a77dfa85468 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2544,8 +2544,8 @@ unsigned long _PAGE_CACHE __read_mostly;
EXPORT_SYMBOL(_PAGE_CACHE);
#ifdef CONFIG_SPARSEMEM_VMEMMAP
-int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
- int node)
+int __meminit __vmemmap_populate(unsigned long vstart, unsigned long vend,
+ int node, struct vmem_altmap *altmap)
{
unsigned long pte_base;
@@ -2587,7 +2587,7 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
- void *block = vmemmap_alloc_block(PMD_SIZE, node);
+ void *block = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
if (!block)
return -ENOMEM;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 136422d7d539..cfe01150f24f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1398,9 +1398,9 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
return 0;
}
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
+ struct vmem_altmap *altmap)
{
- struct vmem_altmap *altmap = to_vmem_altmap(start);
int err;
if (boot_cpu_has(X86_FEATURE_PSE))
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 93416196ba64..6f8f65d8ebdd 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -8,12 +8,12 @@ struct resource;
struct device;
/**
- * struct vmem_altmap - pre-allocated storage for vmemmap_populate
+ * struct vmem_altmap - pre-allocated storage for __vmemmap_populate
* @base_pfn: base of the entire dev_pagemap mapping
* @reserve: pages mapped, but reserved for driver use (relative to @base)
* @free: free pages set aside in the mapping for memmap storage
* @align: pages reserved to meet allocation alignments
- * @alloc: track pages consumed, private to vmemmap_populate()
+ * @alloc: track pages consumed, private to __vmemmap_populate()
*/
struct vmem_altmap {
const unsigned long base_pfn;
@@ -26,15 +26,6 @@ struct vmem_altmap {
unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-#ifdef CONFIG_ZONE_DEVICE
-struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start);
-#else
-static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
-{
- return NULL;
-}
-#endif
-
/**
* struct dev_pagemap - metadata for ZONE_DEVICE mappings
* @altmap: pre-allocated/reserved memory for vmemmap allocations
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a47fc92..957d4658977d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2441,10 +2441,27 @@ static inline void *vmemmap_alloc_block_buf(unsigned long size, int node)
return __vmemmap_alloc_block_buf(size, node, NULL);
}
+#ifdef CONFIG_ZONE_DEVICE
+struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start);
+#else
+static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
+{
+ return NULL;
+}
+#endif
+
void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
int vmemmap_populate_basepages(unsigned long start, unsigned long end,
int node);
-int vmemmap_populate(unsigned long start, unsigned long end, int node);
+int __vmemmap_populate(unsigned long start, unsigned long end, int node,
+ struct vmem_altmap *altmap);
+static inline int vmemmap_populate(unsigned long start, unsigned long end,
+ int node)
+{
+ struct vmem_altmap *altmap = to_vmem_altmap(start);
+ return __vmemmap_populate(start, end, node, altmap);
+}
+
void vmemmap_populate_print_last(void);
#ifdef CONFIG_MEMORY_HOTPLUG
void vmemmap_free(unsigned long start, unsigned long end);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index c50b1a14d55e..483ba270d522 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -14,7 +14,7 @@
* case the overhead consists of a few additional pages that are
* allocated to create a view of memory for vmemmap.
*
- * The architecture is expected to provide a vmemmap_populate() function
+ * The architecture is expected to provide a __vmemmap_populate() function
* to instantiate the mapping.
*/
#include <linux/mm.h>
--
2.13.2
From: Michal Hocko <[email protected]>
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. kmalloc is currantly used for those
allocations.
This has some disadvantages a) an existing memory is consumed for
that purpose (~2MB per 128MB memory section) and b) if the whole node
is movable then we have off-node struct pages which has performance
drawbacks.
a) has turned out to be a problem for memory hotplug based ballooning
because the userspace might not react in time to online memory while
to memory consumed during physical hotadd consumes enough memory to push
system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory") has been added to workaround that
problem.
We can do much better when CONFIG_SPARSEMEM_VMEMMAP=y because vmemap
page tables can map arbitrary memory. That means that we can simply
use the beginning of each memory section and map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully so that we know they are not usable.
Add {_Set,_Clear}PageVmemmap helpers to distinguish those pages in pfn
walkers. We do not have any spare page flag for this purpose so use the
combination of PageReserved bit which already tells that the page should
be ignored by the core mm code and store VMEMMAP_PAGE (which sets all
bits but PAGE_MAPPING_FLAGS) into page->mapping.
On the memory hotplug front reuse vmem_altmap infrastructure to override
the default allocator used by __vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation
and this is done by a new vmem_altmap::flush_alloc_pfns callback.
mark_vmemmap_pages implementation then simply __SetPageVmemmap all
struct pages backing those pfns. The callback is called from
sparse_add_one_section after the memmap has been initialized to 0.
We also have to be careful about those pages during online and offline
operations. They are simply ignored.
Finally __ClearPageVmemmap is called when the vmemmap page tables are
torn down.
Please note that only the memory hotplug is currently using this
allocation scheme. The boot time memmap allocation could use the same
trick as well but this is not done yet.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Dan Williams <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
arch/powerpc/mm/init_64.c | 5 +++++
arch/x86/mm/init_64.c | 9 +++++++++
include/linux/memory_hotplug.h | 5 ++++-
include/linux/memremap.h | 25 ++++++++++++++++++++++++-
include/linux/mm.h | 6 ++----
include/linux/page-flags.h | 18 ++++++++++++++++++
kernel/memremap.c | 6 ------
mm/compaction.c | 3 +++
mm/memory_hotplug.c | 23 ++++++++++++++++++++---
mm/page_alloc.c | 14 ++++++++++++++
mm/page_isolation.c | 11 ++++++++++-
mm/sparse-vmemmap.c | 11 +++++++++--
mm/sparse.c | 26 +++++++++++++++++++++-----
13 files changed, 139 insertions(+), 23 deletions(-)
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 5ea5e870a589..b14f2239a152 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -264,6 +264,11 @@ void __ref vmemmap_free(unsigned long start, unsigned long end)
goto unmap;
}
+ if (PageVmemmap(page)) {
+ __ClearPageVmemmap(page);
+ goto unmap;
+ }
+
if (PageReserved(page)) {
/* allocated from bootmem */
if (page_size < PAGE_SIZE) {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index cfe01150f24f..e4f749e5652f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -803,6 +803,15 @@ static void __meminit free_pagetable(struct page *page, int order)
return;
}
+ /*
+ * runtime vmemmap pages are residing inside the memory section so
+ * they do not have to be freed anywhere.
+ */
+ if (PageVmemmap(page)) {
+ __ClearPageVmemmap(page);
+ return;
+ }
+
/* bootmem page has reserved flag */
if (PageReserved(page)) {
__ClearPageReserved(page);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8e8738a6e76d..f2636ad2d00f 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -312,7 +312,10 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
extern bool is_memblock_offlined(struct memory_block *mem);
extern void remove_memory(int nid, u64 start, u64 size);
-extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
+
+struct vmem_altmap;
+extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn,
+ struct vmem_altmap *altmap);
extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
unsigned long map_offset);
extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 6f8f65d8ebdd..7aec9272fe4d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -14,6 +14,8 @@ struct device;
* @free: free pages set aside in the mapping for memmap storage
* @align: pages reserved to meet allocation alignments
* @alloc: track pages consumed, private to __vmemmap_populate()
+ * @flush_alloc_pfns: callback to be called on the allocated range after it
+ * is mapped to the vmemmap - see mark_vmemmap_pages
*/
struct vmem_altmap {
const unsigned long base_pfn;
@@ -21,11 +23,32 @@ struct vmem_altmap {
unsigned long free;
unsigned long align;
unsigned long alloc;
+ void (*flush_alloc_pfns)(struct vmem_altmap *self);
};
-unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
+static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
+{
+ /* number of pfns from base where pfn_to_page() is valid */
+ return altmap->reserve + altmap->free;
+}
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
+static inline void mark_vmemmap_pages(struct vmem_altmap *self)
+{
+ unsigned long pfn = self->base_pfn + self->reserve;
+ unsigned long nr_pages = self->alloc;
+ unsigned long i;
+
+ /*
+ * All allocations for the memory hotplug are the same sized so align
+ * should be 0
+ */
+ WARN_ON(self->align);
+ for (i = 0; i < nr_pages; i++, pfn++) {
+ struct page *page = pfn_to_page(pfn);
+ __SetPageVmemmap(page);
+ }
+}
/**
* struct dev_pagemap - metadata for ZONE_DEVICE mappings
* @altmap: pre-allocated/reserved memory for vmemmap allocations
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 957d4658977d..3ce673570fb8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2426,6 +2426,8 @@ void sparse_mem_maps_populate_node(struct page **map_map,
unsigned long map_count,
int nodeid);
+struct page *__sparse_mem_map_populate(unsigned long pnum, int nid,
+ struct vmem_altmap *altmap);
struct page *sparse_mem_map_populate(unsigned long pnum, int nid);
pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
@@ -2436,10 +2438,6 @@ void *vmemmap_alloc_block(unsigned long size, int node);
struct vmem_altmap;
void *__vmemmap_alloc_block_buf(unsigned long size, int node,
struct vmem_altmap *altmap);
-static inline void *vmemmap_alloc_block_buf(unsigned long size, int node)
-{
- return __vmemmap_alloc_block_buf(size, node, NULL);
-}
#ifdef CONFIG_ZONE_DEVICE
struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d33e3280c8ad..a6d4f94cf7bf 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -412,6 +412,24 @@ static __always_inline int __PageMovable(struct page *page)
PAGE_MAPPING_MOVABLE;
}
+#define VMEMMAP_PAGE ~PAGE_MAPPING_FLAGS
+static __always_inline int PageVmemmap(struct page *page)
+{
+ return PageReserved(page) && (unsigned long)page->mapping == VMEMMAP_PAGE;
+}
+
+static __always_inline void __ClearPageVmemmap(struct page *page)
+{
+ ClearPageReserved(page);
+ page->mapping = NULL;
+}
+
+static __always_inline void __SetPageVmemmap(struct page *page)
+{
+ SetPageReserved(page);
+ page->mapping = (void *)VMEMMAP_PAGE;
+}
+
#ifdef CONFIG_KSM
/*
* A KSM page is one of those write-protected "shared pages" or "merged pages"
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 124bed776532..a72eb5932d2f 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -393,12 +393,6 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
}
EXPORT_SYMBOL(devm_memremap_pages);
-unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
-{
- /* number of pfns from base where pfn_to_page() is valid */
- return altmap->reserve + altmap->free;
-}
-
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
{
altmap->alloc -= nr_pfns;
diff --git a/mm/compaction.c b/mm/compaction.c
index fb548e4c7bd4..1ba5ad029258 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -741,6 +741,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
page = pfn_to_page(low_pfn);
+ if (PageVmemmap(page))
+ goto isolate_fail;
+
if (!valid_page)
valid_page = page;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 260139f2581c..19037d0191e5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -35,6 +35,7 @@
#include <linux/memblock.h>
#include <linux/bootmem.h>
#include <linux/compaction.h>
+#include <linux/memremap.h>
#include <asm/tlbflush.h>
@@ -244,7 +245,7 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
- bool want_memblock)
+ bool want_memblock, struct vmem_altmap *altmap)
{
int ret;
int i;
@@ -252,7 +253,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
if (pfn_valid(phys_start_pfn))
return -EEXIST;
- ret = sparse_add_one_section(NODE_DATA(nid), phys_start_pfn);
+ ret = sparse_add_one_section(NODE_DATA(nid), phys_start_pfn, altmap);
if (ret < 0)
return ret;
@@ -292,12 +293,23 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
int err = 0;
int start_sec, end_sec;
struct vmem_altmap *altmap;
+ struct vmem_altmap __section_altmap = {.base_pfn = phys_start_pfn};
/* during initialize mem_map, align hot-added range to section */
start_sec = pfn_to_section_nr(phys_start_pfn);
end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
+ /*
+ * Check device specific altmap and fallback to allocating from the
+ * begining of the section otherwise
+ */
altmap = to_vmem_altmap((unsigned long) pfn_to_page(phys_start_pfn));
+ if (!altmap) {
+ __section_altmap.free = nr_pages;
+ __section_altmap.flush_alloc_pfns = mark_vmemmap_pages;
+ altmap = &__section_altmap;
+ }
+
if (altmap) {
/*
* Validate altmap is within bounds of the total request
@@ -312,7 +324,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
}
for (i = start_sec; i <= end_sec; i++) {
- err = __add_section(nid, section_nr_to_pfn(i), want_memblock);
+ err = __add_section(nid, section_nr_to_pfn(i), want_memblock, altmap);
/*
* EEXIST is finally dealt with by ioresource collision
@@ -689,6 +701,8 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
if (PageReserved(pfn_to_page(start_pfn)))
for (i = 0; i < nr_pages; i++) {
page = pfn_to_page(start_pfn + i);
+ if (PageVmemmap(page))
+ continue;
if (PageHWPoison(page)) {
ClearPageReserved(page);
continue;
@@ -1406,6 +1420,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
continue;
page = pfn_to_page(pfn);
+ if (PageVmemmap(page))
+ continue;
+
if (PageHuge(page)) {
struct page *head = compound_head(page);
pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63a59864a21d..78be043cd66d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1168,6 +1168,10 @@ static void free_one_page(struct zone *zone,
static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
+ /*
+ * Please note that the page already has some state which has to be
+ * preserved (e.g. PageVmemmap)
+ */
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
@@ -7781,6 +7785,16 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
continue;
}
page = pfn_to_page(pfn);
+
+ /*
+ * vmemmap pages are residing inside the memory section so
+ * they do not have to be freed anywhere.
+ */
+ if (PageVmemmap(page)) {
+ pfn++;
+ continue;
+ }
+
/*
* The HWPoisoned page may be not in buddy system, and
* page_count() is not 0.
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 757410d9f758..6e976aa94dee 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -242,6 +242,10 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
continue;
}
page = pfn_to_page(pfn);
+ if (PageVmemmap(page)) {
+ pfn++;
+ continue;
+ }
if (PageBuddy(page))
/*
* If the page is on a free list, it has to be on
@@ -272,10 +276,15 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
* are not aligned to pageblock_nr_pages.
* Then we just check migratetype first.
*/
- for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ for (pfn = start_pfn; pfn < end_pfn; ) {
page = __first_valid_page(pfn, pageblock_nr_pages);
+ if (PageVmemmap(page)) {
+ pfn++;
+ continue;
+ }
if (page && !is_migrate_isolate_page(page))
break;
+ pfn += pageblock_nr_pages;
}
page = __first_valid_page(start_pfn, end_pfn - start_pfn);
if ((pfn < end_pfn) || !page)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 483ba270d522..42d6721cfb71 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -264,7 +264,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
return 0;
}
-struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid)
+struct page * __meminit __sparse_mem_map_populate(unsigned long pnum, int nid,
+ struct vmem_altmap *altmap)
{
unsigned long start;
unsigned long end;
@@ -274,12 +275,18 @@ struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid)
start = (unsigned long)map;
end = (unsigned long)(map + PAGES_PER_SECTION);
- if (vmemmap_populate(start, end, nid))
+ if (__vmemmap_populate(start, end, nid, altmap))
return NULL;
return map;
}
+struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid)
+{
+ WARN_ON(to_vmem_altmap(pnum));
+ return __sparse_mem_map_populate(pnum, nid, NULL);
+}
+
void __init sparse_mem_maps_populate_node(struct page **map_map,
unsigned long pnum_begin,
unsigned long pnum_end,
diff --git a/mm/sparse.c b/mm/sparse.c
index 7b4be3fd5cac..8460ecb9e69b 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -10,6 +10,7 @@
#include <linux/export.h>
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
+#include <linux/memremap.h>
#include "internal.h"
#include <asm/dma.h>
@@ -666,10 +667,11 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
#endif
#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid)
+static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+ struct vmem_altmap *altmap)
{
/* This will make the necessary allocations eventually. */
- return sparse_mem_map_populate(pnum, nid);
+ return __sparse_mem_map_populate(pnum, nid, altmap);
}
static void __kfree_section_memmap(struct page *memmap)
{
@@ -709,7 +711,8 @@ static struct page *__kmalloc_section_memmap(void)
return ret;
}
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid)
+static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+ struct vmem_altmap *altmap)
{
return __kmalloc_section_memmap();
}
@@ -761,7 +764,8 @@ static void free_map_bootmem(struct page *memmap)
* set. If this is <=0, then that means that the passed-in
* map was not consumed and must be freed.
*/
-int __meminit sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn)
+int __meminit sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn,
+ struct vmem_altmap *altmap)
{
unsigned long section_nr = pfn_to_section_nr(start_pfn);
struct mem_section *ms;
@@ -777,7 +781,7 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, unsigned long st
ret = sparse_index_init(section_nr, pgdat->node_id);
if (ret < 0 && ret != -EEXIST)
return ret;
- memmap = kmalloc_section_memmap(section_nr, pgdat->node_id);
+ memmap = kmalloc_section_memmap(section_nr, pgdat->node_id, altmap);
if (!memmap)
return -ENOMEM;
usemap = __kmalloc_section_usemap();
@@ -794,8 +798,20 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, unsigned long st
goto out;
}
+ /*
+ * TODO get rid of this somehow - we want to postpone the full
+ * initialization until memmap_init_zone.
+ */
memset(memmap, 0, sizeof(struct page) * PAGES_PER_SECTION);
+ /*
+ * now that we have a valid vmemmap mapping we can use
+ * pfn_to_page and flush struct pages which back the
+ * memmap
+ */
+ if (altmap && altmap->flush_alloc_pfns)
+ altmap->flush_alloc_pfns(altmap);
+
section_mark_present(ms);
ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
--
2.13.2
From: Michal Hocko <[email protected]>
All current users of the altmap are in the memory hotplug code and
they use __vmemmap_populate explicitly (via __sparse_mem_map_populate).
Complain if somebody uses vmemmap_populate with altmap registered
because that could be an unexpected usage. Also call __vmemmap_populate
with NULL from that code path.
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/mm.h | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3ce673570fb8..ae1fa053d09e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2456,8 +2456,12 @@ int __vmemmap_populate(unsigned long start, unsigned long end, int node,
static inline int vmemmap_populate(unsigned long start, unsigned long end,
int node)
{
- struct vmem_altmap *altmap = to_vmem_altmap(start);
- return __vmemmap_populate(start, end, node, altmap);
+ /*
+ * All users of the altmap have to be explicit and use
+ * __vmemmap_populate directly
+ */
+ WARN_ON(to_vmem_altmap(start));
+ return __vmemmap_populate(start, end, node, NULL);
}
void vmemmap_populate_print_last(void);
--
2.13.2
From: Michal Hocko <[email protected]>
Both functions will use altmap rather than kmalloc for sparsemem-vmemmap
so rename them to alloc_section_memmap/free_section_memmap which better
reflect the functionality.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/sparse.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/sparse.c b/mm/sparse.c
index 8460ecb9e69b..eb2e45b35a28 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -667,13 +667,13 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
#endif
#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+static inline struct page *alloc_section_memmap(unsigned long pnum, int nid,
struct vmem_altmap *altmap)
{
/* This will make the necessary allocations eventually. */
return __sparse_mem_map_populate(pnum, nid, altmap);
}
-static void __kfree_section_memmap(struct page *memmap)
+static void free_section_memmap(struct page *memmap)
{
unsigned long start = (unsigned long)memmap;
unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
@@ -711,13 +711,13 @@ static struct page *__kmalloc_section_memmap(void)
return ret;
}
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+static inline struct page *alloc_section_memmap(unsigned long pnum, int nid,
struct vmem_altmap *altmap)
{
return __kmalloc_section_memmap();
}
-static void __kfree_section_memmap(struct page *memmap)
+static void free_section_memmap(struct page *memmap)
{
if (is_vmalloc_addr(memmap))
vfree(memmap);
@@ -781,12 +781,12 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, unsigned long st
ret = sparse_index_init(section_nr, pgdat->node_id);
if (ret < 0 && ret != -EEXIST)
return ret;
- memmap = kmalloc_section_memmap(section_nr, pgdat->node_id, altmap);
+ memmap = alloc_section_memmap(section_nr, pgdat->node_id, altmap);
if (!memmap)
return -ENOMEM;
usemap = __kmalloc_section_usemap();
if (!usemap) {
- __kfree_section_memmap(memmap);
+ free_section_memmap(memmap);
return -ENOMEM;
}
@@ -820,7 +820,7 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, unsigned long st
pgdat_resize_unlock(pgdat, &flags);
if (ret <= 0) {
kfree(usemap);
- __kfree_section_memmap(memmap);
+ free_section_memmap(memmap);
}
return ret;
}
@@ -861,7 +861,7 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
kfree(usemap);
if (memmap)
- __kfree_section_memmap(memmap);
+ free_section_memmap(memmap);
return;
}
--
2.13.2
On Wed 26-07-17 10:33:28, Michal Hocko wrote:
> The patches is based on the current mmotm tree (mmotm-2017-07-12-15-11)
Btw. the patchset is also
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git branch
attempts/memmap-in-section-hotplug
--
Michal Hocko
SUSE Labs
On Wed, Jul 26, 2017 at 10:33:31AM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. kmalloc is currantly used for those
> allocations.
>
> This has some disadvantages a) an existing memory is consumed for
> that purpose (~2MB per 128MB memory section) and b) if the whole node
> is movable then we have off-node struct pages which has performance
> drawbacks.
>
> a) has turned out to be a problem for memory hotplug based ballooning
> because the userspace might not react in time to online memory while
> to memory consumed during physical hotadd consumes enough memory to push
> system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
> policy for the newly added memory") has been added to workaround that
> problem.
>
> We can do much better when CONFIG_SPARSEMEM_VMEMMAP=y because vmemap
> page tables can map arbitrary memory. That means that we can simply
> use the beginning of each memory section and map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully so that we know they are not usable.
>
> Add {_Set,_Clear}PageVmemmap helpers to distinguish those pages in pfn
> walkers. We do not have any spare page flag for this purpose so use the
> combination of PageReserved bit which already tells that the page should
> be ignored by the core mm code and store VMEMMAP_PAGE (which sets all
> bits but PAGE_MAPPING_FLAGS) into page->mapping.
>
> On the memory hotplug front reuse vmem_altmap infrastructure to override
> the default allocator used by __vmemap_populate. Once the memmap is
> allocated we need a way to mark altmap pfns used for the allocation
> and this is done by a new vmem_altmap::flush_alloc_pfns callback.
> mark_vmemmap_pages implementation then simply __SetPageVmemmap all
> struct pages backing those pfns. The callback is called from
> sparse_add_one_section after the memmap has been initialized to 0.
>
> We also have to be careful about those pages during online and offline
> operations. They are simply ignored.
>
> Finally __ClearPageVmemmap is called when the vmemmap page tables are
> torn down.
>
> Please note that only the memory hotplug is currently using this
> allocation scheme. The boot time memmap allocation could use the same
> trick as well but this is not done yet.
Which kernel are these patches based on? I tried linux-next and Linus'
vanilla tree, however the series does not apply.
In general I do like your idea, however if I understand your patches
correctly we might have an ordering problem on s390: it is not possible to
access hot-added memory on s390 before it is online (MEM_GOING_ONLINE
succeeded).
On MEM_GOING_ONLINE we ask the hypervisor to back the potential available
hot-added memory region with physical pages. Accessing those ranges before
that will result in an exception.
However with your approach the memory is still allocated when add_memory()
is being called, correct? That wouldn't be a change to the current
behaviour; except for the ordering problem outlined above.
Just trying to make sure I get this right :)
On Wed, Jul 26, 2017 at 01:45:39PM +0200, Heiko Carstens wrote:
> > Please note that only the memory hotplug is currently using this
> > allocation scheme. The boot time memmap allocation could use the same
> > trick as well but this is not done yet.
>
> Which kernel are these patches based on? I tried linux-next and Linus'
> vanilla tree, however the series does not apply.
I found the answer to this question in the meantime.. sorry ;)
On Wed 26-07-17 13:45:39, Heiko Carstens wrote:
[...]
> In general I do like your idea, however if I understand your patches
> correctly we might have an ordering problem on s390: it is not possible to
> access hot-added memory on s390 before it is online (MEM_GOING_ONLINE
> succeeded).
Could you point me to the code please? I cannot seem to find the
notifier which implements that.
> On MEM_GOING_ONLINE we ask the hypervisor to back the potential available
> hot-added memory region with physical pages. Accessing those ranges before
> that will result in an exception.
Can we make the range which backs the memmap range available? E.g from
s390 specific __vmemmap_populate path?
> However with your approach the memory is still allocated when add_memory()
> is being called, correct? That wouldn't be a change to the current
> behaviour; except for the ordering problem outlined above.
Could you be more specific please? I do not change when the memmap is
allocated.
--
Michal Hocko
SUSE Labs
On Wed, 26 Jul 2017 14:30:41 +0200
Michal Hocko <[email protected]> wrote:
> On Wed 26-07-17 13:45:39, Heiko Carstens wrote:
> [...]
> > In general I do like your idea, however if I understand your patches
> > correctly we might have an ordering problem on s390: it is not possible to
> > access hot-added memory on s390 before it is online (MEM_GOING_ONLINE
> > succeeded).
>
> Could you point me to the code please? I cannot seem to find the
> notifier which implements that.
It is in drivers/s390/char/sclp_cmd.c: sclp_mem_notifier().
>
> > On MEM_GOING_ONLINE we ask the hypervisor to back the potential available
> > hot-added memory region with physical pages. Accessing those ranges before
> > that will result in an exception.
>
> Can we make the range which backs the memmap range available? E.g from
> s390 specific __vmemmap_populate path?
No, only the complete range of a storage increment can be made available.
The size of those increments may vary between z/VM and LPAR, but at least
with LPAR it will always be minimum 256 MB, IIRC.
>
> > However with your approach the memory is still allocated when add_memory()
> > is being called, correct? That wouldn't be a change to the current
> > behaviour; except for the ordering problem outlined above.
>
> Could you be more specific please? I do not change when the memmap is
> allocated.
I guess this is about the difference between s390 and others, wrt when
we call add_memory(). It is also in drivers/s390/char/sclp_cmd.c, early
during memory detection, as opposed to other archs, where I guess this
could be triggered by an ACPI event during runtime, at least for newly
added and to-be-onlined memory.
This probably means that any approach that tries to allocate memmap
memory during add_memory(), out of the "to-be-onlined but still offline"
memory, will be difficult for s390, because we call add_memory() only once
during memory detection for the complete range of (hypervisor) defined
online and offline memory. The offline parts are then made available in
the MEM_GOING_ONLINE notifier called from online_pages(). Only after
this point the memory would then be available to allocate a memmap in it.
Nevertheless, we have great interest in such a "allocate memmap from
the added memory range" solution. I guess we would need some way to
separate the memmap allocation from add_memory(), which sounds odd,
or provide some way to have add_memory() only allocate a memmap for
online memory, and a mechanism to add the memmaps for offline memory
blocks later when they are being set online.
Regards,
Gerald
On Wed, Jul 26, 2017 at 10:33:28AM +0200, Michal Hocko wrote:
> Hi,
> this is another step to make the memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot added
> memory (at least for SPARSE_VMEMMAP memory model). Currently we use
> kmalloc to poppulate memmap (struct page array) which has two main
> drawbacks a) it consumes an additional memory until the hotadded memory
> itslef is onlined and b) memmap might end up on a different numa node
> which is especially true for movable_node configuration.
>
> a) is problem especially for memory hotplug based memory "ballooning"
> solutions when the delay between physical memory hotplug and the
> onlining can lead to OOM and that led to introduction of hacks like auto
> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
> policy for the newly added memory")).
> b) can have performance drawbacks.
>
> One way to mitigate both issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hotadded memory itself. VMEMMAP memory model allows us to map
> any pfn range so the memory doesn't need to be online to be usable
> for the array. See patch 3 for more details. In short I am reusing an
> existing vmem_altmap which wants to achieve the same thing for nvdim
> device memory.
>
> I am sending this as an RFC because this has seen only a very limited
> testing and I am mostly interested about opinions on the chosen
> approach. I had to touch some arch code and I have no idea whether my
> changes make sense there (especially ppc). Therefore I would highly
> appreciate arch maintainers to check patch 2.
>
> Patches 4 and 5 should be straightforward cleanups.
>
> There is also one potential drawback, though. If somebody uses memory
> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
> for them obviously because each memory section will contain 2MB reserved
> area. I am not really sure somebody does that and how reliable that
> can work actually. Nevertheless, I _believe_ that onlining more memory
> into virtual machines is much more common usecase. Anyway if there ever
> is a strong demand for such a usecase we have basically 3 options a)
> enlarge memory sections b) enhance altmap allocation strategy and reuse
> low memory sections to host memmaps of other sections on the same NUMA
> node c) have the memmap allocation strategy configurable to fallback to
> the current allocation.
>
> Are there any other concerns, ideas, comments?
>
This does not seems to be an opt-in change ie if i am reading patch 3
correctly if an altmap is not provided to __add_pages() you fallback
to allocating from begining of zone. This will not work with HMM ie
device private memory. So at very least i would like to see some way
to opt-out of this. Maybe a new argument like bool forbid_altmap ?
Cheers,
J?r?me
On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
[...]
> This does not seems to be an opt-in change ie if i am reading patch 3
> correctly if an altmap is not provided to __add_pages() you fallback
> to allocating from begining of zone. This will not work with HMM ie
> device private memory. So at very least i would like to see some way
> to opt-out of this. Maybe a new argument like bool forbid_altmap ?
OK, I see! I will think about how to make a sane api for that.
Thanks!
--
Michal Hocko
SUSE Labs
On Wed 26-07-17 19:20:39, Gerald Schaefer wrote:
> On Wed, 26 Jul 2017 14:30:41 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Wed 26-07-17 13:45:39, Heiko Carstens wrote:
> > [...]
> > > In general I do like your idea, however if I understand your patches
> > > correctly we might have an ordering problem on s390: it is not possible to
> > > access hot-added memory on s390 before it is online (MEM_GOING_ONLINE
> > > succeeded).
> >
> > Could you point me to the code please? I cannot seem to find the
> > notifier which implements that.
>
> It is in drivers/s390/char/sclp_cmd.c: sclp_mem_notifier().
Thanks for the pointer. I will have a look.
> > > On MEM_GOING_ONLINE we ask the hypervisor to back the potential available
> > > hot-added memory region with physical pages. Accessing those ranges before
> > > that will result in an exception.
> >
> > Can we make the range which backs the memmap range available? E.g from
> > s390 specific __vmemmap_populate path?
>
> No, only the complete range of a storage increment can be made available.
> The size of those increments may vary between z/VM and LPAR, but at least
> with LPAR it will always be minimum 256 MB, IIRC.
Is there any problem doing that before we even get to __add_pages - e.g.
in arch_add_memory? X86 already does something along those lines by
calling init_memory_mapping AFAIU. Yes it is different thing than s390
but essentially it is preparing the physical address space for the new
memory so it is not that far away...
> > > However with your approach the memory is still allocated when add_memory()
> > > is being called, correct? That wouldn't be a change to the current
> > > behaviour; except for the ordering problem outlined above.
> >
> > Could you be more specific please? I do not change when the memmap is
> > allocated.
>
> I guess this is about the difference between s390 and others, wrt when
> we call add_memory(). It is also in drivers/s390/char/sclp_cmd.c, early
> during memory detection, as opposed to other archs, where I guess this
> could be triggered by an ACPI event during runtime, at least for newly
> added and to-be-onlined memory.
I guess this is trying to answer my question above about arch_add_memory
but I still to grasp what this means.
> This probably means that any approach that tries to allocate memmap
> memory during add_memory(), out of the "to-be-onlined but still offline"
> memory, will be difficult for s390, because we call add_memory() only once
> during memory detection for the complete range of (hypervisor) defined
> online and offline memory. The offline parts are then made available in
> the MEM_GOING_ONLINE notifier called from online_pages(). Only after
> this point the memory would then be available to allocate a memmap in it.
Yes, this scheme is really unfortunate for the mechanism I am proposing
and it is not compatible.
> Nevertheless, we have great interest in such a "allocate memmap from
> the added memory range" solution. I guess we would need some way to
> separate the memmap allocation from add_memory(), which sounds odd,
> or provide some way to have add_memory() only allocate a memmap for
> online memory, and a mechanism to add the memmaps for offline memory
> blocks later when they are being set online.
Well, we cannot move the memmap allocation to later. We do have users
which never online the memory (ZONE_DEVICE). And __add_pages is exactly
about adding memmap for the range. I believe this should be addressed
somewhere at arch_add_memory layer.
Jerome has noted that there will have to be an opt-out from using altmap
becuase his hotplug usecase (HMM) cannot allocate from the added range
as well. So I will use the same thing for the s390 until we figure how
to implement it there for now.
--
Michal Hocko
SUSE Labs
On Wed 26-07-17 10:33:28, Michal Hocko wrote:
[...]
> There is also one potential drawback, though. If somebody uses memory
> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
> for them obviously because each memory section will contain 2MB reserved
> area.
Actually I am wrong here. It is not each section in general. I have
completely forgot that we do large memblocks on x86_64 with a lot of
memory and then we hotadd 2GB memblocks and the altmap will allocate
from the beginign of the _memblock_ and so all the memmaps will end up
in the first memory section.
--
Michal Hocko
SUSE Labs
On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> [...]
> > This does not seems to be an opt-in change ie if i am reading patch 3
> > correctly if an altmap is not provided to __add_pages() you fallback
> > to allocating from begining of zone. This will not work with HMM ie
> > device private memory. So at very least i would like to see some way
> > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
>
> OK, I see! I will think about how to make a sane api for that.
This is what I came up with. s390 guys mentioned that I cannot simply
use the new range at this stage yet. This will need probably some other
changes but I guess we want an opt-in approach with an arch veto in general.
So what do you think about the following? Only x86 is update now and I
will split it into two parts but the idea should be clear at least.
---
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e4f749e5652f..a4a29af28bcf 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -772,7 +772,8 @@ static void update_end_of_memory_vars(u64 start, u64 size)
}
}
-int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+ struct mhp_restrictions *restrictions)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -780,7 +781,9 @@ int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
init_memory_mapping(start, start + size);
- ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
+ /* newly hotadded memory range is ready to be used for the memmap */
+ restrictions->flags |= MHP_RANGE_ACCESSIBLE;
+ ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
WARN_ON_ONCE(ret);
/* update max_pfn, max_low_pfn and high_memory */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f2636ad2d00f..928d93e2a555 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -129,9 +129,29 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
#endif /* CONFIG_MEMORY_HOTREMOVE */
+/*
+ * Do we want sysfs memblock files created. This will allow userspace to online
+ * and offline memory explicitly. Lack of this bit means that the caller has to
+ * call move_pfn_range_to_zone to finish the initialization.
+ */
+#define MHP_MEMBLOCK_API 1<<0
+
+/*
+ * Is the hotadded memory accessible directly or it needs a special handling.
+ * We will try to allocate the memmap for the range from within the added memory
+ * if the bit is set.
+ */
+#define MHP_RANGE_ACCESSIBLE 1<<1
+
+/* Restrictions for the memory hotplug */
+struct mhp_restrictions {
+ unsigned long flags; /* MHP_ flags */
+ struct vmem_altmap *altmap; /* use this alternative allocatro for memmaps */
+};
+
/* reasonably generic interface to expand the physical pages */
extern int __add_pages(int nid, unsigned long start_pfn,
- unsigned long nr_pages, bool want_memblock);
+ unsigned long nr_pages, struct mhp_restrictions *restrictions);
#ifdef CONFIG_NUMA
extern int memory_add_physaddr_to_nid(u64 start);
@@ -306,7 +326,8 @@ extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
void *arg, int (*func)(struct memory_block *, void *));
extern int add_memory(int nid, u64 start, u64 size);
extern int add_memory_resource(int nid, struct resource *resource, bool online);
-extern int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock);
+extern int arch_add_memory(int nid, u64 start, u64 size,
+ struct mhp_restrictions *restrictions);
extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index a72eb5932d2f..cf0998cfcb13 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -286,6 +286,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
struct dev_pagemap *pgmap;
struct page_map *page_map;
int error, nid, is_ram;
+ struct mhp_restrictions restrictions = {};
unsigned long pfn;
align_start = res->start & ~(SECTION_SIZE - 1);
@@ -357,8 +358,11 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
if (error)
goto err_pfn_remap;
+ /* We do not want any optional features only our own memmap */
+ restrictions.altmap = to_vmem_altmap((unsigned long) pfn_to_page(phys_start_pfn));
+
mem_hotplug_begin();
- error = arch_add_memory(nid, align_start, align_size, false);
+ error = arch_add_memory(nid, align_start, align_size, &restrictions);
if (!error)
move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
align_start >> PAGE_SHIFT,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 19037d0191e5..9d11c3b5b448 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -287,12 +287,13 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
* add the new pages.
*/
int __ref __add_pages(int nid, unsigned long phys_start_pfn,
- unsigned long nr_pages, bool want_memblock)
+ unsigned long nr_pages,
+ struct mhp_restrictions *restrictions)
{
unsigned long i;
int err = 0;
int start_sec, end_sec;
- struct vmem_altmap *altmap;
+ struct vmem_altmap *altmap = restrictions->altmap;
struct vmem_altmap __section_altmap = {.base_pfn = phys_start_pfn};
/* during initialize mem_map, align hot-added range to section */
@@ -301,10 +302,9 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
/*
* Check device specific altmap and fallback to allocating from the
- * begining of the section otherwise
+ * begining of the added range otherwise
*/
- altmap = to_vmem_altmap((unsigned long) pfn_to_page(phys_start_pfn));
- if (!altmap) {
+ if (!altmap && restrictions->flags & MHP_RANGE_ACCESSIBLE) {
__section_altmap.free = nr_pages;
__section_altmap.flush_alloc_pfns = mark_vmemmap_pages;
altmap = &__section_altmap;
@@ -324,7 +324,9 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
}
for (i = start_sec; i <= end_sec; i++) {
- err = __add_section(nid, section_nr_to_pfn(i), want_memblock, altmap);
+ err = __add_section(nid, section_nr_to_pfn(i),
+ restrictions->flags & MHP_MEMBLOCK_API,
+ altmap);
/*
* EEXIST is finally dealt with by ioresource collision
@@ -1160,6 +1162,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
bool new_pgdat;
bool new_node;
int ret;
+ struct mhp_restrictions restrictions = {};
start = res->start;
size = resource_size(res);
@@ -1191,8 +1194,10 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
goto error;
}
+ restrictions.flags = MHP_MEMBLOCK_API;
+
/* call arch's memory hotadd */
- ret = arch_add_memory(nid, start, size, true);
+ ret = arch_add_memory(nid, start, size, &restrictions);
if (ret < 0)
goto error;
--
Michal Hocko
SUSE Labs
On Wed 26-07-17 10:33:31, Michal Hocko wrote:
[...]
> @@ -312,7 +324,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
> }
>
> for (i = start_sec; i <= end_sec; i++) {
> - err = __add_section(nid, section_nr_to_pfn(i), want_memblock);
> + err = __add_section(nid, section_nr_to_pfn(i), want_memblock, altmap);
>
> /*
> * EEXIST is finally dealt with by ioresource collision
[...]
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 483ba270d522..42d6721cfb71 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -794,8 +798,20 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, unsigned long st
> goto out;
> }
>
> + /*
> + * TODO get rid of this somehow - we want to postpone the full
> + * initialization until memmap_init_zone.
> + */
> memset(memmap, 0, sizeof(struct page) * PAGES_PER_SECTION);
>
> + /*
> + * now that we have a valid vmemmap mapping we can use
> + * pfn_to_page and flush struct pages which back the
> + * memmap
> + */
> + if (altmap && altmap->flush_alloc_pfns)
> + altmap->flush_alloc_pfns(altmap);
> +
> section_mark_present(ms);
>
> ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
I have only now realized that flush_alloc_pfns would go over the same
range of pfns for each section again and again. I haven't noticed that
because my memblock has one section but larger memblocks (2GB on x86
with a lot of memory) would see that issue. So I will fold the following
into the patch.
---
commit 2658f448af25aa2d2ff7fea12e60a8fe78966f9b
Author: Michal Hocko <[email protected]>
Date: Fri Jul 28 19:45:25 2017 +0200
fold me "mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap"
- rewind base pfn of the vmem_altmap when flushing one section
(mark_vmemmap_pages) because we do not want to flush the same range
all over again when memblock has more than one section
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7aec9272fe4d..55f82c652d51 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -18,7 +18,7 @@ struct device;
* is mapped to the vmemmap - see mark_vmemmap_pages
*/
struct vmem_altmap {
- const unsigned long base_pfn;
+ unsigned long base_pfn;
const unsigned long reserve;
unsigned long free;
unsigned long align;
@@ -48,6 +48,9 @@ static inline void mark_vmemmap_pages(struct vmem_altmap *self)
struct page *page = pfn_to_page(pfn);
__SetPageVmemmap(page);
}
+
+ self->alloc = 0;
+ self->base_pfn += nr_pages + self->reserve;
}
/**
* struct dev_pagemap - metadata for ZONE_DEVICE mappings
--
Michal Hocko
SUSE Labs
On Fri, 28 Jul 2017 14:19:41 +0200
Michal Hocko <[email protected]> wrote:
> On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > [...]
> > > This does not seems to be an opt-in change ie if i am reading patch 3
> > > correctly if an altmap is not provided to __add_pages() you fallback
> > > to allocating from begining of zone. This will not work with HMM ie
> > > device private memory. So at very least i would like to see some way
> > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> >
> > OK, I see! I will think about how to make a sane api for that.
>
> This is what I came up with. s390 guys mentioned that I cannot simply
> use the new range at this stage yet. This will need probably some other
> changes but I guess we want an opt-in approach with an arch veto in general.
>
> So what do you think about the following? Only x86 is update now and I
> will split it into two parts but the idea should be clear at least.
This looks good, and the kernel will also boot again on s390 when applied
on top of the other 5 patches (plus adding the s390 part here).
Regards,
Gerald
On Wed, 26 Jul 2017 10:33:30 +0200
Michal Hocko <[email protected]> wrote:
> From: Michal Hocko <[email protected]>
>
> vmem_altmap allows vmemmap_populate to allocate memmap (struct page
> array) from an alternative allocator rather than bootmem resp.
> kmalloc. Only x86 currently supports altmap handling, most likely
> because only nvdim code uses this mechanism currently and the code
> depends on ZONE_DEVICE which is present only for x86_64. This will
> change in follow up changes so we would like other architectures
> to support it as well.
>
> Provide vmemmap_populate generic implementation which simply resolves
> altmap and then call into arch specific __vmemmap_populate.
> Architectures then only need to use __vmemmap_alloc_block_buf to
> allocate the memmap. vmemmap_free then needs to call vmem_altmap_free
> if there is any altmap associated with the address.
>
> This patch shouldn't introduce any functional changes because
> to_vmem_altmap always returns NULL on !x86_x64.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Fenghua Yu <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 9 ++++++---
> arch/ia64/mm/discontig.c | 4 +++-
> arch/powerpc/mm/init_64.c | 29 ++++++++++++++++++++---------
> arch/s390/mm/vmem.c | 7 ++++---
> arch/sparc/mm/init_64.c | 6 +++---
> arch/x86/mm/init_64.c | 4 ++--
> include/linux/memremap.h | 13 ++-----------
> include/linux/mm.h | 19 ++++++++++++++++++-
> mm/sparse-vmemmap.c | 2 +-
> 9 files changed, 59 insertions(+), 34 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 0c429ec6fde8..5de1161e7a1b 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -649,12 +649,15 @@ int kern_addr_valid(unsigned long addr)
> }
> #ifdef CONFIG_SPARSEMEM_VMEMMAP
> #if !ARM64_SWAPPER_USES_SECTION_MAPS
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
> + struct vmem_altmap *altmap)
> {
> + WARN(altmap, "altmap unsupported\n");
> return vmemmap_populate_basepages(start, end, node);
> }
> #else /* !ARM64_SWAPPER_USES_SECTION_MAPS */
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
> + struct vmem_altmap *altmap)
> {
> unsigned long addr = start;
> unsigned long next;
> @@ -677,7 +680,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> if (pmd_none(*pmd)) {
> void *p = NULL;
>
> - p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> + p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
> if (!p)
> return -ENOMEM;
>
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index 878626805369..2a939e877ced 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -753,8 +753,10 @@ void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat)
> #endif
>
> #ifdef CONFIG_SPARSEMEM_VMEMMAP
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
> + struct vmem_altmap *altmap)
> {
> + WARN(altmap, "altmap unsupported\n");
> return vmemmap_populate_basepages(start, end, node);
> }
>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index ec84b31c6c86..5ea5e870a589 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -44,6 +44,7 @@
> #include <linux/slab.h>
> #include <linux/of_fdt.h>
> #include <linux/libfdt.h>
> +#include <linux/memremap.h>
>
> #include <asm/pgalloc.h>
> #include <asm/page.h>
> @@ -115,7 +116,8 @@ static struct vmemmap_backing *next;
> static int num_left;
> static int num_freed;
>
> -static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node)
> +static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node,
> + struct vmem_altmap *altmap)
> {
> struct vmemmap_backing *vmem_back;
> /* get from freed entries first */
> @@ -129,7 +131,7 @@ static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node)
>
> /* allocate a page when required and hand out chunks */
> if (!num_left) {
> - next = vmemmap_alloc_block(PAGE_SIZE, node);
> + next = __vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> if (unlikely(!next)) {
> WARN_ON(1);
> return NULL;
> @@ -144,11 +146,12 @@ static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node)
>
> static __meminit void vmemmap_list_populate(unsigned long phys,
> unsigned long start,
> - int node)
> + int node,
> + struct vmem_altmap *altmap)
> {
> struct vmemmap_backing *vmem_back;
>
> - vmem_back = vmemmap_list_alloc(node);
> + vmem_back = vmemmap_list_alloc(node, altmap);
> if (unlikely(!vmem_back)) {
> WARN_ON(1);
> return;
> @@ -161,14 +164,15 @@ static __meminit void vmemmap_list_populate(unsigned long phys,
> vmemmap_list = vmem_back;
> }
>
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
> + struct vmem_altmap *altmap)
> {
> unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
>
> /* Align to the page size of the linear mapping. */
> start = _ALIGN_DOWN(start, page_size);
>
> - pr_debug("vmemmap_populate %lx..%lx, node %d\n", start, end, node);
> + pr_debug("__vmemmap_populate %lx..%lx, node %d\n", start, end, node);
>
> for (; start < end; start += page_size) {
> void *p;
> @@ -177,11 +181,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> if (vmemmap_populated(start, page_size))
> continue;
>
> - p = vmemmap_alloc_block(page_size, node);
> + p = __vmemmap_alloc_block_buf(page_size, node, altmap);
> if (!p)
> return -ENOMEM;
>
> - vmemmap_list_populate(__pa(p), start, node);
> + vmemmap_list_populate(__pa(p), start, node, altmap);
>
> pr_debug(" * %016lx..%016lx allocated at %p\n",
> start, start + page_size, p);
> @@ -189,7 +193,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> rc = vmemmap_create_mapping(start, page_size, __pa(p));
> if (rc < 0) {
> pr_warning(
> - "vmemmap_populate: Unable to create vmemmap mapping: %d\n",
> + "__vmemmap_populate: Unable to create vmemmap mapping: %d\n",
> rc);
> return -EFAULT;
> }
> @@ -253,6 +257,12 @@ void __ref vmemmap_free(unsigned long start, unsigned long end)
> addr = vmemmap_list_free(start);
> if (addr) {
> struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
> + struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
> +
> + if (altmap) {
> + vmem_altmap_free(altmap, page_size >> PAGE_SHIFT);
> + goto unmap;
> + }
>
> if (PageReserved(page)) {
> /* allocated from bootmem */
> @@ -272,6 +282,7 @@ void __ref vmemmap_free(unsigned long start, unsigned long end)
> free_pages((unsigned long)(__va(addr)),
> get_order(page_size));
>
> +unmap:
> vmemmap_remove_mapping(start, page_size);
> }
> }
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index c33c94b4be60..07120bc137a1 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -208,7 +208,8 @@ static void vmem_remove_range(unsigned long start, unsigned long size)
> /*
> * Add a backed mem_map array to the virtual mem_map array.
> */
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
> + struct vmem_altmap *altmap)
> {
> unsigned long pgt_prot, sgt_prot;
> unsigned long address = start;
> @@ -247,12 +248,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> * use large frames even if they are only partially
> * used.
> * Otherwise we would have also page tables since
> - * vmemmap_populate gets called for each section
> + * __vmemmap_populate gets called for each section
> * separately. */
> if (MACHINE_HAS_EDAT1) {
> void *new_page;
>
> - new_page = vmemmap_alloc_block(PMD_SIZE, node);
> + new_page = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
> if (!new_page)
> goto out;
> pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
There is another call to vmemmap_alloc_block() in this function, a couple
of lines below, this should also be replaced by __vmemmap_alloc_block_buf().
We won't be able to use this directly on s390 yet, but your work should
get us a big step closer, thanks.
Regards,
Gerald
On Mon 31-07-17 14:35:21, Gerald Schaefer wrote:
> On Fri, 28 Jul 2017 14:19:41 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > > [...]
> > > > This does not seems to be an opt-in change ie if i am reading patch 3
> > > > correctly if an altmap is not provided to __add_pages() you fallback
> > > > to allocating from begining of zone. This will not work with HMM ie
> > > > device private memory. So at very least i would like to see some way
> > > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > >
> > > OK, I see! I will think about how to make a sane api for that.
> >
> > This is what I came up with. s390 guys mentioned that I cannot simply
> > use the new range at this stage yet. This will need probably some other
> > changes but I guess we want an opt-in approach with an arch veto in general.
> >
> > So what do you think about the following? Only x86 is update now and I
> > will split it into two parts but the idea should be clear at least.
>
> This looks good, and the kernel will also boot again on s390 when applied
> on top of the other 5 patches (plus adding the s390 part here).
Thanks for testing Gerald! I am still undecided whether the arch code
should veto MHP_RANGE_ACCESSIBLE if it cannot be supported or just set
it when it is supported. My last post did the later but the first one
sounds like a more clear API to me. I will keep thinking about it.
Anyway, did you have any chance to consider mapping the new physical
memory range inside arch_add_memory rather than during online on s390?
--
Michal Hocko
SUSE Labs
On Mon 31-07-17 14:40:53, Gerald Schaefer wrote:
[...]
> > @@ -247,12 +248,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> > * use large frames even if they are only partially
> > * used.
> > * Otherwise we would have also page tables since
> > - * vmemmap_populate gets called for each section
> > + * __vmemmap_populate gets called for each section
> > * separately. */
> > if (MACHINE_HAS_EDAT1) {
> > void *new_page;
> >
> > - new_page = vmemmap_alloc_block(PMD_SIZE, node);
> > + new_page = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
> > if (!new_page)
> > goto out;
> > pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
>
> There is another call to vmemmap_alloc_block() in this function, a couple
> of lines below, this should also be replaced by __vmemmap_alloc_block_buf().
I've noticed that one but in general I have only transformed PMD
mappings because we shouldn't even get to pte level if the forme works
AFAICS. Memory sections should be always 2MB aligned unless I am missing
something. Or is this not true?
--
Michal Hocko
SUSE Labs
On Mon, 31 Jul 2017 14:55:56 +0200
Michal Hocko <[email protected]> wrote:
> On Mon 31-07-17 14:40:53, Gerald Schaefer wrote:
> [...]
> > > @@ -247,12 +248,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> > > * use large frames even if they are only partially
> > > * used.
> > > * Otherwise we would have also page tables since
> > > - * vmemmap_populate gets called for each section
> > > + * __vmemmap_populate gets called for each section
> > > * separately. */
> > > if (MACHINE_HAS_EDAT1) {
> > > void *new_page;
> > >
> > > - new_page = vmemmap_alloc_block(PMD_SIZE, node);
> > > + new_page = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
> > > if (!new_page)
> > > goto out;
> > > pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
> >
> > There is another call to vmemmap_alloc_block() in this function, a couple
> > of lines below, this should also be replaced by __vmemmap_alloc_block_buf().
>
> I've noticed that one but in general I have only transformed PMD
> mappings because we shouldn't even get to pte level if the forme works
> AFAICS. Memory sections should be always 2MB aligned unless I am missing
> something. Or is this not true?
vmemmap_populate() on s390 will only stop at pmd level if we have HW
support for large pages (MACHINE_HAS_EDAT1). In that case we will allocate
a PMD_SIZE block with vmemmap_alloc_block() and map it on pmd level as
a large page.
Without HW large page support, we will continue to allocate a pte page,
populate the pmd entry with that, and fall through to the pte_none()
check below, with its PAGE_SIZE vmemmap_alloc_block() allocation. In this
case we should use the __vmemmap_alloc_block_buf().
Regards,
Gerald
On Mon 31-07-17 16:27:46, Gerald Schaefer wrote:
> On Mon, 31 Jul 2017 14:55:56 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Mon 31-07-17 14:40:53, Gerald Schaefer wrote:
> > [...]
> > > > @@ -247,12 +248,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> > > > * use large frames even if they are only partially
> > > > * used.
> > > > * Otherwise we would have also page tables since
> > > > - * vmemmap_populate gets called for each section
> > > > + * __vmemmap_populate gets called for each section
> > > > * separately. */
> > > > if (MACHINE_HAS_EDAT1) {
> > > > void *new_page;
> > > >
> > > > - new_page = vmemmap_alloc_block(PMD_SIZE, node);
> > > > + new_page = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
> > > > if (!new_page)
> > > > goto out;
> > > > pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
> > >
> > > There is another call to vmemmap_alloc_block() in this function, a couple
> > > of lines below, this should also be replaced by __vmemmap_alloc_block_buf().
> >
> > I've noticed that one but in general I have only transformed PMD
> > mappings because we shouldn't even get to pte level if the forme works
> > AFAICS. Memory sections should be always 2MB aligned unless I am missing
> > something. Or is this not true?
>
> vmemmap_populate() on s390 will only stop at pmd level if we have HW
> support for large pages (MACHINE_HAS_EDAT1). In that case we will allocate
> a PMD_SIZE block with vmemmap_alloc_block() and map it on pmd level as
> a large page.
>
> Without HW large page support, we will continue to allocate a pte page,
> populate the pmd entry with that, and fall through to the pte_none()
> check below, with its PAGE_SIZE vmemmap_alloc_block() allocation. In this
> case we should use the __vmemmap_alloc_block_buf().
OK, I see. I've considered s390 will support large pages in general. I
will fold this in. Thanks!
---
commit df13e3a1237c3fef399e26b0f5a015715df12ede
Author: Michal Hocko <[email protected]>
Date: Mon Jul 31 16:34:18 2017 +0200
fold me "mm, arch: unify vmemmap_populate altmap handling"
- use altmap even for ptes in case the HW doesn't support large pages
as per Gerald Schaefer
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 07120bc137a1..764b6393e66c 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -273,7 +273,7 @@ int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int nod
if (pte_none(*pt_dir)) {
void *new_page;
- new_page = vmemmap_alloc_block(PAGE_SIZE, node);
+ new_page = __vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
if (!new_page)
goto out;
pte_val(*pt_dir) = __pa(new_page) | pgt_prot;
--
Michal Hocko
SUSE Labs
On Mon, 31 Jul 2017 14:53:19 +0200
Michal Hocko <[email protected]> wrote:
> On Mon 31-07-17 14:35:21, Gerald Schaefer wrote:
> > On Fri, 28 Jul 2017 14:19:41 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > > > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > > > [...]
> > > > > This does not seems to be an opt-in change ie if i am reading patch 3
> > > > > correctly if an altmap is not provided to __add_pages() you fallback
> > > > > to allocating from begining of zone. This will not work with HMM ie
> > > > > device private memory. So at very least i would like to see some way
> > > > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > > >
> > > > OK, I see! I will think about how to make a sane api for that.
> > >
> > > This is what I came up with. s390 guys mentioned that I cannot simply
> > > use the new range at this stage yet. This will need probably some other
> > > changes but I guess we want an opt-in approach with an arch veto in general.
> > >
> > > So what do you think about the following? Only x86 is update now and I
> > > will split it into two parts but the idea should be clear at least.
> >
> > This looks good, and the kernel will also boot again on s390 when applied
> > on top of the other 5 patches (plus adding the s390 part here).
>
> Thanks for testing Gerald! I am still undecided whether the arch code
> should veto MHP_RANGE_ACCESSIBLE if it cannot be supported or just set
> it when it is supported. My last post did the later but the first one
> sounds like a more clear API to me. I will keep thinking about it.
>
> Anyway, did you have any chance to consider mapping the new physical
> memory range inside arch_add_memory rather than during online on s390?
Well, it still looks like we cannot do w/o splitting up add_memory():
1) (only) set up section map during our initial memory detection, w/o
allocating struct pages, so that the sysfs entries get created also for
our offline memory (or else we have no way to online it later)
2) set up vmemmap and allocate struct pages with your new altmap approach
during our MEM_GOING_ONLINE callback, because only now the memory is really
accessible
Besides the obvious problem that this would need a new interface, there is
also the problem that (at least) show_valid_zones() in drivers/base/memory.c
operates on struct pages from _offline_ memory, for its page_zone() checks.
This will not work well if we have no struct pages for offline memory ...
BTW, the latter may also be a issue with your rework on any architecture.
Not sure if I understood it correctly, but the situation on s390 (i.e.
having offline memory blocks visible in sysfs) should be similar to
the scenario on x86, when you plug in memory, set it online in the acpi
handler, and then manually set it offline again via sysfs. Now the
memory is still visible in sysfs, and reading the valid_zones attribute
will trigger an access to struct pages for that memory. What if this
memory is now physically removed, in a race with such a struct page
access? Before your patch that would probably be OK, because the struct
pages were not part of the removed memory, but now you should get an
addressing exception, right?
Regards,
Gerald
On Mon 31-07-17 17:04:59, Gerald Schaefer wrote:
> On Mon, 31 Jul 2017 14:53:19 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Mon 31-07-17 14:35:21, Gerald Schaefer wrote:
> > > On Fri, 28 Jul 2017 14:19:41 +0200
> > > Michal Hocko <[email protected]> wrote:
> > >
> > > > On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > > > > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > > > > [...]
> > > > > > This does not seems to be an opt-in change ie if i am reading patch 3
> > > > > > correctly if an altmap is not provided to __add_pages() you fallback
> > > > > > to allocating from begining of zone. This will not work with HMM ie
> > > > > > device private memory. So at very least i would like to see some way
> > > > > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > > > >
> > > > > OK, I see! I will think about how to make a sane api for that.
> > > >
> > > > This is what I came up with. s390 guys mentioned that I cannot simply
> > > > use the new range at this stage yet. This will need probably some other
> > > > changes but I guess we want an opt-in approach with an arch veto in general.
> > > >
> > > > So what do you think about the following? Only x86 is update now and I
> > > > will split it into two parts but the idea should be clear at least.
> > >
> > > This looks good, and the kernel will also boot again on s390 when applied
> > > on top of the other 5 patches (plus adding the s390 part here).
> >
> > Thanks for testing Gerald! I am still undecided whether the arch code
> > should veto MHP_RANGE_ACCESSIBLE if it cannot be supported or just set
> > it when it is supported. My last post did the later but the first one
> > sounds like a more clear API to me. I will keep thinking about it.
> >
> > Anyway, did you have any chance to consider mapping the new physical
> > memory range inside arch_add_memory rather than during online on s390?
>
> Well, it still looks like we cannot do w/o splitting up add_memory():
> 1) (only) set up section map during our initial memory detection, w/o
> allocating struct pages, so that the sysfs entries get created also for
> our offline memory (or else we have no way to online it later)
> 2) set up vmemmap and allocate struct pages with your new altmap approach
> during our MEM_GOING_ONLINE callback, because only now the memory is really
> accessible
As I've tried to mentioned in my other response. This is not possible
because there are memory hotplug usecases which never do an explicit
online.
I am sorry to ask again. But why exactly cannot we make the range
accessible from arch_add_memory on s390?
> Besides the obvious problem that this would need a new interface, there is
> also the problem that (at least) show_valid_zones() in drivers/base/memory.c
> operates on struct pages from _offline_ memory, for its page_zone() checks.
> This will not work well if we have no struct pages for offline memory ...
Yes.
> BTW, the latter may also be a issue with your rework on any architecture.
> Not sure if I understood it correctly, but the situation on s390 (i.e.
> having offline memory blocks visible in sysfs) should be similar to
> the scenario on x86, when you plug in memory, set it online in the acpi
> handler, and then manually set it offline again via sysfs. Now the
> memory is still visible in sysfs, and reading the valid_zones attribute
> will trigger an access to struct pages for that memory. What if this
> memory is now physically removed, in a race with such a struct page
> access?
The memmap goes away together with the whole section tear down. And we
shouldn't have any users of any struct page by that time. Memblock sysfs
should be down as well. I will go and double check whether there are any
possible races.
--
Michal Hocko
SUSE Labs
On Mon, 31 Jul 2017 17:53:50 +0200
Michal Hocko <[email protected]> wrote:
> On Mon 31-07-17 17:04:59, Gerald Schaefer wrote:
> > On Mon, 31 Jul 2017 14:53:19 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Mon 31-07-17 14:35:21, Gerald Schaefer wrote:
> > > > On Fri, 28 Jul 2017 14:19:41 +0200
> > > > Michal Hocko <[email protected]> wrote:
> > > >
> > > > > On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > > > > > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > > > > > [...]
> > > > > > > This does not seems to be an opt-in change ie if i am reading patch 3
> > > > > > > correctly if an altmap is not provided to __add_pages() you fallback
> > > > > > > to allocating from begining of zone. This will not work with HMM ie
> > > > > > > device private memory. So at very least i would like to see some way
> > > > > > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > > > > >
> > > > > > OK, I see! I will think about how to make a sane api for that.
> > > > >
> > > > > This is what I came up with. s390 guys mentioned that I cannot simply
> > > > > use the new range at this stage yet. This will need probably some other
> > > > > changes but I guess we want an opt-in approach with an arch veto in general.
> > > > >
> > > > > So what do you think about the following? Only x86 is update now and I
> > > > > will split it into two parts but the idea should be clear at least.
> > > >
> > > > This looks good, and the kernel will also boot again on s390 when applied
> > > > on top of the other 5 patches (plus adding the s390 part here).
> > >
> > > Thanks for testing Gerald! I am still undecided whether the arch code
> > > should veto MHP_RANGE_ACCESSIBLE if it cannot be supported or just set
> > > it when it is supported. My last post did the later but the first one
> > > sounds like a more clear API to me. I will keep thinking about it.
> > >
> > > Anyway, did you have any chance to consider mapping the new physical
> > > memory range inside arch_add_memory rather than during online on s390?
> >
> > Well, it still looks like we cannot do w/o splitting up add_memory():
> > 1) (only) set up section map during our initial memory detection, w/o
> > allocating struct pages, so that the sysfs entries get created also for
> > our offline memory (or else we have no way to online it later)
> > 2) set up vmemmap and allocate struct pages with your new altmap approach
> > during our MEM_GOING_ONLINE callback, because only now the memory is really
> > accessible
>
> As I've tried to mentioned in my other response. This is not possible
> because there are memory hotplug usecases which never do an explicit
> online.
Of course the default behaviour should not change, we only need an option
to do the "2-stage-approach". E.g. we would call __add_pages() from our
MEM_GOING_ONLINE handler, and not from arch_add_memory() as before, but
then we would need some way to add memory sections (for generating sysfs
memory blocks) only, without allocating struct pages. See also below.
>
> I am sorry to ask again. But why exactly cannot we make the range
> accessible from arch_add_memory on s390?
We have no acpi or other events to indicate new memory, both online and
offline memory needs to be (hypervisor) defined upfront, and then we want
to be able to use memory hotplug for ballooning during runtime.
Making the range accessible is equivalent to a hypervisor call that assigns
the memory to the guest. The problem with arch_add_memory() is now that
this gets called from add_memory(), which we call during initial memory
detection for the offline memory ranges. At that time, assigning all
offline memory to the guest, and thus making it accessible, would break
the ballooning usecase (even if it is still offline in Linux, the
hypervisor could not use it for other guests any more).
The main difference to other architectures is that we can not simply
call add_memory() (and thus arch_add_memory()) at the time when the
offline memory is actually supposed to get online (e.g. triggered by acpi).
We rather need to somehow make sure that the offline memory is detected
early, and sysfs entries are created for it, so that it can be set online
later on demand.
Maybe our design to use add_memory() for offline ranges during memory
detection was wrong, or overkill, since we actually only need to establish
a memory section, if I understood the sysfs code right. But I currently
see no other way to make sure that we get the sysfs attributes. And of
course the presence of users that work on offline struct pages, like
valid_zones, is also not helpful.
>
> > Besides the obvious problem that this would need a new interface, there is
> > also the problem that (at least) show_valid_zones() in drivers/base/memory.c
> > operates on struct pages from _offline_ memory, for its page_zone() checks.
> > This will not work well if we have no struct pages for offline memory ...
>
> Yes.
>
> > BTW, the latter may also be a issue with your rework on any architecture.
> > Not sure if I understood it correctly, but the situation on s390 (i.e.
> > having offline memory blocks visible in sysfs) should be similar to
> > the scenario on x86, when you plug in memory, set it online in the acpi
> > handler, and then manually set it offline again via sysfs. Now the
> > memory is still visible in sysfs, and reading the valid_zones attribute
> > will trigger an access to struct pages for that memory. What if this
> > memory is now physically removed, in a race with such a struct page
> > access?
>
> The memmap goes away together with the whole section tear down. And we
> shouldn't have any users of any struct page by that time. Memblock sysfs
> should be down as well. I will go and double check whether there are any
> possible races.
I was thinking of someone pulling out a DIMM whose range was (manually)
set offline before. It looks like (arch_)remove_memory() is not triggered
directly on setting it offline, but rather by an acpi event, probably after
physical memory removal. And that would mean that a user could just read
sysfs valid_zones in a loop, after setting it offline and before the
physical removal, thereby accessing struct pages in the offline range,
which would then race with the physical DIMM removal.
However, as you can see, s390 memory hotplug works in a special way,
so I may have gotten the wrong picture of how it works on "normal"
architectures :-)
Regards,
Gerald
On Mon, 31 Jul 2017 19:58:30 +0200
Gerald Schaefer <[email protected]> wrote:
> On Mon, 31 Jul 2017 17:53:50 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Mon 31-07-17 17:04:59, Gerald Schaefer wrote:
> > > On Mon, 31 Jul 2017 14:53:19 +0200
> > > Michal Hocko <[email protected]> wrote:
> > >
> > > > On Mon 31-07-17 14:35:21, Gerald Schaefer wrote:
> > > > > On Fri, 28 Jul 2017 14:19:41 +0200
> > > > > Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > > On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > > > > > > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > > > > > > [...]
> > > > > > > > This does not seems to be an opt-in change ie if i am reading patch 3
> > > > > > > > correctly if an altmap is not provided to __add_pages() you fallback
> > > > > > > > to allocating from begining of zone. This will not work with HMM ie
> > > > > > > > device private memory. So at very least i would like to see some way
> > > > > > > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > > > > > >
> > > > > > > OK, I see! I will think about how to make a sane api for that.
> > > > > >
> > > > > > This is what I came up with. s390 guys mentioned that I cannot simply
> > > > > > use the new range at this stage yet. This will need probably some other
> > > > > > changes but I guess we want an opt-in approach with an arch veto in general.
> > > > > >
> > > > > > So what do you think about the following? Only x86 is update now and I
> > > > > > will split it into two parts but the idea should be clear at least.
> > > > >
> > > > > This looks good, and the kernel will also boot again on s390 when applied
> > > > > on top of the other 5 patches (plus adding the s390 part here).
> > > >
> > > > Thanks for testing Gerald! I am still undecided whether the arch code
> > > > should veto MHP_RANGE_ACCESSIBLE if it cannot be supported or just set
> > > > it when it is supported. My last post did the later but the first one
> > > > sounds like a more clear API to me. I will keep thinking about it.
> > > >
> > > > Anyway, did you have any chance to consider mapping the new physical
> > > > memory range inside arch_add_memory rather than during online on s390?
> > >
> > > Well, it still looks like we cannot do w/o splitting up add_memory():
> > > 1) (only) set up section map during our initial memory detection, w/o
> > > allocating struct pages, so that the sysfs entries get created also for
> > > our offline memory (or else we have no way to online it later)
> > > 2) set up vmemmap and allocate struct pages with your new altmap approach
> > > during our MEM_GOING_ONLINE callback, because only now the memory is really
> > > accessible
> >
> > As I've tried to mentioned in my other response. This is not possible
> > because there are memory hotplug usecases which never do an explicit
> > online.
>
> Of course the default behaviour should not change, we only need an option
> to do the "2-stage-approach". E.g. we would call __add_pages() from our
> MEM_GOING_ONLINE handler, and not from arch_add_memory() as before, but
> then we would need some way to add memory sections (for generating sysfs
> memory blocks) only, without allocating struct pages. See also below.
>
> >
> > I am sorry to ask again. But why exactly cannot we make the range
> > accessible from arch_add_memory on s390?
>
> We have no acpi or other events to indicate new memory, both online and
> offline memory needs to be (hypervisor) defined upfront, and then we want
> to be able to use memory hotplug for ballooning during runtime.
>
> Making the range accessible is equivalent to a hypervisor call that assigns
> the memory to the guest. The problem with arch_add_memory() is now that
> this gets called from add_memory(), which we call during initial memory
> detection for the offline memory ranges. At that time, assigning all
> offline memory to the guest, and thus making it accessible, would break
> the ballooning usecase (even if it is still offline in Linux, the
> hypervisor could not use it for other guests any more).
>
> The main difference to other architectures is that we can not simply
> call add_memory() (and thus arch_add_memory()) at the time when the
> offline memory is actually supposed to get online (e.g. triggered by acpi).
> We rather need to somehow make sure that the offline memory is detected
> early, and sysfs entries are created for it, so that it can be set online
> later on demand.
>
> Maybe our design to use add_memory() for offline ranges during memory
> detection was wrong, or overkill, since we actually only need to establish
> a memory section, if I understood the sysfs code right. But I currently
> see no other way to make sure that we get the sysfs attributes. And of
> course the presence of users that work on offline struct pages, like
> valid_zones, is also not helpful.
>
> >
> > > Besides the obvious problem that this would need a new interface, there is
> > > also the problem that (at least) show_valid_zones() in drivers/base/memory.c
> > > operates on struct pages from _offline_ memory, for its page_zone() checks.
> > > This will not work well if we have no struct pages for offline memory ...
> >
> > Yes.
> >
> > > BTW, the latter may also be a issue with your rework on any architecture.
> > > Not sure if I understood it correctly, but the situation on s390 (i.e.
> > > having offline memory blocks visible in sysfs) should be similar to
> > > the scenario on x86, when you plug in memory, set it online in the acpi
> > > handler, and then manually set it offline again via sysfs. Now the
> > > memory is still visible in sysfs, and reading the valid_zones attribute
> > > will trigger an access to struct pages for that memory. What if this
> > > memory is now physically removed, in a race with such a struct page
> > > access?
> >
> > The memmap goes away together with the whole section tear down. And we
> > shouldn't have any users of any struct page by that time. Memblock sysfs
> > should be down as well. I will go and double check whether there are any
> > possible races.
>
> I was thinking of someone pulling out a DIMM whose range was (manually)
> set offline before. It looks like (arch_)remove_memory() is not triggered
> directly on setting it offline, but rather by an acpi event, probably after
> physical memory removal.
Order here a little bit different, first and ACPI event sent and processed
by kernel (including removing ranges from sysfs) and only then kernel should
call ACPI _EJ0 method on DIMM device when there shouldn't be any users left
nor races should happen.
> And that would mean that a user could just read
> sysfs valid_zones in a loop, after setting it offline and before the
> physical removal, thereby accessing struct pages in the offline range,
> which would then race with the physical DIMM removal.
>
> However, as you can see, s390 memory hotplug works in a special way,
> so I may have gotten the wrong picture of how it works on "normal"
> architectures :-)
>
> Regards,
> Gerald
>
On Mon 31-07-17 19:58:30, Gerald Schaefer wrote:
> On Mon, 31 Jul 2017 17:53:50 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Mon 31-07-17 17:04:59, Gerald Schaefer wrote:
[...]
> > > Well, it still looks like we cannot do w/o splitting up add_memory():
> > > 1) (only) set up section map during our initial memory detection, w/o
> > > allocating struct pages, so that the sysfs entries get created also for
> > > our offline memory (or else we have no way to online it later)
> > > 2) set up vmemmap and allocate struct pages with your new altmap approach
> > > during our MEM_GOING_ONLINE callback, because only now the memory is really
> > > accessible
> >
> > As I've tried to mentioned in my other response. This is not possible
> > because there are memory hotplug usecases which never do an explicit
> > online.
>
> Of course the default behaviour should not change, we only need an option
> to do the "2-stage-approach". E.g. we would call __add_pages() from our
> MEM_GOING_ONLINE handler, and not from arch_add_memory() as before, but
> then we would need some way to add memory sections (for generating sysfs
> memory blocks) only, without allocating struct pages. See also below.
I would have to check that more deeply. I am not sure some parts of
memblock infrastructure depends on struct page existence.
> > I am sorry to ask again. But why exactly cannot we make the range
> > accessible from arch_add_memory on s390?
>
> We have no acpi or other events to indicate new memory, both online and
> offline memory needs to be (hypervisor) defined upfront, and then we want
> to be able to use memory hotplug for ballooning during runtime.
>
> Making the range accessible is equivalent to a hypervisor call that assigns
> the memory to the guest. The problem with arch_add_memory() is now that
> this gets called from add_memory(), which we call during initial memory
> detection for the offline memory ranges. At that time, assigning all
> offline memory to the guest, and thus making it accessible, would break
> the ballooning usecase (even if it is still offline in Linux, the
> hypervisor could not use it for other guests any more).
OK, I guess I see your point. Thanks for the clarification. I will try
to think about this limitation but I will rule simply disable the
feature for the initial inclusion. s390 can be done later.
> The main difference to other architectures is that we can not simply
> call add_memory() (and thus arch_add_memory()) at the time when the
> offline memory is actually supposed to get online (e.g. triggered by acpi).
> We rather need to somehow make sure that the offline memory is detected
> early, and sysfs entries are created for it, so that it can be set online
> later on demand.
>
> Maybe our design to use add_memory() for offline ranges during memory
> detection was wrong, or overkill, since we actually only need to establish
> a memory section, if I understood the sysfs code right. But I currently
> see no other way to make sure that we get the sysfs attributes. And of
> course the presence of users that work on offline struct pages, like
> valid_zones, is also not helpful.
Yeah, but I suspect that we can make the whole memory hotplug sysfs API
independant on memory sections and struct page states. I am adding this
to my todo list.
--
Michal Hocko
SUSE Labs