Changes since v4 [1]:
- Given v4 was from March of 2017 the bulk of the changes result from
rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
- A unit test is added to ndctl to exercise the creation and dax
mounting of multiple independent namespaces in a single 128M section.
[1]: https://lwn.net/Articles/717383/
---
Quote patch7:
"The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity. For example the following backtrace
is emitted when attempting arch_add_memory() with physical address
ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
within a given section:
WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5f/0x80
devm_memremap_pages+0x3b5/0x4c0
__wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
pmem_attach_disk+0x19a/0x440 [nd_pmem]
Recently it was discovered that the problem goes beyond RAM vs PMEM
collisions as some platform produce PMEM vs PMEM collisions within a
given section. The libnvdimm workaround for that case revealed that the
libnvdimm section-alignment-padding implementation has been broken for a
long while. A fix for that long-standing breakage introduces as many
problems as it solves as it would require a backward-incompatible change
to the namespace metadata interpretation. Instead of that dubious route
[2], address the root problem in the memory-hotplug implementation."
The approach is taken is to observe that each section already maintains
an array of 'unsigned long' values to hold the pageblock_flags. A single
additional 'unsigned long' is added to house a 'sub-section active'
bitmask. Each bit tracks the mapped state of one sub-section's worth of
capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
The implication of allowing sections to be piecemeal mapped/unmapped is
that the valid_section() helper is no longer authoritative to determine
if a section is fully mapped. Instead pfn_valid() is updated to consult
the section-active bitmask. Given that typical memory hotplug still has
deep "section" dependencies the sub-section capability is limited to
'want_memblock=false' invocations of arch_add_memory(), effectively only
devm_memremap_pages() users for now.
With this in place the hacks in the libnvdimm sub-system can be
dropped, and other devm_memremap_pages() users need no longer be
constrained to 128MB mapping granularity.
[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
---
Dan Williams (10):
mm/sparsemem: Introduce struct mem_section_usage
mm/sparsemem: Introduce common definitions for the size and mask of a section
mm/sparsemem: Add helpers track active portions of a section at boot
mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal
mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
mm/sparsemem: Prepare for sub-section ranges
mm/sparsemem: Support sub-section hotplug
mm/devm_memremap_pages: Enable sub-section remap
libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
libnvdimm/pfn: Stop padding pmem namespaces to section alignment
arch/x86/mm/init_64.c | 15 +-
drivers/nvdimm/dax_devs.c | 2
drivers/nvdimm/pfn.h | 12 -
drivers/nvdimm/pfn_devs.c | 93 +++-------
include/linux/memory_hotplug.h | 7 -
include/linux/mm.h | 4
include/linux/mmzone.h | 60 ++++++
kernel/memremap.c | 57 ++----
mm/hmm.c | 2
mm/memory_hotplug.c | 119 +++++++-----
mm/page_alloc.c | 6 -
mm/sparse-vmemmap.c | 21 +-
mm/sparse.c | 382 ++++++++++++++++++++++++++++------------
13 files changed, 476 insertions(+), 304 deletions(-)
Up-level the local section size and mask from kernel/memremap.c to
global definitions. These will be used by the new sub-section hotplug
support.
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/mmzone.h | 2 ++
kernel/memremap.c | 10 ++++------
mm/hmm.c | 2 --
3 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 151dd7327e0b..69b9cb9cb2ed 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1081,6 +1081,8 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
* PFN_SECTION_SHIFT pfn to/from section number
*/
#define PA_SECTION_SHIFT (SECTION_SIZE_BITS)
+#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+#define PA_SECTION_MASK (~(PA_SECTION_SIZE-1))
#define PFN_SECTION_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)
#define NR_MEM_SECTIONS (1UL << SECTIONS_SHIFT)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index a856cb5ff192..dda1367b385d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -14,8 +14,6 @@
#include <linux/hmm.h>
static DEFINE_XARRAY(pgmap_array);
-#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
-#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
vm_fault_t device_private_entry_fault(struct vm_area_struct *vma,
@@ -98,8 +96,8 @@ static void devm_memremap_pages_release(void *data)
put_page(pfn_to_page(pfn));
/* pages are dead and unused, undo the arch mapping */
- align_start = res->start & ~(SECTION_SIZE - 1);
- align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
+ align_start = res->start & ~(PA_SECTION_SIZE - 1);
+ align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
- align_start;
nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
@@ -154,8 +152,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
if (!pgmap->ref || !pgmap->kill)
return ERR_PTR(-EINVAL);
- align_start = res->start & ~(SECTION_SIZE - 1);
- align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
+ align_start = res->start & ~(PA_SECTION_SIZE - 1);
+ align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
- align_start;
align_end = align_start + align_size - 1;
diff --git a/mm/hmm.c b/mm/hmm.c
index fe1cd87e49ac..ef9e4e6c9f92 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -33,8 +33,6 @@
#include <linux/mmu_notifier.h>
#include <linux/memory_hotplug.h>
-#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
-
#if IS_ENABLED(CONFIG_HMM_MIRROR)
static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
Towards enabling memory hotplug to track partial population of a
section, introduce 'struct mem_section_usage'.
A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to
house a new 'map_active' bitmap. The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.
The primary motivation for this functionality is to support platforms
that mix "System RAM" and "Persistent Memory" within a single section,
or multiple PMEM ranges with different mapping lifetimes within a single
section. The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.
Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/mmzone.h | 23 ++++++++++++--
mm/memory_hotplug.c | 18 ++++++-----
mm/page_alloc.c | 2 +
mm/sparse.c | 81 ++++++++++++++++++++++++------------------------
4 files changed, 71 insertions(+), 53 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fba7741533be..151dd7327e0b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1107,6 +1107,19 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
#define SECTION_ALIGN_UP(pfn) (((pfn) + PAGES_PER_SECTION - 1) & PAGE_SECTION_MASK)
#define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK)
+#define SECTION_ACTIVE_SIZE ((1UL << SECTION_SIZE_BITS) / BITS_PER_LONG)
+#define SECTION_ACTIVE_MASK (~(SECTION_ACTIVE_SIZE - 1))
+
+struct mem_section_usage {
+ /*
+ * SECTION_ACTIVE_SIZE portions of the section that are populated in
+ * the memmap
+ */
+ unsigned long map_active;
+ /* See declaration of similar field in struct zone */
+ unsigned long pageblock_flags[0];
+};
+
struct page;
struct page_ext;
struct mem_section {
@@ -1124,8 +1137,7 @@ struct mem_section {
*/
unsigned long section_mem_map;
- /* See declaration of similar field in struct zone */
- unsigned long *pageblock_flags;
+ struct mem_section_usage *usage;
#ifdef CONFIG_PAGE_EXTENSION
/*
* If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
@@ -1156,6 +1168,11 @@ extern struct mem_section **mem_section;
extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
#endif
+static inline unsigned long *section_to_usemap(struct mem_section *ms)
+{
+ return ms->usage->pageblock_flags;
+}
+
static inline struct mem_section *__nr_to_section(unsigned long nr)
{
#ifdef CONFIG_SPARSEMEM_EXTREME
@@ -1167,7 +1184,7 @@ static inline struct mem_section *__nr_to_section(unsigned long nr)
return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
}
extern int __section_nr(struct mem_section* ms);
-extern unsigned long usemap_size(void);
+extern size_t mem_section_usage_size(void);
/*
* We use the lower bits of the mem_map pointer to store
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f767582af4f8..2541a3a15854 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -164,9 +164,10 @@ void put_page_bootmem(struct page *page)
#ifndef CONFIG_SPARSEMEM_VMEMMAP
static void register_page_bootmem_info_section(unsigned long start_pfn)
{
- unsigned long *usemap, mapsize, section_nr, i;
+ unsigned long mapsize, section_nr, i;
struct mem_section *ms;
struct page *page, *memmap;
+ struct mem_section_usage *usage;
section_nr = pfn_to_section_nr(start_pfn);
ms = __nr_to_section(section_nr);
@@ -186,10 +187,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
for (i = 0; i < mapsize; i++, page++)
get_page_bootmem(section_nr, page, SECTION_INFO);
- usemap = ms->pageblock_flags;
- page = virt_to_page(usemap);
+ usage = ms->usage;
+ page = virt_to_page(usage);
- mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+ mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
for (i = 0; i < mapsize; i++, page++)
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
@@ -198,9 +199,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
#else /* CONFIG_SPARSEMEM_VMEMMAP */
static void register_page_bootmem_info_section(unsigned long start_pfn)
{
- unsigned long *usemap, mapsize, section_nr, i;
+ unsigned long mapsize, section_nr, i;
struct mem_section *ms;
struct page *page, *memmap;
+ struct mem_section_usage *usage;
section_nr = pfn_to_section_nr(start_pfn);
ms = __nr_to_section(section_nr);
@@ -209,10 +211,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
- usemap = ms->pageblock_flags;
- page = virt_to_page(usemap);
+ usage = ms->usage;
+ page = virt_to_page(usage);
- mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+ mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
for (i = 0; i < mapsize; i++, page++)
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03fcf73d47da..bf23bc0b8399 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -388,7 +388,7 @@ static inline unsigned long *get_pageblock_bitmap(struct page *page,
unsigned long pfn)
{
#ifdef CONFIG_SPARSEMEM
- return __pfn_to_section(pfn)->pageblock_flags;
+ return section_to_usemap(__pfn_to_section(pfn));
#else
return page_zone(page)->pageblock_flags;
#endif /* CONFIG_SPARSEMEM */
diff --git a/mm/sparse.c b/mm/sparse.c
index 69904aa6165b..cdd2978d0ffe 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -288,33 +288,31 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
static void __meminit sparse_init_one_section(struct mem_section *ms,
unsigned long pnum, struct page *mem_map,
- unsigned long *pageblock_bitmap)
+ struct mem_section_usage *usage)
{
ms->section_mem_map &= ~SECTION_MAP_MASK;
ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
SECTION_HAS_MEM_MAP;
- ms->pageblock_flags = pageblock_bitmap;
+ ms->usage = usage;
}
-unsigned long usemap_size(void)
+static unsigned long usemap_size(void)
{
return BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS) * sizeof(unsigned long);
}
-#ifdef CONFIG_MEMORY_HOTPLUG
-static unsigned long *__kmalloc_section_usemap(void)
+size_t mem_section_usage_size(void)
{
- return kmalloc(usemap_size(), GFP_KERNEL);
+ return sizeof(struct mem_section_usage) + usemap_size();
}
-#endif /* CONFIG_MEMORY_HOTPLUG */
#ifdef CONFIG_MEMORY_HOTREMOVE
-static unsigned long * __init
+static struct mem_section_usage * __init
sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
unsigned long size)
{
+ struct mem_section_usage *usage;
unsigned long goal, limit;
- unsigned long *p;
int nid;
/*
* A page may contain usemaps for other sections preventing the
@@ -330,15 +328,16 @@ sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
limit = goal + (1UL << PA_SECTION_SHIFT);
nid = early_pfn_to_nid(goal >> PAGE_SHIFT);
again:
- p = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid);
- if (!p && limit) {
+ usage = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid);
+ if (!usage && limit) {
limit = 0;
goto again;
}
- return p;
+ return usage;
}
-static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
+static void __init check_usemap_section_nr(int nid,
+ struct mem_section_usage *usage)
{
unsigned long usemap_snr, pgdat_snr;
static unsigned long old_usemap_snr;
@@ -352,7 +351,7 @@ static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
old_pgdat_snr = NR_MEM_SECTIONS;
}
- usemap_snr = pfn_to_section_nr(__pa(usemap) >> PAGE_SHIFT);
+ usemap_snr = pfn_to_section_nr(__pa(usage) >> PAGE_SHIFT);
pgdat_snr = pfn_to_section_nr(__pa(pgdat) >> PAGE_SHIFT);
if (usemap_snr == pgdat_snr)
return;
@@ -380,14 +379,15 @@ static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
usemap_snr, pgdat_snr, nid);
}
#else
-static unsigned long * __init
+static struct mem_section_usage * __init
sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
unsigned long size)
{
return memblock_alloc_node(size, SMP_CACHE_BYTES, pgdat->node_id);
}
-static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
+static void __init check_usemap_section_nr(int nid,
+ struct mem_section_usage *usage)
{
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -474,14 +474,13 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
unsigned long pnum_end,
unsigned long map_count)
{
- unsigned long pnum, usemap_longs, *usemap;
+ struct mem_section_usage *usage;
+ unsigned long pnum;
struct page *map;
- usemap_longs = BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS);
- usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
- usemap_size() *
- map_count);
- if (!usemap) {
+ usage = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
+ mem_section_usage_size() * map_count);
+ if (!usage) {
pr_err("%s: node[%d] usemap allocation failed", __func__, nid);
goto failed;
}
@@ -497,9 +496,9 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
pnum_begin = pnum;
goto failed;
}
- check_usemap_section_nr(nid, usemap);
- sparse_init_one_section(__nr_to_section(pnum), pnum, map, usemap);
- usemap += usemap_longs;
+ check_usemap_section_nr(nid, usage);
+ sparse_init_one_section(__nr_to_section(pnum), pnum, map, usage);
+ usage = (void *) usage + mem_section_usage_size();
}
sparse_buffer_fini();
return;
@@ -693,9 +692,9 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
struct vmem_altmap *altmap)
{
unsigned long section_nr = pfn_to_section_nr(start_pfn);
+ struct mem_section_usage *usage;
struct mem_section *ms;
struct page *memmap;
- unsigned long *usemap;
int ret;
/*
@@ -709,8 +708,8 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
memmap = kmalloc_section_memmap(section_nr, nid, altmap);
if (!memmap)
return -ENOMEM;
- usemap = __kmalloc_section_usemap();
- if (!usemap) {
+ usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
+ if (!usage) {
__kfree_section_memmap(memmap, altmap);
return -ENOMEM;
}
@@ -728,11 +727,11 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
section_mark_present(ms);
- sparse_init_one_section(ms, section_nr, memmap, usemap);
+ sparse_init_one_section(ms, section_nr, memmap, usage);
out:
if (ret < 0) {
- kfree(usemap);
+ kfree(usage);
__kfree_section_memmap(memmap, altmap);
}
return ret;
@@ -769,20 +768,20 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
}
#endif
-static void free_section_usemap(struct page *memmap, unsigned long *usemap,
- struct vmem_altmap *altmap)
+static void free_section_usage(struct page *memmap,
+ struct mem_section_usage *usage, struct vmem_altmap *altmap)
{
- struct page *usemap_page;
+ struct page *usage_page;
- if (!usemap)
+ if (!usage)
return;
- usemap_page = virt_to_page(usemap);
+ usage_page = virt_to_page(usage);
/*
* Check to see if allocation came from hot-plug-add
*/
- if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
- kfree(usemap);
+ if (PageSlab(usage_page) || PageCompound(usage_page)) {
+ kfree(usage);
if (memmap)
__kfree_section_memmap(memmap, altmap);
return;
@@ -801,19 +800,19 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
unsigned long map_offset, struct vmem_altmap *altmap)
{
struct page *memmap = NULL;
- unsigned long *usemap = NULL;
+ struct mem_section_usage *usage = NULL;
if (ms->section_mem_map) {
- usemap = ms->pageblock_flags;
+ usage = ms->usage;
memmap = sparse_decode_mem_map(ms->section_mem_map,
__section_nr(ms));
ms->section_mem_map = 0;
- ms->pageblock_flags = NULL;
+ ms->usage = NULL;
}
clear_hwpoisoned_pages(memmap + map_offset,
PAGES_PER_SECTION - map_offset);
- free_section_usemap(memmap, usemap, altmap);
+ free_section_usage(memmap, usage, altmap);
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
#endif /* CONFIG_MEMORY_HOTPLUG */
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
section active bitmask, each bit representing 2MB (SECTION_SIZE (128M) /
map_active bitmask length (64)). If it turns out that 2MB is too large
of an active tracking granularity it is trivial to increase the size of
the map_active bitmap.
The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and read the sub-section
active ranges from the bitmask.
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/mmzone.h | 29 ++++++++++++++++++++++++++++-
mm/page_alloc.c | 4 +++-
mm/sparse.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 79 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 69b9cb9cb2ed..ae4aa7f63d2e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1122,6 +1122,8 @@ struct mem_section_usage {
unsigned long pageblock_flags[0];
};
+void section_active_init(unsigned long pfn, unsigned long nr_pages);
+
struct page;
struct page_ext;
struct mem_section {
@@ -1259,12 +1261,36 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn)
extern int __highest_present_section_nr;
+static inline int section_active_index(phys_addr_t phys)
+{
+ return (phys & ~(PA_SECTION_MASK)) / SECTION_ACTIVE_SIZE;
+}
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
+{
+ int idx = section_active_index(PFN_PHYS(pfn));
+
+ return !!(ms->usage->map_active & (1UL << idx));
+}
+#else
+static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
+{
+ return 1;
+}
+#endif
+
#ifndef CONFIG_HAVE_ARCH_PFN_VALID
static inline int pfn_valid(unsigned long pfn)
{
+ struct mem_section *ms;
+
if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
return 0;
- return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
+ ms = __nr_to_section(pfn_to_section_nr(pfn));
+ if (!valid_section(ms))
+ return 0;
+ return pfn_section_valid(ms, pfn);
}
#endif
@@ -1295,6 +1321,7 @@ void sparse_init(void);
#else
#define sparse_init() do {} while (0)
#define sparse_index_init(_sec, _nid) do {} while (0)
+#define section_active_init(_pfn, _nr_pages) do {} while (0)
#endif /* CONFIG_SPARSEMEM */
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf23bc0b8399..508a810fd514 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7221,10 +7221,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
/* Print out the early node map */
pr_info("Early memory node ranges\n");
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
pr_info(" node %3d: [mem %#018Lx-%#018Lx]\n", nid,
(u64)start_pfn << PAGE_SHIFT,
((u64)end_pfn << PAGE_SHIFT) - 1);
+ section_active_init(start_pfn, end_pfn - start_pfn);
+ }
/* Initialise every node */
mminit_verify_pageflags_layout();
diff --git a/mm/sparse.c b/mm/sparse.c
index cdd2978d0ffe..3cd7ce46e749 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -210,6 +210,54 @@ static inline unsigned long first_present_section_nr(void)
return next_present_section_nr(-1);
}
+static unsigned long section_active_mask(unsigned long pfn,
+ unsigned long nr_pages)
+{
+ int idx_start, idx_size;
+ phys_addr_t start, size;
+
+ if (!nr_pages)
+ return 0;
+
+ start = PFN_PHYS(pfn);
+ size = PFN_PHYS(min(nr_pages, PAGES_PER_SECTION
+ - (pfn & ~PAGE_SECTION_MASK)));
+ size = ALIGN(size, SECTION_ACTIVE_SIZE);
+
+ idx_start = section_active_index(start);
+ idx_size = section_active_index(size);
+
+ if (idx_size == 0)
+ return -1;
+ return ((1UL << idx_size) - 1) << idx_start;
+}
+
+void section_active_init(unsigned long pfn, unsigned long nr_pages)
+{
+ int end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
+ int i, start_sec = pfn_to_section_nr(pfn);
+
+ if (!nr_pages)
+ return;
+
+ for (i = start_sec; i <= end_sec; i++) {
+ struct mem_section *ms;
+ unsigned long mask;
+ unsigned long pfns;
+
+ pfns = min(nr_pages, PAGES_PER_SECTION
+ - (pfn & ~PAGE_SECTION_MASK));
+ mask = section_active_mask(pfn, pfns);
+
+ ms = __nr_to_section(i);
+ pr_debug("%s: sec: %d mask: %#018lx\n", __func__, i, mask);
+ ms->usage->map_active = mask;
+
+ pfn += pfns;
+ nr_pages -= pfns;
+ }
+}
+
/* Record a memory area against a node. */
void __init memory_present(int nid, unsigned long start, unsigned long end)
{
Allow sub-section sized ranges to be added to the memmap.
populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate(). There should be no sub-section usage in
current deployments. New warnings are added to clarify which memmap
allocation paths are sub-section capable.
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/mm/init_64.c | 4 ++-
include/linux/mm.h | 4 ++-
mm/sparse-vmemmap.c | 21 +++++++++++------
mm/sparse.c | 61 +++++++++++++++++++++++++++++++------------------
4 files changed, 57 insertions(+), 33 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bccff68e3267..799887eada60 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1461,7 +1461,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
{
int err;
- if (boot_cpu_has(X86_FEATURE_PSE))
+ if (end - start < PAGES_PER_SECTION * sizeof(struct page))
+ err = vmemmap_populate_basepages(start, end, node);
+ else if (boot_cpu_has(X86_FEATURE_PSE))
err = vmemmap_populate_hugepages(start, end, node, altmap);
else if (altmap) {
pr_err_once("%s: no cpu support for altmap allocations\n",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 76769749b5a5..76ba638ceda8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2666,8 +2666,8 @@ const char * arch_vma_name(struct vm_area_struct *vma);
void print_vma_addr(char *prefix, unsigned long rip);
void *sparse_buffer_alloc(unsigned long size);
-struct page *sparse_mem_map_populate(unsigned long pnum, int nid,
- struct vmem_altmap *altmap);
+struct page * __populate_section_memmap(unsigned long pfn,
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 7fec05796796..dcb023aa23d1 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -245,19 +245,26 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
return 0;
}
-struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid,
- struct vmem_altmap *altmap)
+struct page * __meminit __populate_section_memmap(unsigned long pfn,
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
{
unsigned long start;
unsigned long end;
- struct page *map;
- map = pfn_to_page(pnum * PAGES_PER_SECTION);
- start = (unsigned long)map;
- end = (unsigned long)(map + PAGES_PER_SECTION);
+ /*
+ * The minimum granularity of memmap extensions is
+ * SECTION_ACTIVE_SIZE as allocations are tracked in the
+ * 'map_active' bitmap of the section.
+ */
+ end = ALIGN(pfn + nr_pages, PHYS_PFN(SECTION_ACTIVE_SIZE));
+ pfn &= PHYS_PFN(SECTION_ACTIVE_MASK);
+ nr_pages = end - pfn;
+
+ start = (unsigned long) pfn_to_page(pfn);
+ end = start + nr_pages * sizeof(struct page);
if (vmemmap_populate(start, end, nid, altmap))
return NULL;
- return map;
+ return pfn_to_page(pfn);
}
diff --git a/mm/sparse.c b/mm/sparse.c
index 3cd7ce46e749..38f80639c6cc 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -452,8 +452,8 @@ static unsigned long __init section_map_size(void)
return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
}
-struct page __init *sparse_mem_map_populate(unsigned long pnum, int nid,
- struct vmem_altmap *altmap)
+struct page __init *__populate_section_memmap(unsigned long pfn,
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
{
unsigned long size = section_map_size();
struct page *map = sparse_buffer_alloc(size);
@@ -534,10 +534,13 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
}
sparse_buffer_init(map_count * section_map_size(), nid);
for_each_present_section_nr(pnum_begin, pnum) {
+ unsigned long pfn = section_nr_to_pfn(pnum);
+
if (pnum >= pnum_end)
break;
- map = sparse_mem_map_populate(pnum, nid, NULL);
+ map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
+ nid, NULL);
if (!map) {
pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
__func__, nid);
@@ -637,17 +640,17 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
#endif
#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
- struct vmem_altmap *altmap)
+static struct page *populate_section_memmap(unsigned long pfn,
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
{
- /* This will make the necessary allocations eventually. */
- return sparse_mem_map_populate(pnum, nid, altmap);
+ return __populate_section_memmap(pfn, nr_pages, nid, altmap);
}
-static void __kfree_section_memmap(struct page *memmap,
+
+static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
struct vmem_altmap *altmap)
{
- unsigned long start = (unsigned long)memmap;
- unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
+ unsigned long start = (unsigned long) pfn_to_page(pfn);
+ unsigned long end = start + nr_pages * sizeof(struct page);
vmemmap_free(start, end, altmap);
}
@@ -661,11 +664,18 @@ static void free_map_bootmem(struct page *memmap)
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
#else
-static struct page *__kmalloc_section_memmap(void)
+struct page *populate_section_memmap(unsigned long pfn,
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
{
struct page *page, *ret;
unsigned long memmap_size = sizeof(struct page) * PAGES_PER_SECTION;
+ if ((pfn & ~PAGE_SECTION_MASK) || nr_pages != PAGES_PER_SECTION) {
+ WARN(1, "%s: called with section unaligned parameters\n",
+ __func__);
+ return NULL;
+ }
+
page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
if (page)
goto got_map_page;
@@ -682,15 +692,17 @@ static struct page *__kmalloc_section_memmap(void)
return ret;
}
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
struct vmem_altmap *altmap)
{
- return __kmalloc_section_memmap();
-}
+ struct page *memmap = pfn_to_page(pfn);
+
+ if ((pfn & ~PAGE_SECTION_MASK) || nr_pages != PAGES_PER_SECTION) {
+ WARN(1, "%s: called with section unaligned parameters\n",
+ __func__);
+ return;
+ }
-static void __kfree_section_memmap(struct page *memmap,
- struct vmem_altmap *altmap)
-{
if (is_vmalloc_addr(memmap))
vfree(memmap);
else
@@ -753,12 +765,13 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
if (ret < 0 && ret != -EEXIST)
return ret;
ret = 0;
- memmap = kmalloc_section_memmap(section_nr, nid, altmap);
+ memmap = populate_section_memmap(start_pfn, PAGES_PER_SECTION, nid,
+ altmap);
if (!memmap)
return -ENOMEM;
usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
if (!usage) {
- __kfree_section_memmap(memmap, altmap);
+ depopulate_section_memmap(start_pfn, PAGES_PER_SECTION, altmap);
return -ENOMEM;
}
@@ -780,7 +793,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
out:
if (ret < 0) {
kfree(usage);
- __kfree_section_memmap(memmap, altmap);
+ depopulate_section_memmap(start_pfn, PAGES_PER_SECTION, altmap);
}
return ret;
}
@@ -817,7 +830,8 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
#endif
static void free_section_usage(struct page *memmap,
- struct mem_section_usage *usage, struct vmem_altmap *altmap)
+ struct mem_section_usage *usage, unsigned long pfn,
+ unsigned long nr_pages, struct vmem_altmap *altmap)
{
struct page *usage_page;
@@ -831,7 +845,7 @@ static void free_section_usage(struct page *memmap,
if (PageSlab(usage_page) || PageCompound(usage_page)) {
kfree(usage);
if (memmap)
- __kfree_section_memmap(memmap, altmap);
+ depopulate_section_memmap(pfn, nr_pages, altmap);
return;
}
@@ -860,7 +874,8 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
clear_hwpoisoned_pages(memmap + map_offset,
PAGES_PER_SECTION - map_offset);
- free_section_usage(memmap, usage, altmap);
+ free_section_usage(memmap, usage, section_nr_to_pfn(__section_nr(ms)),
+ PAGES_PER_SECTION, altmap);
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
#endif /* CONFIG_MEMORY_HOTPLUG */
Sub-section hotplug support reduces the unit of operation of hotplug
from section-sized-units (PAGES_PER_SECTION) to sub-section-sized units
(PAGES_PER_SUBSECTION). Teach shrink_{zone,pgdat}_span() to consider
PAGES_PER_SUBSECTION boundaries as the points where pfn_valid(), not
valid_section(), can toggle.
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/mmzone.h | 2 ++
mm/memory_hotplug.c | 16 ++++++++--------
2 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ae4aa7f63d2e..067ee217c692 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1111,6 +1111,8 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
#define SECTION_ACTIVE_SIZE ((1UL << SECTION_SIZE_BITS) / BITS_PER_LONG)
#define SECTION_ACTIVE_MASK (~(SECTION_ACTIVE_SIZE - 1))
+#define PAGES_PER_SUB_SECTION (SECTION_ACTIVE_SIZE / PAGE_SIZE)
+#define PAGE_SUB_SECTION_MASK (~(PAGES_PER_SUB_SECTION-1))
struct mem_section_usage {
/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2541a3a15854..0ea3bb58d223 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -326,10 +326,10 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
{
struct mem_section *ms;
- for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
+ for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SUB_SECTION) {
ms = __pfn_to_section(start_pfn);
- if (unlikely(!valid_section(ms)))
+ if (unlikely(!pfn_valid(start_pfn)))
continue;
if (unlikely(pfn_to_nid(start_pfn) != nid))
@@ -354,10 +354,10 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
/* pfn is the end pfn of a memory section. */
pfn = end_pfn - 1;
- for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
+ for (; pfn >= start_pfn; pfn -= PAGES_PER_SUB_SECTION) {
ms = __pfn_to_section(pfn);
- if (unlikely(!valid_section(ms)))
+ if (unlikely(!pfn_valid(pfn)))
continue;
if (unlikely(pfn_to_nid(pfn) != nid))
@@ -416,10 +416,10 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
* it check the zone has only hole or not.
*/
pfn = zone_start_pfn;
- for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
+ for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUB_SECTION) {
ms = __pfn_to_section(pfn);
- if (unlikely(!valid_section(ms)))
+ if (unlikely(!pfn_valid(pfn)))
continue;
if (page_zone(pfn_to_page(pfn)) != zone)
@@ -484,10 +484,10 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
* has only hole or not.
*/
pfn = pgdat_start_pfn;
- for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
+ for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SUB_SECTION) {
ms = __pfn_to_section(pfn);
- if (unlikely(!valid_section(ms)))
+ if (unlikely(!pfn_valid(pfn)))
continue;
if (pfn_to_nid(pfn) != nid)
Prepare the memory hot-{add,remove} paths for handling sub-section
ranges by plumbing the starting page frame and number of pages being
handled through arch_{add,remove}_memory() to
sparse_{add,remove}_one_section().
This is simply plumbing, small cleanups, and some identifier renames. No
intended functional changes.
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/mm/init_64.c | 11 +++++
include/linux/memory_hotplug.h | 7 ++-
mm/memory_hotplug.c | 85 ++++++++++++++++++++++------------------
mm/sparse.c | 7 ++-
4 files changed, 66 insertions(+), 44 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 799887eada60..4ae817a79ac3 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -781,6 +781,17 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
{
int ret;
+ /*
+ * Only allow partial section hotplug for !memblock ranges,
+ * since register_new_memory() requires section alignment, and
+ * CONFIG_SPARSEMEM_VMEMMAP=n requires sections to be fully
+ * populated.
+ */
+ if ((!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) || want_memblock)
+ && ((start_pfn & ~PAGE_SECTION_MASK)
+ || (nr_pages & ~PAGE_SECTION_MASK)))
+ return -EINVAL;
+
ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
WARN_ON_ONCE(ret);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8ade08c50d26..83ee937fb67f 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -336,9 +336,10 @@ extern int arch_add_memory(int nid, u64 start, u64 size,
extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
extern bool is_memblock_offlined(struct memory_block *mem);
-extern int sparse_add_one_section(int nid, unsigned long start_pfn,
- struct vmem_altmap *altmap);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+extern int sparse_add_section(int nid, unsigned long pfn,
+ unsigned long nr_pages, struct vmem_altmap *altmap);
+extern void sparse_remove_section(struct zone *zone, struct mem_section *ms,
+ unsigned long pfn, unsigned long nr_pages,
unsigned long map_offset, struct vmem_altmap *altmap);
extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
unsigned long pnum);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0ea3bb58d223..e093348f5d04 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,22 +250,23 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
}
#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
-static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
- struct vmem_altmap *altmap, bool want_memblock)
+static int __meminit __add_section(int nid, unsigned long pfn,
+ unsigned long nr_pages, struct vmem_altmap *altmap,
+ bool want_memblock)
{
int ret;
- if (pfn_valid(phys_start_pfn))
+ if (pfn_valid(pfn))
return -EEXIST;
- ret = sparse_add_one_section(nid, phys_start_pfn, altmap);
+ ret = sparse_add_section(nid, pfn, nr_pages, altmap);
if (ret < 0)
return ret;
if (!want_memblock)
return 0;
- return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
+ return hotplug_memory_register(nid, __pfn_to_section(pfn));
}
/*
@@ -274,23 +275,18 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
* call this function after deciding the zone to which to
* add the new pages.
*/
-int __ref __add_pages(int nid, unsigned long phys_start_pfn,
- unsigned long nr_pages, struct vmem_altmap *altmap,
- bool want_memblock)
+int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
+ struct vmem_altmap *altmap, bool want_memblock)
{
unsigned long i;
int err = 0;
int start_sec, end_sec;
- /* during initialize mem_map, align hot-added range to section */
- start_sec = pfn_to_section_nr(phys_start_pfn);
- end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
-
if (altmap) {
/*
* Validate altmap is within bounds of the total request
*/
- if (altmap->base_pfn != phys_start_pfn
+ if (altmap->base_pfn != pfn
|| vmem_altmap_offset(altmap) > nr_pages) {
pr_warn_once("memory add fail, invalid altmap\n");
err = -EINVAL;
@@ -299,9 +295,16 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
altmap->alloc = 0;
}
+ start_sec = pfn_to_section_nr(pfn);
+ end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
for (i = start_sec; i <= end_sec; i++) {
- err = __add_section(nid, section_nr_to_pfn(i), altmap,
- want_memblock);
+ unsigned long pfns;
+
+ pfns = min(nr_pages, PAGES_PER_SECTION
+ - (pfn & ~PAGE_SECTION_MASK));
+ err = __add_section(nid, pfn, pfns, altmap, want_memblock);
+ pfn += pfns;
+ nr_pages -= pfns;
/*
* EEXIST is finally dealt with by ioresource collision
@@ -506,10 +509,10 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
pgdat->node_spanned_pages = 0;
}
-static void __remove_zone(struct zone *zone, unsigned long start_pfn)
+static void __remove_zone(struct zone *zone, unsigned long start_pfn,
+ unsigned long nr_pages)
{
struct pglist_data *pgdat = zone->zone_pgdat;
- int nr_pages = PAGES_PER_SECTION;
unsigned long flags;
pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -518,11 +521,11 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn)
pgdat_resize_unlock(zone->zone_pgdat, &flags);
}
-static int __remove_section(struct zone *zone, struct mem_section *ms,
- unsigned long map_offset, struct vmem_altmap *altmap)
+static int __remove_section(struct zone *zone, unsigned long pfn,
+ unsigned long nr_pages, unsigned long map_offset,
+ struct vmem_altmap *altmap)
{
- unsigned long start_pfn;
- int scn_nr;
+ struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn));
int ret = -EINVAL;
if (!valid_section(ms))
@@ -532,18 +535,16 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
if (ret)
return ret;
- scn_nr = __section_nr(ms);
- start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
- __remove_zone(zone, start_pfn);
+ __remove_zone(zone, pfn, nr_pages);
- sparse_remove_one_section(zone, ms, map_offset, altmap);
+ sparse_remove_section(zone, ms, pfn, nr_pages, map_offset, altmap);
return 0;
}
/**
* __remove_pages() - remove sections of pages from a zone
* @zone: zone from which pages need to be removed
- * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
+ * @pfn: starting pageframe (must be aligned to start of a section)
* @nr_pages: number of pages to remove (must be multiple of section size)
* @altmap: alternative device page map or %NULL if default memmap is used
*
@@ -552,12 +553,11 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
* sure that pages are marked reserved and zones are adjust properly by
* calling offline_pages().
*/
-int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
+int __remove_pages(struct zone *zone, unsigned long pfn,
unsigned long nr_pages, struct vmem_altmap *altmap)
{
- unsigned long i;
unsigned long map_offset = 0;
- int sections_to_remove, ret = 0;
+ int i, start_sec, end_sec, ret = 0;
/* In the ZONE_DEVICE case device driver owns the memory region */
if (is_dev_zone(zone)) {
@@ -566,7 +566,7 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
} else {
resource_size_t start, size;
- start = phys_start_pfn << PAGE_SHIFT;
+ start = pfn << PAGE_SHIFT;
size = nr_pages * PAGE_SIZE;
ret = release_mem_region_adjustable(&iomem_resource, start,
@@ -582,18 +582,27 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
clear_zone_contiguous(zone);
/*
- * We can only remove entire sections
+ * Only ZONE_DEVICE memory is enabled to remove
+ * section-unaligned ranges. See register_new_memory() which
+ * assumes section alignment and is skipped for ZONE_DEVICE
+ * ranges.
*/
- BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
- BUG_ON(nr_pages % PAGES_PER_SECTION);
+ if (!is_dev_zone(zone) && ((pfn | nr_pages) & ~PAGE_SECTION_MASK)) {
+ WARN(1, "section unaligned removal not supported\n");
+ return -EINVAL;
+ }
- sections_to_remove = nr_pages / PAGES_PER_SECTION;
- for (i = 0; i < sections_to_remove; i++) {
- unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
+ start_sec = pfn_to_section_nr(pfn);
+ end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
+ for (i = start_sec; i <= end_sec; i++) {
+ unsigned long pfns;
cond_resched();
- ret = __remove_section(zone, __pfn_to_section(pfn), map_offset,
- altmap);
+ pfns = min(nr_pages, PAGES_PER_SECTION
+ - (pfn & ~PAGE_SECTION_MASK));
+ ret = __remove_section(zone, pfn, pfns, map_offset, altmap);
+ pfn += pfns;
+ nr_pages -= pfns;
map_offset = 0;
if (ret)
break;
diff --git a/mm/sparse.c b/mm/sparse.c
index 38f80639c6cc..767713c88cf5 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -748,8 +748,8 @@ static void free_map_bootmem(struct page *memmap)
* set. If this is <=0, then that means that the passed-in
* map was not consumed and must be freed.
*/
-int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
- struct vmem_altmap *altmap)
+int __meminit sparse_add_section(int nid, unsigned long start_pfn,
+ unsigned long nr_pages, struct vmem_altmap *altmap)
{
unsigned long section_nr = pfn_to_section_nr(start_pfn);
struct mem_section_usage *usage;
@@ -858,7 +858,8 @@ static void free_section_usage(struct page *memmap,
free_map_bootmem(memmap);
}
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+void sparse_remove_section(struct zone *zone, struct mem_section *ms,
+ unsigned long pfn, unsigned long nr_pages,
unsigned long map_offset, struct vmem_altmap *altmap)
{
struct page *memmap = NULL;
The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity. For example the following backtrace
is emitted when attempting arch_add_memory() with physical address
ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
within a given section:
WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5f/0x80
devm_memremap_pages+0x3b5/0x4c0
__wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
pmem_attach_disk+0x19a/0x440 [nd_pmem]
Recently it was discovered that the problem goes beyond RAM vs PMEM
collisions as some platform produce PMEM vs PMEM collisions within a
given section. The libnvdimm workaround for that case revealed that the
libnvdimm section-alignment-padding implementation has been broken for a
long while. A fix for that long-standing breakage introduces as many
problems as it solves as it would require a backward-incompatible change
to the namespace metadata interpretation. Instead of that dubious route
[1], address the root problem in the memory-hotplug implementation.
[1]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
mm/sparse.c | 235 ++++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 158 insertions(+), 77 deletions(-)
diff --git a/mm/sparse.c b/mm/sparse.c
index 767713c88cf5..d41ad9643f86 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -83,8 +83,15 @@ static int __meminit sparse_index_init(unsigned long section_nr, int nid)
unsigned long root = SECTION_NR_TO_ROOT(section_nr);
struct mem_section *section;
+ /*
+ * An existing section is possible in the sub-section hotplug
+ * case. First hot-add instantiates, follow-on hot-add reuses
+ * the existing section.
+ *
+ * The mem_hotplug_lock resolves the apparent race below.
+ */
if (mem_section[root])
- return -EEXIST;
+ return 0;
section = sparse_index_alloc(nid);
if (!section)
@@ -338,6 +345,15 @@ static void __meminit sparse_init_one_section(struct mem_section *ms,
unsigned long pnum, struct page *mem_map,
struct mem_section_usage *usage)
{
+ /*
+ * Given that SPARSEMEM_VMEMMAP=y supports sub-section hotplug,
+ * ->section_mem_map can not be guaranteed to point to a full
+ * section's worth of memory. The field is only valid / used
+ * in the SPARSEMEM_VMEMMAP=n case.
+ */
+ if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
+ mem_map = NULL;
+
ms->section_mem_map &= ~SECTION_MAP_MASK;
ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
SECTION_HAS_MEM_MAP;
@@ -743,58 +759,164 @@ static void free_map_bootmem(struct page *memmap)
#endif /* CONFIG_MEMORY_HOTREMOVE */
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
-/*
- * returns the number of sections whose mem_maps were properly
- * set. If this is <=0, then that means that the passed-in
- * map was not consumed and must be freed.
+#ifndef CONFIG_MEMORY_HOTREMOVE
+static void free_map_bootmem(struct page *memmap)
+{
+}
+#endif
+
+static bool is_early_section(struct mem_section *ms)
+{
+ struct page *usage_page;
+
+ usage_page = virt_to_page(ms->usage);
+ if (PageSlab(usage_page) || PageCompound(usage_page))
+ return false;
+ else
+ return true;
+}
+
+static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
+ int nid, struct vmem_altmap *altmap)
+{
+ unsigned long mask = section_active_mask(pfn, nr_pages);
+ struct mem_section *ms = __pfn_to_section(pfn);
+ bool early_section = is_early_section(ms);
+ struct page *memmap = NULL;
+
+ if (WARN(!ms->usage || (ms->usage->map_active & mask) != mask,
+ "section already deactivated: active: %#lx mask: %#lx\n",
+ ms->usage ? ms->usage->map_active : 0, mask))
+ return;
+
+ if (WARN(!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)
+ && nr_pages < PAGES_PER_SECTION,
+ "partial memory section removal not supported\n"))
+ return;
+
+ /*
+ * There are 3 cases to handle across two configurations
+ * (SPARSEMEM_VMEMMAP={y,n}):
+ *
+ * 1/ deactivation of a partial hot-added section (only possible
+ * in the SPARSEMEM_VMEMMAP=y case).
+ * a/ section was present at memory init
+ * b/ section was hot-added post memory init
+ * 2/ deactivation of a complete hot-added section
+ * 3/ deactivation of a complete section from memory init
+ *
+ * For 1/, when map_active does not go to zero we will not be
+ * freeing the usage map, but still need to free the vmemmap
+ * range.
+ *
+ * For 2/ and 3/ the SPARSEMEM_VMEMMAP={y,n} cases are unified
+ */
+ ms->usage->map_active ^= mask;
+ if (ms->usage->map_active == 0) {
+ unsigned long section_nr = pfn_to_section_nr(pfn);
+
+ if (!early_section) {
+ kfree(ms->usage);
+ ms->usage = NULL;
+ }
+ memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+ ms->section_mem_map = sparse_encode_mem_map(NULL, section_nr);
+ }
+
+ if (early_section && memmap)
+ free_map_bootmem(memmap);
+ else
+ depopulate_section_memmap(pfn, nr_pages, altmap);
+}
+
+static struct page * __meminit section_activate(int nid, unsigned long pfn,
+ unsigned long nr_pages, struct vmem_altmap *altmap)
+{
+ unsigned long mask = section_active_mask(pfn, nr_pages);
+ struct mem_section *ms = __pfn_to_section(pfn);
+ struct mem_section_usage *usage = NULL;
+ struct page *memmap;
+ int rc = 0;
+
+ if (!ms->usage) {
+ usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
+ if (!usage)
+ return ERR_PTR(-ENOMEM);
+ ms->usage = usage;
+ }
+
+ if (!mask)
+ rc = -EINVAL;
+ else if (mask & ms->usage->map_active)
+ rc = -EEXIST;
+ else
+ ms->usage->map_active |= mask;
+
+ if (rc) {
+ if (usage)
+ ms->usage = NULL;
+ kfree(usage);
+ return ERR_PTR(rc);
+ }
+
+ /*
+ * The early init code does not consider partially populated
+ * initial sections, it simply assumes that memory will never be
+ * referenced. If we hot-add memory into such a section then we
+ * do not need to populate the memmap and can simply reuse what
+ * is already there.
+ */
+ if (nr_pages < PAGES_PER_SECTION && is_early_section(ms))
+ return pfn_to_page(pfn);
+
+ memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+ if (!memmap) {
+ section_deactivate(pfn, nr_pages, nid, altmap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ return memmap;
+}
+
+/**
+ * sparse_add_section() - create a new memmap section, or populate an
+ * existing one
+ * @zone: host zone for the new memory mapping
+ * @start_pfn: first pfn to add (section aligned if zone != ZONE_DEVICE)
+ * @nr_pages: number of new pages to add
+ *
+ * returns the number of sections whose mem_maps were properly set. If
+ * this is <=0, then that means that the passed-in map was not consumed
+ * and must be freed.
*/
int __meminit sparse_add_section(int nid, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap)
{
unsigned long section_nr = pfn_to_section_nr(start_pfn);
- struct mem_section_usage *usage;
- struct mem_section *ms;
+ struct mem_section *ms = __pfn_to_section(start_pfn);
struct page *memmap;
int ret;
- /*
- * no locking for this, because it does its own
- * plus, it does a kmalloc
- */
ret = sparse_index_init(section_nr, nid);
if (ret < 0 && ret != -EEXIST)
return ret;
- ret = 0;
- memmap = populate_section_memmap(start_pfn, PAGES_PER_SECTION, nid,
- altmap);
- if (!memmap)
- return -ENOMEM;
- usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
- if (!usage) {
- depopulate_section_memmap(start_pfn, PAGES_PER_SECTION, altmap);
- return -ENOMEM;
- }
- ms = __pfn_to_section(start_pfn);
- if (ms->section_mem_map & SECTION_MARKED_PRESENT) {
- ret = -EEXIST;
- goto out;
- }
+ memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+ if (IS_ERR(memmap))
+ return PTR_ERR(memmap);
+ ret = 0;
/*
* Poison uninitialized struct pages in order to catch invalid flags
* combinations.
*/
- page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
+ page_init_poison(pfn_to_page(start_pfn), sizeof(struct page) * nr_pages);
section_mark_present(ms);
- sparse_init_one_section(ms, section_nr, memmap, usage);
+ sparse_init_one_section(ms, section_nr, memmap, ms->usage);
-out:
- if (ret < 0) {
- kfree(usage);
- depopulate_section_memmap(start_pfn, PAGES_PER_SECTION, altmap);
- }
+ if (ret < 0)
+ section_deactivate(start_pfn, nr_pages, nid, altmap);
return ret;
}
@@ -829,54 +951,13 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
}
#endif
-static void free_section_usage(struct page *memmap,
- struct mem_section_usage *usage, unsigned long pfn,
- unsigned long nr_pages, struct vmem_altmap *altmap)
-{
- struct page *usage_page;
-
- if (!usage)
- return;
-
- usage_page = virt_to_page(usage);
- /*
- * Check to see if allocation came from hot-plug-add
- */
- if (PageSlab(usage_page) || PageCompound(usage_page)) {
- kfree(usage);
- if (memmap)
- depopulate_section_memmap(pfn, nr_pages, altmap);
- return;
- }
-
- /*
- * The usemap came from bootmem. This is packed with other usemaps
- * on the section which has pgdat at boot time. Just keep it as is now.
- */
-
- if (memmap)
- free_map_bootmem(memmap);
-}
-
void sparse_remove_section(struct zone *zone, struct mem_section *ms,
unsigned long pfn, unsigned long nr_pages,
unsigned long map_offset, struct vmem_altmap *altmap)
{
- struct page *memmap = NULL;
- struct mem_section_usage *usage = NULL;
-
- if (ms->section_mem_map) {
- usage = ms->usage;
- memmap = sparse_decode_mem_map(ms->section_mem_map,
- __section_nr(ms));
- ms->section_mem_map = 0;
- ms->usage = NULL;
- }
-
- clear_hwpoisoned_pages(memmap + map_offset,
- PAGES_PER_SECTION - map_offset);
- free_section_usage(memmap, usage, section_nr_to_pfn(__section_nr(ms)),
- PAGES_PER_SECTION, altmap);
+ clear_hwpoisoned_pages(pfn_to_page(pfn) + map_offset,
+ nr_pages - map_offset);
+ section_deactivate(pfn, nr_pages, zone_to_nid(zone), altmap);
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
#endif /* CONFIG_MEMORY_HOTPLUG */
At namespace creation time there is the potential for the "expected to
be zero" fields of a 'pfn' info-block to be filled with indeterminate
data. While the kernel buffer is zeroed on allocation it is immediately
overwritten by nd_pfn_validate() filling it with the current contents of
the on-media info-block location. For fields like, 'flags' and the
'padding' it potentially means that future implementations can not rely
on those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for
section alignment, arrange for fields that are not explicitly
initialized to be guaranteed zero. Bump the minor version to indicate it
is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
corruption is expected to benign since all other critical fields are
explicitly initialized.
Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
Cc: <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/dax_devs.c | 2 +-
drivers/nvdimm/pfn.h | 1 +
drivers/nvdimm/pfn_devs.c | 18 +++++++++++++++---
3 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
index 0453f49dc708..326f02ffca81 100644
--- a/drivers/nvdimm/dax_devs.c
+++ b/drivers/nvdimm/dax_devs.c
@@ -126,7 +126,7 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns)
nvdimm_bus_unlock(&ndns->dev);
if (!dax_dev)
return -ENOMEM;
- pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
+ pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
nd_pfn->pfn_sb = pfn_sb;
rc = nd_pfn_validate(nd_pfn, DAX_SIG);
dev_dbg(dev, "dax: %s\n", rc == 0 ? dev_name(dax_dev) : "<none>");
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index dde9853453d3..e901e3a3b04c 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -36,6 +36,7 @@ struct nd_pfn_sb {
__le32 end_trunc;
/* minor-version-2 record the base alignment of the mapping */
__le32 align;
+ /* minor-version-3 guarantee the padding and flags are zero */
u8 padding[4000];
__le64 checksum;
};
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index d271bd731af7..f0e918186504 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -420,6 +420,15 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn)
return 0;
}
+/**
+ * nd_pfn_validate - read and validate info-block
+ * @nd_pfn: fsdax namespace runtime state / properties
+ * @sig: 'devdax' or 'fsdax' signature
+ *
+ * Upon return the info-block buffer contents (->pfn_sb) are
+ * indeterminate when validation fails, and a coherent info-block
+ * otherwise.
+ */
int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
{
u64 checksum, offset;
@@ -565,7 +574,7 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
nvdimm_bus_unlock(&ndns->dev);
if (!pfn_dev)
return -ENOMEM;
- pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
+ pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
nd_pfn = to_nd_pfn(pfn_dev);
nd_pfn->pfn_sb = pfn_sb;
rc = nd_pfn_validate(nd_pfn, PFN_SIG);
@@ -702,7 +711,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
u64 checksum;
int rc;
- pfn_sb = devm_kzalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
+ pfn_sb = devm_kmalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
if (!pfn_sb)
return -ENOMEM;
@@ -711,11 +720,14 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
sig = DAX_SIG;
else
sig = PFN_SIG;
+
rc = nd_pfn_validate(nd_pfn, sig);
if (rc != -ENODEV)
return rc;
/* no info block, do init */;
+ memset(pfn_sb, 0, sizeof(*pfn_sb));
+
nd_region = to_nd_region(nd_pfn->dev.parent);
if (nd_region->ro) {
dev_info(&nd_pfn->dev,
@@ -768,7 +780,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16);
pfn_sb->version_major = cpu_to_le16(1);
- pfn_sb->version_minor = cpu_to_le16(2);
+ pfn_sb->version_minor = cpu_to_le16(3);
pfn_sb->start_pad = cpu_to_le32(start_pad);
pfn_sb->end_trunc = cpu_to_le32(end_trunc);
pfn_sb->align = cpu_to_le32(nd_pfn->align);
Teach devm_memremap_pages() about the new sub-section capabilities of
arch_{add,remove}_memory(). Effectively, just replace all usage of
align_start, align_end, and align_size with res->start, res->end, and
resource_size(res). The existing sanity check will still make sure that
the two separate remap attempts do not collide within a sub-section (2MB
on x86).
Cc: Michal Hocko <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
kernel/memremap.c | 55 +++++++++++++++++++++--------------------------------
1 file changed, 22 insertions(+), 33 deletions(-)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index dda1367b385d..08344869e717 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -59,7 +59,7 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap)
struct vmem_altmap *altmap = &pgmap->altmap;
unsigned long pfn;
- pfn = res->start >> PAGE_SHIFT;
+ pfn = PHYS_PFN(res->start);
if (pgmap->altmap_valid)
pfn += vmem_altmap_offset(altmap);
return pfn;
@@ -87,7 +87,6 @@ static void devm_memremap_pages_release(void *data)
struct dev_pagemap *pgmap = data;
struct device *dev = pgmap->dev;
struct resource *res = &pgmap->res;
- resource_size_t align_start, align_size;
unsigned long pfn;
int nid;
@@ -96,25 +95,21 @@ static void devm_memremap_pages_release(void *data)
put_page(pfn_to_page(pfn));
/* pages are dead and unused, undo the arch mapping */
- align_start = res->start & ~(PA_SECTION_SIZE - 1);
- align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
- - align_start;
-
- nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
+ nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
mem_hotplug_begin();
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
- pfn = align_start >> PAGE_SHIFT;
+ pfn = PHYS_PFN(res->start);
__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
- align_size >> PAGE_SHIFT, NULL);
+ PHYS_PFN(resource_size(res)), NULL);
} else {
- arch_remove_memory(nid, align_start, align_size,
+ arch_remove_memory(nid, res->start, resource_size(res),
pgmap->altmap_valid ? &pgmap->altmap : NULL);
- kasan_remove_zero_shadow(__va(align_start), align_size);
+ kasan_remove_zero_shadow(__va(res->start), resource_size(res));
}
mem_hotplug_done();
- untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+ untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res));
pgmap_array_delete(res);
dev_WARN_ONCE(dev, pgmap->altmap.alloc,
"%s: failed to free all reserved pages\n", __func__);
@@ -141,7 +136,6 @@ static void devm_memremap_pages_release(void *data)
*/
void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
{
- resource_size_t align_start, align_size, align_end;
struct vmem_altmap *altmap = pgmap->altmap_valid ?
&pgmap->altmap : NULL;
struct resource *res = &pgmap->res;
@@ -152,26 +146,21 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
if (!pgmap->ref || !pgmap->kill)
return ERR_PTR(-EINVAL);
- align_start = res->start & ~(PA_SECTION_SIZE - 1);
- align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
- - align_start;
- align_end = align_start + align_size - 1;
-
- conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_start), NULL);
+ conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->start), NULL);
if (conflict_pgmap) {
dev_WARN(dev, "Conflicting mapping in same section\n");
put_dev_pagemap(conflict_pgmap);
return ERR_PTR(-ENOMEM);
}
- conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_end), NULL);
+ conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->end), NULL);
if (conflict_pgmap) {
dev_WARN(dev, "Conflicting mapping in same section\n");
put_dev_pagemap(conflict_pgmap);
return ERR_PTR(-ENOMEM);
}
- is_ram = region_intersects(align_start, align_size,
+ is_ram = region_intersects(res->start, resource_size(res),
IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
if (is_ram != REGION_DISJOINT) {
@@ -192,8 +181,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
if (nid < 0)
nid = numa_mem_id();
- error = track_pfn_remap(NULL, &pgprot, PHYS_PFN(align_start), 0,
- align_size);
+ error = track_pfn_remap(NULL, &pgprot, PHYS_PFN(res->start), 0,
+ resource_size(res));
if (error)
goto err_pfn_remap;
@@ -211,16 +200,16 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
* arch_add_memory().
*/
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
- error = add_pages(nid, align_start >> PAGE_SHIFT,
- align_size >> PAGE_SHIFT, NULL, false);
+ error = add_pages(nid, PHYS_PFN(res->start),
+ PHYS_PFN(resource_size(res)), NULL, false);
} else {
- error = kasan_add_zero_shadow(__va(align_start), align_size);
+ error = kasan_add_zero_shadow(__va(res->start), resource_size(res));
if (error) {
mem_hotplug_done();
goto err_kasan;
}
- error = arch_add_memory(nid, align_start, align_size, altmap,
+ error = arch_add_memory(nid, res->start, resource_size(res), altmap,
false);
}
@@ -228,8 +217,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
struct zone *zone;
zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
- move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT,
- align_size >> PAGE_SHIFT, altmap);
+ move_pfn_range_to_zone(zone, PHYS_PFN(res->start),
+ PHYS_PFN(resource_size(res)), altmap);
}
mem_hotplug_done();
@@ -241,8 +230,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
* to allow us to do the work while not holding the hotplug lock.
*/
memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
- align_start >> PAGE_SHIFT,
- align_size >> PAGE_SHIFT, pgmap);
+ PHYS_PFN(res->start),
+ PHYS_PFN(resource_size(res)), pgmap);
percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
error = devm_add_action_or_reset(dev, devm_memremap_pages_release,
@@ -253,9 +242,9 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
return __va(res->start);
err_add_memory:
- kasan_remove_zero_shadow(__va(align_start), align_size);
+ kasan_remove_zero_shadow(__va(res->start), resource_size(res));
err_kasan:
- untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+ untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res));
err_pfn_remap:
pgmap_array_delete(res);
err_array:
Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE
memory, we no longer need to add padding at pfn/dax device creation
time. The kernel will still honor padding established by older kernels.
Reported-by: Jeff Moyer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/pfn.h | 11 ++-----
drivers/nvdimm/pfn_devs.c | 75 +++++++--------------------------------------
include/linux/mmzone.h | 4 ++
3 files changed, 19 insertions(+), 71 deletions(-)
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index e901e3a3b04c..ae589cc528f2 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -41,18 +41,13 @@ struct nd_pfn_sb {
__le64 checksum;
};
-#ifdef CONFIG_SPARSEMEM
-#define PFN_SECTION_ALIGN_DOWN(x) SECTION_ALIGN_DOWN(x)
-#define PFN_SECTION_ALIGN_UP(x) SECTION_ALIGN_UP(x)
-#else
/*
* In this case ZONE_DEVICE=n and we will disable 'pfn' device support,
* but we still want pmem to compile.
*/
-#define PFN_SECTION_ALIGN_DOWN(x) (x)
-#define PFN_SECTION_ALIGN_UP(x) (x)
+#ifndef SUB_SECTION_ALIGN_DOWN
+#define SUB_SECTION_ALIGN_DOWN(x) (x)
+#define SUB_SECTION_ALIGN_UP(x) (x)
#endif
-#define PHYS_SECTION_ALIGN_DOWN(x) PFN_PHYS(PFN_SECTION_ALIGN_DOWN(PHYS_PFN(x)))
-#define PHYS_SECTION_ALIGN_UP(x) PFN_PHYS(PFN_SECTION_ALIGN_UP(PHYS_PFN(x)))
#endif /* __NVDIMM_PFN_H */
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index f0e918186504..b7928bfc5691 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -595,14 +595,14 @@ static u32 info_block_reserve(void)
}
/*
- * We hotplug memory at section granularity, pad the reserved area from
- * the previous section base to the namespace base address.
+ * We hotplug memory at sub-section granularity, pad the reserved area
+ * from the previous section base to the namespace base address.
*/
static unsigned long init_altmap_base(resource_size_t base)
{
unsigned long base_pfn = PHYS_PFN(base);
- return PFN_SECTION_ALIGN_DOWN(base_pfn);
+ return SUB_SECTION_ALIGN_DOWN(base_pfn);
}
static unsigned long init_altmap_reserve(resource_size_t base)
@@ -610,7 +610,7 @@ static unsigned long init_altmap_reserve(resource_size_t base)
unsigned long reserve = info_block_reserve() >> PAGE_SHIFT;
unsigned long base_pfn = PHYS_PFN(base);
- reserve += base_pfn - PFN_SECTION_ALIGN_DOWN(base_pfn);
+ reserve += base_pfn - SUB_SECTION_ALIGN_DOWN(base_pfn);
return reserve;
}
@@ -641,8 +641,7 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
pgmap->altmap_valid = false;
} else if (nd_pfn->mode == PFN_MODE_PMEM) {
- nd_pfn->npfns = PFN_SECTION_ALIGN_UP((resource_size(res)
- - offset) / PAGE_SIZE);
+ nd_pfn->npfns = PHYS_PFN((resource_size(res) - offset));
if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns)
dev_info(&nd_pfn->dev,
"number of pfns truncated from %lld to %ld\n",
@@ -658,50 +657,10 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
return 0;
}
-static u64 phys_pmem_align_down(struct nd_pfn *nd_pfn, u64 phys)
-{
- return min_t(u64, PHYS_SECTION_ALIGN_DOWN(phys),
- ALIGN_DOWN(phys, nd_pfn->align));
-}
-
-/*
- * Check if pmem collides with 'System RAM', or other regions when
- * section aligned. Trim it accordingly.
- */
-static void trim_pfn_device(struct nd_pfn *nd_pfn, u32 *start_pad, u32 *end_trunc)
-{
- struct nd_namespace_common *ndns = nd_pfn->ndns;
- struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
- struct nd_region *nd_region = to_nd_region(nd_pfn->dev.parent);
- const resource_size_t start = nsio->res.start;
- const resource_size_t end = start + resource_size(&nsio->res);
- resource_size_t adjust, size;
-
- *start_pad = 0;
- *end_trunc = 0;
-
- adjust = start - PHYS_SECTION_ALIGN_DOWN(start);
- size = resource_size(&nsio->res) + adjust;
- if (region_intersects(start - adjust, size, IORESOURCE_SYSTEM_RAM,
- IORES_DESC_NONE) == REGION_MIXED
- || nd_region_conflict(nd_region, start - adjust, size))
- *start_pad = PHYS_SECTION_ALIGN_UP(start) - start;
-
- /* Now check that end of the range does not collide. */
- adjust = PHYS_SECTION_ALIGN_UP(end) - end;
- size = resource_size(&nsio->res) + adjust;
- if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
- IORES_DESC_NONE) == REGION_MIXED
- || !IS_ALIGNED(end, nd_pfn->align)
- || nd_region_conflict(nd_region, start, size))
- *end_trunc = end - phys_pmem_align_down(nd_pfn, end);
-}
-
static int nd_pfn_init(struct nd_pfn *nd_pfn)
{
struct nd_namespace_common *ndns = nd_pfn->ndns;
struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
- u32 start_pad, end_trunc, reserve = info_block_reserve();
resource_size_t start, size;
struct nd_region *nd_region;
struct nd_pfn_sb *pfn_sb;
@@ -736,43 +695,35 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
return -ENXIO;
}
- memset(pfn_sb, 0, sizeof(*pfn_sb));
-
- trim_pfn_device(nd_pfn, &start_pad, &end_trunc);
- if (start_pad + end_trunc)
- dev_info(&nd_pfn->dev, "%s alignment collision, truncate %d bytes\n",
- dev_name(&ndns->dev), start_pad + end_trunc);
-
/*
* Note, we use 64 here for the standard size of struct page,
* debugging options may cause it to be larger in which case the
* implementation will limit the pfns advertised through
* ->direct_access() to those that are included in the memmap.
*/
- start = nsio->res.start + start_pad;
+ start = nsio->res.start;
size = resource_size(&nsio->res);
- npfns = PFN_SECTION_ALIGN_UP((size - start_pad - end_trunc - reserve)
- / PAGE_SIZE);
+ npfns = PHYS_PFN(size - SZ_8K);
if (nd_pfn->mode == PFN_MODE_PMEM) {
/*
* The altmap should be padded out to the block size used
* when populating the vmemmap. This *should* be equal to
* PMD_SIZE for most architectures.
*/
- offset = ALIGN(start + reserve + 64 * npfns,
- max(nd_pfn->align, PMD_SIZE)) - start;
+ offset = ALIGN(start + SZ_8K + 64 * npfns,
+ max(nd_pfn->align, SECTION_ACTIVE_SIZE)) - start;
} else if (nd_pfn->mode == PFN_MODE_RAM)
- offset = ALIGN(start + reserve, nd_pfn->align) - start;
+ offset = ALIGN(start + SZ_8K, nd_pfn->align) - start;
else
return -ENXIO;
- if (offset + start_pad + end_trunc >= size) {
+ if (offset >= size) {
dev_err(&nd_pfn->dev, "%s unable to satisfy requested alignment\n",
dev_name(&ndns->dev));
return -ENXIO;
}
- npfns = (size - offset - start_pad - end_trunc) / SZ_4K;
+ npfns = PHYS_PFN(size - offset);
pfn_sb->mode = cpu_to_le32(nd_pfn->mode);
pfn_sb->dataoff = cpu_to_le64(offset);
pfn_sb->npfns = cpu_to_le64(npfns);
@@ -781,8 +732,6 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16);
pfn_sb->version_major = cpu_to_le16(1);
pfn_sb->version_minor = cpu_to_le16(3);
- pfn_sb->start_pad = cpu_to_le32(start_pad);
- pfn_sb->end_trunc = cpu_to_le32(end_trunc);
pfn_sb->align = cpu_to_le32(nd_pfn->align);
checksum = nd_sb_checksum((struct nd_gen_sb *) pfn_sb);
pfn_sb->checksum = cpu_to_le64(checksum);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 067ee217c692..3117aa9d0f33 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1114,6 +1114,10 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
#define PAGES_PER_SUB_SECTION (SECTION_ACTIVE_SIZE / PAGE_SIZE)
#define PAGE_SUB_SECTION_MASK (~(PAGES_PER_SUB_SECTION-1))
+#define SUB_SECTION_ALIGN_UP(pfn) (((pfn) + PAGES_PER_SUB_SECTION - 1) \
+ & PAGE_SUB_SECTION_MASK)
+#define SUB_SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SUB_SECTION_MASK)
+
struct mem_section_usage {
/*
* SECTION_ACTIVE_SIZE portions of the section that are populated in
On Fri 22-03-19 09:57:54, Dan Williams wrote:
> Changes since v4 [1]:
> - Given v4 was from March of 2017 the bulk of the changes result from
> rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
>
> - A unit test is added to ndctl to exercise the creation and dax
> mounting of multiple independent namespaces in a single 128M section.
>
> [1]: https://lwn.net/Articles/717383/
>
> ---
>
> Quote patch7:
>
> "The libnvdimm sub-system has suffered a series of hacks and broken
> workarounds for the memory-hotplug implementation's awkward
> section-aligned (128MB) granularity. For example the following backtrace
> is emitted when attempting arch_add_memory() with physical address
> ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> within a given section:
>
> WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> [..]
> Call Trace:
> dump_stack+0x86/0xc3
> __warn+0xcb/0xf0
> warn_slowpath_fmt+0x5f/0x80
> devm_memremap_pages+0x3b5/0x4c0
> __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> pmem_attach_disk+0x19a/0x440 [nd_pmem]
>
> Recently it was discovered that the problem goes beyond RAM vs PMEM
> collisions as some platform produce PMEM vs PMEM collisions within a
> given section. The libnvdimm workaround for that case revealed that the
> libnvdimm section-alignment-padding implementation has been broken for a
> long while. A fix for that long-standing breakage introduces as many
> problems as it solves as it would require a backward-incompatible change
> to the namespace metadata interpretation. Instead of that dubious route
> [2], address the root problem in the memory-hotplug implementation."
>
> The approach is taken is to observe that each section already maintains
> an array of 'unsigned long' values to hold the pageblock_flags. A single
> additional 'unsigned long' is added to house a 'sub-section active'
> bitmask. Each bit tracks the mapped state of one sub-section's worth of
> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
So the hotplugable unit is pageblock now, right? Why is this
sufficient? What prevents new and creative HW to come up with
alignements that do not fit there? Do not get me wrong but the section
as a unit is deeply carved into the memory hotplug and removing all those
assumptions is a major undertaking and I would like to know that you are
not just shifting the problem to a smaller unit and a new/creative HW
will force us to go even more complicated.
What is the fundamental reason that pmem sections cannot be assigned
to a section aligned memory range? The physical address space is
quite large to impose 128MB sections IMHO. I thought this is merely a
configuration issue. How often this really happens and how often it is
unavoidable.
> The implication of allowing sections to be piecemeal mapped/unmapped is
> that the valid_section() helper is no longer authoritative to determine
> if a section is fully mapped. Instead pfn_valid() is updated to consult
> the section-active bitmask. Given that typical memory hotplug still has
> deep "section" dependencies the sub-section capability is limited to
> 'want_memblock=false' invocations of arch_add_memory(), effectively only
> devm_memremap_pages() users for now.
Does this mean that pfn_valid is more expensive now? How much? For any
pfn? Also what about the section life time? Who is removing section now?
I will probably have much more question, but it's friday and I am mostly
offline already. I would just like to hear much more about the new
design and resulting assumptions.
--
Michal Hocko
SUSE Labs
On Fri, Mar 22, 2019 at 11:06 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 22-03-19 09:57:54, Dan Williams wrote:
> > Changes since v4 [1]:
> > - Given v4 was from March of 2017 the bulk of the changes result from
> > rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> >
> > - A unit test is added to ndctl to exercise the creation and dax
> > mounting of multiple independent namespaces in a single 128M section.
> >
> > [1]: https://lwn.net/Articles/717383/
> >
> > ---
> >
> > Quote patch7:
> >
> > "The libnvdimm sub-system has suffered a series of hacks and broken
> > workarounds for the memory-hotplug implementation's awkward
> > section-aligned (128MB) granularity. For example the following backtrace
> > is emitted when attempting arch_add_memory() with physical address
> > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> > within a given section:
> >
> > WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> > devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> > [..]
> > Call Trace:
> > dump_stack+0x86/0xc3
> > __warn+0xcb/0xf0
> > warn_slowpath_fmt+0x5f/0x80
> > devm_memremap_pages+0x3b5/0x4c0
> > __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> > pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >
> > Recently it was discovered that the problem goes beyond RAM vs PMEM
> > collisions as some platform produce PMEM vs PMEM collisions within a
> > given section. The libnvdimm workaround for that case revealed that the
> > libnvdimm section-alignment-padding implementation has been broken for a
> > long while. A fix for that long-standing breakage introduces as many
> > problems as it solves as it would require a backward-incompatible change
> > to the namespace metadata interpretation. Instead of that dubious route
> > [2], address the root problem in the memory-hotplug implementation."
> >
> > The approach is taken is to observe that each section already maintains
> > an array of 'unsigned long' values to hold the pageblock_flags. A single
> > additional 'unsigned long' is added to house a 'sub-section active'
> > bitmask. Each bit tracks the mapped state of one sub-section's worth of
> > capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
>
> So the hotplugable unit is pageblock now, right?
No, with this patchset the hotplug unit is 2MB.
> Why is this sufficient?
2MB is sufficient because it allows mapping a namespace at PMD
granularity and there is no practical need to go smaller.
> What prevents new and creative HW to come up with alignements that do not fit there?
There is a resource in hardware memory controllers called
address-decode-registers that control the mapping granularity. The
minimum granularity today is 64MB and the pressure as memory sizes
increases is to make that granularity larger, not smaller. So the
hardware pressure is going in the opposite direction of your concern,
at least for persistent memory.
User-defined memory namespaces have this problem, but 2MB is the
default alignment and is sufficient for most uses.
PCI Address BARs that are also mapped with devm_memremap_pages are
aligned to their size and there is no expectation to support smaller
than 2MB.
All that said, to support a smaller sub-section granularity, just add
more bits to the section-active bitmask.
> Do not get me wrong but the section
> as a unit is deeply carved into the memory hotplug and removing all those
> assumptions is a major undertaking
Right, as stated in the cover letter, this does not remove all those
assumptions, it only removes the ones that impact
devm_memremap_pages(). Specifying that sub-section is only supported
in the 'want_memblock=false' case to arch_add_memory().
> and I would like to know that you are
> not just shifting the problem to a smaller unit and a new/creative HW
> will force us to go even more complicated.
HW will not do this to us. It's software that has the problem.
Namespace creation is unnecessarily constrained to 128MB alignment.
I'm also open to exploring lifting the section alignment constraint
for the 'want_memblock=true', but first things first.
> What is the fundamental reason that pmem sections cannot be assigned
> to a section aligned memory range? The physical address space is
> quite large to impose 128MB sections IMHO. I thought this is merely a
> configuration issue.
1) it's not just hardware that imposes this, software wants to be able
to avoid the constraint
2) the flexibility of the memory controller initialization code is
constrained by address-decode-registers. So while it is simple to say
"just configure it to be aligned" it's not that easy in practice
without throwing away usable memory capacity.
> How often this really happens and how often it is unavoidable.
Again, software can cause this problem at will. Multiple shipping
systems expose this alignment problem in physical address space, for
example: https://github.com/pmem/ndctl/issues/76
> > The implication of allowing sections to be piecemeal mapped/unmapped is
> > that the valid_section() helper is no longer authoritative to determine
> > if a section is fully mapped. Instead pfn_valid() is updated to consult
> > the section-active bitmask. Given that typical memory hotplug still has
> > deep "section" dependencies the sub-section capability is limited to
> > 'want_memblock=false' invocations of arch_add_memory(), effectively only
> > devm_memremap_pages() users for now.
>
> Does this mean that pfn_valid is more expensive now? How much?
Negligible, the information to determine whether the sub-section is
valid for a given pfn is in the same cacheline as the section-valid
flag.
> Also what about the section life time?
Section is live as long as any sub-section is active.
> Who is removing section now?
Last arch_remove_memory() that removes the last sub-section clears out
the remaining sub-section active bits.
> I will probably have much more question, but it's friday and I am mostly
> offline already. I would just like to hear much more about the new
> design and resulting assumptions.
Happy to accommodate this discussion. The section alignment has been
an absolute horror to contend with. So I have years worth of pain to
share for as deep as you want to go on probing why this is needed.
On Fri 22-03-19 11:32:11, Dan Williams wrote:
> On Fri, Mar 22, 2019 at 11:06 AM Michal Hocko <[email protected]> wrote:
> >
> > On Fri 22-03-19 09:57:54, Dan Williams wrote:
> > > Changes since v4 [1]:
> > > - Given v4 was from March of 2017 the bulk of the changes result from
> > > rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> > >
> > > - A unit test is added to ndctl to exercise the creation and dax
> > > mounting of multiple independent namespaces in a single 128M section.
> > >
> > > [1]: https://lwn.net/Articles/717383/
> > >
> > > ---
> > >
> > > Quote patch7:
> > >
> > > "The libnvdimm sub-system has suffered a series of hacks and broken
> > > workarounds for the memory-hotplug implementation's awkward
> > > section-aligned (128MB) granularity. For example the following backtrace
> > > is emitted when attempting arch_add_memory() with physical address
> > > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> > > within a given section:
> > >
> > > WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> > > devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> > > [..]
> > > Call Trace:
> > > dump_stack+0x86/0xc3
> > > __warn+0xcb/0xf0
> > > warn_slowpath_fmt+0x5f/0x80
> > > devm_memremap_pages+0x3b5/0x4c0
> > > __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> > > pmem_attach_disk+0x19a/0x440 [nd_pmem]
> > >
> > > Recently it was discovered that the problem goes beyond RAM vs PMEM
> > > collisions as some platform produce PMEM vs PMEM collisions within a
> > > given section. The libnvdimm workaround for that case revealed that the
> > > libnvdimm section-alignment-padding implementation has been broken for a
> > > long while. A fix for that long-standing breakage introduces as many
> > > problems as it solves as it would require a backward-incompatible change
> > > to the namespace metadata interpretation. Instead of that dubious route
> > > [2], address the root problem in the memory-hotplug implementation."
> > >
> > > The approach is taken is to observe that each section already maintains
> > > an array of 'unsigned long' values to hold the pageblock_flags. A single
> > > additional 'unsigned long' is added to house a 'sub-section active'
> > > bitmask. Each bit tracks the mapped state of one sub-section's worth of
> > > capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> >
> > So the hotplugable unit is pageblock now, right?
>
> No, with this patchset the hotplug unit is 2MB.
Which is a pageblock unit on x86 with hugetlb enabled. I was just
wondering whether this is really bound to pageblock or the math just
works out to be the same.
> > Why is this sufficient?
>
> 2MB is sufficient because it allows mapping a namespace at PMD
> granularity and there is no practical need to go smaller.
>
> > What prevents new and creative HW to come up with alignements that do not fit there?
>
> There is a resource in hardware memory controllers called
> address-decode-registers that control the mapping granularity. The
> minimum granularity today is 64MB and the pressure as memory sizes
> increases is to make that granularity larger, not smaller. So the
> hardware pressure is going in the opposite direction of your concern,
> at least for persistent memory.
OK, this is good to know and actually against subsection direction.
> User-defined memory namespaces have this problem, but 2MB is the
> default alignment and is sufficient for most uses.
What does prevent users to go and use a larger alignment?
> PCI Address BARs that are also mapped with devm_memremap_pages are
> aligned to their size and there is no expectation to support smaller
> than 2MB.
>
> All that said, to support a smaller sub-section granularity, just add
> more bits to the section-active bitmask.
>
> > Do not get me wrong but the section
> > as a unit is deeply carved into the memory hotplug and removing all those
> > assumptions is a major undertaking
>
> Right, as stated in the cover letter, this does not remove all those
> assumptions, it only removes the ones that impact
> devm_memremap_pages(). Specifying that sub-section is only supported
> in the 'want_memblock=false' case to arch_add_memory().
And this is exactly the problem. Having different assumptions depending
on whether there is a memblock interface or not is utterly wrong and a
maintainability mess.
> > and I would like to know that you are
> > not just shifting the problem to a smaller unit and a new/creative HW
> > will force us to go even more complicated.
>
> HW will not do this to us. It's software that has the problem.
> Namespace creation is unnecessarily constrained to 128MB alignment.
And why is that a problem? A lack of documentation that this is a
requirement? Something will not work with a larger alignment? Someting
else?
Why do we have to go a mile to tweak the kernel, especially something as
fragile as memory hotplug, just to support sub mem section ranges. This
is somthing that is not clearly explained in the cover letter. Sure you
are talking about hacks at the higher level to deal with this but I do
not see any fundamental reason to actually support that at all.
> I'm also open to exploring lifting the section alignment constraint
> for the 'want_memblock=true', but first things first.
I disagree. If you want to get rid of the the section requirement then
do it first and build on top. This is a normal kernel development
process.
> > What is the fundamental reason that pmem sections cannot be assigned
> > to a section aligned memory range? The physical address space is
> > quite large to impose 128MB sections IMHO. I thought this is merely a
> > configuration issue.
>
> 1) it's not just hardware that imposes this, software wants to be able
> to avoid the constraint
>
> 2) the flexibility of the memory controller initialization code is
> constrained by address-decode-registers. So while it is simple to say
> "just configure it to be aligned" it's not that easy in practice
> without throwing away usable memory capacity.
Yes and we are talking about 128MB is sacrifying that unit worth all the
troubles?
[...]
> > I will probably have much more question, but it's friday and I am mostly
> > offline already. I would just like to hear much more about the new
> > design and resulting assumptions.
>
> Happy to accommodate this discussion. The section alignment has been
> an absolute horror to contend with. So I have years worth of pain to
> share for as deep as you want to go on probing why this is needed.
I can feel your frustration. I am not entirely happy about the section
size limitation myself but you have to realize that this is simplicy vs.
feature set compromise. It works reasonably well for many usecases but
falls flat on some others. But you cannot simply build on top of
existing foundations and tweak some code paths to handle one particular
case. This is exactly how the memory hotplug ended in the unfortunate
state it is now. If you want to make the code more reusable then there
is a _lot_ of ground work first before you can add a shiny new feature.
--
Michal Hocko
SUSE Labs
Michal Hocko <[email protected]> writes:
>> > and I would like to know that you are
>> > not just shifting the problem to a smaller unit and a new/creative HW
>> > will force us to go even more complicated.
>>
>> HW will not do this to us. It's software that has the problem.
>> Namespace creation is unnecessarily constrained to 128MB alignment.
>
> And why is that a problem? A lack of documentation that this is a
> requirement? Something will not work with a larger alignment? Someting
> else?
See this email for one user-visible problem:
https://lore.kernel.org/lkml/[email protected]/
Cheers,
Jeff
On Mon 25-03-19 10:28:00, Jeff Moyer wrote:
> Michal Hocko <[email protected]> writes:
>
> >> > and I would like to know that you are
> >> > not just shifting the problem to a smaller unit and a new/creative HW
> >> > will force us to go even more complicated.
> >>
> >> HW will not do this to us. It's software that has the problem.
> >> Namespace creation is unnecessarily constrained to 128MB alignment.
> >
> > And why is that a problem? A lack of documentation that this is a
> > requirement? Something will not work with a larger alignment? Someting
> > else?
>
> See this email for one user-visible problem:
> https://lore.kernel.org/lkml/[email protected]/
: # ndctl create-namespace -m fsdax -s 128m
: Error: '--size=' must align to interleave-width: 6 and alignment: 2097152
: did you intend --size=132M?
:
: failed to create namespace: Invalid argument
So the size is in section size units. So what prevents the userspace to
use a proper alignment? I am sorry if this is a stupid question but I am
not really familiar with ndctl nor the pmem side of it.
--
Michal Hocko
SUSE Labs
On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 22-03-19 11:32:11, Dan Williams wrote:
> > On Fri, Mar 22, 2019 at 11:06 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Fri 22-03-19 09:57:54, Dan Williams wrote:
> > > > Changes since v4 [1]:
> > > > - Given v4 was from March of 2017 the bulk of the changes result from
> > > > rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> > > >
> > > > - A unit test is added to ndctl to exercise the creation and dax
> > > > mounting of multiple independent namespaces in a single 128M section.
> > > >
> > > > [1]: https://lwn.net/Articles/717383/
> > > >
> > > > ---
> > > >
> > > > Quote patch7:
> > > >
> > > > "The libnvdimm sub-system has suffered a series of hacks and broken
> > > > workarounds for the memory-hotplug implementation's awkward
> > > > section-aligned (128MB) granularity. For example the following backtrace
> > > > is emitted when attempting arch_add_memory() with physical address
> > > > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> > > > within a given section:
> > > >
> > > > WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> > > > devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> > > > [..]
> > > > Call Trace:
> > > > dump_stack+0x86/0xc3
> > > > __warn+0xcb/0xf0
> > > > warn_slowpath_fmt+0x5f/0x80
> > > > devm_memremap_pages+0x3b5/0x4c0
> > > > __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> > > > pmem_attach_disk+0x19a/0x440 [nd_pmem]
> > > >
> > > > Recently it was discovered that the problem goes beyond RAM vs PMEM
> > > > collisions as some platform produce PMEM vs PMEM collisions within a
> > > > given section. The libnvdimm workaround for that case revealed that the
> > > > libnvdimm section-alignment-padding implementation has been broken for a
> > > > long while. A fix for that long-standing breakage introduces as many
> > > > problems as it solves as it would require a backward-incompatible change
> > > > to the namespace metadata interpretation. Instead of that dubious route
> > > > [2], address the root problem in the memory-hotplug implementation."
> > > >
> > > > The approach is taken is to observe that each section already maintains
> > > > an array of 'unsigned long' values to hold the pageblock_flags. A single
> > > > additional 'unsigned long' is added to house a 'sub-section active'
> > > > bitmask. Each bit tracks the mapped state of one sub-section's worth of
> > > > capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> > >
> > > So the hotplugable unit is pageblock now, right?
> >
> > No, with this patchset the hotplug unit is 2MB.
>
> Which is a pageblock unit on x86 with hugetlb enabled. I was just
> wondering whether this is really bound to pageblock or the math just
> works out to be the same.
Ah, ok just coincidental math.
> > > Why is this sufficient?
> >
> > 2MB is sufficient because it allows mapping a namespace at PMD
> > granularity and there is no practical need to go smaller.
> >
> > > What prevents new and creative HW to come up with alignements that do not fit there?
> >
> > There is a resource in hardware memory controllers called
> > address-decode-registers that control the mapping granularity. The
> > minimum granularity today is 64MB and the pressure as memory sizes
> > increases is to make that granularity larger, not smaller. So the
> > hardware pressure is going in the opposite direction of your concern,
> > at least for persistent memory.
>
> OK, this is good to know and actually against subsection direction.
Seems I forgot to mention timescales. The 64MB granularity is present
on current generation platforms, and I expect multiple platform
generations (potentially years) until it might change in the future.
That does not even take into consideration the configuration
flexibility of PCI BAR ranges and the interaction with the
peer-to-peer DMA facility which maps 'struct page' for physical ranges
that are not memory. There is no pressure for PCI BAR ranges to submit
to a 128MB alignment.
> > User-defined memory namespaces have this problem, but 2MB is the
> > default alignment and is sufficient for most uses.
>
> What does prevent users to go and use a larger alignment?
Given that we are living with 64MB granularity on mainstream platforms
for the foreseeable future, the reason users can't rely on a larger
alignment to address the issue is that the physical alignment may
change from one boot to the next.
No, you can't just wish hardware / platform firmware won't do this,
because there are not enough platform resources to give every hardware
device a guaranteed alignment.
The effect is that even if the driver deploys a software alignment
mitigation when it first sees the persistent memory range, that
alignment can be violated on a subsequent boot leading to data being
unavailable. There is no facility to communicate to the administrator
what went wrong in this scenario as several events can trigger a
physical map layout change. Add / remove of hardware and hardware
failure are the most likely causes.
An additional pain point for users is that EFI pre-boot environment
has little chance to create a namespace that Linux might be able to
use. The section size is an arbitrary Linux constraint and we should
not encode something Linux specific that might change in the future
into OS agnostic software.
> > PCI Address BARs that are also mapped with devm_memremap_pages are
> > aligned to their size and there is no expectation to support smaller
> > than 2MB.
> >
> > All that said, to support a smaller sub-section granularity, just add
> > more bits to the section-active bitmask.
> >
> > > Do not get me wrong but the section
> > > as a unit is deeply carved into the memory hotplug and removing all those
> > > assumptions is a major undertaking
> >
> > Right, as stated in the cover letter, this does not remove all those
> > assumptions, it only removes the ones that impact
> > devm_memremap_pages(). Specifying that sub-section is only supported
> > in the 'want_memblock=false' case to arch_add_memory().
>
> And this is exactly the problem. Having different assumptions depending
> on whether there is a memblock interface or not is utterly wrong and a
> maintainability mess.
In this case I disagree with you. The hotplug code already has the
want_memblock=false semantic in the implementation. The sub-section
hotplug infrastructure is a strict superset of what is there already.
Now, if it created parallel infrastructure that would indeed be a
maintainability burden, but in this case there are no behavior changes
for typical memory hotplug as it just hotplugs full sections at a time
like always. The 'section' concept is not going away.
> > > and I would like to know that you are
> > > not just shifting the problem to a smaller unit and a new/creative HW
> > > will force us to go even more complicated.
> >
> > HW will not do this to us. It's software that has the problem.
> > Namespace creation is unnecessarily constrained to 128MB alignment.
>
> And why is that a problem?
Data loss, inability to cope with some hardware configurations,
difficult to interoperate with non-Linux software.
> A lack of documentation that this is a requirement?
It's not a requirement. It's an arbitrary Linux implementation detail.
> Something will not work with a larger alignment? Someting else?
[..]
> Why do we have to go a mile to tweak the kernel, especially something as
> fragile as memory hotplug, just to support sub mem section ranges. This
> is somthing that is not clearly explained in the cover letter. Sure you
> are talking about hacks at the higher level to deal with this but I do
> not see any fundamental reason to actually support that at all.
Like it or not, 'struct page' mappings for arbitrary hardware-physical
memory ranges is a facility that has grown from the pmem case, to hmm,
and peer-to-peer DMA. Unless you want to do the work to eliminate the
'struct page' requirement across the kernel I think it is unreasonable
to effectively archive the arch_add_memory() implementation and
prevent it from reacting to growing demands.
Note that I did try to eliminate 'struct page' before creating
devm_memremap_pages(), that effort failed because 'struct page' is
just too ingrained into too many kernel code paths.
> > I'm also open to exploring lifting the section alignment constraint
> > for the 'want_memblock=true', but first things first.
>
> I disagree. If you want to get rid of the the section requirement then
> do it first and build on top. This is a normal kernel development
> process.
>
> > > What is the fundamental reason that pmem sections cannot be assigned
> > > to a section aligned memory range? The physical address space is
> > > quite large to impose 128MB sections IMHO. I thought this is merely a
> > > configuration issue.
> >
> > 1) it's not just hardware that imposes this, software wants to be able
> > to avoid the constraint
> >
> > 2) the flexibility of the memory controller initialization code is
> > constrained by address-decode-registers. So while it is simple to say
> > "just configure it to be aligned" it's not that easy in practice
> > without throwing away usable memory capacity.
>
> Yes and we are talking about 128MB is sacrifying that unit worth all the
> troubles?
>
> [...]
>
> > > I will probably have much more question, but it's friday and I am mostly
> > > offline already. I would just like to hear much more about the new
> > > design and resulting assumptions.
> >
> > Happy to accommodate this discussion. The section alignment has been
> > an absolute horror to contend with. So I have years worth of pain to
> > share for as deep as you want to go on probing why this is needed.
>
> I can feel your frustration. I am not entirely happy about the section
> size limitation myself but you have to realize that this is simplicy vs.
> feature set compromise.
You have to realize that arch_add_memory() is no longer just a
front-end for typical memory hotplug. The requirements have changed.
Simplicity should be maintained for as long as it can get the job
done, and the simplicity is currently failing.
> It works reasonably well for many usecases but
> falls flat on some others. But you cannot simply build on top of
> existing foundations and tweak some code paths to handle one particular
> case. This is exactly how the memory hotplug ended in the unfortunate
> state it is now. If you want to make the code more reusable then there
> is a _lot_ of ground work first before you can add a shiny new feature.
This *is* the ground work. It solves the today's hardware page-mapping
pain points with a path to go deeper and enable sub-section hotplug
for typical memory hotplug, but in the meantime the two cases share
common code paths. This is not a tweak on the side, it's a super-set,
and typical memory-hotplug is properly fixed up to use just the subset
that it needs.
On Mon 25-03-19 13:03:47, Dan Williams wrote:
> On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <[email protected]> wrote:
[...]
> > > User-defined memory namespaces have this problem, but 2MB is the
> > > default alignment and is sufficient for most uses.
> >
> > What does prevent users to go and use a larger alignment?
>
> Given that we are living with 64MB granularity on mainstream platforms
> for the foreseeable future, the reason users can't rely on a larger
> alignment to address the issue is that the physical alignment may
> change from one boot to the next.
I would love to learn more about this inter boot volatility. Could you
expand on that some more? I though that the HW configuration presented
to the OS would be more or less stable unless the underlying HW changes.
> No, you can't just wish hardware / platform firmware won't do this,
> because there are not enough platform resources to give every hardware
> device a guaranteed alignment.
Guarantee is one part and I can see how nobody wants to give you
something as strong but how often does that happen in the real life?
> The effect is that even if the driver deploys a software alignment
> mitigation when it first sees the persistent memory range, that
> alignment can be violated on a subsequent boot leading to data being
> unavailable. There is no facility to communicate to the administrator
> what went wrong in this scenario as several events can trigger a
> physical map layout change. Add / remove of hardware and hardware
> failure are the most likely causes.
This is indeed bad and unexpected! That is exactly something to have in
the chagelog!
> An additional pain point for users is that EFI pre-boot environment
> has little chance to create a namespace that Linux might be able to
> use. The section size is an arbitrary Linux constraint and we should
> not encode something Linux specific that might change in the future
> into OS agnostic software.
This looks like a fair point but please keep in mind that there hotplug
restrictions are on other platforms as well (4MB on Windows IIRC) so
there will be some knowledge required all the time. Besides that there
are likely to be some restrictions depending on the implementation.
[...]
> > > Right, as stated in the cover letter, this does not remove all those
> > > assumptions, it only removes the ones that impact
> > > devm_memremap_pages(). Specifying that sub-section is only supported
> > > in the 'want_memblock=false' case to arch_add_memory().
> >
> > And this is exactly the problem. Having different assumptions depending
> > on whether there is a memblock interface or not is utterly wrong and a
> > maintainability mess.
>
> In this case I disagree with you. The hotplug code already has the
> want_memblock=false semantic in the implementation.
want_memblock was a hack to allow memory hotplug to not have user
visible sysfs interface. It was added to reduce the code duplication
IIRC. Besides that this hasn't changed the underlying assumptions about
hotplugable units or other invariants that were in place.
> The sub-section
> hotplug infrastructure is a strict superset of what is there already.
> Now, if it created parallel infrastructure that would indeed be a
> maintainability burden, but in this case there are no behavior changes
> for typical memory hotplug as it just hotplugs full sections at a time
> like always. The 'section' concept is not going away.
You are really neglecting many details here. E.g. memory section can be
shared between two different types of memory. We've had some bugs in the
hotplug code when one section can be shared between two different NUMA
nodes (e.g. 4aa9fc2a435a ("Revert "mm, memory_hotplug: initialize struct
pages for the full memory section""). We do not allow to hotremove such
sections because it would open another can of worms. I am not saying
your implementation is incorrect - still haven't time to look deeply -
but stating that this is a strict superset of want_memblock is simply
wrong.
[...]
> > Why do we have to go a mile to tweak the kernel, especially something as
> > fragile as memory hotplug, just to support sub mem section ranges. This
> > is somthing that is not clearly explained in the cover letter. Sure you
> > are talking about hacks at the higher level to deal with this but I do
> > not see any fundamental reason to actually support that at all.
>
> Like it or not, 'struct page' mappings for arbitrary hardware-physical
> memory ranges is a facility that has grown from the pmem case, to hmm,
> and peer-to-peer DMA. Unless you want to do the work to eliminate the
> 'struct page' requirement across the kernel I think it is unreasonable
> to effectively archive the arch_add_memory() implementation and
> prevent it from reacting to growing demands.
I am definitely not blocking memory hotplug to be reused more! All I am
saying is that there is much more ground work to be done before you can
add features like that. There are some general assumptions in the code,
like it or not, and you should start by removing those to build on top.
Pmem/nvidimm development is full of "we have to do it now and find a way
to graft it into the existing infrastructure" pattern that I really
hate. Clean up will come later, I have heard. Have a look at all
zone_device hacks that remained. Why is this any different?
And just to make myself clear. There are places where section cannot go
away because that is the unit in which the memory model maintains struct
pages. But the hotplug code is fill of construct where we iterate mem
sections as one unit and operate on it as whole. Those have to go away
before you can consider subsection hotadd/remove.
> > I can feel your frustration. I am not entirely happy about the section
> > size limitation myself but you have to realize that this is simplicy vs.
> > feature set compromise.
>
> You have to realize that arch_add_memory() is no longer just a
> front-end for typical memory hotplug. The requirements have changed.
> Simplicity should be maintained for as long as it can get the job
> done, and the simplicity is currently failing.
I do agree. But you also have to realize that this require a lot of
work. As long as users of the api are not willing to do that work then
I am afraid but the facility will remain dumb. But putting hacks to make
a specific usecase (almost)work is not the right way.
--
Michal Hocko
SUSE Labs
On Tue, Mar 26, 2019 at 1:04 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 25-03-19 13:03:47, Dan Williams wrote:
> > On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <[email protected]> wrote:
> [...]
> > > > User-defined memory namespaces have this problem, but 2MB is the
> > > > default alignment and is sufficient for most uses.
> > >
> > > What does prevent users to go and use a larger alignment?
> >
> > Given that we are living with 64MB granularity on mainstream platforms
> > for the foreseeable future, the reason users can't rely on a larger
> > alignment to address the issue is that the physical alignment may
> > change from one boot to the next.
>
> I would love to learn more about this inter boot volatility. Could you
> expand on that some more? I though that the HW configuration presented
> to the OS would be more or less stable unless the underlying HW changes.
Even if the configuration is static there can be hardware failures
that prevent a DIMM, or a PCI device to be included in the memory map.
When that happens the BIOS needs to re-layout the map and the result
is not guaranteed to maintain the previous alignment.
> > No, you can't just wish hardware / platform firmware won't do this,
> > because there are not enough platform resources to give every hardware
> > device a guaranteed alignment.
>
> Guarantee is one part and I can see how nobody wants to give you
> something as strong but how often does that happen in the real life?
I expect a "rare" event to happen everyday in a data-center fleet.
Failure rates tend towards 100% daily occurrence at scale and in this
case the kernel has everything it needs to mitigate such an event.
Setting aside the success rate of a software-alignment mitigation, the
reason I am charging this hill again after a 2 year hiatus is the
realization that this problem is wider spread than the original
failing scenario. Back in 2017 the problem seemed limited to custom
memmap= configurations, and collisions between PMEM and System RAM.
Now it is clear that the collisions can happen between PMEM regions
and namespaces as well, and the problem spans platforms from multiple
vendors. Here is the most recent collision problem:
https://github.com/pmem/ndctl/issues/76, from a third-party platform.
The fix for that issue uncovered a bug in the padding implementation,
and a fix for that bug would result in even more hacks in the nvdimm
code for what is a core kernel deficiency. Code review of those
changes resulted in changing direction to go after the core
deficiency.
> > The effect is that even if the driver deploys a software alignment
> > mitigation when it first sees the persistent memory range, that
> > alignment can be violated on a subsequent boot leading to data being
> > unavailable. There is no facility to communicate to the administrator
> > what went wrong in this scenario as several events can trigger a
> > physical map layout change. Add / remove of hardware and hardware
> > failure are the most likely causes.
>
> This is indeed bad and unexpected! That is exactly something to have in
> the chagelog!
Apologies that was indeed included in the 2017 changelog (see: "a user
could inadvertently lose access to nvdimm namespaces" note here:
https://lwn.net/Articles/717383/), and I failed to carry it forward.
>
> > An additional pain point for users is that EFI pre-boot environment
> > has little chance to create a namespace that Linux might be able to
> > use. The section size is an arbitrary Linux constraint and we should
> > not encode something Linux specific that might change in the future
> > into OS agnostic software.
>
> This looks like a fair point but please keep in mind that there hotplug
> restrictions are on other platforms as well (4MB on Windows IIRC) so
> there will be some knowledge required all the time. Besides that there
> are likely to be some restrictions depending on the implementation.
Windows does not have an equivalent constraint, so it's only Linux
that imposes an arbitrary alignment restriction on pmem to agents like
EFI.
> [...]
> > > > Right, as stated in the cover letter, this does not remove all those
> > > > assumptions, it only removes the ones that impact
> > > > devm_memremap_pages(). Specifying that sub-section is only supported
> > > > in the 'want_memblock=false' case to arch_add_memory().
> > >
> > > And this is exactly the problem. Having different assumptions depending
> > > on whether there is a memblock interface or not is utterly wrong and a
> > > maintainability mess.
> >
> > In this case I disagree with you. The hotplug code already has the
> > want_memblock=false semantic in the implementation.
>
> want_memblock was a hack to allow memory hotplug to not have user
> visible sysfs interface. It was added to reduce the code duplication
> IIRC. Besides that this hasn't changed the underlying assumptions about
> hotplugable units or other invariants that were in place.
Neither does this patch series for the typical memory hotplug case.
For the device-memory use case I've gone through and fixed up the
underlying assumptions.
> > The sub-section
> > hotplug infrastructure is a strict superset of what is there already.
> > Now, if it created parallel infrastructure that would indeed be a
> > maintainability burden, but in this case there are no behavior changes
> > for typical memory hotplug as it just hotplugs full sections at a time
> > like always. The 'section' concept is not going away.
>
> You are really neglecting many details here. E.g. memory section can be
> shared between two different types of memory. We've had some bugs in the
> hotplug code when one section can be shared between two different NUMA
> nodes (e.g. 4aa9fc2a435a ("Revert "mm, memory_hotplug: initialize struct
> pages for the full memory section""). We do not allow to hotremove such
> sections because it would open another can of worms. I am not saying
> your implementation is incorrect - still haven't time to look deeply -
> but stating that this is a strict superset of want_memblock is simply
> wrong.
Please have a look at the code and the handling of "early" sections.
The assertion that I neglected to consider that detail is not true.
My "superset" contention is from the arch_add_memory() api
perspective. All typical memory hotplug use cases are a sub-case of
the new support.
> [...]
> > > Why do we have to go a mile to tweak the kernel, especially something as
> > > fragile as memory hotplug, just to support sub mem section ranges. This
> > > is somthing that is not clearly explained in the cover letter. Sure you
> > > are talking about hacks at the higher level to deal with this but I do
> > > not see any fundamental reason to actually support that at all.
> >
> > Like it or not, 'struct page' mappings for arbitrary hardware-physical
> > memory ranges is a facility that has grown from the pmem case, to hmm,
> > and peer-to-peer DMA. Unless you want to do the work to eliminate the
> > 'struct page' requirement across the kernel I think it is unreasonable
> > to effectively archive the arch_add_memory() implementation and
> > prevent it from reacting to growing demands.
>
> I am definitely not blocking memory hotplug to be reused more! All I am
> saying is that there is much more ground work to be done before you can
> add features like that. There are some general assumptions in the code,
> like it or not, and you should start by removing those to build on top.
Let's talk about specifics please, because I don't think you've had a
chance to consider the details in the patches. Your "start by removing
those [assumptions] to build on top" request is indeed what the
preparation patches in this series aim to achieve.
The general assumptions of the current (pre-patch-series) implementation are:
- Sections that describe boot memory (early sections) are never
unplugged / removed.
- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check
- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.
- the memblock sysfs interface only comprehends full sections
Those assumptions are removed / handled with the following
implementation details respectively:
- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable
memory capacity to padding.
- pfn_valid() goes beyond valid_section() to also check the
active-sub-section mask. As stated before this indication is in the
same cacheline as the valid_section() so the performance impact is
expected to be negligible. So far the lkp robot has not reported any
regressions.
- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that deal
with online memory are not updated. That's a feature not a bug until
we decide that sub-section hotplug makes sense for online / typical
memory as well.
- the existing memblock sysfs user api guarantees / assumptions are
not touched since this capability is limited to !online
!sysfs-accessible sections for now.
> Pmem/nvidimm development is full of "we have to do it now and find a way
> to graft it into the existing infrastructure" pattern that I really
> hate. Clean up will come later, I have heard. Have a look at all
> zone_device hacks that remained. Why is this any different?
This is indeed different because unlike memmap_init_zone_device(),
which is arguably a side-hack to move 'struct page' init outside the
mem_hotplug_lock just for ZONE_DEVICE, this implementation is reused
in the main memory hotplug path. It's not a "temporary implementation
until something better comes along", it moves the implementation
forward not sideways.
> And just to make myself clear. There are places where section cannot go
> away because that is the unit in which the memory model maintains struct
> pages. But the hotplug code is fill of construct where we iterate mem
> sections as one unit and operate on it as whole. Those have to go away
> before you can consider subsection hotadd/remove.
>
> > > I can feel your frustration. I am not entirely happy about the section
> > > size limitation myself but you have to realize that this is simplicy vs.
> > > feature set compromise.
> >
> > You have to realize that arch_add_memory() is no longer just a
> > front-end for typical memory hotplug. The requirements have changed.
> > Simplicity should be maintained for as long as it can get the job
> > done, and the simplicity is currently failing.
>
> I do agree. But you also have to realize that this require a lot of
> work. As long as users of the api are not willing to do that work then
> I am afraid but the facility will remain dumb. But putting hacks to make
> a specific usecase (almost)work is not the right way.
Please look at the patches. This isn't a half-step, it's a solution to
a problem that has haunted the implementation for years. If there are
opportunities for additional cleanups please point them out.
On Tue 26-03-19 17:20:41, Dan Williams wrote:
> On Tue, Mar 26, 2019 at 1:04 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 25-03-19 13:03:47, Dan Williams wrote:
> > > On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <[email protected]> wrote:
> > [...]
> > > > > User-defined memory namespaces have this problem, but 2MB is the
> > > > > default alignment and is sufficient for most uses.
> > > >
> > > > What does prevent users to go and use a larger alignment?
> > >
> > > Given that we are living with 64MB granularity on mainstream platforms
> > > for the foreseeable future, the reason users can't rely on a larger
> > > alignment to address the issue is that the physical alignment may
> > > change from one boot to the next.
> >
> > I would love to learn more about this inter boot volatility. Could you
> > expand on that some more? I though that the HW configuration presented
> > to the OS would be more or less stable unless the underlying HW changes.
>
> Even if the configuration is static there can be hardware failures
> that prevent a DIMM, or a PCI device to be included in the memory map.
> When that happens the BIOS needs to re-layout the map and the result
> is not guaranteed to maintain the previous alignment.
>
> > > No, you can't just wish hardware / platform firmware won't do this,
> > > because there are not enough platform resources to give every hardware
> > > device a guaranteed alignment.
> >
> > Guarantee is one part and I can see how nobody wants to give you
> > something as strong but how often does that happen in the real life?
>
> I expect a "rare" event to happen everyday in a data-center fleet.
> Failure rates tend towards 100% daily occurrence at scale and in this
> case the kernel has everything it needs to mitigate such an event.
>
> Setting aside the success rate of a software-alignment mitigation, the
> reason I am charging this hill again after a 2 year hiatus is the
> realization that this problem is wider spread than the original
> failing scenario. Back in 2017 the problem seemed limited to custom
> memmap= configurations, and collisions between PMEM and System RAM.
> Now it is clear that the collisions can happen between PMEM regions
> and namespaces as well, and the problem spans platforms from multiple
> vendors. Here is the most recent collision problem:
> https://github.com/pmem/ndctl/issues/76, from a third-party platform.
>
> The fix for that issue uncovered a bug in the padding implementation,
> and a fix for that bug would result in even more hacks in the nvdimm
> code for what is a core kernel deficiency. Code review of those
> changes resulted in changing direction to go after the core
> deficiency.
This kind of information along with real world examples is exactly what
you should have added into the cover letter. A previous very vague
claims were not really convincing or something that can be considered a
proper justification. Please do realize that people who are not working
with the affected HW are unlikely to have an idea how serious/relevant
those problems really are.
People are asking for a smaller memory hotplug granularity for other
usecases (e.g. memory ballooning into VMs) which are quite dubious to
be honest and not really worth all the code rework. If we are talking
about something that can be worked around elsewhere then it is preferred
because the code base is not in an excellent shape and putting more on
top is just going to cause more headaches.
I will try to find some time to review this more deeply (no promises
though because time is hectic and this is not a simple feature). For the
future, please try harder to write up a proper justification and a
highlevel design description which tells a bit about all important parts
of the new scheme.
--
Michal Hocko
SUSE Labs
On Wed, Mar 27, 2019 at 9:13 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 26-03-19 17:20:41, Dan Williams wrote:
> > On Tue, Mar 26, 2019 at 1:04 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 25-03-19 13:03:47, Dan Williams wrote:
> > > > On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <[email protected]> wrote:
> > > [...]
> > > > > > User-defined memory namespaces have this problem, but 2MB is the
> > > > > > default alignment and is sufficient for most uses.
> > > > >
> > > > > What does prevent users to go and use a larger alignment?
> > > >
> > > > Given that we are living with 64MB granularity on mainstream platforms
> > > > for the foreseeable future, the reason users can't rely on a larger
> > > > alignment to address the issue is that the physical alignment may
> > > > change from one boot to the next.
> > >
> > > I would love to learn more about this inter boot volatility. Could you
> > > expand on that some more? I though that the HW configuration presented
> > > to the OS would be more or less stable unless the underlying HW changes.
> >
> > Even if the configuration is static there can be hardware failures
> > that prevent a DIMM, or a PCI device to be included in the memory map.
> > When that happens the BIOS needs to re-layout the map and the result
> > is not guaranteed to maintain the previous alignment.
> >
> > > > No, you can't just wish hardware / platform firmware won't do this,
> > > > because there are not enough platform resources to give every hardware
> > > > device a guaranteed alignment.
> > >
> > > Guarantee is one part and I can see how nobody wants to give you
> > > something as strong but how often does that happen in the real life?
> >
> > I expect a "rare" event to happen everyday in a data-center fleet.
> > Failure rates tend towards 100% daily occurrence at scale and in this
> > case the kernel has everything it needs to mitigate such an event.
> >
> > Setting aside the success rate of a software-alignment mitigation, the
> > reason I am charging this hill again after a 2 year hiatus is the
> > realization that this problem is wider spread than the original
> > failing scenario. Back in 2017 the problem seemed limited to custom
> > memmap= configurations, and collisions between PMEM and System RAM.
> > Now it is clear that the collisions can happen between PMEM regions
> > and namespaces as well, and the problem spans platforms from multiple
> > vendors. Here is the most recent collision problem:
> > https://github.com/pmem/ndctl/issues/76, from a third-party platform.
> >
> > The fix for that issue uncovered a bug in the padding implementation,
> > and a fix for that bug would result in even more hacks in the nvdimm
> > code for what is a core kernel deficiency. Code review of those
> > changes resulted in changing direction to go after the core
> > deficiency.
>
> This kind of information along with real world examples is exactly what
> you should have added into the cover letter. A previous very vague
> claims were not really convincing or something that can be considered a
> proper justification. Please do realize that people who are not working
> with the affected HW are unlikely to have an idea how serious/relevant
> those problems really are.
>
> People are asking for a smaller memory hotplug granularity for other
> usecases (e.g. memory ballooning into VMs) which are quite dubious to
> be honest and not really worth all the code rework. If we are talking
> about something that can be worked around elsewhere then it is preferred
> because the code base is not in an excellent shape and putting more on
> top is just going to cause more headaches.
>
> I will try to find some time to review this more deeply (no promises
> though because time is hectic and this is not a simple feature). For the
> future, please try harder to write up a proper justification and a
> highlevel design description which tells a bit about all important parts
> of the new scheme.
Fair enough. I've been steeped in this for too long, and should have
taken a wider view to bring reviewers up to speed.
On 27.03.19 17:13, Michal Hocko wrote:
> On Tue 26-03-19 17:20:41, Dan Williams wrote:
>> On Tue, Mar 26, 2019 at 1:04 AM Michal Hocko <[email protected]> wrote:
>>>
>>> On Mon 25-03-19 13:03:47, Dan Williams wrote:
>>>> On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <[email protected]> wrote:
>>> [...]
>>>>>> User-defined memory namespaces have this problem, but 2MB is the
>>>>>> default alignment and is sufficient for most uses.
>>>>>
>>>>> What does prevent users to go and use a larger alignment?
>>>>
>>>> Given that we are living with 64MB granularity on mainstream platforms
>>>> for the foreseeable future, the reason users can't rely on a larger
>>>> alignment to address the issue is that the physical alignment may
>>>> change from one boot to the next.
>>>
>>> I would love to learn more about this inter boot volatility. Could you
>>> expand on that some more? I though that the HW configuration presented
>>> to the OS would be more or less stable unless the underlying HW changes.
>>
>> Even if the configuration is static there can be hardware failures
>> that prevent a DIMM, or a PCI device to be included in the memory map.
>> When that happens the BIOS needs to re-layout the map and the result
>> is not guaranteed to maintain the previous alignment.
>>
>>>> No, you can't just wish hardware / platform firmware won't do this,
>>>> because there are not enough platform resources to give every hardware
>>>> device a guaranteed alignment.
>>>
>>> Guarantee is one part and I can see how nobody wants to give you
>>> something as strong but how often does that happen in the real life?
>>
>> I expect a "rare" event to happen everyday in a data-center fleet.
>> Failure rates tend towards 100% daily occurrence at scale and in this
>> case the kernel has everything it needs to mitigate such an event.
>>
>> Setting aside the success rate of a software-alignment mitigation, the
>> reason I am charging this hill again after a 2 year hiatus is the
>> realization that this problem is wider spread than the original
>> failing scenario. Back in 2017 the problem seemed limited to custom
>> memmap= configurations, and collisions between PMEM and System RAM.
>> Now it is clear that the collisions can happen between PMEM regions
>> and namespaces as well, and the problem spans platforms from multiple
>> vendors. Here is the most recent collision problem:
>> https://github.com/pmem/ndctl/issues/76, from a third-party platform.
>>
>> The fix for that issue uncovered a bug in the padding implementation,
>> and a fix for that bug would result in even more hacks in the nvdimm
>> code for what is a core kernel deficiency. Code review of those
>> changes resulted in changing direction to go after the core
>> deficiency.
>
> This kind of information along with real world examples is exactly what
> you should have added into the cover letter. A previous very vague
> claims were not really convincing or something that can be considered a
> proper justification. Please do realize that people who are not working
> with the affected HW are unlikely to have an idea how serious/relevant
> those problems really are.
>
> People are asking for a smaller memory hotplug granularity for other
> usecases (e.g. memory ballooning into VMs) which are quite dubious to
> be honest and not really worth all the code rework. If we are talking
> about something that can be worked around elsewhere then it is preferred
> because the code base is not in an excellent shape and putting more on
> top is just going to cause more headaches.
At least for virtio-mem, it will be handled similar to xen-balloon and
hyper-v balloon, where whole actions are added and some parts are kept
"soft-offline". But there, one device "owns" the complete section, it
does not overlap with other devices. One section only has one owner.
As we discussed a similar approach back then with virtio-mem
(online/offline of smaller blocks), you had a pretty good point that
such complexity is better avoided in core MM. Sections really seem to be
the granularity with which core MM should work. At least speaking about
!pmem memory hotplug.
>
> I will try to find some time to review this more deeply (no promises
> though because time is hectic and this is not a simple feature). For the
> future, please try harder to write up a proper justification and a
> highlevel design description which tells a bit about all important parts
> of the new scheme.
>
--
Thanks,
David / dhildenb
On Thu 28-03-19 14:38:15, David Hildenbrand wrote:
> On 27.03.19 17:13, Michal Hocko wrote:
[...]
> > People are asking for a smaller memory hotplug granularity for other
> > usecases (e.g. memory ballooning into VMs) which are quite dubious to
> > be honest and not really worth all the code rework. If we are talking
> > about something that can be worked around elsewhere then it is preferred
> > because the code base is not in an excellent shape and putting more on
> > top is just going to cause more headaches.
>
> At least for virtio-mem, it will be handled similar to xen-balloon and
> hyper-v balloon, where whole actions are added and some parts are kept
> "soft-offline". But there, one device "owns" the complete section, it
> does not overlap with other devices. One section only has one owner.
This is exactly what I meant by handing at a higher level.
--
Michal Hocko
SUSE Labs
On 22.03.19 17:57, Dan Williams wrote:
> Changes since v4 [1]:
> - Given v4 was from March of 2017 the bulk of the changes result from
> rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
>
> - A unit test is added to ndctl to exercise the creation and dax
> mounting of multiple independent namespaces in a single 128M section.
>
> [1]: https://lwn.net/Articles/717383/
>
> ---
I'm gonna have to ask some very basic questions:
You are using the term "Sub-section memory hotplug support", but is it
actually what you mean? To rephrase, aren't we talking here about
"Sub-section device memory hotplug support" or similar?
Reason I am asking is because I wonder how that would interact with the
memory block device infrastructure and hotplugging of system ram -
add_memory()/add_memory_resource(). I *assume* you are not changing the
add_memory() interface, so that one still only works with whole sections
(or well, memory_block_size_bytes()) - check_hotplug_memory_range().
In general, mix and matching system RAM and persistent memory per
section, I am not a friend of that. Especially when it comes to memory
block devices. But I am getting the feeling that we are rather targeting
PMEM vs. PMEM with this patch series.
>
> Quote patch7:
>
> "The libnvdimm sub-system has suffered a series of hacks and broken
> workarounds for the memory-hotplug implementation's awkward
> section-aligned (128MB) granularity. For example the following backtrace
> is emitted when attempting arch_add_memory() with physical address
> ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> within a given section:
>
> WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> [..]
> Call Trace:
> dump_stack+0x86/0xc3
> __warn+0xcb/0xf0
> warn_slowpath_fmt+0x5f/0x80
> devm_memremap_pages+0x3b5/0x4c0
> __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> pmem_attach_disk+0x19a/0x440 [nd_pmem]
>
> Recently it was discovered that the problem goes beyond RAM vs PMEM
> collisions as some platform produce PMEM vs PMEM collisions within a
As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be
implemented "on top" of what we have right now. Or is this what we
already have that you call "hacks in nvdimm" code? (no NVDIMM expert,
sorry for the stupid questions)
> given section. The libnvdimm workaround for that case revealed that the
> libnvdimm section-alignment-padding implementation has been broken for a
> long while. A fix for that long-standing breakage introduces as many
> problems as it solves as it would require a backward-incompatible change
> to the namespace metadata interpretation. Instead of that dubious route
> [2], address the root problem in the memory-hotplug implementation."
>
> The approach is taken is to observe that each section already maintains
> an array of 'unsigned long' values to hold the pageblock_flags. A single
> additional 'unsigned long' is added to house a 'sub-section active'
> bitmask. Each bit tracks the mapped state of one sub-section's worth of
> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
>
> The implication of allowing sections to be piecemeal mapped/unmapped is
> that the valid_section() helper is no longer authoritative to determine
> if a section is fully mapped. Instead pfn_valid() is updated to consult
> the section-active bitmask. Given that typical memory hotplug still has
> deep "section" dependencies the sub-section capability is limited to
> 'want_memblock=false' invocations of arch_add_memory(), effectively only
> devm_memremap_pages() users for now.
Ah, there it is. And my point would be, please don't ever unlock
something like that for want_memblock=true. Especially not for memory
added after boot via device drivers (add_memory()).
>
> With this in place the hacks in the libnvdimm sub-system can be
> dropped, and other devm_memremap_pages() users need no longer be
> constrained to 128MB mapping granularity.
--
Thanks,
David / dhildenb
On Thu, Mar 28, 2019 at 1:10 PM David Hildenbrand <[email protected]> wrote:
>
> On 22.03.19 17:57, Dan Williams wrote:
> > Changes since v4 [1]:
> > - Given v4 was from March of 2017 the bulk of the changes result from
> > rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
> >
> > - A unit test is added to ndctl to exercise the creation and dax
> > mounting of multiple independent namespaces in a single 128M section.
> >
> > [1]: https://lwn.net/Articles/717383/
> >
> > ---
>
> I'm gonna have to ask some very basic questions:
No worries.
>
> You are using the term "Sub-section memory hotplug support", but is it
> actually what you mean? To rephrase, aren't we talking here about
> "Sub-section device memory hotplug support" or similar?
Specifically it is support for passing @start and @size arguments to
arch_add_memory() that are not section aligned. It's not limited to
"device memory" which is otherwise not a concept that
arch_add_memory() understands, it just groks spans of pfns.
> Reason I am asking is because I wonder how that would interact with the
> memory block device infrastructure and hotplugging of system ram -
> add_memory()/add_memory_resource(). I *assume* you are not changing the
> add_memory() interface, so that one still only works with whole sections
> (or well, memory_block_size_bytes()) - check_hotplug_memory_range().
Like you found below, the implementation enforces that add_memory_*()
interfaces maintain section alignment for @start and @size.
> In general, mix and matching system RAM and persistent memory per
> section, I am not a friend of that.
You have no choice. The platform may decide to map PMEM and System RAM
in the same section because the Linux section is too large compared to
typical memory controller mapping granularity capability.
> Especially when it comes to memory
> block devices. But I am getting the feeling that we are rather targeting
> PMEM vs. PMEM with this patch series.
The collisions are between System RAM, PMEM regions, and PMEM
namespaces (sub-divisions of regions that each need their own mapping
lifetime).
> > Quote patch7:
> >
> > "The libnvdimm sub-system has suffered a series of hacks and broken
> > workarounds for the memory-hotplug implementation's awkward
> > section-aligned (128MB) granularity. For example the following backtrace
> > is emitted when attempting arch_add_memory() with physical address
> > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> > within a given section:
> >
> > WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> > devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> > [..]
> > Call Trace:
> > dump_stack+0x86/0xc3
> > __warn+0xcb/0xf0
> > warn_slowpath_fmt+0x5f/0x80
> > devm_memremap_pages+0x3b5/0x4c0
> > __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> > pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >
> > Recently it was discovered that the problem goes beyond RAM vs PMEM
> > collisions as some platform produce PMEM vs PMEM collisions within a
>
> As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be
> implemented "on top" of what we have right now. Or is this what we
> already have that you call "hacks in nvdimm" code? (no NVDIMM expert,
> sorry for the stupid questions)
It doesn't work, because even if the padding was implemented 100%
correct, which thus far has failed to be the case, the platform may
change physical alignments from one boot to the next for a variety of
reasons.
>
> > given section. The libnvdimm workaround for that case revealed that the
> > libnvdimm section-alignment-padding implementation has been broken for a
> > long while. A fix for that long-standing breakage introduces as many
> > problems as it solves as it would require a backward-incompatible change
> > to the namespace metadata interpretation. Instead of that dubious route
> > [2], address the root problem in the memory-hotplug implementation."
> >
> > The approach is taken is to observe that each section already maintains
> > an array of 'unsigned long' values to hold the pageblock_flags. A single
> > additional 'unsigned long' is added to house a 'sub-section active'
> > bitmask. Each bit tracks the mapped state of one sub-section's worth of
> > capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> >
> > The implication of allowing sections to be piecemeal mapped/unmapped is
> > that the valid_section() helper is no longer authoritative to determine
> > if a section is fully mapped. Instead pfn_valid() is updated to consult
> > the section-active bitmask. Given that typical memory hotplug still has
> > deep "section" dependencies the sub-section capability is limited to
> > 'want_memblock=false' invocations of arch_add_memory(), effectively only
> > devm_memremap_pages() users for now.
>
> Ah, there it is. And my point would be, please don't ever unlock
> something like that for want_memblock=true. Especially not for memory
> added after boot via device drivers (add_memory()).
I don't see a strong reason why not, as long as it does not regress
existing use cases. It might need to be an opt-in for new tooling that
is aware of finer granularity hotplug. That said, I have no pressing
need to go there and just care about the arch_add_memory() capability
for now.
>> You are using the term "Sub-section memory hotplug support", but is it
>> actually what you mean? To rephrase, aren't we talking here about
>> "Sub-section device memory hotplug support" or similar?
>
> Specifically it is support for passing @start and @size arguments to
> arch_add_memory() that are not section aligned. It's not limited to
> "device memory" which is otherwise not a concept that
> arch_add_memory() understands, it just groks spans of pfns.
Okay, so everything that does not have a memory block devices as of now.
>
>> Reason I am asking is because I wonder how that would interact with the
>> memory block device infrastructure and hotplugging of system ram -
>> add_memory()/add_memory_resource(). I *assume* you are not changing the
>> add_memory() interface, so that one still only works with whole sections
>> (or well, memory_block_size_bytes()) - check_hotplug_memory_range().
>
> Like you found below, the implementation enforces that add_memory_*()
> interfaces maintain section alignment for @start and @size.
>
>> In general, mix and matching system RAM and persistent memory per
>> section, I am not a friend of that.
>
> You have no choice. The platform may decide to map PMEM and System RAM
> in the same section because the Linux section is too large compared to
> typical memory controller mapping granularity capability.
I might be very wrong here, but do we actually care about something like
64MB getting lost in the cracks? I mean if it simplifies core MM, let go
of the couple of MB of system ram and handle the PMEM part only. Treat
the system ram parts like memory holes we already have in ordinary
sections (well, there we simply set the relevant struct pages to
PG_reserved). Of course, if we have hundreds of unaligned devices and
stuff will start to add up ... but I assume this is not the case?
>
>> Especially when it comes to memory
>> block devices. But I am getting the feeling that we are rather targeting
>> PMEM vs. PMEM with this patch series.
>
> The collisions are between System RAM, PMEM regions, and PMEM
> namespaces (sub-divisions of regions that each need their own mapping
> lifetime).
Understood. I wonder if that PMEM only mapping (including separate
lifetime) could be handled differently. But I am absolutely no expert,
just curious.
>
>>> Quote patch7:
>>>
>>> "The libnvdimm sub-system has suffered a series of hacks and broken
>>> workarounds for the memory-hotplug implementation's awkward
>>> section-aligned (128MB) granularity. For example the following backtrace
>>> is emitted when attempting arch_add_memory() with physical address
>>> ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
>>> within a given section:
>>>
>>> WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
>>> devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
>>> [..]
>>> Call Trace:
>>> dump_stack+0x86/0xc3
>>> __warn+0xcb/0xf0
>>> warn_slowpath_fmt+0x5f/0x80
>>> devm_memremap_pages+0x3b5/0x4c0
>>> __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
>>> pmem_attach_disk+0x19a/0x440 [nd_pmem]
>>>
>>> Recently it was discovered that the problem goes beyond RAM vs PMEM
>>> collisions as some platform produce PMEM vs PMEM collisions within a
>>
>> As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be
>> implemented "on top" of what we have right now. Or is this what we
>> already have that you call "hacks in nvdimm" code? (no NVDIMM expert,
>> sorry for the stupid questions)
>
> It doesn't work, because even if the padding was implemented 100%
> correct, which thus far has failed to be the case, the platform may
> change physical alignments from one boot to the next for a variety of
> reasons.
Would ignoring the System RAM parts (as mentioned above) help or doesn't
it make any difference in terms of complexity?
>
>>
>>> given section. The libnvdimm workaround for that case revealed that the
>>> libnvdimm section-alignment-padding implementation has been broken for a
>>> long while. A fix for that long-standing breakage introduces as many
>>> problems as it solves as it would require a backward-incompatible change
>>> to the namespace metadata interpretation. Instead of that dubious route
>>> [2], address the root problem in the memory-hotplug implementation."
>>>
>>> The approach is taken is to observe that each section already maintains
>>> an array of 'unsigned long' values to hold the pageblock_flags. A single
>>> additional 'unsigned long' is added to house a 'sub-section active'
>>> bitmask. Each bit tracks the mapped state of one sub-section's worth of
>>> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
>>>
>>> The implication of allowing sections to be piecemeal mapped/unmapped is
>>> that the valid_section() helper is no longer authoritative to determine
>>> if a section is fully mapped. Instead pfn_valid() is updated to consult
>>> the section-active bitmask. Given that typical memory hotplug still has
>>> deep "section" dependencies the sub-section capability is limited to
>>> 'want_memblock=false' invocations of arch_add_memory(), effectively only
>>> devm_memremap_pages() users for now.
>>
>> Ah, there it is. And my point would be, please don't ever unlock
>> something like that for want_memblock=true. Especially not for memory
>> added after boot via device drivers (add_memory()).
>
> I don't see a strong reason why not, as long as it does not regress
> existing use cases. It might need to be an opt-in for new tooling that
> is aware of finer granularity hotplug. That said, I have no pressing
> need to go there and just care about the arch_add_memory() capability
> for now.
Especially onlining/offlining of memory might end up very ugly. And that
goes hand in hand with memory block devices. They are either online or
offline, not something in between. (I went that path and Michal
correctly told me why it is not a good idea)
I was recently trying to teach memory block devices who their owner is /
of which type they are. Right now I am looking into the option of using
drivers. Memory block devices that could belong to different drivers at
a time are well ... totally broken. I assume it would still be a special
case, though, but conceptually speaking about the interface it would be
allowed.
Memory block devices (and therefore 1..X sections) should have one owner
only. Anything else just does not fit.
--
Thanks,
David / dhildenb
On Thu, Mar 28, 2019 at 2:17 PM David Hildenbrand <[email protected]> wrote:
>
> >> You are using the term "Sub-section memory hotplug support", but is it
> >> actually what you mean? To rephrase, aren't we talking here about
> >> "Sub-section device memory hotplug support" or similar?
> >
> > Specifically it is support for passing @start and @size arguments to
> > arch_add_memory() that are not section aligned. It's not limited to
> > "device memory" which is otherwise not a concept that
> > arch_add_memory() understands, it just groks spans of pfns.
>
> Okay, so everything that does not have a memory block devices as of now.
>
> >
> >> Reason I am asking is because I wonder how that would interact with the
> >> memory block device infrastructure and hotplugging of system ram -
> >> add_memory()/add_memory_resource(). I *assume* you are not changing the
> >> add_memory() interface, so that one still only works with whole sections
> >> (or well, memory_block_size_bytes()) - check_hotplug_memory_range().
> >
> > Like you found below, the implementation enforces that add_memory_*()
> > interfaces maintain section alignment for @start and @size.
> >
> >> In general, mix and matching system RAM and persistent memory per
> >> section, I am not a friend of that.
> >
> > You have no choice. The platform may decide to map PMEM and System RAM
> > in the same section because the Linux section is too large compared to
> > typical memory controller mapping granularity capability.
>
> I might be very wrong here, but do we actually care about something like
> 64MB getting lost in the cracks? I mean if it simplifies core MM, let go
> of the couple of MB of system ram and handle the PMEM part only. Treat
> the system ram parts like memory holes we already have in ordinary
> sections (well, there we simply set the relevant struct pages to
> PG_reserved). Of course, if we have hundreds of unaligned devices and
> stuff will start to add up ... but I assume this is not the case?
That's precisely what we do today and it has become untenable as the
collision scenarios pile up. This thread [1] is worth a read if you
care about some of the gory details why I'm back to pushing for
sub-section support, but most if it has already been summarized in the
current discussion on this thread.
[1]: https://lore.kernel.org/lkml/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> >
> >> Especially when it comes to memory
> >> block devices. But I am getting the feeling that we are rather targeting
> >> PMEM vs. PMEM with this patch series.
> >
> > The collisions are between System RAM, PMEM regions, and PMEM
> > namespaces (sub-divisions of regions that each need their own mapping
> > lifetime).
>
> Understood. I wonder if that PMEM only mapping (including separate
> lifetime) could be handled differently. But I am absolutely no expert,
> just curious.
I refer you to the above thread trying to fix the libnvdimm-local hacks.
>
> >
> >>> Quote patch7:
> >>>
> >>> "The libnvdimm sub-system has suffered a series of hacks and broken
> >>> workarounds for the memory-hotplug implementation's awkward
> >>> section-aligned (128MB) granularity. For example the following backtrace
> >>> is emitted when attempting arch_add_memory() with physical address
> >>> ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> >>> within a given section:
> >>>
> >>> WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> >>> devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> >>> [..]
> >>> Call Trace:
> >>> dump_stack+0x86/0xc3
> >>> __warn+0xcb/0xf0
> >>> warn_slowpath_fmt+0x5f/0x80
> >>> devm_memremap_pages+0x3b5/0x4c0
> >>> __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> >>> pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >>>
> >>> Recently it was discovered that the problem goes beyond RAM vs PMEM
> >>> collisions as some platform produce PMEM vs PMEM collisions within a
> >>
> >> As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be
> >> implemented "on top" of what we have right now. Or is this what we
> >> already have that you call "hacks in nvdimm" code? (no NVDIMM expert,
> >> sorry for the stupid questions)
> >
> > It doesn't work, because even if the padding was implemented 100%
> > correct, which thus far has failed to be the case, the platform may
> > change physical alignments from one boot to the next for a variety of
> > reasons.
>
> Would ignoring the System RAM parts (as mentioned above) help or doesn't
> it make any difference in terms of complexity?
Doesn't help much, that's only one of many collision sources.
> >>> given section. The libnvdimm workaround for that case revealed that the
> >>> libnvdimm section-alignment-padding implementation has been broken for a
> >>> long while. A fix for that long-standing breakage introduces as many
> >>> problems as it solves as it would require a backward-incompatible change
> >>> to the namespace metadata interpretation. Instead of that dubious route
> >>> [2], address the root problem in the memory-hotplug implementation."
> >>>
> >>> The approach is taken is to observe that each section already maintains
> >>> an array of 'unsigned long' values to hold the pageblock_flags. A single
> >>> additional 'unsigned long' is added to house a 'sub-section active'
> >>> bitmask. Each bit tracks the mapped state of one sub-section's worth of
> >>> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> >>>
> >>> The implication of allowing sections to be piecemeal mapped/unmapped is
> >>> that the valid_section() helper is no longer authoritative to determine
> >>> if a section is fully mapped. Instead pfn_valid() is updated to consult
> >>> the section-active bitmask. Given that typical memory hotplug still has
> >>> deep "section" dependencies the sub-section capability is limited to
> >>> 'want_memblock=false' invocations of arch_add_memory(), effectively only
> >>> devm_memremap_pages() users for now.
> >>
> >> Ah, there it is. And my point would be, please don't ever unlock
> >> something like that for want_memblock=true. Especially not for memory
> >> added after boot via device drivers (add_memory()).
> >
> > I don't see a strong reason why not, as long as it does not regress
> > existing use cases. It might need to be an opt-in for new tooling that
> > is aware of finer granularity hotplug. That said, I have no pressing
> > need to go there and just care about the arch_add_memory() capability
> > for now.
>
> Especially onlining/offlining of memory might end up very ugly. And that
> goes hand in hand with memory block devices. They are either online or
> offline, not something in between. (I went that path and Michal
> correctly told me why it is not a good idea)
Thread reference?
> I was recently trying to teach memory block devices who their owner is /
> of which type they are. Right now I am looking into the option of using
> drivers. Memory block devices that could belong to different drivers at
> a time are well ... totally broken.
Sub-section support is aimed at a similar case where different
portions of an 128MB span need to handed out to devices / drivers with
independent lifetimes.
> I assume it would still be a special
> case, though, but conceptually speaking about the interface it would be
> allowed.
>
> Memory block devices (and therefore 1..X sections) should have one owner
> only. Anything else just does not fit.
Yes, but I would say the problem there is that the
memory-block-devices interface design is showing its age and is being
pressured with how systems want to deploy and use memory today.
>>>> Reason I am asking is because I wonder how that would interact with the
>>>> memory block device infrastructure and hotplugging of system ram -
>>>> add_memory()/add_memory_resource(). I *assume* you are not changing the
>>>> add_memory() interface, so that one still only works with whole sections
>>>> (or well, memory_block_size_bytes()) - check_hotplug_memory_range().
>>>
>>> Like you found below, the implementation enforces that add_memory_*()
>>> interfaces maintain section alignment for @start and @size.
>>>
>>>> In general, mix and matching system RAM and persistent memory per
>>>> section, I am not a friend of that.
>>>
>>> You have no choice. The platform may decide to map PMEM and System RAM
>>> in the same section because the Linux section is too large compared to
>>> typical memory controller mapping granularity capability.
>>
>> I might be very wrong here, but do we actually care about something like
>> 64MB getting lost in the cracks? I mean if it simplifies core MM, let go
>> of the couple of MB of system ram and handle the PMEM part only. Treat
>> the system ram parts like memory holes we already have in ordinary
>> sections (well, there we simply set the relevant struct pages to
>> PG_reserved). Of course, if we have hundreds of unaligned devices and
>> stuff will start to add up ... but I assume this is not the case?
>
> That's precisely what we do today and it has become untenable as the
> collision scenarios pile up. This thread [1] is worth a read if you
> care about some of the gory details why I'm back to pushing for
> sub-section support, but most if it has already been summarized in the
> current discussion on this thread.
Thanks, exactly what I am interested in, will have a look!
>>>
>>> I don't see a strong reason why not, as long as it does not regress
>>> existing use cases. It might need to be an opt-in for new tooling that
>>> is aware of finer granularity hotplug. That said, I have no pressing
>>> need to go there and just care about the arch_add_memory() capability
>>> for now.
>>
>> Especially onlining/offlining of memory might end up very ugly. And that
>> goes hand in hand with memory block devices. They are either online or
>> offline, not something in between. (I went that path and Michal
>> correctly told me why it is not a good idea)
>
> Thread reference?
Sure:
https://marc.info/?l=linux-mm&m=152362539714432&w=2
Onlining/offlining subsections was what I tried. (adding/removing whole
sections). But with the memory block device model (online/offline memory
blocks), this really was in some sense dirty, although it worked.
>
>> I was recently trying to teach memory block devices who their owner is /
>> of which type they are. Right now I am looking into the option of using
>> drivers. Memory block devices that could belong to different drivers at
>> a time are well ... totally broken.
>
> Sub-section support is aimed at a similar case where different
> portions of an 128MB span need to handed out to devices / drivers with
> independent lifetimes.
Right, but we are stuck here with memory block devices having certain
bigger granularity. We already went from 128MB to 2048MB because "there
were too many". Modeling this on 2MB level (e.g. subsections), no way.
And as I said, multiple users for one memory block device, very ugly.
What would be interesting is having memory block devices of variable
size. (64MB, 1024GB, 6GB ..), maybe even representing the unit in which
e.g. add_memory() was performed. But it would also have downsides when
it comes to changing the zone of memory blocks. Memory would be
onlined/offlined in way bigger chunks.
E.g. one DIMM = one memory block device.
>
>> I assume it would still be a special
>> case, though, but conceptually speaking about the interface it would be
>> allowed.
>>
>> Memory block devices (and therefore 1..X sections) should have one owner
>> only. Anything else just does not fit.
>
> Yes, but I would say the problem there is that the
> memory-block-devices interface design is showing its age and is being
> pressured with how systems want to deploy and use memory today.
Maybe, I guess the main "issue" started to pop up when different things
(RAM vs. PMEM) were started to be mapped into memory side by side. But
it is ABI, and basic kdump would completely break if removed. And of
course memory unplug and much more. It is a crucial part of how system
ram is handled today and might not at all be easy to replace.
--
Thanks,
David / dhildenb
On 27.03.19 01:20, Dan Williams wrote:
> On Tue, Mar 26, 2019 at 1:04 AM Michal Hocko <[email protected]> wrote:
>>
>> On Mon 25-03-19 13:03:47, Dan Williams wrote:
>>> On Mon, Mar 25, 2019 at 3:20 AM Michal Hocko <[email protected]> wrote:
>> [...]
>>>>> User-defined memory namespaces have this problem, but 2MB is the
>>>>> default alignment and is sufficient for most uses.
>>>>
>>>> What does prevent users to go and use a larger alignment?
>>>
>>> Given that we are living with 64MB granularity on mainstream platforms
>>> for the foreseeable future, the reason users can't rely on a larger
>>> alignment to address the issue is that the physical alignment may
>>> change from one boot to the next.
>>
>> I would love to learn more about this inter boot volatility. Could you
>> expand on that some more? I though that the HW configuration presented
>> to the OS would be more or less stable unless the underlying HW changes.
>
> Even if the configuration is static there can be hardware failures
> that prevent a DIMM, or a PCI device to be included in the memory map.
> When that happens the BIOS needs to re-layout the map and the result
> is not guaranteed to maintain the previous alignment.
>
>>> No, you can't just wish hardware / platform firmware won't do this,
>>> because there are not enough platform resources to give every hardware
>>> device a guaranteed alignment.
>>
>> Guarantee is one part and I can see how nobody wants to give you
>> something as strong but how often does that happen in the real life?
>
> I expect a "rare" event to happen everyday in a data-center fleet.
> Failure rates tend towards 100% daily occurrence at scale and in this
> case the kernel has everything it needs to mitigate such an event.
>
> Setting aside the success rate of a software-alignment mitigation, the
> reason I am charging this hill again after a 2 year hiatus is the
> realization that this problem is wider spread than the original
> failing scenario. Back in 2017 the problem seemed limited to custom
> memmap= configurations, and collisions between PMEM and System RAM.
> Now it is clear that the collisions can happen between PMEM regions
> and namespaces as well, and the problem spans platforms from multiple
> vendors. Here is the most recent collision problem:
> https://github.com/pmem/ndctl/issues/76, from a third-party platform.
>
> The fix for that issue uncovered a bug in the padding implementation,
> and a fix for that bug would result in even more hacks in the nvdimm
> code for what is a core kernel deficiency. Code review of those
> changes resulted in changing direction to go after the core
> deficiency.
>
>>> The effect is that even if the driver deploys a software alignment
>>> mitigation when it first sees the persistent memory range, that
>>> alignment can be violated on a subsequent boot leading to data being
>>> unavailable. There is no facility to communicate to the administrator
>>> what went wrong in this scenario as several events can trigger a
>>> physical map layout change. Add / remove of hardware and hardware
>>> failure are the most likely causes.
>>
>> This is indeed bad and unexpected! That is exactly something to have in
>> the chagelog!
>
> Apologies that was indeed included in the 2017 changelog (see: "a user
> could inadvertently lose access to nvdimm namespaces" note here:
> https://lwn.net/Articles/717383/), and I failed to carry it forward.
>
>>
>>> An additional pain point for users is that EFI pre-boot environment
>>> has little chance to create a namespace that Linux might be able to
>>> use. The section size is an arbitrary Linux constraint and we should
>>> not encode something Linux specific that might change in the future
>>> into OS agnostic software.
>>
>> This looks like a fair point but please keep in mind that there hotplug
>> restrictions are on other platforms as well (4MB on Windows IIRC) so
>> there will be some knowledge required all the time. Besides that there
>> are likely to be some restrictions depending on the implementation.
>
> Windows does not have an equivalent constraint, so it's only Linux
> that imposes an arbitrary alignment restriction on pmem to agents like
> EFI.
>
>> [...]
>>>>> Right, as stated in the cover letter, this does not remove all those
>>>>> assumptions, it only removes the ones that impact
>>>>> devm_memremap_pages(). Specifying that sub-section is only supported
>>>>> in the 'want_memblock=false' case to arch_add_memory().
>>>>
>>>> And this is exactly the problem. Having different assumptions depending
>>>> on whether there is a memblock interface or not is utterly wrong and a
>>>> maintainability mess.
>>>
>>> In this case I disagree with you. The hotplug code already has the
>>> want_memblock=false semantic in the implementation.
>>
>> want_memblock was a hack to allow memory hotplug to not have user
>> visible sysfs interface. It was added to reduce the code duplication
>> IIRC. Besides that this hasn't changed the underlying assumptions about
>> hotplugable units or other invariants that were in place.
>
> Neither does this patch series for the typical memory hotplug case.
> For the device-memory use case I've gone through and fixed up the
> underlying assumptions.
>
>>> The sub-section
>>> hotplug infrastructure is a strict superset of what is there already.
>>> Now, if it created parallel infrastructure that would indeed be a
>>> maintainability burden, but in this case there are no behavior changes
>>> for typical memory hotplug as it just hotplugs full sections at a time
>>> like always. The 'section' concept is not going away.
>>
>> You are really neglecting many details here. E.g. memory section can be
>> shared between two different types of memory. We've had some bugs in the
>> hotplug code when one section can be shared between two different NUMA
>> nodes (e.g. 4aa9fc2a435a ("Revert "mm, memory_hotplug: initialize struct
>> pages for the full memory section""). We do not allow to hotremove such
>> sections because it would open another can of worms. I am not saying
>> your implementation is incorrect - still haven't time to look deeply -
>> but stating that this is a strict superset of want_memblock is simply
>> wrong.
>
> Please have a look at the code and the handling of "early" sections.
> The assertion that I neglected to consider that detail is not true.
>
> My "superset" contention is from the arch_add_memory() api
> perspective. All typical memory hotplug use cases are a sub-case of
> the new support.
>
>> [...]
>>>> Why do we have to go a mile to tweak the kernel, especially something as
>>>> fragile as memory hotplug, just to support sub mem section ranges. This
>>>> is somthing that is not clearly explained in the cover letter. Sure you
>>>> are talking about hacks at the higher level to deal with this but I do
>>>> not see any fundamental reason to actually support that at all.
>>>
>>> Like it or not, 'struct page' mappings for arbitrary hardware-physical
>>> memory ranges is a facility that has grown from the pmem case, to hmm,
>>> and peer-to-peer DMA. Unless you want to do the work to eliminate the
>>> 'struct page' requirement across the kernel I think it is unreasonable
>>> to effectively archive the arch_add_memory() implementation and
>>> prevent it from reacting to growing demands.
>>
>> I am definitely not blocking memory hotplug to be reused more! All I am
>> saying is that there is much more ground work to be done before you can
>> add features like that. There are some general assumptions in the code,
>> like it or not, and you should start by removing those to build on top.
>
> Let's talk about specifics please, because I don't think you've had a
> chance to consider the details in the patches. Your "start by removing
> those [assumptions] to build on top" request is indeed what the
> preparation patches in this series aim to achieve.
>
> The general assumptions of the current (pre-patch-series) implementation are:
>
> - Sections that describe boot memory (early sections) are never
> unplugged / removed.
I m not sure if this is completely true, and it also recently popped up
while discussing some work Oscar is doing ("[PATCH 0/4]
mm,memory_hotplug: allocate memmap from hotadded memory").
We have powernv (arch/powerpc/platforms/powernv/memtrace.c), that will
offline + remove memory from the system that it didn't originally add.
As far as I understand, this can easily be boot memory. Not sure if
there is anything blocking this code from removing boot memory.
Also, ACPI memory hotplug (drivers/acpi/acpi_memhotplug.c) seems to have
a case where memory provided by a DIMM is already used by the kernel (I
assume this means, it was detected and added during boot). This memory
can theoretically be removed. I am still to figure out how that special
case here fits into the big picture.
>
> - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
> valid_section() check
>
> - __add_pages() and helper routines assume all operations occur in
> PAGES_PER_SECTION units.
>
> - the memblock sysfs interface only comprehends full sections
>
> Those assumptions are removed / handled with the following
> implementation details respectively:
>
> - Partially populated early sections can be extended with additional
> sub-sections, and those sub-sections can be removed with
> arch_remove_memory(). With this in place we no longer lose usable
> memory capacity to padding.
>
> - pfn_valid() goes beyond valid_section() to also check the
> active-sub-section mask. As stated before this indication is in the
> same cacheline as the valid_section() so the performance impact is
> expected to be negligible. So far the lkp robot has not reported any
> regressions.
>
> - Outside of the core vmemmap population routines which are replaced,
> other helper routines like shrink_{zone,pgdat}_span() are updated to
> handle the smaller granularity. Core memory hotplug routines that deal
> with online memory are not updated. That's a feature not a bug until
> we decide that sub-section hotplug makes sense for online / typical
> memory as well.
>
> - the existing memblock sysfs user api guarantees / assumptions are
> not touched since this capability is limited to !online
> !sysfs-accessible sections for now.
So to expand on that, the main difference of RAM hotplug to device
memory hotplug is that
- Memory has to be onlined/offlined. Sections are marked as being either
online or offline. Not relevant for device memory. Onlining/offlining
imples working on the buddy / core MM.
- Memory is exposed and managed via memblock sysfs API. memblocks are
multiples of sections. The RAM hotplug granularity really is the size of
memblocks. E.g. kdump uses memblock sysfs events to reaload when new
memory is added/onlined. Onlining controlled by userspace works on
memblocks getting added. Other users heavily use the memblock API.
So I think the hotplug granularity of RAM really is memblocks (actually
sections). Changing that might be very complicated, will break APIs and
has a questionable benefit.
I am starting to wonder if RAM (memdev) really is the special case and
what you are proposing is the right thing to do for everything that
- doesn't use memdev sysfs interface
- doesn't require to online memory (sections)
So, it boils down to memblock=true is the special case. We would have to
make sure that memblock=true cannot be mixed with memblock=false on the
same sections (or even memory blocks)
(not having had a detailed look at the patches yet) Michal, what do you
think?
--
Thanks,
David / dhildenb
On 22.03.19 17:57, Dan Williams wrote:
> Changes since v4 [1]:
> - Given v4 was from March of 2017 the bulk of the changes result from
> rebasing the patch set from a v4.11-rc2 baseline to v5.1-rc1.
>
> - A unit test is added to ndctl to exercise the creation and dax
> mounting of multiple independent namespaces in a single 128M section.
>
> [1]: https://lwn.net/Articles/717383/
>
> ---
>
> Quote patch7:
>
> "The libnvdimm sub-system has suffered a series of hacks and broken
> workarounds for the memory-hotplug implementation's awkward
> section-aligned (128MB) granularity. For example the following backtrace
> is emitted when attempting arch_add_memory() with physical address
> ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> within a given section:
>
> WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> [..]
> Call Trace:
> dump_stack+0x86/0xc3
> __warn+0xcb/0xf0
> warn_slowpath_fmt+0x5f/0x80
> devm_memremap_pages+0x3b5/0x4c0
> __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> pmem_attach_disk+0x19a/0x440 [nd_pmem]
>
> Recently it was discovered that the problem goes beyond RAM vs PMEM
> collisions as some platform produce PMEM vs PMEM collisions within a
> given section. The libnvdimm workaround for that case revealed that the
> libnvdimm section-alignment-padding implementation has been broken for a
> long while. A fix for that long-standing breakage introduces as many
> problems as it solves as it would require a backward-incompatible change
> to the namespace metadata interpretation. Instead of that dubious route
> [2], address the root problem in the memory-hotplug implementation."
>
> The approach is taken is to observe that each section already maintains
> an array of 'unsigned long' values to hold the pageblock_flags. A single
> additional 'unsigned long' is added to house a 'sub-section active'
> bitmask. Each bit tracks the mapped state of one sub-section's worth of
> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
>
> The implication of allowing sections to be piecemeal mapped/unmapped is
> that the valid_section() helper is no longer authoritative to determine
> if a section is fully mapped. Instead pfn_valid() is updated to consult
> the section-active bitmask. Given that typical memory hotplug still has
> deep "section" dependencies the sub-section capability is limited to
> 'want_memblock=false' invocations of arch_add_memory(), effectively only
> devm_memremap_pages() users for now.
>
> With this in place the hacks in the libnvdimm sub-system can be
> dropped, and other devm_memremap_pages() users need no longer be
> constrained to 128MB mapping granularity.
>
> [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
>
I started to explore the wonderful world of system ram memory hotplug
(memory block devices) and it is full with issues. Too many to name them
all, but two example are memory block devices that span several nodes
(such memory can only be added during boot, mem->nid would be completely
misleading) or that we assume struct pages have been initialized, while
they really haven't when removing memory.
It is already a mess that we have multiple sections per memory block
devices (and it was never properly cleaned up and I think I spotted
several issues). Going into the direction of sub-sections for memory
block devices, I don't like. It is already a big mess.
Memory block devices are an important concept for memory hotplug/unplug.
This is the granularity memory will get onlined/offlined by user space.
I don't see this interface going away. On the other hand, memory block
devices only make sense for memory to be onlined/offlined in such
chunks, system ram. So whatever ZONE_DEVICE memory doesn't run into that
restriction.
I think we should restrict adding/removing system ram via
online_pages()/offline_pages()/add_memory()/remove_memory() to
- memory block device granularity (already mostly checked)
- single zones (already mostly checked)
- single nodes (basically not checked as far as I can see)
Cleaning this mess up might take some time. Essentially, all special
handling related to memory block devices should be factored out from
arch_add_memory()/arch_remove_memory() to add_memory()/remove_memory().
I started looking into that. __add_pages() doesn't properly revert what
it already did when failing.
I don't have a strong opinion against adding sub-section memory hotadd
as long as we don't use it for memory block devices hotplug. Meaning,
use it internally, but don't use it along with memory block device hotplug.
As add_memory() only works on memory block device granularity, memory
block devices for something like that is not an issue. The real issue is
memory added during boot that e.g. has holes or overlaps with pmem and
friends. Trying to offline/online/remove such memory should be
completely blocked.
To easily detect multiple nodes per memory block devices, I am thinking
about making mem->nid indicated that. E.g. nid == -1, uninitialized, nid
== -2, mixed nodes, don't allow to offline/remove such memory. Might
require some refactorings.
The question is how we could easily detect
- memory block devices with some sub-sections missing. Offlining code
forbids this right now as the holes are marked as PG_reserved.
- memory block devices with some sub-sections being ZONE_DEVICE memory
like pmem.
Both cases could only happen when memory was added during boot.
Offlining/removing such memory has to be forbidden.
--
Thanks,
David / dhildenb