2015-06-27 02:21:21

by Xishi Qiu

[permalink] [raw]
Subject: [RFC v2 PATCH 0/8] mm: mirrored memory support for page buddy allocations

Intel Xeon processor E7 v3 product family-based platforms introduces support
for partial memory mirroring called as 'Address Range Mirroring'. This feature
allows BIOS to specify a subset of total available memory to be mirrored (and
optionally also specify whether to mirror the range 0-4 GB). This capability
allows user to make an appropriate tradeoff between non-mirrored memory range
and mirrored memory range thus optimizing total available memory and still
achieving highly reliable memory range for mission critical workloads and/or
kernel space.

Tony has already send a patchset to supprot this feature at boot time.
https://lkml.org/lkml/2015/5/8/521
This patchset is based on Tony's, it can support the feature after boot time.
Use mirrored memory for all kernel allocations.

TBD:
- Add compatibility with memory online/offline, memory compaction, CMA...
- Need to discuss the implementation ideas, add a new zone or a new
migratetype or others.

V2:
- Use memblock which marked MEMBLOCK_MIRROR to find mirrored memory instead
of mirror_info.
- Remove __GFP_MIRROR and /proc/sys/vm/mirrorable.
- Use mirrored memory for all kernel allocations.


Xishi Qiu (8):
mm: add a new config to manage the code
mm: introduce MIGRATE_MIRROR to manage the mirrored pages
mm: find mirrored memory in memblock
mm: add mirrored memory to buddy system
mm: introduce a new zone_stat_item NR_FREE_MIRROR_PAGES
mm: add free mirrored pages info
mm: add the buddy system interface
mm: add the PCP interface

drivers/base/node.c | 17 ++++---
fs/proc/meminfo.c | 6 +++
include/linux/memblock.h | 29 ++++++++++--
include/linux/mmzone.h | 10 ++++
include/linux/vmstat.h | 2 +
mm/Kconfig | 8 ++++
mm/memblock.c | 33 +++++++++++--
mm/nobootmem.c | 3 ++
mm/page_alloc.c | 117 ++++++++++++++++++++++++++++++++++++-----------
mm/vmstat.c | 4 ++
10 files changed, 190 insertions(+), 39 deletions(-)

--
2.0.0


2015-06-27 02:23:53

by Xishi Qiu

[permalink] [raw]
Subject: [RFC v2 PATCH 1/8] mm: add a new config to manage the code

This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", set it
off by default.

Signed-off-by: Xishi Qiu <[email protected]>
---
mm/Kconfig | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 390214d..c40bb8b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION

+config MEMORY_MIRROR
+ bool "Address range mirroring support"
+ depends on X86 && MEMORY_FAILURE
+ default n
+ help
+ This feature depends on hardware and firmware support.
+ ACPI or EFI records the mirror info.
+
#
# If we have space for more page flags then we can enable additional
# optimizations and functionality.
--
2.0.0

2015-06-27 02:24:45

by Xishi Qiu

[permalink] [raw]
Subject: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to
allocate mirrored pages.
When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.

Signed-off-by: Xishi Qiu <[email protected]>
---
include/linux/mmzone.h | 9 +++++++++
mm/page_alloc.c | 3 +++
mm/vmstat.c | 3 +++
3 files changed, 15 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54d74f6..54e891a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,9 @@ enum {
MIGRATE_UNMOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_MOVABLE,
+#ifdef CONFIG_MEMORY_MIRROR
+ MIGRATE_MIRROR,
+#endif
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_RESERVE = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
@@ -69,6 +72,12 @@ enum {
# define is_migrate_cma(migratetype) false
#endif

+#ifdef CONFIG_MEMORY_MIRROR
+# define is_migrate_mirror(migratetype) unlikely((migratetype) == MIGRATE_MIRROR)
+#else
+# define is_migrate_mirror(migratetype) false
+#endif
+
#define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
for (type = 0; type < MIGRATE_TYPES; type++)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..6e4d79f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3216,6 +3216,9 @@ static void show_migration_types(unsigned char type)
[MIGRATE_UNMOVABLE] = 'U',
[MIGRATE_RECLAIMABLE] = 'E',
[MIGRATE_MOVABLE] = 'M',
+#ifdef CONFIG_MEMORY_MIRROR
+ [MIGRATE_MIRROR] = 'O',
+#endif
[MIGRATE_RESERVE] = 'R',
#ifdef CONFIG_CMA
[MIGRATE_CMA] = 'C',
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd97..d0323e0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,6 +901,9 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Reclaimable",
"Movable",
+#ifdef CONFIG_MEMORY_MIRROR
+ "Mirror",
+#endif
"Reserve",
#ifdef CONFIG_CMA
"CMA",
--
2.0.0

2015-06-27 02:25:21

by Xishi Qiu

[permalink] [raw]
Subject: [RFC v2 PATCH 3/8] mm: find mirrored memory in memblock

Add a macro for_each_mirror_pfn_range() to find mirrored memory in memblock.
This patch is based on Tony's patchset "Find mirrored memory, use for boot time
allocations"

Signed-off-by: Xishi Qiu <[email protected]>
---
include/linux/memblock.h | 25 ++++++++++++++++++++++---
mm/memblock.c | 6 +++++-
2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 0215ffd..97f71ca 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -171,7 +171,8 @@ static inline bool memblock_is_mirror(struct memblock_region *m)
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long *end_pfn);
-void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
+void __next_mem_pfn_range(int *idx, int nid, ulong flags,
+ unsigned long *out_start_pfn,
unsigned long *out_end_pfn, int *out_nid);

/**
@@ -185,8 +186,26 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
* Walks over configured memory ranges.
*/
#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid) \
- for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \
- i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
+ for (i = -1, __next_mem_pfn_range(&i, nid, MEMBLOCK_NONE, \
+ p_start, p_end, p_nid); \
+ i >= 0; __next_mem_pfn_range(&i, nid, MEMBLOCK_NONE, \
+ p_start, p_end, p_nid))
+
+/**
+ * for_each_mirror_pfn_range - early mirrored memory pfn range iterator
+ * @i: an integer used as loop variable
+ * @nid: node selector, %MAX_NUMNODES for all nodes
+ * @p_start: ptr to ulong for start pfn of the range, can be %NULL
+ * @p_end: ptr to ulong for end pfn of the range, can be %NULL
+ * @p_nid: ptr to int for nid of the range, can be %NULL
+ *
+ * Walks over configured mirrored memory ranges.
+ */
+#define for_each_mirror_pfn_range(i, nid, p_start, p_end, p_nid) \
+ for (i = -1, __next_mem_pfn_range(&i, nid, MEMBLOCK_MIRROR, \
+ p_start, p_end, p_nid); \
+ i >= 0; __next_mem_pfn_range(&i, nid, MEMBLOCK_MIRROR, \
+ p_start, p_end, p_nid))
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

/**
diff --git a/mm/memblock.c b/mm/memblock.c
index 1b444c7..7612876 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1040,7 +1040,7 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid, ulong flags,
/*
* Common iterator interface used to define for_each_mem_range().
*/
-void __init_memblock __next_mem_pfn_range(int *idx, int nid,
+void __init_memblock __next_mem_pfn_range(int *idx, int nid, ulong flags,
unsigned long *out_start_pfn,
unsigned long *out_end_pfn, int *out_nid)
{
@@ -1050,6 +1050,10 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid,
while (++*idx < type->cnt) {
r = &type->regions[*idx];

+ /* if we want mirror memory skip non-mirror memory regions */
+ if ((flags & MEMBLOCK_MIRROR) && !memblock_is_mirror(r))
+ continue;
+
if (PFN_UP(r->base) >= PFN_DOWN(r->base + r->size))
continue;
if (nid == MAX_NUMNODES || nid == r->nid)
--
2.0.0

2015-06-27 02:26:09

by Xishi Qiu

[permalink] [raw]
Subject: [RFC v2 PATCH 4/8] mm: add mirrored memory to buddy system

Before free bootmem, set mirrored pageblock's migratetype to MIGRATE_MIRROR, so
they could free to buddy system's MIGRATE_MIRROR list.
When set reserved memory, skip the mirrored memory.

Signed-off-by: Xishi Qiu <[email protected]>
---
include/linux/memblock.h | 3 +++
mm/memblock.c | 21 +++++++++++++++++++++
mm/nobootmem.c | 3 +++
mm/page_alloc.c | 3 +++
4 files changed, 30 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 97f71ca..53be030 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -81,6 +81,9 @@ int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
ulong choose_memblock_flags(void);
+#ifdef CONFIG_MEMORY_MIRROR
+void memblock_mark_migratemirror(void);
+#endif

/* Low level functions */
int memblock_add_range(struct memblock_type *type,
diff --git a/mm/memblock.c b/mm/memblock.c
index 7612876..0d0b210 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -19,6 +19,7 @@
#include <linux/debugfs.h>
#include <linux/seq_file.h>
#include <linux/memblock.h>
+#include <linux/page-isolation.h>

#include <asm-generic/sections.h>
#include <linux/io.h>
@@ -818,6 +819,26 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
}

+#ifdef CONFIG_MEMORY_MIRROR
+void __init_memblock memblock_mark_migratemirror(void)
+{
+ unsigned long start_pfn, end_pfn, pfn;
+ int i, node;
+ struct page *page;
+
+ printk(KERN_DEBUG "Mirrored memory:\n");
+ for_each_mirror_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
+ &node) {
+ printk(KERN_DEBUG " node %3d: [mem %#010llx-%#010llx]\n",
+ node, PFN_PHYS(start_pfn), PFN_PHYS(end_pfn) - 1);
+ for (pfn = start_pfn; pfn < end_pfn;
+ pfn += pageblock_nr_pages) {
+ page = pfn_to_page(pfn);
+ set_pageblock_migratetype(page, MIGRATE_MIRROR);
+ }
+ }
+}
+#endif

/**
* __next__mem_range - next function for for_each_free_mem_range() etc.
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 5258386..31aa6d4 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -129,6 +129,9 @@ static unsigned long __init free_low_memory_core_early(void)
u64 i;

memblock_clear_hotplug(0, -1);
+#ifdef CONFIG_MEMORY_MIRROR
+ memblock_mark_migratemirror();
+#endif

for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
NULL)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e4d79f..aea78a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4118,6 +4118,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)

block_migratetype = get_pageblock_migratetype(page);

+ if (is_migrate_mirror(block_migratetype))
+ continue;
+
/* Only test what is necessary when the reserves are not met */
if (reserve > 0) {
/*
--
2.0.0

2015-06-27 02:26:43

by Xishi Qiu

[permalink] [raw]
Subject: [RFC v2 PATCH 5/8] mm: introduce a new zone_stat_item NR_FREE_MIRROR_PAGES

This patch introduces a new zone_stat_item called "NR_FREE_MIRROR_PAGES", it is
used to storage free mirrored pages count.

Signed-off-by: Xishi Qiu <[email protected]>
---
include/linux/mmzone.h | 1 +
include/linux/vmstat.h | 2 ++
mm/vmstat.c | 1 +
3 files changed, 4 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54e891a..7cc0a29 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -166,6 +166,7 @@ enum zone_stat_item {
WORKINGSET_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
+ NR_FREE_MIRROR_PAGES,
NR_VM_ZONE_STAT_ITEMS };

/*
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7..d0a7268 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -283,6 +283,8 @@ static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
if (is_migrate_cma(migratetype))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
+ if (is_migrate_mirror(migratetype))
+ __mod_zone_page_state(zone, NR_FREE_MIRROR_PAGES, nr_pages);
}

extern const char * const vmstat_text[];
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d0323e0..7ee11ca 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -739,6 +739,7 @@ const char * const vmstat_text[] = {
"workingset_nodereclaim",
"nr_anon_transparent_hugepages",
"nr_free_cma",
+ "nr_free_mirror",

/* enum writeback_stat_item counters */
"nr_dirty_threshold",
--
2.0.0

2015-06-27 02:30:36

by Xishi Qiu

[permalink] [raw]
Subject: [RFC v2 PATCH 6/8] mm: add free mirrored pages info

Add the count of free mirrored pages in the following paths:
/proc/meminfo
/proc/zoneinfo
/sys/devices/system/node/node XX/meminfo
/sys/devices/system/node/node XX/vmstat

Signed-off-by: Xishi Qiu <[email protected]>
---
drivers/base/node.c | 17 +++++++++++------
fs/proc/meminfo.c | 6 ++++++
mm/page_alloc.c | 7 +++++--
3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index a2aa65b..d1a3556 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -114,6 +114,9 @@ static ssize_t node_read_meminfo(struct device *dev,
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
#endif
+#ifdef CONFIG_MEMORY_MIRROR
+ "Node %d MirrorFree: %8lu kB\n"
+#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
nid, K(node_page_state(nid, NR_WRITEBACK)),
@@ -130,14 +133,16 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
- , nid,
- K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR));
-#else
- nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ , nid, K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR)
+#endif
+#ifdef CONFIG_MEMORY_MIRROR
+ , nid, K(node_page_state(nid, NR_FREE_MIRROR_PAGES))
#endif
+ );
+
n += hugetlb_report_node_meminfo(nid, buf + n);
return n;
}
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index d3ebf2e..d1ebb20 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -145,6 +145,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
"CmaTotal: %8lu kB\n"
"CmaFree: %8lu kB\n"
#endif
+#ifdef CONFIG_MEMORY_MIRROR
+ "MirrorFree: %8lu kB\n"
+#endif
,
K(i.totalram),
K(i.freeram),
@@ -204,6 +207,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
, K(totalcma_pages)
, K(global_page_state(NR_FREE_CMA_PAGES))
#endif
+#ifdef CONFIG_MEMORY_MIRROR
+ , K(global_page_state(NR_FREE_MIRROR_PAGES))
+#endif
);

hugetlb_report_meminfo(m);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aea78a5..4c5bc50 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3268,7 +3268,7 @@ void show_free_areas(unsigned int filter)
" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
- " free:%lu free_pcp:%lu free_cma:%lu\n",
+ " free:%lu free_pcp:%lu free_cma:%lu free_mirror:%lu\n",
global_page_state(NR_ACTIVE_ANON),
global_page_state(NR_INACTIVE_ANON),
global_page_state(NR_ISOLATED_ANON),
@@ -3287,7 +3287,8 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_BOUNCE),
global_page_state(NR_FREE_PAGES),
free_pcp,
- global_page_state(NR_FREE_CMA_PAGES));
+ global_page_state(NR_FREE_CMA_PAGES),
+ global_page_state(NR_FREE_MIRROR_PAGES));

for_each_populated_zone(zone) {
int i;
@@ -3328,6 +3329,7 @@ void show_free_areas(unsigned int filter)
" free_pcp:%lukB"
" local_pcp:%ukB"
" free_cma:%lukB"
+ " free_mirror:%lukB"
" writeback_tmp:%lukB"
" pages_scanned:%lu"
" all_unreclaimable? %s"
@@ -3361,6 +3363,7 @@ void show_free_areas(unsigned int filter)
K(free_pcp),
K(this_cpu_read(zone->pageset->pcp.count)),
K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
+ K(zone_page_state(zone, NR_FREE_MIRROR_PAGES)),
K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
K(zone_page_state(zone, NR_PAGES_SCANNED)),
(!zone_reclaimable(zone) ? "yes" : "no")
--
2.0.0

2015-06-27 02:31:35

by Xishi Qiu

[permalink] [raw]
Subject: [RFC v2 PATCH 7/8] mm: add the buddy system interface

Add the buddy system interface for address range mirroring feature.
Use mirrored memory for all kernel allocations. If there is no mirrored pages
left, try to use other types pages.

Signed-off-by: Xishi Qiu <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 6 +++---
mm/page_alloc.c | 19 +++++++++++++++++++
3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 53be030..8c33ac0 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -81,6 +81,7 @@ int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
ulong choose_memblock_flags(void);
+extern struct static_key system_has_mirror;
#ifdef CONFIG_MEMORY_MIRROR
void memblock_mark_migratemirror(void);
#endif
diff --git a/mm/memblock.c b/mm/memblock.c
index 0d0b210..430ad87 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -55,14 +55,14 @@ int memblock_debug __initdata_memblock;
#ifdef CONFIG_MOVABLE_NODE
bool movable_node_enabled __initdata_memblock = false;
#endif
-static bool system_has_some_mirror __initdata_memblock = false;
+struct static_key system_has_mirror = STATIC_KEY_INIT;
static int memblock_can_resize __initdata_memblock;
static int memblock_memory_in_slab __initdata_memblock = 0;
static int memblock_reserved_in_slab __initdata_memblock = 0;

ulong __init_memblock choose_memblock_flags(void)
{
- return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE;
+ return static_key_false(&system_has_mirror) ? MEMBLOCK_MIRROR : MEMBLOCK_NONE;
}

/* inline so we don't get a warning when pr_debug is compiled out */
@@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
*/
int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
{
- system_has_some_mirror = true;
+ static_key_slow_inc(&system_has_mirror);

return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c5bc50..8a6125e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1033,6 +1033,9 @@ static int fallbacks[MIGRATE_TYPES][4] = {
[MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
[MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
+#ifdef CONFIG_MEMORY_MIRROR
+ [MIGRATE_MIRROR] = { MIGRATE_RESERVE }, /* Never used */
+#endif
#ifdef CONFIG_CMA
[MIGRATE_CMA] = { MIGRATE_RESERVE }, /* Never used */
#endif
@@ -1295,6 +1298,15 @@ retry_reserve:
page = __rmqueue_smallest(zone, order, migratetype);

if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
+ /*
+ * If there is no mirrored memory left, alloc other types
+ * memory. But we should not change the pageblock's
+ * migratetype between mirror and others, so just use
+ * MIGRATE_RECLAIMABLE to retry
+ */
+ if (is_migrate_mirror(migratetype))
+ return __rmqueue(zone, order, MIGRATE_RECLAIMABLE);
+
if (migratetype == MIGRATE_MOVABLE)
page = __rmqueue_cma_fallback(zone, order);

@@ -2872,6 +2884,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;

+#ifdef CONFIG_MEMORY_MIRROR
+ /* Alloc mirrored memory for kernel */
+ if (static_key_false(&system_has_mirror)
+ && !(gfp_mask & __GFP_MOVABLE))
+ ac.migratetype = MIGRATE_MIRROR;
+#endif
+
retry_cpuset:
cpuset_mems_cookie = read_mems_allowed_begin();

--
2.0.0

2015-06-27 02:29:03

by Xishi Qiu

[permalink] [raw]
Subject: [RFC v2 PATCH 8/8] mm: add the PCP interface

Abstract the PCP code in __rmqueue_pcp(), and do not call fallback in
rmqueue_bulk() when the migratetype is mirror.

Signed-off-by: Xishi Qiu <[email protected]>
---
mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 61 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a6125e..bb44463 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1337,11 +1337,20 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
unsigned long count, struct list_head *list,
int migratetype, bool cold)
{
- int i;
+ int i, mt;
+ struct page *page;

spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order, migratetype);
+ /*
+ * If there is no mirrored memory left, just keep the list
+ * empty, because we can not mix other types pages into the
+ * mirror list.
+ */
+ if (is_migrate_mirror(migratetype))
+ page = __rmqueue_smallest(zone, order, migratetype);
+ else
+ page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;

@@ -1359,15 +1368,61 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
else
list_add_tail(&page->lru, list);
list = &page->lru;
- if (is_migrate_cma(get_freepage_migratetype(page)))
+
+ mt = get_freepage_migratetype(page);
+ if (is_migrate_cma(mt))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-(1 << order));
+ if (is_migrate_mirror(mt))
+ __mod_zone_page_state(zone, NR_FREE_MIRROR_PAGES,
+ -(1 << order));
}
__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
spin_unlock(&zone->lock);
return i;
}

+static struct page *__rmqueue_pcp(struct zone *zone, unsigned int order,
+ gfp_t gfp_flags, int migratetype)
+{
+ struct page *page;
+ struct per_cpu_pages *pcp;
+ struct list_head *list;
+ bool cold;
+
+ cold = ((gfp_flags & __GFP_COLD) != 0);
+ pcp = &this_cpu_ptr(zone->pageset)->pcp;
+
+retry:
+ list = &pcp->lists[migratetype];
+ if (list_empty(list)) {
+ pcp->count += rmqueue_bulk(zone, 0,
+ pcp->batch, list,
+ migratetype, cold);
+ if (unlikely(list_empty(list))) {
+ /*
+ * If there is no mirrored memory left, alloc other
+ * types PCP, use MIGRATE_RECLAIMABLE to retry
+ */
+ if (is_migrate_mirror(migratetype)) {
+ migratetype = MIGRATE_RECLAIMABLE;
+ goto retry;
+ } else
+ return NULL;
+ }
+ }
+
+ if (cold)
+ page = list_entry(list->prev, struct page, lru);
+ else
+ page = list_entry(list->next, struct page, lru);
+
+ list_del(&page->lru);
+ pcp->count--;
+
+ return page;
+}
+
#ifdef CONFIG_NUMA
/*
* Called from the vmstat counter updater to drain pagesets of this
@@ -1713,30 +1768,12 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
{
unsigned long flags;
struct page *page;
- bool cold = ((gfp_flags & __GFP_COLD) != 0);

if (likely(order == 0)) {
- struct per_cpu_pages *pcp;
- struct list_head *list;
-
local_irq_save(flags);
- pcp = &this_cpu_ptr(zone->pageset)->pcp;
- list = &pcp->lists[migratetype];
- if (list_empty(list)) {
- pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, list,
- migratetype, cold);
- if (unlikely(list_empty(list)))
- goto failed;
- }
-
- if (cold)
- page = list_entry(list->prev, struct page, lru);
- else
- page = list_entry(list->next, struct page, lru);
-
- list_del(&page->lru);
- pcp->count--;
+ page = __rmqueue_pcp(zone, order, gfp_flags, migratetype);
+ if (!page)
+ goto failed;
} else {
if (unlikely(gfp_flags & __GFP_NOFAIL)) {
/*
--
2.0.0

2015-06-29 06:53:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 1/8] mm: add a new config to manage the code

On 2015/06/27 11:23, Xishi Qiu wrote:
> This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", set it
CONFIG_MEMORY_MIRROR
> off by default.
>
> Signed-off-by: Xishi Qiu <[email protected]>
> ---
> mm/Kconfig | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 390214d..c40bb8b 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
> depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
> depends on MIGRATION
>
> +config MEMORY_MIRROR

In following patches, you use CONFIG_MEMORY_MIRROR.

I think the name is too generic besides it's depends on ACPI.
But I'm not sure address based memory mirror is planned in other platform.

So, hmm. How about dividing the config into 2 parts like attached ? (just an example)

Thanks,
-Kame


Attachments:
0001-add-a-new-config-option-for-memory-mirror.patch (1.87 kB)

2015-06-29 07:34:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

On 2015/06/27 11:24, Xishi Qiu wrote:
> This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to
> allocate mirrored pages.
> When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.
>
> Signed-off-by: Xishi Qiu <[email protected]>

My fear about this approarch is that this may break something existing.

Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as
MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE.

Logically, MIRROR attribute is independent from page mobility and this overwrites
will make some information lost.

Then,

> ---
> include/linux/mmzone.h | 9 +++++++++
> mm/page_alloc.c | 3 +++
> mm/vmstat.c | 3 +++
> 3 files changed, 15 insertions(+)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 54d74f6..54e891a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -39,6 +39,9 @@ enum {
> MIGRATE_UNMOVABLE,
> MIGRATE_RECLAIMABLE,
> MIGRATE_MOVABLE,
> +#ifdef CONFIG_MEMORY_MIRROR
> + MIGRATE_MIRROR,
> +#endif

I think
MIGRATE_MIRROR_UNMOVABLE,
MIGRATE_MIRROR_RECLAIMABLE,
MIGRATE_MIRROR_MOVABLE, <== adding this may need discuss.
MIGRATE_MIRROR_RESERVED, <== reserved pages should be maintained per mirrored/unmirrored.

should be added with the following fallback list.

/*
* MIRROR page range is defined by firmware at boot. The range is limited
* and is used only for kernel memory mirroring.
*/
[MIGRATE_UNMOVABLE_MIRROR] = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE}

Then, we'll not lose the original information of "Reclaiable Pages".

One problem here is whteher we should have MIGRATE_RESERVE_MIRROR.

If we never allow users to allocate mirrored memory, we should have MIGRATE_RESERVE_MIRROR.
But it seems to require much more code change to do that.

Creating a zone or adding an attribues to zones are another design choice.

Anyway, your patch doesn't takes care of reserved memory calculation at this point.
Please check setup_zone_migrate_reserve() That will be a problem.

Thanks,
-Kame

> MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
> MIGRATE_RESERVE = MIGRATE_PCPTYPES,
> #ifdef CONFIG_CMA
> @@ -69,6 +72,12 @@ enum {
> # define is_migrate_cma(migratetype) false
> #endif
>
> +#ifdef CONFIG_MEMORY_MIRROR
> +# define is_migrate_mirror(migratetype) unlikely((migratetype) == MIGRATE_MIRROR)
> +#else
> +# define is_migrate_mirror(migratetype) false
> +#endif
> +
> #define for_each_migratetype_order(order, type) \
> for (order = 0; order < MAX_ORDER; order++) \
> for (type = 0; type < MIGRATE_TYPES; type++)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ebffa0e..6e4d79f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3216,6 +3216,9 @@ static void show_migration_types(unsigned char type)
> [MIGRATE_UNMOVABLE] = 'U',
> [MIGRATE_RECLAIMABLE] = 'E',
> [MIGRATE_MOVABLE] = 'M',
> +#ifdef CONFIG_MEMORY_MIRROR
> + [MIGRATE_MIRROR] = 'O',
> +#endif
> [MIGRATE_RESERVE] = 'R',
> #ifdef CONFIG_CMA
> [MIGRATE_CMA] = 'C',
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4f5cd97..d0323e0 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -901,6 +901,9 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
> "Unmovable",
> "Reclaimable",
> "Movable",
> +#ifdef CONFIG_MEMORY_MIRROR
> + "Mirror",
> +#endif
> "Reserve",
> #ifdef CONFIG_CMA
> "CMA",
>

2015-06-29 07:40:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 4/8] mm: add mirrored memory to buddy system

On 2015/06/27 11:25, Xishi Qiu wrote:
> Before free bootmem, set mirrored pageblock's migratetype to MIGRATE_MIRROR, so
> they could free to buddy system's MIGRATE_MIRROR list.
> When set reserved memory, skip the mirrored memory.
>
> Signed-off-by: Xishi Qiu <[email protected]>
> ---
> include/linux/memblock.h | 3 +++
> mm/memblock.c | 21 +++++++++++++++++++++
> mm/nobootmem.c | 3 +++
> mm/page_alloc.c | 3 +++
> 4 files changed, 30 insertions(+)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 97f71ca..53be030 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -81,6 +81,9 @@ int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> ulong choose_memblock_flags(void);
> +#ifdef CONFIG_MEMORY_MIRROR
> +void memblock_mark_migratemirror(void);
> +#endif
>
> /* Low level functions */
> int memblock_add_range(struct memblock_type *type,
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 7612876..0d0b210 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -19,6 +19,7 @@
> #include <linux/debugfs.h>
> #include <linux/seq_file.h>
> #include <linux/memblock.h>
> +#include <linux/page-isolation.h>
>
> #include <asm-generic/sections.h>
> #include <linux/io.h>
> @@ -818,6 +819,26 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
> return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
> }
>
> +#ifdef CONFIG_MEMORY_MIRROR
> +void __init_memblock memblock_mark_migratemirror(void)
> +{
> + unsigned long start_pfn, end_pfn, pfn;
> + int i, node;
> + struct page *page;
> +
> + printk(KERN_DEBUG "Mirrored memory:\n");
> + for_each_mirror_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
> + &node) {
> + printk(KERN_DEBUG " node %3d: [mem %#010llx-%#010llx]\n",
> + node, PFN_PHYS(start_pfn), PFN_PHYS(end_pfn) - 1);
> + for (pfn = start_pfn; pfn < end_pfn;
> + pfn += pageblock_nr_pages) {
> + page = pfn_to_page(pfn);
> + set_pageblock_migratetype(page, MIGRATE_MIRROR);
> + }
> + }
> +}
> +#endif
>
> /**
> * __next__mem_range - next function for for_each_free_mem_range() etc.
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index 5258386..31aa6d4 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -129,6 +129,9 @@ static unsigned long __init free_low_memory_core_early(void)
> u64 i;
>
> memblock_clear_hotplug(0, -1);
> +#ifdef CONFIG_MEMORY_MIRROR
> + memblock_mark_migratemirror();
> +#endif
>
> for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
> NULL)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e4d79f..aea78a5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4118,6 +4118,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)
>
> block_migratetype = get_pageblock_migratetype(page);
>
> + if (is_migrate_mirror(block_migratetype))
> + continue;
> +

If mirrored area will not have reserved memory, this should break the page allocator's
logic.

I think both of mirrored and unmirrored range should have reserved area.

Thanks,
-Kame

> /* Only test what is necessary when the reserves are not met */
> if (reserve > 0) {
> /*
>

2015-06-29 15:19:13

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/8] mm: mirrored memory support for page buddy allocations

On 06/26/2015 07:19 PM, Xishi Qiu wrote:
> drivers/base/node.c | 17 ++++---
> fs/proc/meminfo.c | 6 +++
> include/linux/memblock.h | 29 ++++++++++--
> include/linux/mmzone.h | 10 ++++
> include/linux/vmstat.h | 2 +
> mm/Kconfig | 8 ++++
> mm/memblock.c | 33 +++++++++++--
> mm/nobootmem.c | 3 ++
> mm/page_alloc.c | 117 ++++++++++++++++++++++++++++++++++++-----------
> mm/vmstat.c | 4 ++
> 10 files changed, 190 insertions(+), 39 deletions(-)

Has there been any performance analysis done on this code? I'm always
nervous when I see page_alloc.c churn.

2015-06-29 23:11:37

by Tony Luck

[permalink] [raw]
Subject: RE: [RFC v2 PATCH 7/8] mm: add the buddy system interface

> @@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
> */
> int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
> {
> - system_has_some_mirror = true;
> + static_key_slow_inc(&system_has_mirror);
>
> return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
> }

This generates some WARN_ON noise when called from efi_find_mirror():

[ 0.000000] e820: last_pfn = 0x7b800 max_arch_pfn = 0x400000000
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:61 static_key_slow_inc+0x57/0xc0()
[ 0.000000] static_key_slow_inc used before call to jump_label_init
[ 0.000000] Modules linked in:

[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.0 #4
[ 0.000000] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0065.R01.1505011640 05/01/2015
[ 0.000000] 0000000000000000 ee366a8dff38f745 ffffffff81997d68 ffffffff816683b4
[ 0.000000] 0000000000000000 ffffffff81997dc0 ffffffff81997da8 ffffffff8107b0aa
[ 0.000000] ffffffff81d48822 ffffffff81f281a0 0000000040000000 0000001fcb7a4000
[ 0.000000] Call Trace:
[ 0.000000] [<ffffffff816683b4>] dump_stack+0x45/0x57
[ 0.000000] [<ffffffff8107b0aa>] warn_slowpath_common+0x8a/0xc0
[ 0.000000] [<ffffffff8107b135>] warn_slowpath_fmt+0x55/0x70
[ 0.000000] [<ffffffff81660273>] ? memblock_add_range+0x175/0x19e
[ 0.000000] [<ffffffff81176c57>] static_key_slow_inc+0x57/0xc0
[ 0.000000] [<ffffffff81660655>] memblock_mark_mirror+0x19/0x33
[ 0.000000] [<ffffffff81b12c18>] efi_find_mirror+0x59/0xdd
[ 0.000000] [<ffffffff81afb8a6>] setup_arch+0x642/0xccf
[ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
[ 0.000000] [<ffffffff81663480>] ? printk+0x55/0x6b
[ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
[ 0.000000] [<ffffffff81af3d93>] start_kernel+0xe8/0x4eb
[ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
[ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
[ 0.000000] [<ffffffff81af35ee>] x86_64_start_reservations+0x2a/0x2c
[ 0.000000] [<ffffffff81af373c>] x86_64_start_kernel+0x14c/0x16f
[ 0.000000] ---[ end trace baa7fa0514e3bc58 ]---
[ 0.000000] ------------[ cut here ]------------




2015-06-30 01:01:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface

On 2015/06/30 8:11, Luck, Tony wrote:
>> @@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
>> */
>> int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
>> {
>> - system_has_some_mirror = true;
>> + static_key_slow_inc(&system_has_mirror);
>>
>> return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
>> }
>
> This generates some WARN_ON noise when called from efi_find_mirror():
>

It seems jump_label_init() is called after memory initialization. (init/main.c::start_kernel())
So, it may be difficut to use static_key function for our purpose because
kernel memory allocation may occur before jump_label is ready.

Thanks,
-Kame

> [ 0.000000] e820: last_pfn = 0x7b800 max_arch_pfn = 0x400000000
> [ 0.000000] ------------[ cut here ]------------
> [ 0.000000] WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:61 static_key_slow_inc+0x57/0xc0()
> [ 0.000000] static_key_slow_inc used before call to jump_label_init
> [ 0.000000] Modules linked in:
>
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.0 #4
> [ 0.000000] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0065.R01.1505011640 05/01/2015
> [ 0.000000] 0000000000000000 ee366a8dff38f745 ffffffff81997d68 ffffffff816683b4
> [ 0.000000] 0000000000000000 ffffffff81997dc0 ffffffff81997da8 ffffffff8107b0aa
> [ 0.000000] ffffffff81d48822 ffffffff81f281a0 0000000040000000 0000001fcb7a4000
> [ 0.000000] Call Trace:
> [ 0.000000] [<ffffffff816683b4>] dump_stack+0x45/0x57
> [ 0.000000] [<ffffffff8107b0aa>] warn_slowpath_common+0x8a/0xc0
> [ 0.000000] [<ffffffff8107b135>] warn_slowpath_fmt+0x55/0x70
> [ 0.000000] [<ffffffff81660273>] ? memblock_add_range+0x175/0x19e
> [ 0.000000] [<ffffffff81176c57>] static_key_slow_inc+0x57/0xc0
> [ 0.000000] [<ffffffff81660655>] memblock_mark_mirror+0x19/0x33
> [ 0.000000] [<ffffffff81b12c18>] efi_find_mirror+0x59/0xdd
> [ 0.000000] [<ffffffff81afb8a6>] setup_arch+0x642/0xccf
> [ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
> [ 0.000000] [<ffffffff81663480>] ? printk+0x55/0x6b
> [ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
> [ 0.000000] [<ffffffff81af3d93>] start_kernel+0xe8/0x4eb
> [ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
> [ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
> [ 0.000000] [<ffffffff81af35ee>] x86_64_start_reservations+0x2a/0x2c
> [ 0.000000] [<ffffffff81af373c>] x86_64_start_kernel+0x14c/0x16f
> [ 0.000000] ---[ end trace baa7fa0514e3bc58 ]---
> [ 0.000000] ------------[ cut here ]------------
>
>
>
>
>

2015-06-30 01:30:13

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/8] mm: mirrored memory support for page buddy allocations

On 2015/6/29 23:19, Dave Hansen wrote:

> On 06/26/2015 07:19 PM, Xishi Qiu wrote:
>> drivers/base/node.c | 17 ++++---
>> fs/proc/meminfo.c | 6 +++
>> include/linux/memblock.h | 29 ++++++++++--
>> include/linux/mmzone.h | 10 ++++
>> include/linux/vmstat.h | 2 +
>> mm/Kconfig | 8 ++++
>> mm/memblock.c | 33 +++++++++++--
>> mm/nobootmem.c | 3 ++
>> mm/page_alloc.c | 117 ++++++++++++++++++++++++++++++++++++-----------
>> mm/vmstat.c | 4 ++
>> 10 files changed, 190 insertions(+), 39 deletions(-)
>
> Has there been any performance analysis done on this code? I'm always
> nervous when I see page_alloc.c churn.
>

Not yet, which benchmark do you suggest?

Thanks,
Xishi Qiu

>
>
> .
>


2015-06-30 01:34:16

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface

On 2015/6/30 9:01, Kamezawa Hiroyuki wrote:

> On 2015/06/30 8:11, Luck, Tony wrote:
>>> @@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
>>> */
>>> int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
>>> {
>>> - system_has_some_mirror = true;
>>> + static_key_slow_inc(&system_has_mirror);
>>>
>>> return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
>>> }
>>
>> This generates some WARN_ON noise when called from efi_find_mirror():
>>
>
> It seems jump_label_init() is called after memory initialization. (init/main.c::start_kernel())
> So, it may be difficut to use static_key function for our purpose because
> kernel memory allocation may occur before jump_label is ready.
>
> Thanks,
> -Kame
>

Hi Kame,

How about like this? Use static bool in bootmem, and use jump label in buddy system.
This means we use two variable to do it.

Thanks,
Xishi Qiu

>> [ 0.000000] e820: last_pfn = 0x7b800 max_arch_pfn = 0x400000000
>> [ 0.000000] ------------[ cut here ]------------
>> [ 0.000000] WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:61 static_key_slow_inc+0x57/0xc0()
>> [ 0.000000] static_key_slow_inc used before call to jump_label_init
>> [ 0.000000] Modules linked in:
>>
>> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.0 #4
>> [ 0.000000] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0065.R01.1505011640 05/01/2015
>> [ 0.000000] 0000000000000000 ee366a8dff38f745 ffffffff81997d68 ffffffff816683b4
>> [ 0.000000] 0000000000000000 ffffffff81997dc0 ffffffff81997da8 ffffffff8107b0aa
>> [ 0.000000] ffffffff81d48822 ffffffff81f281a0 0000000040000000 0000001fcb7a4000
>> [ 0.000000] Call Trace:
>> [ 0.000000] [<ffffffff816683b4>] dump_stack+0x45/0x57
>> [ 0.000000] [<ffffffff8107b0aa>] warn_slowpath_common+0x8a/0xc0
>> [ 0.000000] [<ffffffff8107b135>] warn_slowpath_fmt+0x55/0x70
>> [ 0.000000] [<ffffffff81660273>] ? memblock_add_range+0x175/0x19e
>> [ 0.000000] [<ffffffff81176c57>] static_key_slow_inc+0x57/0xc0
>> [ 0.000000] [<ffffffff81660655>] memblock_mark_mirror+0x19/0x33
>> [ 0.000000] [<ffffffff81b12c18>] efi_find_mirror+0x59/0xdd
>> [ 0.000000] [<ffffffff81afb8a6>] setup_arch+0x642/0xccf
>> [ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
>> [ 0.000000] [<ffffffff81663480>] ? printk+0x55/0x6b
>> [ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
>> [ 0.000000] [<ffffffff81af3d93>] start_kernel+0xe8/0x4eb
>> [ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
>> [ 0.000000] [<ffffffff81af3120>] ? early_idt_handler_array+0x120/0x120
>> [ 0.000000] [<ffffffff81af35ee>] x86_64_start_reservations+0x2a/0x2c
>> [ 0.000000] [<ffffffff81af373c>] x86_64_start_kernel+0x14c/0x16f
>> [ 0.000000] ---[ end trace baa7fa0514e3bc58 ]---
>> [ 0.000000] ------------[ cut here ]------------
>>
>>
>>
>>
>>
>
>
>
> .
>


2015-06-30 01:52:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/8] mm: mirrored memory support for page buddy allocations

On 06/29/2015 06:26 PM, Xishi Qiu wrote:
>> > Has there been any performance analysis done on this code? I'm always
>> > nervous when I see page_alloc.c churn.
>> >
> Not yet, which benchmark do you suggest?

mmtests is always a good place to start. aim9. I'm partial to
will-it-scale.

2015-06-30 02:01:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface

On 2015/06/30 10:31, Xishi Qiu wrote:
> On 2015/6/30 9:01, Kamezawa Hiroyuki wrote:
>
>> On 2015/06/30 8:11, Luck, Tony wrote:
>>>> @@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
>>>> */
>>>> int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
>>>> {
>>>> - system_has_some_mirror = true;
>>>> + static_key_slow_inc(&system_has_mirror);
>>>>
>>>> return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
>>>> }
>>>
>>> This generates some WARN_ON noise when called from efi_find_mirror():
>>>
>>
>> It seems jump_label_init() is called after memory initialization. (init/main.c::start_kernel())
>> So, it may be difficut to use static_key function for our purpose because
>> kernel memory allocation may occur before jump_label is ready.
>>
>> Thanks,
>> -Kame
>>
>
> Hi Kame,
>
> How about like this? Use static bool in bootmem, and use jump label in buddy system.
> This means we use two variable to do it.
>

I think it can be done but it should be done in separated patch with enough comment/changelog.

Thanks,
-Kame


2015-06-30 02:46:35

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

On 2015/6/29 15:32, Kamezawa Hiroyuki wrote:

> On 2015/06/27 11:24, Xishi Qiu wrote:
>> This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to
>> allocate mirrored pages.
>> When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.
>>
>> Signed-off-by: Xishi Qiu <[email protected]>
>
> My fear about this approarch is that this may break something existing.
>
> Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as
> MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE.
>
> Logically, MIRROR attribute is independent from page mobility and this overwrites
> will make some information lost.
>
> Then,
>
>> ---
>> include/linux/mmzone.h | 9 +++++++++
>> mm/page_alloc.c | 3 +++
>> mm/vmstat.c | 3 +++
>> 3 files changed, 15 insertions(+)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 54d74f6..54e891a 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -39,6 +39,9 @@ enum {
>> MIGRATE_UNMOVABLE,
>> MIGRATE_RECLAIMABLE,
>> MIGRATE_MOVABLE,
>> +#ifdef CONFIG_MEMORY_MIRROR
>> + MIGRATE_MIRROR,
>> +#endif
>
> I think
> MIGRATE_MIRROR_UNMOVABLE,
> MIGRATE_MIRROR_RECLAIMABLE,
> MIGRATE_MIRROR_MOVABLE, <== adding this may need discuss.
> MIGRATE_MIRROR_RESERVED, <== reserved pages should be maintained per mirrored/unmirrored.
>

Hi Kame,

You mean add 3 or 4 new migratetype?

> should be added with the following fallback list.
>
> /*
> * MIRROR page range is defined by firmware at boot. The range is limited
> * and is used only for kernel memory mirroring.
> */
> [MIGRATE_UNMOVABLE_MIRROR] = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE}
> [MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE}
>

Why not like this:
{MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_MIRROR_RESERVED, MIGRATE_RESERVE}

> Then, we'll not lose the original information of "Reclaiable Pages".
>
> One problem here is whteher we should have MIGRATE_RESERVE_MIRROR.
>
> If we never allow users to allocate mirrored memory, we should have MIGRATE_RESERVE_MIRROR.
> But it seems to require much more code change to do that.
>
> Creating a zone or adding an attribues to zones are another design choice.
>

If we add a new zone, mirror_zone will span others, I'm worry about this
maybe have problems.

Thanks,
Xishi Qiu

> Anyway, your patch doesn't takes care of reserved memory calculation at this point.
> Please check setup_zone_migrate_reserve() That will be a problem.
>
> Thanks,
> -Kame
>


2015-06-30 02:49:09

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/8] mm: mirrored memory support for page buddy allocations

On 2015/6/30 9:52, Dave Hansen wrote:

> On 06/29/2015 06:26 PM, Xishi Qiu wrote:
>>>> Has there been any performance analysis done on this code? I'm always
>>>> nervous when I see page_alloc.c churn.
>>>>
>> Not yet, which benchmark do you suggest?
>
> mmtests is always a good place to start. aim9. I'm partial to
> will-it-scale.
>

I see, thank you.

>
>
> .
>


2015-06-30 02:52:29

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 1/8] mm: add a new config to manage the code

On 2015/6/29 14:50, Kamezawa Hiroyuki wrote:

> On 2015/06/27 11:23, Xishi Qiu wrote:
>> This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", set it
> CONFIG_MEMORY_MIRROR
>> off by default.
>>
>> Signed-off-by: Xishi Qiu <[email protected]>
>> ---
>> mm/Kconfig | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 390214d..c40bb8b 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
>> depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
>> depends on MIGRATION
>>
>> +config MEMORY_MIRROR
>
> In following patches, you use CONFIG_MEMORY_MIRROR.
>
> I think the name is too generic besides it's depends on ACPI.
> But I'm not sure address based memory mirror is planned in other platform.
>
> So, hmm. How about dividing the config into 2 parts like attached ? (just an example)
>

Seems like a good idea, thank you.

> Thanks,
> -Kame


2015-06-30 07:54:02

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

On 2015/06/30 11:45, Xishi Qiu wrote:
> On 2015/6/29 15:32, Kamezawa Hiroyuki wrote:
>
>> On 2015/06/27 11:24, Xishi Qiu wrote:
>>> This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to
>>> allocate mirrored pages.
>>> When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.
>>>
>>> Signed-off-by: Xishi Qiu <[email protected]>
>>
>> My fear about this approarch is that this may break something existing.
>>
>> Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as
>> MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE.
>>
>> Logically, MIRROR attribute is independent from page mobility and this overwrites
>> will make some information lost.
>>
>> Then,
>>
>>> ---
>>> include/linux/mmzone.h | 9 +++++++++
>>> mm/page_alloc.c | 3 +++
>>> mm/vmstat.c | 3 +++
>>> 3 files changed, 15 insertions(+)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index 54d74f6..54e891a 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -39,6 +39,9 @@ enum {
>>> MIGRATE_UNMOVABLE,
>>> MIGRATE_RECLAIMABLE,
>>> MIGRATE_MOVABLE,
>>> +#ifdef CONFIG_MEMORY_MIRROR
>>> + MIGRATE_MIRROR,
>>> +#endif
>>
>> I think
>> MIGRATE_MIRROR_UNMOVABLE,
>> MIGRATE_MIRROR_RECLAIMABLE,
>> MIGRATE_MIRROR_MOVABLE, <== adding this may need discuss.
>> MIGRATE_MIRROR_RESERVED, <== reserved pages should be maintained per mirrored/unmirrored.
>>
>
> Hi Kame,
>
> You mean add 3 or 4 new migratetype?
>

yes. But please check how NR_MIGRATETYPE_BITS will be.
I think this will not have big impact in x86-64 .

>> should be added with the following fallback list.
>>
>> /*
>> * MIRROR page range is defined by firmware at boot. The range is limited
>> * and is used only for kernel memory mirroring.
>> */
>> [MIGRATE_UNMOVABLE_MIRROR] = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE}
>> [MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE}
>>
>
> Why not like this:
> {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_MIRROR_RESERVED, MIGRATE_RESERVE}
>

My mistake.
[MIGRATE_UNMOVABLE_MIRROR] = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE_MIRROR}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE_MIRROR}

was my intention. This means mirrored memory and unmirrored memory is separated completely.

But this should affect kswapd or other memory reclaim logic.

for example, kswapd stops free pages are more than hi watermark.
But mirrored/unmirrored pages exhausted cases are not handled in this series.
You need some extra check in memory reclaim logic if you go with migration_type.



>> Then, we'll not lose the original information of "Reclaiable Pages".
>>
>> One problem here is whteher we should have MIGRATE_RESERVE_MIRROR.
>>
>> If we never allow users to allocate mirrored memory, we should have MIGRATE_RESERVE_MIRROR.
>> But it seems to require much more code change to do that.
>>
>> Creating a zone or adding an attribues to zones are another design choice.
>>
>
> If we add a new zone, mirror_zone will span others, I'm worry about this
> maybe have problems.

Yes. that's problem. And zoneid bit is very limited resource.
(....But memory reclaim logic can be unchanged.)

Anyway, I'd like to see your solution with above changes 1st rather than adding zones.

Thanks,
-Kame

2015-06-30 09:31:31

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

On 2015/6/30 15:53, Kamezawa Hiroyuki wrote:

> On 2015/06/30 11:45, Xishi Qiu wrote:
>> On 2015/6/29 15:32, Kamezawa Hiroyuki wrote:
>>
>>> On 2015/06/27 11:24, Xishi Qiu wrote:
>>>> This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to
>>>> allocate mirrored pages.
>>>> When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.
>>>>
>>>> Signed-off-by: Xishi Qiu <[email protected]>
>>>
>>> My fear about this approarch is that this may break something existing.
>>>
>>> Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as
>>> MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE.
>>>
>>> Logically, MIRROR attribute is independent from page mobility and this overwrites
>>> will make some information lost.
>>>
>>> Then,
>>>
>>>> ---
>>>> include/linux/mmzone.h | 9 +++++++++
>>>> mm/page_alloc.c | 3 +++
>>>> mm/vmstat.c | 3 +++
>>>> 3 files changed, 15 insertions(+)
>>>>
>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>> index 54d74f6..54e891a 100644
>>>> --- a/include/linux/mmzone.h
>>>> +++ b/include/linux/mmzone.h
>>>> @@ -39,6 +39,9 @@ enum {
>>>> MIGRATE_UNMOVABLE,
>>>> MIGRATE_RECLAIMABLE,
>>>> MIGRATE_MOVABLE,
>>>> +#ifdef CONFIG_MEMORY_MIRROR
>>>> + MIGRATE_MIRROR,
>>>> +#endif
>>>
>>> I think
>>> MIGRATE_MIRROR_UNMOVABLE,
>>> MIGRATE_MIRROR_RECLAIMABLE,
>>> MIGRATE_MIRROR_MOVABLE, <== adding this may need discuss.
>>> MIGRATE_MIRROR_RESERVED, <== reserved pages should be maintained per mirrored/unmirrored.
>>>
>>
>> Hi Kame,
>>
>> You mean add 3 or 4 new migratetype?
>>
>
> yes. But please check how NR_MIGRATETYPE_BITS will be.
> I think this will not have big impact in x86-64 .
>
>>> should be added with the following fallback list.
>>>
>>> /*
>>> * MIRROR page range is defined by firmware at boot. The range is limited
>>> * and is used only for kernel memory mirroring.
>>> */
>>> [MIGRATE_UNMOVABLE_MIRROR] = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE}
>>> [MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE}
>>>
>>
>> Why not like this:
>> {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_MIRROR_RESERVED, MIGRATE_RESERVE}
>>
>
> My mistake.
> [MIGRATE_UNMOVABLE_MIRROR] = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE_MIRROR}
> [MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE_MIRROR}
>
> was my intention. This means mirrored memory and unmirrored memory is separated completely.
>
> But this should affect kswapd or other memory reclaim logic.
>
> for example, kswapd stops free pages are more than hi watermark.
> But mirrored/unmirrored pages exhausted cases are not handled in this series.
> You need some extra check in memory reclaim logic if you go with migration_type.
>

OK, I understand. Thank you for your suggestion.

Thanks,
Xishi Qiu

>
>
>>> Then, we'll not lose the original information of "Reclaiable Pages".
>>>
>>> One problem here is whteher we should have MIGRATE_RESERVE_MIRROR.
>>>
>>> If we never allow users to allocate mirrored memory, we should have MIGRATE_RESERVE_MIRROR.
>>> But it seems to require much more code change to do that.
>>>
>>> Creating a zone or adding an attribues to zones are another design choice.
>>>
>>
>> If we add a new zone, mirror_zone will span others, I'm worry about this
>> maybe have problems.
>
> Yes. that's problem. And zoneid bit is very limited resource.
> (....But memory reclaim logic can be unchanged.)
>
> Anyway, I'd like to see your solution with above changes 1st rather than adding zones.
>
> Thanks,
> -Kame
>
>
>
> .
>


2015-06-30 09:42:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/8] mm: mirrored memory support for page buddy allocations

On Sat, Jun 27, 2015 at 10:19:54AM +0800, Xishi Qiu wrote:
> Intel Xeon processor E7 v3 product family-based platforms introduces support
> for partial memory mirroring called as 'Address Range Mirroring'. This feature
> allows BIOS to specify a subset of total available memory to be mirrored (and
> optionally also specify whether to mirror the range 0-4 GB). This capability
> allows user to make an appropriate tradeoff between non-mirrored memory range
> and mirrored memory range thus optimizing total available memory and still
> achieving highly reliable memory range for mission critical workloads and/or
> kernel space.
>
> Tony has already send a patchset to supprot this feature at boot time.
> https://lkml.org/lkml/2015/5/8/521
> This patchset is based on Tony's, it can support the feature after boot time.
> Use mirrored memory for all kernel allocations.
>

This is my first time glancing through the series so I'm not aware of any
past discussion. Hopefully there are no repeats. Broadly speaking though
I'm not comfortable with the series.

First and foremost, there is uncontrolled access to the memory because
it's any kernel request. This includes even short-lived ones that do not
need mirroring such as network buffers or caches. Network network traffic
can be retried, caches can be reconstructed from disk etc. Kernel page
tables, struct page corruption etc are much harder to recover from.

Who are the expected users of this memory and how are they meant to be
prioritised? What happens if they fail to be mirrored? What happens if the
mirrored memory is all used up and a high priority request arrives? Is there
any prioritisation of one subsystem over another? What about boot-memory
allocations, should they ever use mirrored memory? The expected users are
important and this series does not address it.

Callers do not specify the flag, you just assume that kernel allocations
must be mirrored. If the allocation request fails, then you assume it was
MIGRATE_RECLAIMABLE later in the series. This is wrong as it'll break
fragmentation avoidance on machines with mirrored memory. Even if you
were to use migrate types to handle mirrored memory, you need to treat
mirrored memory as a type of reserve or else as a first preference for
allocations requested.

The fact that this will be used by very few machines but affects the memory
footprint of the page allocator is a general concern. When active, it affects
the fast paths for all users whether they care about mirroring or not.

If all free memory is in the MIGRATE_MIRROR then all user-space requests
will be rejected but reclaim will not make any progress if the zone
is balanced. The system may go prematurely OOM as no progress is made.
Getting around this is tricky and affects a few fast paths. Generally, the
easiest approach would be zone-based but I recognise that it has problems
of its own.

Basically, overall I feel this series is the wrong approach but not knowing
who the users are making is much harder to judge. I strongly suspect that
if mirrored memory is to be properly used then it needs to be available
before the page allocator is even active. Once active, there needs to
be controlled access for allocation requests that are really critical
to mirror and not just all kernel allocations. None of that would use a
MIGRATE_TYPE approach. It would be alterations to the bootmem allocator and
access to an explicit reserve that is not accounted for as "free memory"
and accessed via an explicit GFP flag.

--
Mel Gorman
SUSE Labs

2015-06-30 10:47:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/8] mm: mirrored memory support for page buddy allocations


* Mel Gorman <[email protected]> wrote:

> [...]
>
> Basically, overall I feel this series is the wrong approach but not knowing who
> the users are making is much harder to judge. I strongly suspect that if
> mirrored memory is to be properly used then it needs to be available before the
> page allocator is even active. Once active, there needs to be controlled access
> for allocation requests that are really critical to mirror and not just all
> kernel allocations. None of that would use a MIGRATE_TYPE approach. It would be
> alterations to the bootmem allocator and access to an explicit reserve that is
> not accounted for as "free memory" and accessed via an explicit GFP flag.

So I think the main goal is to avoid kernel crashes when a #MC memory fault
arrives on a piece of memory that is owned by the kernel.

In that sense 'protecting' all kernel allocations is natural: we don't know how to
recover from faults that affect kernel memory.

We do know how to recover from faults that affect user-space memory alone.

So if a mechanism is in place that prioritizes 3 groups of allocators:

- non-recoverable memory (kernel allocations mostly)

- high priority user memory (critical apps that must never fail)

- recoverable user memory (non-dirty caches that can simply be dropped,
non-critical apps, etc.)

then we can make use of this hardware feature. I suspect this series tries to move
in that direction.

Thanks,

Ingo

2015-06-30 11:54:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/8] mm: mirrored memory support for page buddy allocations

On Tue, Jun 30, 2015 at 12:46:54PM +0200, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > [...]
> >
> > Basically, overall I feel this series is the wrong approach but not knowing who
> > the users are making is much harder to judge. I strongly suspect that if
> > mirrored memory is to be properly used then it needs to be available before the
> > page allocator is even active. Once active, there needs to be controlled access
> > for allocation requests that are really critical to mirror and not just all
> > kernel allocations. None of that would use a MIGRATE_TYPE approach. It would be
> > alterations to the bootmem allocator and access to an explicit reserve that is
> > not accounted for as "free memory" and accessed via an explicit GFP flag.
>
> So I think the main goal is to avoid kernel crashes when a #MC memory fault
> arrives on a piece of memory that is owned by the kernel.
>

Sounds logical. In that case, bootmem awareness would be crucial.
Enabling support in just the page allocator is too late.

> In that sense 'protecting' all kernel allocations is natural: we don't know how to
> recover from faults that affect kernel memory.
>

It potentially uses all mirrored memory on memory that does not need that
sort of guarantee. For example, if there was a MC on memory backing the
inode cache then potentially that is recoverable as long as the inodes
were not dirty. That's a minor detail as the kernel could later protect
only MIGRATE_UNMOVABLE requests instead of all kernel allocations if fatal
MC in kernel space could be distinguished from non-fatal checks.

Bootmem awareness is much more important either way. If that was addressed
then potentially a MIGRATE_UNMOVABLE_MIRROR type could be created that
is only used for MIGRATE_UNMOVABLE allocations and never for user-space.
That misses MIGRATE_RECLAIMABLE so if that is required then we need
something else that both preserves fragmentation avoidance and avoid
introducing loads of new migratetypes.

Reclaim-related issues could be partially avoided by forbidding use from
userspace and accounting for the size of MIGRATE_UNMOVABLE_MIRROR during
watermark checks.

> We do know how to recover from faults that affect user-space memory alone.
>
> So if a mechanism is in place that prioritizes 3 groups of allocators:
>
> - non-recoverable memory (kernel allocations mostly)
>

So bootmem at the very least followed by MIGRATE_UNMOVABLE requests whether
they are accounted for by zones of MIGRATE_TYPES.

> - high priority user memory (critical apps that must never fail)
>

This one is problematic with a MIGRATE_TYPE-based approach such as the one in
this series. If a high priority requires memory and MIGRATE_MIRROR is full
then some of it must be reclaimed. With a MIGRATE_TYPE approach, the kernel
may reclaim a lot of unnecessary memory trying to free some MIGRATE_MIRROR
memory with no guarantee of success. It'll look like unnecessary thrashing
from userspace but difficult to diagnose as reclaim stats are per-zone based.
Dealing with this needs either a zone-based approach or a lot of surgery
to reclaim (similar to what the node-based LRU series does actually when
it skips pages when the caller requires lowmem pages).

--
Mel Gorman
SUSE Labs

2015-06-30 18:12:48

by Tony Luck

[permalink] [raw]
Subject: RE: [RFC v2 PATCH 0/8] mm: mirrored memory support for page buddy allocations

> Sounds logical. In that case, bootmem awareness would be crucial.
> Enabling support in just the page allocator is too late.

Andrew already applied some patches from me that I think covered bootmem
mirror allocations:

commit fc6daaf93151877748f8096af6b3fddb147f22d6
mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
commit a3f5bafcc04aaf62990e0cf3ced1cc6d8dc6fe95
mm/memblock: allocate boot time data structures from mirrored memory
commit b05b9f5f9dcf593a0e9327676b78e6c17b4218e8
x86, mirror: x86 enabling - find mirrored memory ranges

If I missed something, please let me know.


>> In that sense 'protecting' all kernel allocations is natural: we don't know how to
>> recover from faults that affect kernel memory.
>>
>
> It potentially uses all mirrored memory on memory that does not need that
> sort of guarantee. For example, if there was a MC on memory backing the
> inode cache then potentially that is recoverable as long as the inodes
> were not dirty.

Right now this is hard to do. On Intel we get a broadcast machine check that
may catch bystander cpus holding locks that we might need to look at kernel
structures to make decisions on what we just lost. That may get easier with
local machine check (only the logical cpu that tried to consume the corrupt
data gets the machine check ... patches for Linux are in for basic support of
this ... waiting for h/w that does it).


> That's a minor detail as the kernel could later protect
> only MIGRATE_UNMOVABLE requests instead of all kernel allocations if fatal
> MC in kernel space could be distinguished from non-fatal checks.

So the immediate use case is large memory servers (hundred+ Gbytes to
TBytes) running some applications that use most of memory in user mode
(like a database). We mirror enough memory to cover *all* the kernel allocations
so that a bad memory access with be fixed from the mirror for kernel, or result
in SIGBUS to a process for user page ... either way we don't crash the system.

Perhaps in the future we might find some places in the kernel where we can
cover a lot of memory without too many code changes ... e.g. things like
pagecopy(). At that time we'd have to think about allocation priorities.

-Tony

2015-07-13 04:57:28

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/8] mm: mirrored memory support for page buddy allocations

On 2015/6/30 19:53, Mel Gorman wrote:

> On Tue, Jun 30, 2015 at 12:46:54PM +0200, Ingo Molnar wrote:
>>
>> * Mel Gorman <[email protected]> wrote:
>>
>>> [...]
>>>
>>> Basically, overall I feel this series is the wrong approach but not knowing who
>>> the users are making is much harder to judge. I strongly suspect that if
>>> mirrored memory is to be properly used then it needs to be available before the
>>> page allocator is even active. Once active, there needs to be controlled access
>>> for allocation requests that are really critical to mirror and not just all
>>> kernel allocations. None of that would use a MIGRATE_TYPE approach. It would be
>>> alterations to the bootmem allocator and access to an explicit reserve that is
>>> not accounted for as "free memory" and accessed via an explicit GFP flag.
>>
>> So I think the main goal is to avoid kernel crashes when a #MC memory fault
>> arrives on a piece of memory that is owned by the kernel.
>>
>
> Sounds logical. In that case, bootmem awareness would be crucial.
> Enabling support in just the page allocator is too late.
>
>> In that sense 'protecting' all kernel allocations is natural: we don't know how to
>> recover from faults that affect kernel memory.
>>
>
> It potentially uses all mirrored memory on memory that does not need that
> sort of guarantee. For example, if there was a MC on memory backing the
> inode cache then potentially that is recoverable as long as the inodes
> were not dirty. That's a minor detail as the kernel could later protect
> only MIGRATE_UNMOVABLE requests instead of all kernel allocations if fatal
> MC in kernel space could be distinguished from non-fatal checks.
>
> Bootmem awareness is much more important either way. If that was addressed
> then potentially a MIGRATE_UNMOVABLE_MIRROR type could be created that
> is only used for MIGRATE_UNMOVABLE allocations and never for user-space.
> That misses MIGRATE_RECLAIMABLE so if that is required then we need
> something else that both preserves fragmentation avoidance and avoid
> introducing loads of new migratetypes.
>
> Reclaim-related issues could be partially avoided by forbidding use from
> userspace and accounting for the size of MIGRATE_UNMOVABLE_MIRROR during
> watermark checks.
>
>> We do know how to recover from faults that affect user-space memory alone.
>>
>> So if a mechanism is in place that prioritizes 3 groups of allocators:
>>
>> - non-recoverable memory (kernel allocations mostly)
>>
>
> So bootmem at the very least followed by MIGRATE_UNMOVABLE requests whether
> they are accounted for by zones of MIGRATE_TYPES.
>
>> - high priority user memory (critical apps that must never fail)
>>
>
> This one is problematic with a MIGRATE_TYPE-based approach such as the one in
> this series. If a high priority requires memory and MIGRATE_MIRROR is full
> then some of it must be reclaimed. With a MIGRATE_TYPE approach, the kernel
> may reclaim a lot of unnecessary memory trying to free some MIGRATE_MIRROR
> memory with no guarantee of success. It'll look like unnecessary thrashing
> from userspace but difficult to diagnose as reclaim stats are per-zone based.
> Dealing with this needs either a zone-based approach or a lot of surgery
> to reclaim (similar to what the node-based LRU series does actually when
> it skips pages when the caller requires lowmem pages).
>

Hi Mel,

Thank you for your comment. Sorry for replying late and some of it is not
very understanding for me.

If fatal memory faults in kernel space could be distinguished from non-fatal,
we can use only MIGRATE_UNMOVABLE_MIRROR, if can't, use two types for
MIGRATE_RECLAIMABLE and MIGRATE_UNMOVABLE, right?

Reclaim-related issues is similar to CMA in zone_watermark_ok(), right?

If we protect high priority user memory, use a new mirrored zone may be
better, right?

How about use a flag(e.g. GFP_MIRROR) to in kernel space allocation?
Can we use it to sort kernel space allocation? And it can also called by
user space via madvise and mmap.

Thanks,
Xishi Qiu