2015-06-04 12:59:41

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

Intel Xeon processor E7 v3 product family-based platforms introduces support
for partial memory mirroring called as 'Address Range Mirroring'. This feature
allows BIOS to specify a subset of total available memory to be mirrored (and
optionally also specify whether to mirror the range 0-4 GB). This capability
allows user to make an appropriate tradeoff between non-mirrored memory range
and mirrored memory range thus optimizing total available memory and still
achieving highly reliable memory range for mission critical workloads and/or
kernel space.

Tony has already send a patchset to supprot this feature at boot time.
https://lkml.org/lkml/2015/5/8/521

This patchset can support the feature after boot time. It introduces mirror_info
to save the mirrored memory range. Then use __GFP_MIRROR to allocate mirrored
pages.

I think add a new migratetype is btter and easier than a new zone, so I use
MIGRATE_MIRROR to manage the mirrored pages. However it changed some code in the
core file, please review and comment, thanks.

TBD:
1) call add_mirror_info() to fill mirrored memory info.
2) add compatibility with memory online/offline.
3) add more interface? others?

Xishi Qiu (12):
mm: add a new config to manage the code
mm: introduce mirror_info
mm: introduce MIGRATE_MIRROR to manage the mirrored pages
mm: add mirrored pages to buddy system
mm: introduce a new zone_stat_item NR_FREE_MIRROR_PAGES
mm: add free mirrored pages info
mm: introduce __GFP_MIRROR to allocate mirrored pages
mm: use mirrorable to switch allocate mirrored memory
mm: enable allocate mirrored memory at boot time
mm: add the buddy system interface
mm: add the PCP interface
mm: let slab/slub/slob use mirrored memory

arch/x86/mm/numa.c | 3 ++
drivers/base/node.c | 17 ++++---
fs/proc/meminfo.c | 6 +++
include/linux/gfp.h | 5 +-
include/linux/mmzone.h | 23 +++++++++
include/linux/vmstat.h | 2 +
kernel/sysctl.c | 9 ++++
mm/Kconfig | 8 +++
mm/page_alloc.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++---
mm/slab.c | 3 +-
mm/slob.c | 2 +-
mm/slub.c | 2 +-
mm/vmstat.c | 4 ++
13 files changed, 202 insertions(+), 16 deletions(-)

--
2.0.0


2015-06-04 13:12:04

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 01/12] mm: add a new config to manage the code

This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", it is
used to on/off the feature.

Signed-off-by: Xishi Qiu <[email protected]>
---
mm/Kconfig | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 390214d..4f2a726 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION

+config MEMORY_MIRROR
+ bool "Address range mirroring support"
+ depends on X86 && NUMA
+ default y
+ help
+ This feature depends on hardware and firmware support.
+ ACPI or EFI records the mirror info.
+
#
# If we have space for more page flags then we can enable additional
# optimizations and functionality.
--
2.0.0

2015-06-04 13:12:39

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 02/12] mm: introduce mirror_info

This patch introduces a new struct called "mirror_info", it is used to storage
the mirror address range which reported by EFI or ACPI.

TBD: call add_mirror_info() to fill it.

Signed-off-by: Xishi Qiu <[email protected]>
---
arch/x86/mm/numa.c | 3 +++
include/linux/mmzone.h | 15 +++++++++++++++
mm/page_alloc.c | 33 +++++++++++++++++++++++++++++++++
3 files changed, 51 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..781fd68 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -619,6 +619,9 @@ static int __init numa_init(int (*init_func)(void))
/* In case that parsing SRAT failed. */
WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
numa_reset_distance();
+#ifdef CONFIG_MEMORY_MIRROR
+ memset(&mirror_info, 0, sizeof(mirror_info));
+#endif

ret = init_func();
if (ret < 0)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54d74f6..1fae07b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -69,6 +69,21 @@ enum {
# define is_migrate_cma(migratetype) false
#endif

+#ifdef CONFIG_MEMORY_MIRROR
+struct numa_mirror_info {
+ int node;
+ unsigned long start;
+ unsigned long size;
+};
+
+struct mirror_info {
+ int count;
+ struct numa_mirror_info info[MAX_NUMNODES];
+};
+
+extern struct mirror_info mirror_info;
+#endif
+
#define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
for (type = 0; type < MIGRATE_TYPES; type++)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..41a95a7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -210,6 +210,10 @@ static char * const zone_names[MAX_NR_ZONES] = {
int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;

+#ifdef CONFIG_MEMORY_MIRROR
+struct mirror_info mirror_info;
+#endif
+
static unsigned long __meminitdata nr_kernel_pages;
static unsigned long __meminitdata nr_all_pages;
static unsigned long __meminitdata dma_reserve;
@@ -545,6 +549,31 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
return 0;
}

+#ifdef CONFIG_MEMORY_MIRROR
+static void __init add_mirror_info(int node,
+ unsigned long start, unsigned long size)
+{
+ mirror_info.info[mirror_info.count].node = node;
+ mirror_info.info[mirror_info.count].start = start;
+ mirror_info.info[mirror_info.count].size = size;
+
+ mirror_info.count++;
+}
+
+static void __init print_mirror_info(void)
+{
+ int i;
+
+ printk("Mirror info\n");
+ for (i = 0; i < mirror_info.count; i++)
+ printk(" node %3d: [mem %#010lx-%#010lx]\n",
+ mirror_info.info[i].node,
+ mirror_info.info[i].start,
+ mirror_info.info[i].start +
+ mirror_info.info[i].size - 1);
+}
+#endif
+
/*
* Freeing function for a buddy system allocator.
*
@@ -5438,6 +5467,10 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
(u64)zone_movable_pfn[i] << PAGE_SHIFT);
}

+#ifdef CONFIG_MEMORY_MIRROR
+ print_mirror_info();
+#endif
+
/* Print out the early node map */
pr_info("Early memory node ranges\n");
for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
--
2.0.0

2015-06-04 13:02:17

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 03/12] mm: introduce MIGRATE_MIRROR to manage the mirrored, pages

This patch introduces a new MIGRATE_TYPES called "MIGRATE_MIRROR", it is used
to storage the mirrored pages list.
When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.

e.g.
euler-linux:~ # cat /proc/pagetypeinfo
Page block order: 9
Pages per block: 512

Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 1 1 0 0 2 1 1 0 1 0 0
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 3
Node 0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Unmovable 0 0 1 0 0 0 0 1 1 1 0
Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Movable 1 2 6 6 6 4 5 3 3 2 738
Node 0, zone DMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Reserve 0 0 0 0 0 0 0 0 0 0 1
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Movable 0 0 1 1 0 0 0 2 1 0 4254
Node 0, zone Normal, type Mirror 148 104 63 70 26 11 2 2 1 1 973
Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0

Number of blocks type Unmovable Reclaimable Movable Mirror Reserve Isolate
Node 0, zone DMA 1 0 6 0 1 0
Node 0, zone DMA32 2 0 1525 0 1 0
Node 0, zone Normal 0 0 8702 2048 2 0
Page block order: 9
Pages per block: 512

Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 1, zone Normal, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
Node 1, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 1, zone Normal, type Movable 2 2 1 1 2 1 2 2 2 3 3996
Node 1, zone Normal, type Mirror 68 94 57 6 8 1 0 0 3 1 2003
Node 1, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1
Node 1, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0

Number of blocks type Unmovable Reclaimable Movable Mirror Reserve Isolate
Node 1, zone Normal 0 0 8190 4096 2 0


Signed-off-by: Xishi Qiu <[email protected]>
---
include/linux/mmzone.h | 6 ++++++
mm/page_alloc.c | 3 +++
mm/vmstat.c | 3 +++
3 files changed, 12 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1fae07b..b444335 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,9 @@ enum {
MIGRATE_UNMOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_MOVABLE,
+#ifdef CONFIG_MEMORY_MIRROR
+ MIGRATE_MIRROR,
+#endif
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_RESERVE = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
@@ -82,6 +85,9 @@ struct mirror_info {
};

extern struct mirror_info mirror_info;
+# define is_migrate_mirror(migratetype) unlikely((migratetype) == MIGRATE_MIRROR)
+#else
+# define is_migrate_mirror(migratetype) false
#endif

#define for_each_migratetype_order(order, type) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 41a95a7..3b2ff46 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3245,6 +3245,9 @@ static void show_migration_types(unsigned char type)
[MIGRATE_UNMOVABLE] = 'U',
[MIGRATE_RECLAIMABLE] = 'E',
[MIGRATE_MOVABLE] = 'M',
+#ifdef CONFIG_MEMORY_MIRROR
+ [MIGRATE_MIRROR] = 'O',
+#endif
[MIGRATE_RESERVE] = 'R',
#ifdef CONFIG_CMA
[MIGRATE_CMA] = 'C',
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd97..d0323e0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,6 +901,9 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Reclaimable",
"Movable",
+#ifdef CONFIG_MEMORY_MIRROR
+ "Mirror",
+#endif
"Reserve",
#ifdef CONFIG_CMA
"CMA",
--
2.0.0

2015-06-04 13:01:37

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 04/12] mm: add mirrored pages to buddy system

Set mirrored pageblock's migratetype to MIGRATE_MIRROR, so they could free to
buddy system's MIGRATE_MIRROR list when free bootmem.

Signed-off-by: Xishi Qiu <[email protected]>
---
mm/page_alloc.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b2ff46..8fe0187 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -572,6 +572,25 @@ static void __init print_mirror_info(void)
mirror_info.info[i].start +
mirror_info.info[i].size - 1);
}
+
+static inline bool is_mirror_pfn(unsigned long pfn)
+{
+ int i;
+ unsigned long addr = pfn << PAGE_SHIFT;
+
+ /* 0-4G is always mirrored, so ignore it */
+ if (addr < (4UL << 30))
+ return false;
+
+ for (i = 0; i < mirror_info.count; i++) {
+ if (addr >= mirror_info.info[i].start &&
+ addr < mirror_info.info[i].start +
+ mirror_info.info[i].size)
+ return true;
+ }
+
+ return false;
+}
#endif

/*
@@ -4147,6 +4166,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)

block_migratetype = get_pageblock_migratetype(page);

+ if (is_migrate_mirror(block_migratetype))
+ continue;
+
/* Only test what is necessary when the reserves are not met */
if (reserve > 0) {
/*
@@ -4246,6 +4268,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
&& !(pfn & (pageblock_nr_pages - 1)))
set_pageblock_migratetype(page, MIGRATE_MOVABLE);

+#ifdef CONFIG_MEMORY_MIRROR
+ if (is_mirror_pfn(pfn))
+ set_pageblock_migratetype(page, MIGRATE_MIRROR);
+#endif
+
INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
--
2.0.0

2015-06-04 13:04:56

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 05/12] mm: introduce a new zone_stat_item NR_FREE_MIRROR_PAGES

This patch introduces a new zone_stat_item called "NR_FREE_MIRROR_PAGES", it is
used to storage free mirrored pages count.

Signed-off-by: Xishi Qiu <[email protected]>
---
include/linux/mmzone.h | 1 +
include/linux/vmstat.h | 2 ++
mm/vmstat.c | 1 +
3 files changed, 4 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b444335..f82e3ae 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -178,6 +178,7 @@ enum zone_stat_item {
WORKINGSET_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
+ NR_FREE_MIRROR_PAGES,
NR_VM_ZONE_STAT_ITEMS };

/*
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7..d0a7268 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -283,6 +283,8 @@ static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
if (is_migrate_cma(migratetype))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
+ if (is_migrate_mirror(migratetype))
+ __mod_zone_page_state(zone, NR_FREE_MIRROR_PAGES, nr_pages);
}

extern const char * const vmstat_text[];
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d0323e0..7ee11ca 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -739,6 +739,7 @@ const char * const vmstat_text[] = {
"workingset_nodereclaim",
"nr_anon_transparent_hugepages",
"nr_free_cma",
+ "nr_free_mirror",

/* enum writeback_stat_item counters */
"nr_dirty_threshold",
--
2.0.0

2015-06-04 13:10:23

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 06/12] mm: add free mirrored pages info

Add the count of free mirrored pages in the following paths:
/proc/meminfo
/proc/zoneinfo
/sys/devices/system/node/node XX/meminfo
/sys/devices/system/node/node XX/vmstat

Signed-off-by: Xishi Qiu <[email protected]>
---
drivers/base/node.c | 17 +++++++++++------
fs/proc/meminfo.c | 6 ++++++
mm/page_alloc.c | 7 +++++--
3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index a2aa65b..d1a3556 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -114,6 +114,9 @@ static ssize_t node_read_meminfo(struct device *dev,
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
#endif
+#ifdef CONFIG_MEMORY_MIRROR
+ "Node %d MirrorFree: %8lu kB\n"
+#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
nid, K(node_page_state(nid, NR_WRITEBACK)),
@@ -130,14 +133,16 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
- , nid,
- K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR));
-#else
- nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ , nid, K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR)
+#endif
+#ifdef CONFIG_MEMORY_MIRROR
+ , nid, K(node_page_state(nid, NR_FREE_MIRROR_PAGES))
#endif
+ );
+
n += hugetlb_report_node_meminfo(nid, buf + n);
return n;
}
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index d3ebf2e..d1ebb20 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -145,6 +145,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
"CmaTotal: %8lu kB\n"
"CmaFree: %8lu kB\n"
#endif
+#ifdef CONFIG_MEMORY_MIRROR
+ "MirrorFree: %8lu kB\n"
+#endif
,
K(i.totalram),
K(i.freeram),
@@ -204,6 +207,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
, K(totalcma_pages)
, K(global_page_state(NR_FREE_CMA_PAGES))
#endif
+#ifdef CONFIG_MEMORY_MIRROR
+ , K(global_page_state(NR_FREE_MIRROR_PAGES))
+#endif
);

hugetlb_report_meminfo(m);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fe0187..249a8f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3316,7 +3316,7 @@ void show_free_areas(unsigned int filter)
" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
- " free:%lu free_pcp:%lu free_cma:%lu\n",
+ " free:%lu free_pcp:%lu free_cma:%lu free_mirror:%lu\n",
global_page_state(NR_ACTIVE_ANON),
global_page_state(NR_INACTIVE_ANON),
global_page_state(NR_ISOLATED_ANON),
@@ -3335,7 +3335,8 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_BOUNCE),
global_page_state(NR_FREE_PAGES),
free_pcp,
- global_page_state(NR_FREE_CMA_PAGES));
+ global_page_state(NR_FREE_CMA_PAGES),
+ global_page_state(NR_FREE_MIRROR_PAGES));

for_each_populated_zone(zone) {
int i;
@@ -3376,6 +3377,7 @@ void show_free_areas(unsigned int filter)
" free_pcp:%lukB"
" local_pcp:%ukB"
" free_cma:%lukB"
+ " free_mirror:%lukB"
" writeback_tmp:%lukB"
" pages_scanned:%lu"
" all_unreclaimable? %s"
@@ -3409,6 +3411,7 @@ void show_free_areas(unsigned int filter)
K(free_pcp),
K(this_cpu_read(zone->pageset->pcp.count)),
K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
+ K(zone_page_state(zone, NR_FREE_MIRROR_PAGES)),
K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
K(zone_page_state(zone, NR_PAGES_SCANNED)),
(!zone_reclaimable(zone) ? "yes" : "no")
--
2.0.0

2015-06-04 13:09:24

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 07/12] mm: introduce __GFP_MIRROR to allocate mirrored pages

This patch introduces a new gfp flag called "__GFP_MIRROR", it is used to
allocate mirrored pages through buddy system.

Signed-off-by: Xishi Qiu <[email protected]>
---
include/linux/gfp.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 15928f0..89d0091 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
+#define ___GFP_MIRROR 0x2000000u
/* If the above are modified, __GFP_BITS_SHIFT may need updating */

/*
@@ -95,13 +96,15 @@ struct vm_area_struct;
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */

+#define __GFP_MIRROR ((__force gfp_t)___GFP_MIRROR) /* Allocate mirrored memory */
+
/*
* This may seem redundant, but it's a way of annotating false positives vs.
* allocations that simply cannot be supported (e.g. page tables).
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)

-#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/* This equals 0, but use constants in case they ever change */
--
2.0.0

2015-06-04 13:10:13

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
we should allocate mirrored memory for both user and kernel processes.

Signed-off-by: Xishi Qiu <[email protected]>
---
include/linux/mmzone.h | 1 +
kernel/sysctl.c | 9 +++++++++
mm/page_alloc.c | 1 +
3 files changed, 11 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f82e3ae..20888dd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -85,6 +85,7 @@ struct mirror_info {
};

extern struct mirror_info mirror_info;
+extern int sysctl_mirrorable;
# define is_migrate_mirror(migratetype) unlikely((migratetype) == MIGRATE_MIRROR)
#else
# define is_migrate_mirror(migratetype) false
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a..dc2625e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1514,6 +1514,15 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
#endif
+#ifdef CONFIG_MEMORY_MIRROR
+ {
+ .procname = "mirrorable",
+ .data = &sysctl_mirrorable,
+ .maxlen = sizeof(sysctl_mirrorable),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ },
+#endif
{
.procname = "user_reserve_kbytes",
.data = &sysctl_user_reserve_kbytes,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 249a8f6..63b90ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -212,6 +212,7 @@ int user_min_free_kbytes = -1;

#ifdef CONFIG_MEMORY_MIRROR
struct mirror_info mirror_info;
+int sysctl_mirrorable = 0;
#endif

static unsigned long __meminitdata nr_kernel_pages;
--
2.0.0

2015-06-04 13:09:13

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 09/12] mm: enable allocate mirrored memory at boot time

Add a boot option called "mirrorable" to allocate mirrored memory at boot time
(after bootmem free).

Signed-off-by: Xishi Qiu <[email protected]>
---
mm/page_alloc.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63b90ca..d4d2066 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -213,6 +213,13 @@ int user_min_free_kbytes = -1;
#ifdef CONFIG_MEMORY_MIRROR
struct mirror_info mirror_info;
int sysctl_mirrorable = 0;
+
+static int __init set_mirrorable(char *p)
+{
+ sysctl_mirrorable = 1;
+ return 0;
+}
+early_param("mirrorable", set_mirrorable);
#endif

static unsigned long __meminitdata nr_kernel_pages;
--
2.0.0

2015-06-04 13:09:46

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 10/12] mm: add the buddy system interface

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu <[email protected]>
---
mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

return false;
}
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+ /*
+ * Do not alloc mirrored memory below 4G, because 0-4G is
+ * all mirrored by default, and the list is always empty.
+ */
+ if (high_zoneidx < ZONE_NORMAL)
+ return false;
+
+ /* Alloc mirrored memory for only kernel */
+ if (gfp_flags & __GFP_MIRROR)
+ return true;
+
+ /* Alloc mirrored memory for both user and kernel */
+ if (sysctl_mirrorable)
+ return true;
+
+ return false;
+}
#endif

/*
@@ -1796,7 +1816,10 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
WARN_ON_ONCE(order > 1);
}
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order, migratetype);
+ if (is_migrate_mirror(migratetype))
+ page = __rmqueue_smallest(zone, order, migratetype);
+ else
+ page = __rmqueue(zone, order, migratetype);
spin_unlock(&zone->lock);
if (!page)
goto failed;
@@ -2928,6 +2951,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;

+#ifdef CONFIG_MEMORY_MIRROR
+ if (change_to_mirror(gfp_mask, ac.high_zoneidx))
+ ac.migratetype = MIGRATE_MIRROR;
+#endif
+
retry_cpuset:
cpuset_mems_cookie = read_mems_allowed_begin();

@@ -2943,9 +2971,19 @@ retry_cpuset:

/* First allocation attempt */
alloc_mask = gfp_mask|__GFP_HARDWALL;
+retry:
page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
if (unlikely(!page)) {
/*
+ * If there is no mirrored memory, we will alloc other
+ * types memory.
+ */
+ if (is_migrate_mirror(ac.migratetype)) {
+ ac.migratetype = gfpflags_to_migratetype(gfp_mask);
+ goto retry;
+ }
+
+ /*
* Runtime PM, block IO and its error handling path
* can deadlock because I/O on the device might not
* complete.
--
2.0.0

2015-06-04 13:05:26

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 11/12] mm: add the PCP interface

Add the PCP interface for address range mirroring feature.

Signed-off-by: Xishi Qiu <[email protected]>
---
mm/page_alloc.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0fb55288..cf3b7cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1401,11 +1401,16 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
unsigned long count, struct list_head *list,
int migratetype, bool cold)
{
- int i;
+ int i, mt;

spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order, migratetype);
+ struct page *page;
+
+ if (is_migrate_mirror(migratetype))
+ page = __rmqueue_smallest(zone, order, migratetype);
+ else
+ page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;

@@ -1423,9 +1428,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
else
list_add_tail(&page->lru, list);
list = &page->lru;
- if (is_migrate_cma(get_freepage_migratetype(page)))
+
+ mt = get_freepage_migratetype(page);
+ if (is_migrate_cma(mt))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-(1 << order));
+ if (is_migrate_mirror(mt))
+ __mod_zone_page_state(zone, NR_FREE_MIRROR_PAGES,
+ -(1 << order));
}
__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
spin_unlock(&zone->lock);
--
2.0.0

2015-06-04 13:09:06

by Xishi Qiu

[permalink] [raw]
Subject: [RFC PATCH 12/12] mm: let slab/slub/slob use mirrored memory

Add __GFP_MIRROR flag when allocate a new slab.

Signed-off-by: Xishi Qiu <[email protected]>
---
mm/slab.c | 3 ++-
mm/slob.c | 2 +-
mm/slub.c | 2 +-
3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 7eb38dd..3b3ef22 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1594,7 +1594,8 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
if (memcg_charge_slab(cachep, flags, cachep->gfporder))
return NULL;

- page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
+ page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK | __GFP_MIRROR,
+ cachep->gfporder);
if (!page) {
memcg_uncharge_slab(cachep, cachep->gfporder);
slab_out_of_memory(cachep, flags, nodeid);
diff --git a/mm/slob.c b/mm/slob.c
index 4765f65..4ff9bde 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -452,7 +452,7 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)

if (likely(order))
gfp |= __GFP_COMP;
- ret = slob_new_pages(gfp, order, node);
+ ret = slob_new_pages(gfp | __GFP_MIRROR, order, node);

trace_kmalloc_node(caller, ret,
size, PAGE_SIZE << order, gfp, node);
diff --git a/mm/slub.c b/mm/slub.c
index 54c0876..1219e33 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1315,7 +1315,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
struct page *page;
int order = oo_order(oo);

- flags |= __GFP_NOTRACK;
+ flags |= __GFP_NOTRACK | __GFP_MIRROR;

if (memcg_charge_slab(s, flags, order))
return NULL;
--
2.0.0

2015-06-04 16:57:14

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC PATCH 02/12] mm: introduce mirror_info

+#ifdef CONFIG_MEMORY_MIRROR
+struct numa_mirror_info {
+ int node;
+ unsigned long start;
+ unsigned long size;
+};
+
+struct mirror_info {
+ int count;
+ struct numa_mirror_info info[MAX_NUMNODES];
+};

Do we really need this? My patch series leaves all the mirrored memory in
the memblock allocator tagged with the MEMBLOCK_MIRROR flag. Can't
we use that information when freeing the boot memory into the runtime
free lists?

If we can't ... then [MAX_NUMNODES] may not be enough. We may have
more than one mirrored range on each node. Current h/w allows two ranges
per node.

-Tony

2015-06-04 17:01:54

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

> Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
> we should allocate mirrored memory for both user and kernel processes.

With some "to be defined later" mechanism for how the user requests mirror vs.
not mirror. Plus some capability/ulimit pieces that restrict who can do this and how
much they can get???

-Tony

2015-06-04 17:09:06

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC PATCH 10/12] mm: add the buddy system interface

+#ifdef CONFIG_MEMORY_MIRROR
+ if (change_to_mirror(gfp_mask, ac.high_zoneidx))
+ ac.migratetype = MIGRATE_MIRROR;
+#endif

We may have to be smarter than this here. I'd like to encourage the
enterprise Linux distributions to set CONFIG_MEMORY_MIRROR=y
But the reality is that most systems will not configure any mirrored
memory - so we don't want the common code path for memory
allocation to call functions that set the migrate type, try to allocate
and then fall back to a non-mirror when that may be a complete waste
of time.

Maybe a global "got_mirror" that is true if we have some mirrored
memory. Then code is

if (got_mirror && change_to_mirror(...))

2015-06-04 17:15:11

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC PATCH 12/12] mm: let slab/slub/slob use mirrored memory

- page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
+ page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK | __GFP_MIRROR,
+ cachep->gfporder);

Set some global "got_mirror"[*] if we have any mirrored memory to __GFP_MIRROR, else to 0.

then

page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK | got_mirror,
cachep->gfporder);

-Tony

[*] Someone will suggest a better name. I'm bad at picking names.

2015-06-04 18:41:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

On 06/04/2015 06:02 AM, Xishi Qiu wrote:
> Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
> we should allocate mirrored memory for both user and kernel processes.

That's a pretty dangerously short name. :)

How would this end up getting used? It seems like it would be dangerous
to use once userspace was very far along. So would the kernel set it to
1 and then let (early??) userspace set it back to 0? That would let
important userspace like /bin/init get mirrored memory without having to
actually change much in userspace.

This definitely needs some good documentation.

Also, if it's insane to turn it back *on*, maybe it should be a one-way
trip to turn off.

2015-06-04 18:44:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 11/12] mm: add the PCP interface

On 06/04/2015 06:04 AM, Xishi Qiu wrote:
> spin_lock(&zone->lock);
> for (i = 0; i < count; ++i) {
> - struct page *page = __rmqueue(zone, order, migratetype);
> + struct page *page;
> +
> + if (is_migrate_mirror(migratetype))
> + page = __rmqueue_smallest(zone, order, migratetype);
> + else
> + page = __rmqueue(zone, order, migratetype);
> if (unlikely(page == NULL))
> break;

Why is this necessary/helpful? The changelog doesn't tell me either. :(

Why was this code modified in stead of putting the changes in
__rmqueue() itself (like CMA did)?

2015-06-05 01:56:58

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] mm: introduce mirror_info

On 2015/6/5 0:57, Luck, Tony wrote:

> +#ifdef CONFIG_MEMORY_MIRROR
> +struct numa_mirror_info {
> + int node;
> + unsigned long start;
> + unsigned long size;
> +};
> +
> +struct mirror_info {
> + int count;
> + struct numa_mirror_info info[MAX_NUMNODES];
> +};
>
> Do we really need this? My patch series leaves all the mirrored memory in
> the memblock allocator tagged with the MEMBLOCK_MIRROR flag. Can't
> we use that information when freeing the boot memory into the runtime
> free lists?
>

Hi Tony,

I used this code for testing before, so when your patchset added to mainline,
I'll rewrite it, use MEMBLOCK_MIRROR, not mirror_info.

I find Andrew has added your patches to mm-tree, right?

Thanks,
Xishi Qiu

> If we can't ... then [MAX_NUMNODES] may not be enough. We may have
> more than one mirrored range on each node. Current h/w allows two ranges
> per node.
>
> -Tony
>
> .
>


2015-06-05 03:17:16

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

On 2015/6/5 2:41, Dave Hansen wrote:

> On 06/04/2015 06:02 AM, Xishi Qiu wrote:
>> Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
>> we should allocate mirrored memory for both user and kernel processes.
>
> That's a pretty dangerously short name. :)
>

Hi Dave,

Thanks for your comment. I'm not sure whether we should add this interface
for user processes. However some important userspace(e.g. /bin/init, key
business like datebase) may be want mirrored memory to improve reliability.

If we want this interface, I think the code need more change.

Thanks,
Xishi Qiu

> How would this end up getting used? It seems like it would be dangerous
> to use once userspace was very far along. So would the kernel set it to
> 1 and then let (early??) userspace set it back to 0? That would let
> important userspace like /bin/init get mirrored memory without having to
> actually change much in userspace.
>
> This definitely needs some good documentation.
>
> Also, if it's insane to turn it back *on*, maybe it should be a one-way
> trip to turn off.
>
> .
>


2015-06-05 03:17:30

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/6/5 1:09, Luck, Tony wrote:

> +#ifdef CONFIG_MEMORY_MIRROR
> + if (change_to_mirror(gfp_mask, ac.high_zoneidx))
> + ac.migratetype = MIGRATE_MIRROR;
> +#endif
>
> We may have to be smarter than this here. I'd like to encourage the
> enterprise Linux distributions to set CONFIG_MEMORY_MIRROR=y
> But the reality is that most systems will not configure any mirrored
> memory - so we don't want the common code path for memory
> allocation to call functions that set the migrate type, try to allocate
> and then fall back to a non-mirror when that may be a complete waste
> of time.
>
> Maybe a global "got_mirror" that is true if we have some mirrored
> memory. Then code is
>
> if (got_mirror && change_to_mirror(...))
>

Yes, I will change next time.

Thanks,

> .
>


2015-06-08 11:52:32

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC PATCH 01/12] mm: add a new config to manage the code

On Thu, Jun 4, 2015 at 3:56 PM, Xishi Qiu <[email protected]> wrote:
> This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", it is
> used to on/off the feature.
>
> Signed-off-by: Xishi Qiu <[email protected]>
> ---
> mm/Kconfig | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 390214d..4f2a726 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
> depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
> depends on MIGRATION
>
> +config MEMORY_MIRROR
> + bool "Address range mirroring support"
> + depends on X86 && NUMA
> + default y
Is it correct for the systems (NOT xeon) without memory support built in?

> + help
> + This feature depends on hardware and firmware support.
> + ACPI or EFI records the mirror info.
> +
> #
> # If we have space for more page flags then we can enable additional
> # optimizations and functionality.
> --
> 2.0.0
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>



--
Leon Romanovsky | Independent Linux Consultant
http://www.leon.nu | [email protected]

2015-06-08 15:17:06

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC PATCH 01/12] mm: add a new config to manage the code

> > +config MEMORY_MIRROR
> > + bool "Address range mirroring support"
> > + depends on X86 && NUMA
> > + default y
> Is it correct for the systems (NOT xeon) without memory support built in?

Is the "&& NUMA" doing that? If you support NUMA, then you are not a minimal
config for a tablet or laptop.

If you want a symbol that has a stronger correlation to high end Xeon features
then perhaps MEMORY_FAILURE?

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-06-08 16:36:31

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC PATCH 01/12] mm: add a new config to manage the code

On Mon, Jun 8, 2015 at 6:14 PM, Luck, Tony <[email protected]> wrote:
>> > +config MEMORY_MIRROR
>> > + bool "Address range mirroring support"
>> > + depends on X86 && NUMA
>> > + default y
>> Is it correct for the systems (NOT xeon) without memory support built in?
>
> Is the "&& NUMA" doing that? If you support NUMA, then you are not a minimal
> config for a tablet or laptop.
>
> If you want a symbol that has a stronger correlation to high end Xeon features
> then perhaps MEMORY_FAILURE?
I would like to see the default set to be "n".
On my machine (x86_64) defconfig enables this feature and I don't know
if this feature can work there.

➜ linux-mm git:(dev) ✗ make defconfig ARCH=x86
HOSTCC scripts/basic/fixdep
HOSTCC scripts/basic/bin2c
HOSTCC scripts/kconfig/conf.o
HOSTCC scripts/kconfig/zconf.tab.o
HOSTLD scripts/kconfig/conf
*** Default configuration is based on 'x86_64_defconfig'
#
# configuration written to .config
#
➜ linux-mm git:(dev) ✗ grep CONFIG_MEMORY_MIRROR .config
CONFIG_MEMORY_MIRROR=y


>
> -Tony



--
Leon Romanovsky | Independent Linux Consultant
http://www.leon.nu | [email protected]

2015-06-09 06:55:47

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 01/12] mm: add a new config to manage the code

On 2015/06/04 21:56, Xishi Qiu wrote:
> This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", it is
> used to on/off the feature.
>
> Signed-off-by: Xishi Qiu <[email protected]>
> ---
> mm/Kconfig | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 390214d..4f2a726 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
> depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
> depends on MIGRATION
>
> +config MEMORY_MIRROR
> + bool "Address range mirroring support"
> + depends on X86 && NUMA
> + default y
> + help
> + This feature depends on hardware and firmware support.
> + ACPI or EFI records the mirror info.

default y...no runtime influence when the user doesn't use memory mirror ?

Thanks,
-Kame


2015-06-09 07:00:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] mm: introduce mirror_info

On 2015/06/04 21:57, Xishi Qiu wrote:
> This patch introduces a new struct called "mirror_info", it is used to storage
> the mirror address range which reported by EFI or ACPI.
>
> TBD: call add_mirror_info() to fill it.
>
> Signed-off-by: Xishi Qiu <[email protected]>
> ---
> arch/x86/mm/numa.c | 3 +++
> include/linux/mmzone.h | 15 +++++++++++++++
> mm/page_alloc.c | 33 +++++++++++++++++++++++++++++++++
> 3 files changed, 51 insertions(+)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 4053bb5..781fd68 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -619,6 +619,9 @@ static int __init numa_init(int (*init_func)(void))
> /* In case that parsing SRAT failed. */
> WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
> numa_reset_distance();
> +#ifdef CONFIG_MEMORY_MIRROR
> + memset(&mirror_info, 0, sizeof(mirror_info));
> +#endif
>
> ret = init_func();
> if (ret < 0)
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 54d74f6..1fae07b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -69,6 +69,21 @@ enum {
> # define is_migrate_cma(migratetype) false
> #endif
>
> +#ifdef CONFIG_MEMORY_MIRROR
> +struct numa_mirror_info {
> + int node;
> + unsigned long start;
> + unsigned long size;
> +};
> +
> +struct mirror_info {
> + int count;
> + struct numa_mirror_info info[MAX_NUMNODES];
> +};

MAX_NUMNODE may not be enough when the firmware cannot use contiguous
address for mirroing.


> +
> +extern struct mirror_info mirror_info;
> +#endif

If this structure will not be updated after boot, read_mostly should be
helpful.


> +
> #define for_each_migratetype_order(order, type) \
> for (order = 0; order < MAX_ORDER; order++) \
> for (type = 0; type < MIGRATE_TYPES; type++)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ebffa0e..41a95a7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -210,6 +210,10 @@ static char * const zone_names[MAX_NR_ZONES] = {
> int min_free_kbytes = 1024;
> int user_min_free_kbytes = -1;
>
> +#ifdef CONFIG_MEMORY_MIRROR
> +struct mirror_info mirror_info;
> +#endif
> +
> static unsigned long __meminitdata nr_kernel_pages;
> static unsigned long __meminitdata nr_all_pages;
> static unsigned long __meminitdata dma_reserve;
> @@ -545,6 +549,31 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
> return 0;
> }
>
> +#ifdef CONFIG_MEMORY_MIRROR
> +static void __init add_mirror_info(int node,
> + unsigned long start, unsigned long size)
> +{
> + mirror_info.info[mirror_info.count].node = node;
> + mirror_info.info[mirror_info.count].start = start;
> + mirror_info.info[mirror_info.count].size = size;
> +
> + mirror_info.count++;
> +}
> +
> +static void __init print_mirror_info(void)
> +{
> + int i;
> +
> + printk("Mirror info\n");
> + for (i = 0; i < mirror_info.count; i++)
> + printk(" node %3d: [mem %#010lx-%#010lx]\n",
> + mirror_info.info[i].node,
> + mirror_info.info[i].start,
> + mirror_info.info[i].start +
> + mirror_info.info[i].size - 1);
> +}
> +#endif
> +
> /*
> * Freeing function for a buddy system allocator.
> *
> @@ -5438,6 +5467,10 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
> (u64)zone_movable_pfn[i] << PAGE_SHIFT);
> }
>
> +#ifdef CONFIG_MEMORY_MIRROR
> + print_mirror_info();
> +#endif
> +
> /* Print out the early node map */
> pr_info("Early memory node ranges\n");
> for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
>

2015-06-09 06:55:33

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 03/12] mm: introduce MIGRATE_MIRROR to manage the mirrored, pages

On 2015/06/04 21:58, Xishi Qiu wrote:
> This patch introduces a new MIGRATE_TYPES called "MIGRATE_MIRROR", it is used
> to storage the mirrored pages list.
> When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.
>

I guess you need to add Mel to CC.

> e.g.
> euler-linux:~ # cat /proc/pagetypeinfo
> Page block order: 9
> Pages per block: 512
>
> Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
> Node 0, zone DMA, type Unmovable 1 1 0 0 2 1 1 0 1 0 0
> Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 3
> Node 0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
> Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32, type Unmovable 0 0 1 0 0 0 0 1 1 1 0
> Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32, type Movable 1 2 6 6 6 4 5 3 3 2 738
> Node 0, zone DMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32, type Reserve 0 0 0 0 0 0 0 0 0 0 1
> Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
> Node 0, zone Normal, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
> Node 0, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
> Node 0, zone Normal, type Movable 0 0 1 1 0 0 0 2 1 0 4254
> Node 0, zone Normal, type Mirror 148 104 63 70 26 11 2 2 1 1 973
> Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1
> Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
>
> Number of blocks type Unmovable Reclaimable Movable Mirror Reserve Isolate
> Node 0, zone DMA 1 0 6 0 1 0
> Node 0, zone DMA32 2 0 1525 0 1 0
> Node 0, zone Normal 0 0 8702 2048 2 0
> Page block order: 9
> Pages per block: 512



>
> Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
> Node 1, zone Normal, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
> Node 1, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
> Node 1, zone Normal, type Movable 2 2 1 1 2 1 2 2 2 3 3996
> Node 1, zone Normal, type Mirror 68 94 57 6 8 1 0 0 3 1 2003
> Node 1, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1
> Node 1, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
>
> Number of blocks type Unmovable Reclaimable Movable Mirror Reserve Isolate
> Node 1, zone Normal 0 0 8190 4096 2 0
>
>
> Signed-off-by: Xishi Qiu <[email protected]>
> ---
> include/linux/mmzone.h | 6 ++++++
> mm/page_alloc.c | 3 +++
> mm/vmstat.c | 3 +++
> 3 files changed, 12 insertions(+)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 1fae07b..b444335 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -39,6 +39,9 @@ enum {
> MIGRATE_UNMOVABLE,
> MIGRATE_RECLAIMABLE,
> MIGRATE_MOVABLE,
> +#ifdef CONFIG_MEMORY_MIRROR
> + MIGRATE_MIRROR,
> +#endif

I can't imagine how the fallback logic will work at reading this patch.
I think an update for fallback order array should be in this patch...

> MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
> MIGRATE_RESERVE = MIGRATE_PCPTYPES,
> #ifdef CONFIG_CMA
> @@ -82,6 +85,9 @@ struct mirror_info {
> };
>
> extern struct mirror_info mirror_info;
> +# define is_migrate_mirror(migratetype) unlikely((migratetype) == MIGRATE_MIRROR)
> +#else
> +# define is_migrate_mirror(migratetype) false
> #endif
>
> #define for_each_migratetype_order(order, type) \
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 41a95a7..3b2ff46 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3245,6 +3245,9 @@ static void show_migration_types(unsigned char type)
> [MIGRATE_UNMOVABLE] = 'U',
> [MIGRATE_RECLAIMABLE] = 'E',
> [MIGRATE_MOVABLE] = 'M',
> +#ifdef CONFIG_MEMORY_MIRROR
> + [MIGRATE_MIRROR] = 'O',
> +#endif
> [MIGRATE_RESERVE] = 'R',
> #ifdef CONFIG_CMA
> [MIGRATE_CMA] = 'C',
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4f5cd97..d0323e0 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -901,6 +901,9 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
> "Unmovable",
> "Reclaimable",
> "Movable",
> +#ifdef CONFIG_MEMORY_MIRROR
> + "Mirror",
> +#endif
> "Reserve",
> #ifdef CONFIG_CMA
> "CMA",
>

2015-06-09 07:02:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 07/12] mm: introduce __GFP_MIRROR to allocate mirrored pages

On 2015/06/04 22:02, Xishi Qiu wrote:
> This patch introduces a new gfp flag called "__GFP_MIRROR", it is used to
> allocate mirrored pages through buddy system.
>
> Signed-off-by: Xishi Qiu <[email protected]>

In Tony's original proposal, the motivation was to mirror all kernel memory.

Is the purpose of this patch making mirrored range available for user space ?

But, hmm... I don't think adding a new GFP flag is a good idea. It adds many conditional jumps.

How about using GFP_KERNEL for user memory if the user wants mirrored memory with mirroring
all kernel memory?

Thanks,
-Kame

> ---
> include/linux/gfp.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 15928f0..89d0091 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -35,6 +35,7 @@ struct vm_area_struct;
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> #define ___GFP_WRITE 0x1000000u
> +#define ___GFP_MIRROR 0x2000000u
> /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>
> /*
> @@ -95,13 +96,15 @@ struct vm_area_struct;
> #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
>
> +#define __GFP_MIRROR ((__force gfp_t)___GFP_MIRROR) /* Allocate mirrored memory */
> +
> /*
> * This may seem redundant, but it's a way of annotating false positives vs.
> * allocations that simply cannot be supported (e.g. page tables).
> */
> #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>
> -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> /* This equals 0, but use constants in case they ever change */
>

2015-06-09 07:06:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

On 2015/06/04 22:02, Xishi Qiu wrote:
> Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
> we should allocate mirrored memory for both user and kernel processes.
>
> Signed-off-by: Xishi Qiu <[email protected]>

I can't see why do we need this switch. If this is set, all GFP_HIGHUSER will use
mirrored memory ?

Or will you add special MMAP/madvise flag to use mirrored memory ?

Thanks,
-Kame

> ---
> include/linux/mmzone.h | 1 +
> kernel/sysctl.c | 9 +++++++++
> mm/page_alloc.c | 1 +
> 3 files changed, 11 insertions(+)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f82e3ae..20888dd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -85,6 +85,7 @@ struct mirror_info {
> };
>
> extern struct mirror_info mirror_info;
> +extern int sysctl_mirrorable;
> # define is_migrate_mirror(migratetype) unlikely((migratetype) == MIGRATE_MIRROR)
> #else
> # define is_migrate_mirror(migratetype) false
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 2082b1a..dc2625e 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1514,6 +1514,15 @@ static struct ctl_table vm_table[] = {
> .extra2 = &one,
> },
> #endif
> +#ifdef CONFIG_MEMORY_MIRROR
> + {
> + .procname = "mirrorable",
> + .data = &sysctl_mirrorable,
> + .maxlen = sizeof(sysctl_mirrorable),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + },
> +#endif
> {
> .procname = "user_reserve_kbytes",
> .data = &sysctl_user_reserve_kbytes,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 249a8f6..63b90ca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -212,6 +212,7 @@ int user_min_free_kbytes = -1;
>
> #ifdef CONFIG_MEMORY_MIRROR
> struct mirror_info mirror_info;
> +int sysctl_mirrorable = 0;
> #endif
>
> static unsigned long __meminitdata nr_kernel_pages;
>

2015-06-09 07:13:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/06/04 22:04, Xishi Qiu wrote:
> Add the buddy system interface for address range mirroring feature.
> Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
> left, use other types pages.
>
> Signed-off-by: Xishi Qiu <[email protected]>
> ---
> mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d4d2066..0fb55288 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)
>
> return false;
> }
> +
> +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
> +{
> + /*
> + * Do not alloc mirrored memory below 4G, because 0-4G is
> + * all mirrored by default, and the list is always empty.
> + */
> + if (high_zoneidx < ZONE_NORMAL)
> + return false;
> +
> + /* Alloc mirrored memory for only kernel */
> + if (gfp_flags & __GFP_MIRROR)
> + return true;

GFP_KERNEL itself should imply mirror, I think.

> +
> + /* Alloc mirrored memory for both user and kernel */
> + if (sysctl_mirrorable)
> + return true;

Reading this, I think this sysctl is not good. The user cannot know what is mirrored
because memory may not be mirrored until the sysctl is set.

Thanks,
-Kame


> +
> + return false;
> +}
> #endif
>
> /*
> @@ -1796,7 +1816,10 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> WARN_ON_ONCE(order > 1);
> }
> spin_lock_irqsave(&zone->lock, flags);
> - page = __rmqueue(zone, order, migratetype);
> + if (is_migrate_mirror(migratetype))
> + page = __rmqueue_smallest(zone, order, migratetype);
> + else
> + page = __rmqueue(zone, order, migratetype);
> spin_unlock(&zone->lock);
> if (!page)
> goto failed;
> @@ -2928,6 +2951,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
>
> +#ifdef CONFIG_MEMORY_MIRROR
> + if (change_to_mirror(gfp_mask, ac.high_zoneidx))
> + ac.migratetype = MIGRATE_MIRROR;
> +#endif
> +
> retry_cpuset:
> cpuset_mems_cookie = read_mems_allowed_begin();
>
> @@ -2943,9 +2971,19 @@ retry_cpuset:
>
> /* First allocation attempt */
> alloc_mask = gfp_mask|__GFP_HARDWALL;
> +retry:
> page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
> if (unlikely(!page)) {
> /*
> + * If there is no mirrored memory, we will alloc other
> + * types memory.
> + */
> + if (is_migrate_mirror(ac.migratetype)) {
> + ac.migratetype = gfpflags_to_migratetype(gfp_mask);
> + goto retry;
> + }
> +
> + /*
> * Runtime PM, block IO and its error handling path
> * can deadlock because I/O on the device might not
> * complete.
>

2015-06-09 10:13:07

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:

> On 2015/06/04 22:04, Xishi Qiu wrote:
>> Add the buddy system interface for address range mirroring feature.
>> Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
>> left, use other types pages.
>>
>> Signed-off-by: Xishi Qiu <[email protected]>
>> ---
>> mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 39 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d4d2066..0fb55288 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)
>>
>> return false;
>> }
>> +
>> +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
>> +{
>> + /*
>> + * Do not alloc mirrored memory below 4G, because 0-4G is
>> + * all mirrored by default, and the list is always empty.
>> + */
>> + if (high_zoneidx < ZONE_NORMAL)
>> + return false;
>> +
>> + /* Alloc mirrored memory for only kernel */
>> + if (gfp_flags & __GFP_MIRROR)
>> + return true;
>
> GFP_KERNEL itself should imply mirror, I think.
>

Hi Kame,

How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_MIRROR) ?

Thanks,
Xishi Qiu

>> +
>> + /* Alloc mirrored memory for both user and kernel */
>> + if (sysctl_mirrorable)
>> + return true;
>
> Reading this, I think this sysctl is not good. The user cannot know what is mirrored
> because memory may not be mirrored until the sysctl is set.
>
> Thanks,
> -Kame
>
>
>> +
>> + return false;
>> +}
>> #endif
>>
>> /*
>> @@ -1796,7 +1816,10 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>> WARN_ON_ONCE(order > 1);
>> }
>> spin_lock_irqsave(&zone->lock, flags);
>> - page = __rmqueue(zone, order, migratetype);
>> + if (is_migrate_mirror(migratetype))
>> + page = __rmqueue_smallest(zone, order, migratetype);
>> + else
>> + page = __rmqueue(zone, order, migratetype);
>> spin_unlock(&zone->lock);
>> if (!page)
>> goto failed;
>> @@ -2928,6 +2951,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>> if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
>> alloc_flags |= ALLOC_CMA;
>>
>> +#ifdef CONFIG_MEMORY_MIRROR
>> + if (change_to_mirror(gfp_mask, ac.high_zoneidx))
>> + ac.migratetype = MIGRATE_MIRROR;
>> +#endif
>> +
>> retry_cpuset:
>> cpuset_mems_cookie = read_mems_allowed_begin();
>>
>> @@ -2943,9 +2971,19 @@ retry_cpuset:
>>
>> /* First allocation attempt */
>> alloc_mask = gfp_mask|__GFP_HARDWALL;
>> +retry:
>> page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
>> if (unlikely(!page)) {
>> /*
>> + * If there is no mirrored memory, we will alloc other
>> + * types memory.
>> + */
>> + if (is_migrate_mirror(ac.migratetype)) {
>> + ac.migratetype = gfpflags_to_migratetype(gfp_mask);
>> + goto retry;
>> + }
>> +
>> + /*
>> * Runtime PM, block IO and its error handling path
>> * can deadlock because I/O on the device might not
>> * complete.
>>
>
>
>
> .
>


2015-06-09 10:19:06

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

On 2015/6/9 15:06, Kamezawa Hiroyuki wrote:

> On 2015/06/04 22:02, Xishi Qiu wrote:
>> Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
>> we should allocate mirrored memory for both user and kernel processes.
>>
>> Signed-off-by: Xishi Qiu <[email protected]>
>
> I can't see why do we need this switch. If this is set, all GFP_HIGHUSER will use
> mirrored memory ?
>
> Or will you add special MMAP/madvise flag to use mirrored memory ?
>

Hi Kame,

Yes,

MMAP/madvise
-> add VM_MIRROR
-> add GFP_MIRROR
-> use MIGRATE_MIRROR list to alloc mirrored pages

So user can use mirrored memory. What do you think?

Thanks,
Xishi Qiu

> Thanks,
> -Kame
>
>> ---
>> include/linux/mmzone.h | 1 +
>> kernel/sysctl.c | 9 +++++++++
>> mm/page_alloc.c | 1 +
>> 3 files changed, 11 insertions(+)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index f82e3ae..20888dd 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -85,6 +85,7 @@ struct mirror_info {
>> };
>>
>> extern struct mirror_info mirror_info;
>> +extern int sysctl_mirrorable;
>> # define is_migrate_mirror(migratetype) unlikely((migratetype) == MIGRATE_MIRROR)
>> #else
>> # define is_migrate_mirror(migratetype) false
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 2082b1a..dc2625e 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -1514,6 +1514,15 @@ static struct ctl_table vm_table[] = {
>> .extra2 = &one,
>> },
>> #endif
>> +#ifdef CONFIG_MEMORY_MIRROR
>> + {
>> + .procname = "mirrorable",
>> + .data = &sysctl_mirrorable,
>> + .maxlen = sizeof(sysctl_mirrorable),
>> + .mode = 0644,
>> + .proc_handler = proc_dointvec_minmax,
>> + },
>> +#endif
>> {
>> .procname = "user_reserve_kbytes",
>> .data = &sysctl_user_reserve_kbytes,
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 249a8f6..63b90ca 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -212,6 +212,7 @@ int user_min_free_kbytes = -1;
>>
>> #ifdef CONFIG_MEMORY_MIRROR
>> struct mirror_info mirror_info;
>> +int sysctl_mirrorable = 0;
>> #endif
>>
>> static unsigned long __meminitdata nr_kernel_pages;
>>
>
>
>
> .
>


2015-06-09 10:19:48

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 01/12] mm: add a new config to manage the code

On 2015/6/9 14:44, Kamezawa Hiroyuki wrote:

> On 2015/06/04 21:56, Xishi Qiu wrote:
>> This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", it is
>> used to on/off the feature.
>>
>> Signed-off-by: Xishi Qiu <[email protected]>
>> ---
>> mm/Kconfig | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 390214d..4f2a726 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
>> depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
>> depends on MIGRATION
>>
>> +config MEMORY_MIRROR
>> + bool "Address range mirroring support"
>> + depends on X86 && NUMA
>> + default y
>> + help
>> + This feature depends on hardware and firmware support.
>> + ACPI or EFI records the mirror info.
>
> default y...no runtime influence when the user doesn't use memory mirror ?
>

It is a new feature, so how about like this: default y -> n?

Thanks,
Xishi Qiu

> Thanks,
> -Kame
>
>
>
>
> .
>


2015-06-10 03:06:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/06/09 19:04, Xishi Qiu wrote:
> On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:
>
>> On 2015/06/04 22:04, Xishi Qiu wrote:
>>> Add the buddy system interface for address range mirroring feature.
>>> Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
>>> left, use other types pages.
>>>
>>> Signed-off-by: Xishi Qiu <[email protected]>
>>> ---
>>> mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
>>> 1 file changed, 39 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index d4d2066..0fb55288 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)
>>>
>>> return false;
>>> }
>>> +
>>> +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
>>> +{
>>> + /*
>>> + * Do not alloc mirrored memory below 4G, because 0-4G is
>>> + * all mirrored by default, and the list is always empty.
>>> + */
>>> + if (high_zoneidx < ZONE_NORMAL)
>>> + return false;
>>> +
>>> + /* Alloc mirrored memory for only kernel */
>>> + if (gfp_flags & __GFP_MIRROR)
>>> + return true;
>>
>> GFP_KERNEL itself should imply mirror, I think.
>>
>
> Hi Kame,
>
> How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_MIRROR) ?
>

Hm.... it cannot cover GFP_ATOMIC at el.

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE

thanks,
-Kame

Thanks,
-Kame


2015-06-10 03:08:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 01/12] mm: add a new config to manage the code

On 2015/06/09 19:10, Xishi Qiu wrote:
> On 2015/6/9 14:44, Kamezawa Hiroyuki wrote:
>
>> On 2015/06/04 21:56, Xishi Qiu wrote:
>>> This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", it is
>>> used to on/off the feature.
>>>
>>> Signed-off-by: Xishi Qiu <[email protected]>
>>> ---
>>> mm/Kconfig | 8 ++++++++
>>> 1 file changed, 8 insertions(+)
>>>
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 390214d..4f2a726 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
>>> depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
>>> depends on MIGRATION
>>>
>>> +config MEMORY_MIRROR
>>> + bool "Address range mirroring support"
>>> + depends on X86 && NUMA
>>> + default y
>>> + help
>>> + This feature depends on hardware and firmware support.
>>> + ACPI or EFI records the mirror info.
>>
>> default y...no runtime influence when the user doesn't use memory mirror ?
>>
>
> It is a new feature, so how about like this: default y -> n?
>

It's okay to me. But it's better to check performance impact before merge
because you modified core code of memory management.

Thanks,
-Kame


2015-06-10 03:10:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

On 2015/06/09 19:09, Xishi Qiu wrote:
> On 2015/6/9 15:06, Kamezawa Hiroyuki wrote:
>
>> On 2015/06/04 22:02, Xishi Qiu wrote:
>>> Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
>>> we should allocate mirrored memory for both user and kernel processes.
>>>
>>> Signed-off-by: Xishi Qiu <[email protected]>
>>
>> I can't see why do we need this switch. If this is set, all GFP_HIGHUSER will use
>> mirrored memory ?
>>
>> Or will you add special MMAP/madvise flag to use mirrored memory ?
>>
>
> Hi Kame,
>
> Yes,
>
> MMAP/madvise
> -> add VM_MIRROR
> -> add GFP_MIRROR
> -> use MIGRATE_MIRROR list to alloc mirrored pages
>
> So user can use mirrored memory. What do you think?
>

I see. please explain it (your final plan) in patch description or in cover page of patches.

Thanks,
-Kame

2015-06-10 20:42:33

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC PATCH 10/12] mm: add the buddy system interface

> I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE

HIGHMEM shouldn't matter - partial memory mirror only makes any sense on X86_64 systems ... 32-bit kernels
don't even boot on systems with 64GB, and the minimum rational configuration for a machine that supports
mirror is 128GB (4 cpu sockets * 2 memory controller per socket * 4 channels per controller * 4GB DIMM ...
leaving any channels empty likely leaves you short of memory bandwidth for these high core count processors).

MOVABLE is mostly the opposite of MIRROR - we never want to fill a kernel allocation from a MOVABLE page. I
want all kernel allocations to be from MIRROR.

-Tony

2015-06-12 08:11:28

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

On Thu, Jun 04, 2015 at 09:02:49PM +0800, Xishi Qiu wrote:
> Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
> we should allocate mirrored memory for both user and kernel processes.

As Dave and Kamezawa-san commented, documentation is not enough, so please
add a section in Documentation/sysctl/vm.txt for this new tuning parameter.

Thanks,
Naoya Horiguchi

>
> Signed-off-by: Xishi Qiu <[email protected]>
> ---
> include/linux/mmzone.h | 1 +
> kernel/sysctl.c | 9 +++++++++
> mm/page_alloc.c | 1 +
> 3 files changed, 11 insertions(+)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f82e3ae..20888dd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -85,6 +85,7 @@ struct mirror_info {
> };
>
> extern struct mirror_info mirror_info;
> +extern int sysctl_mirrorable;
> # define is_migrate_mirror(migratetype) unlikely((migratetype) == MIGRATE_MIRROR)
> #else
> # define is_migrate_mirror(migratetype) false
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 2082b1a..dc2625e 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1514,6 +1514,15 @@ static struct ctl_table vm_table[] = {
> .extra2 = &one,
> },
> #endif
> +#ifdef CONFIG_MEMORY_MIRROR
> + {
> + .procname = "mirrorable",
> + .data = &sysctl_mirrorable,
> + .maxlen = sizeof(sysctl_mirrorable),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + },
> +#endif
> {
> .procname = "user_reserve_kbytes",
> .data = &sysctl_user_reserve_kbytes,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 249a8f6..63b90ca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -212,6 +212,7 @@ int user_min_free_kbytes = -1;
>
> #ifdef CONFIG_MEMORY_MIRROR
> struct mirror_info mirror_info;
> +int sysctl_mirrorable = 0;
> #endif
>
> static unsigned long __meminitdata nr_kernel_pages;
> --
> 2.0.0
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>-

2015-06-12 08:52:25

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On Thu, Jun 04, 2015 at 08:54:22PM +0800, Xishi Qiu wrote:
> Intel Xeon processor E7 v3 product family-based platforms introduces support
> for partial memory mirroring called as 'Address Range Mirroring'. This feature
> allows BIOS to specify a subset of total available memory to be mirrored (and
> optionally also specify whether to mirror the range 0-4 GB). This capability
> allows user to make an appropriate tradeoff between non-mirrored memory range
> and mirrored memory range thus optimizing total available memory and still
> achieving highly reliable memory range for mission critical workloads and/or
> kernel space.
>
> Tony has already send a patchset to supprot this feature at boot time.
> https://lkml.org/lkml/2015/5/8/521
>
> This patchset can support the feature after boot time. It introduces mirror_info
> to save the mirrored memory range. Then use __GFP_MIRROR to allocate mirrored
> pages.
>
> I think add a new migratetype is btter and easier than a new zone, so I use
> MIGRATE_MIRROR to manage the mirrored pages. However it changed some code in the
> core file, please review and comment, thanks.
>
> TBD:
> 1) call add_mirror_info() to fill mirrored memory info.
> 2) add compatibility with memory online/offline.

Maybe simply disabling memory offlining of memory block including MIGRATE_MIRROR?

> 3) add more interface? others?

4?) I don't have the whole picture of how address ranging mirroring works,
but I'm curious about what happens when an uncorrected memory error happens
on the a mirror page. If HW/FW do some useful work invisible from kernel,
please document it somewhere. And my questions are:
- can the kernel with this patchset really continue its operation without
breaking consistency? More specifically, the corrupted page is replaced with
its mirror page, but can any other pages which have references (like struct
page or pfn) for the corrupted page properly switch these references to the
mirror page? Or no worry about that? (This is difficult for kernel pages
like slab, and that's why currently hwpoison doesn't handle any kernel pages.)
- How can we test/confirm that the whole scheme works fine? Is current memory
error injection framework enough?

It's really nice if any roadmap including testing is shared.

# And please CC me as [email protected] (my primary email address :)

Thanks,
Naoya Horiguchi

> Xishi Qiu (12):
> mm: add a new config to manage the code
> mm: introduce mirror_info
> mm: introduce MIGRATE_MIRROR to manage the mirrored pages
> mm: add mirrored pages to buddy system
> mm: introduce a new zone_stat_item NR_FREE_MIRROR_PAGES
> mm: add free mirrored pages info
> mm: introduce __GFP_MIRROR to allocate mirrored pages
> mm: use mirrorable to switch allocate mirrored memory
> mm: enable allocate mirrored memory at boot time
> mm: add the buddy system interface
> mm: add the PCP interface
> mm: let slab/slub/slob use mirrored memory
>
> arch/x86/mm/numa.c | 3 ++
> drivers/base/node.c | 17 ++++---
> fs/proc/meminfo.c | 6 +++
> include/linux/gfp.h | 5 +-
> include/linux/mmzone.h | 23 +++++++++
> include/linux/vmstat.h | 2 +
> kernel/sysctl.c | 9 ++++
> mm/Kconfig | 8 +++
> mm/page_alloc.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++---
> mm/slab.c | 3 +-
> mm/slob.c | 2 +-
> mm/slub.c | 2 +-
> mm/vmstat.c | 4 ++
> 13 files changed, 202 insertions(+), 16 deletions(-)
>
> --
> 2.0.0
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/-

2015-06-12 09:13:54

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On 2015/6/12 16:42, Naoya Horiguchi wrote:

> On Thu, Jun 04, 2015 at 08:54:22PM +0800, Xishi Qiu wrote:
>> Intel Xeon processor E7 v3 product family-based platforms introduces support
>> for partial memory mirroring called as 'Address Range Mirroring'. This feature
>> allows BIOS to specify a subset of total available memory to be mirrored (and
>> optionally also specify whether to mirror the range 0-4 GB). This capability
>> allows user to make an appropriate tradeoff between non-mirrored memory range
>> and mirrored memory range thus optimizing total available memory and still
>> achieving highly reliable memory range for mission critical workloads and/or
>> kernel space.
>>
>> Tony has already send a patchset to supprot this feature at boot time.
>> https://lkml.org/lkml/2015/5/8/521
>>
>> This patchset can support the feature after boot time. It introduces mirror_info
>> to save the mirrored memory range. Then use __GFP_MIRROR to allocate mirrored
>> pages.
>>
>> I think add a new migratetype is btter and easier than a new zone, so I use
>> MIGRATE_MIRROR to manage the mirrored pages. However it changed some code in the
>> core file, please review and comment, thanks.
>>
>> TBD:
>> 1) call add_mirror_info() to fill mirrored memory info.
>> 2) add compatibility with memory online/offline.
>
> Maybe simply disabling memory offlining of memory block including MIGRATE_MIRROR?
>
>> 3) add more interface? others?
>
> 4?) I don't have the whole picture of how address ranging mirroring works,
> but I'm curious about what happens when an uncorrected memory error happens
> on the a mirror page. If HW/FW do some useful work invisible from kernel,
> please document it somewhere. And my questions are:

Hi Naoya,

I think the hardware and BIOS will do the work when page corrupted, and it is
invisible to kernel. The kernel just use the mirrored memory (alloc pages in
special physical address).

Thanks,
Xishi Qiu

> - can the kernel with this patchset really continue its operation without
> breaking consistency? More specifically, the corrupted page is replaced with
> its mirror page, but can any other pages which have references (like struct
> page or pfn) for the corrupted page properly switch these references to the
> mirror page? Or no worry about that? (This is difficult for kernel pages
> like slab, and that's why currently hwpoison doesn't handle any kernel pages.)
> - How can we test/confirm that the whole scheme works fine? Is current memory
> error injection framework enough?
>
> It's really nice if any roadmap including testing is shared.
>
> # And please CC me as [email protected] (my primary email address :)
>
> Thanks,
> Naoya Horiguchi
>
>> Xishi Qiu (12):
>> mm: add a new config to manage the code
>> mm: introduce mirror_info
>> mm: introduce MIGRATE_MIRROR to manage the mirrored pages
>> mm: add mirrored pages to buddy system
>> mm: introduce a new zone_stat_item NR_FREE_MIRROR_PAGES
>> mm: add free mirrored pages info
>> mm: introduce __GFP_MIRROR to allocate mirrored pages
>> mm: use mirrorable to switch allocate mirrored memory
>> mm: enable allocate mirrored memory at boot time
>> mm: add the buddy system interface
>> mm: add the PCP interface
>> mm: let slab/slub/slob use mirrored memory
>>
>> arch/x86/mm/numa.c | 3 ++
>> drivers/base/node.c | 17 ++++---
>> fs/proc/meminfo.c | 6 +++
>> include/linux/gfp.h | 5 +-
>> include/linux/mmzone.h | 23 +++++++++
>> include/linux/vmstat.h | 2 +
>> kernel/sysctl.c | 9 ++++
>> mm/Kconfig | 8 +++
>> mm/page_alloc.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++---
>> mm/slab.c | 3 +-
>> mm/slob.c | 2 +-
>> mm/slub.c | 2 +-
>> mm/vmstat.c | 4 ++
>> 13 files changed, 202 insertions(+), 16 deletions(-)
>>
>> --
>> 2.0.0
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
> .
>


2015-06-12 19:03:51

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On Fri, Jun 12, 2015 at 08:42:33AM +0000, Naoya Horiguchi wrote:
> 4?) I don't have the whole picture of how address ranging mirroring works,
> but I'm curious about what happens when an uncorrected memory error happens
> on the a mirror page. If HW/FW do some useful work invisible from kernel,
> please document it somewhere. And my questions are:
> - can the kernel with this patchset really continue its operation without
> breaking consistency? More specifically, the corrupted page is replaced with
> its mirror page, but can any other pages which have references (like struct
> page or pfn) for the corrupted page properly switch these references to the
> mirror page? Or no worry about that? (This is difficult for kernel pages
> like slab, and that's why currently hwpoison doesn't handle any kernel pages.)

The mirror is operated by h/w (perhaps with some platform firmware
intervention when things start breaking badly).

In normal operation there are two DIMM addresses backing each
system physical address in the mirrored range (thus total system
memory capacity is reduced when mirror is enabled). Memory writes
are directed to both locations. Memory reads are interleaved to
maintain bandwidth, so could come from either address.

When a read returns with an ECC failure the h/w automatically:
1) Re-issues the read to the other DIMM address. If that also fails - then
we do the normal machine check processing for an uncorrected error
2) But if the other side of the mirror is good, we can send the good
data to the reader (cpu, or dma) and, in parallel try to fix the
bad side by writing the good data to it.
3) A corrected error will be logged, it may indicate whether the
attempt to fix succeeded or not.
4) If platform firmware wants, it can be notified of the correction
and it may keep statistics on the rate of errors, correction status,
etc. If things get very bad it may "break" the mirror and direct
all future reads to the remaining "good" side. If does this it will
likely tell the OS via some ACPI method.

All of this is done at much less than page granularity. Cache coherence
is maintained ... apart from some small performance glitches and the corrected
error logs, the OS is unware of all of this.

Note that in current implementations the mirror copies are both behind
the same memory controller ... so this isn't intended to cope with high
level failure of a memory controller ... just to deal with randomly
distributed ECC errors.

> - How can we test/confirm that the whole scheme works fine? Is current memory
> error injection framework enough?

Still working on that piece. To validate you need to be able to
inject errors to just one side of the mirror, and I'm not really
sure that the ACPI/EINJ interface is up to the task.

-Tony

2015-06-15 00:28:07

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On Fri, Jun 12, 2015 at 12:03:35PM -0700, Luck, Tony wrote:
> On Fri, Jun 12, 2015 at 08:42:33AM +0000, Naoya Horiguchi wrote:
> > 4?) I don't have the whole picture of how address ranging mirroring works,
> > but I'm curious about what happens when an uncorrected memory error happens
> > on the a mirror page. If HW/FW do some useful work invisible from kernel,
> > please document it somewhere. And my questions are:
> > - can the kernel with this patchset really continue its operation without
> > breaking consistency? More specifically, the corrupted page is replaced with
> > its mirror page, but can any other pages which have references (like struct
> > page or pfn) for the corrupted page properly switch these references to the
> > mirror page? Or no worry about that? (This is difficult for kernel pages
> > like slab, and that's why currently hwpoison doesn't handle any kernel pages.)
>
> The mirror is operated by h/w (perhaps with some platform firmware
> intervention when things start breaking badly).
>
> In normal operation there are two DIMM addresses backing each
> system physical address in the mirrored range (thus total system
> memory capacity is reduced when mirror is enabled). Memory writes
> are directed to both locations. Memory reads are interleaved to
> maintain bandwidth, so could come from either address.

I misunderstood that both of mirrored page and mirroring page are visible
to OS, which is incorrect.

> When a read returns with an ECC failure the h/w automatically:
> 1) Re-issues the read to the other DIMM address. If that also fails - then
> we do the normal machine check processing for an uncorrected error
> 2) But if the other side of the mirror is good, we can send the good
> data to the reader (cpu, or dma) and, in parallel try to fix the
> bad side by writing the good data to it.
> 3) A corrected error will be logged, it may indicate whether the
> attempt to fix succeeded or not.
> 4) If platform firmware wants, it can be notified of the correction
> and it may keep statistics on the rate of errors, correction status,
> etc. If things get very bad it may "break" the mirror and direct
> all future reads to the remaining "good" side. If does this it will
> likely tell the OS via some ACPI method.

Thanks, this fully answered my question.

> All of this is done at much less than page granularity. Cache coherence
> is maintained ... apart from some small performance glitches and the corrected
> error logs, the OS is unware of all of this.
>
> Note that in current implementations the mirror copies are both behind
> the same memory controller ... so this isn't intended to cope with high
> level failure of a memory controller ... just to deal with randomly
> distributed ECC errors.

OK, I looked at "Memory Address Range Mirroring Validation Guide" and Fig 2-2
clearly shows that.

> > - How can we test/confirm that the whole scheme works fine? Is current memory
> > error injection framework enough?
>
> Still working on that piece. To validate you need to be able to
> inject errors to just one side of the mirror, and I'm not really
> sure that the ACPI/EINJ interface is up to the task.

OK.

Thanks,
Naoya Horiguchi-

2015-06-15 08:48:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/06/11 5:40, Luck, Tony wrote:
>> I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE
>
> HIGHMEM shouldn't matter - partial memory mirror only makes any sense on X86_64 systems ... 32-bit kernels
> don't even boot on systems with 64GB, and the minimum rational configuration for a machine that supports
> mirror is 128GB (4 cpu sockets * 2 memory controller per socket * 4 channels per controller * 4GB DIMM ...
> leaving any channels empty likely leaves you short of memory bandwidth for these high core count processors).
>
> MOVABLE is mostly the opposite of MIRROR - we never want to fill a kernel allocation from a MOVABLE page. I
> want all kernel allocations to be from MIRROR.
>

So, there are 3 ideas.

(1) kernel only from MIRROR / user only from MOVABLE (Tony)
(2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP) (AKPM suggested)
This makes use of the fact MOVABLE memory is reclaimable but Tony pointed out
the memory reclaim can be critical for GFP_ATOMIC.
(3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR (Xishi)

2 Implementation ideas.
- creating ZONE
- creating new alloation attribute

I don't convince whether we need some new structure in mm. Isn't it good to use
ZONE_MOVABLE for not-mirrored memory ?
Then, disable fallback from ZONE_MOVABLE -> ZONE_NORMAL for (1) and (3)

Thanks,
-Kame

2015-06-15 17:20:46

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On Mon, Jun 15, 2015 at 05:47:27PM +0900, Kamezawa Hiroyuki wrote:
> So, there are 3 ideas.
>
> (1) kernel only from MIRROR / user only from MOVABLE (Tony)
> (2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP) (AKPM suggested)
> This makes use of the fact MOVABLE memory is reclaimable but Tony pointed out
> the memory reclaim can be critical for GFP_ATOMIC.
> (3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR (Xishi)
>
> 2 Implementation ideas.
> - creating ZONE
> - creating new alloation attribute
>
> I don't convince whether we need some new structure in mm. Isn't it good to use
> ZONE_MOVABLE for not-mirrored memory ?
> Then, disable fallback from ZONE_MOVABLE -> ZONE_NORMAL for (1) and (3)

We might need to rename it ... right now the memory hotplug
people use ZONE_MOVABLE to indicate regions of physical memory
that can be removed from the system. I'm wondering whether
people will want systems that have both removable and mirrored
areas? Then we have four attribute combinations:

mirror=no removable=no - prefer to use for user, could use for kernel if we run out of mirror
mirror=no removable=yes - can only be used for user (kernel allocation makes it not-removable)
mirror=yes removable=no - use for kernel, possibly for special users if we define some interface
mirror=yes removable=yes - must not use for kernel ... would have to give to user ... seems like a bad idea to configure a system this way

-Tony

2015-06-16 00:32:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/06/16 2:20, Luck, Tony wrote:
> On Mon, Jun 15, 2015 at 05:47:27PM +0900, Kamezawa Hiroyuki wrote:
>> So, there are 3 ideas.
>>
>> (1) kernel only from MIRROR / user only from MOVABLE (Tony)
>> (2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP) (AKPM suggested)
>> This makes use of the fact MOVABLE memory is reclaimable but Tony pointed out
>> the memory reclaim can be critical for GFP_ATOMIC.
>> (3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR (Xishi)
>>
>> 2 Implementation ideas.
>> - creating ZONE
>> - creating new alloation attribute
>>
>> I don't convince whether we need some new structure in mm. Isn't it good to use
>> ZONE_MOVABLE for not-mirrored memory ?
>> Then, disable fallback from ZONE_MOVABLE -> ZONE_NORMAL for (1) and (3)
>
> We might need to rename it ... right now the memory hotplug
> people use ZONE_MOVABLE to indicate regions of physical memory
> that can be removed from the system. I'm wondering whether
> people will want systems that have both removable and mirrored
> areas? Then we have four attribute combinations:
>
> mirror=no removable=no - prefer to use for user, could use for kernel if we run out of mirror
> mirror=no removable=yes - can only be used for user (kernel allocation makes it not-removable)
> mirror=yes removable=no - use for kernel, possibly for special users if we define some interface
> mirror=yes removable=yes - must not use for kernel ... would have to give to user ... seems like a bad idea to configure a system this way
>

Thank you for clarification. I see "mirror=no, removable=no" case may require a new name.

IMHO, the value of Address-Based-Memory-Mirror is that users can protect their system's
important functions without using full-memory mirror. So, I feel thinking
"mirror=no, removable=no" just makes our discussion/implemenation complex without real
user value.

Shouldn't we start with just thiking 2 cases of
mirror=no removable=yes
mirror=yes removable=no
?

And then, if the naming is problem, alias name can be added.

Thanks,
-Kame





2015-06-16 07:53:53

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On 06/04/2015 02:54 PM, Xishi Qiu wrote:
> Intel Xeon processor E7 v3 product family-based platforms introduces support
> for partial memory mirroring called as 'Address Range Mirroring'. This feature
> allows BIOS to specify a subset of total available memory to be mirrored (and
> optionally also specify whether to mirror the range 0-4 GB). This capability
> allows user to make an appropriate tradeoff between non-mirrored memory range
> and mirrored memory range thus optimizing total available memory and still
> achieving highly reliable memory range for mission critical workloads and/or
> kernel space.
>
> Tony has already send a patchset to supprot this feature at boot time.
> https://lkml.org/lkml/2015/5/8/521
>
> This patchset can support the feature after boot time. It introduces mirror_info
> to save the mirrored memory range. Then use __GFP_MIRROR to allocate mirrored
> pages.
>
> I think add a new migratetype is btter and easier than a new zone, so I use

If the mirrored memory is in a single reasonably compact (no large holes) range
(per NUMA node) and won't dynamically change its size, then zone might be a
better option. For one thing, it will still allow distinguishing movable and
unmovable allocations within the mirrored memory.

We had enough fun with MIGRATE_CMA and all kinds of checks it added to allocator
hot paths, and even CMA is now considering moving to a separate zone.

> MIGRATE_MIRROR to manage the mirrored pages. However it changed some code in the
> core file, please review and comment, thanks.
>
> TBD:
> 1) call add_mirror_info() to fill mirrored memory info.
> 2) add compatibility with memory online/offline.
> 3) add more interface? others?
>
> Xishi Qiu (12):
> mm: add a new config to manage the code
> mm: introduce mirror_info
> mm: introduce MIGRATE_MIRROR to manage the mirrored pages
> mm: add mirrored pages to buddy system
> mm: introduce a new zone_stat_item NR_FREE_MIRROR_PAGES
> mm: add free mirrored pages info
> mm: introduce __GFP_MIRROR to allocate mirrored pages
> mm: use mirrorable to switch allocate mirrored memory
> mm: enable allocate mirrored memory at boot time
> mm: add the buddy system interface
> mm: add the PCP interface
> mm: let slab/slub/slob use mirrored memory
>
> arch/x86/mm/numa.c | 3 ++
> drivers/base/node.c | 17 ++++---
> fs/proc/meminfo.c | 6 +++
> include/linux/gfp.h | 5 +-
> include/linux/mmzone.h | 23 +++++++++
> include/linux/vmstat.h | 2 +
> kernel/sysctl.c | 9 ++++
> mm/Kconfig | 8 +++
> mm/page_alloc.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++---
> mm/slab.c | 3 +-
> mm/slob.c | 2 +-
> mm/slub.c | 2 +-
> mm/vmstat.c | 4 ++
> 13 files changed, 202 insertions(+), 16 deletions(-)
>

2015-06-16 08:18:36

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On 2015/6/16 15:53, Vlastimil Babka wrote:

> On 06/04/2015 02:54 PM, Xishi Qiu wrote:
>> Intel Xeon processor E7 v3 product family-based platforms introduces support
>> for partial memory mirroring called as 'Address Range Mirroring'. This feature
>> allows BIOS to specify a subset of total available memory to be mirrored (and
>> optionally also specify whether to mirror the range 0-4 GB). This capability
>> allows user to make an appropriate tradeoff between non-mirrored memory range
>> and mirrored memory range thus optimizing total available memory and still
>> achieving highly reliable memory range for mission critical workloads and/or
>> kernel space.
>>
>> Tony has already send a patchset to supprot this feature at boot time.
>> https://lkml.org/lkml/2015/5/8/521
>>
>> This patchset can support the feature after boot time. It introduces mirror_info
>> to save the mirrored memory range. Then use __GFP_MIRROR to allocate mirrored
>> pages.
>>
>> I think add a new migratetype is btter and easier than a new zone, so I use
>
> If the mirrored memory is in a single reasonably compact (no large holes) range
> (per NUMA node) and won't dynamically change its size, then zone might be a
> better option. For one thing, it will still allow distinguishing movable and
> unmovable allocations within the mirrored memory.
>
> We had enough fun with MIGRATE_CMA and all kinds of checks it added to allocator
> hot paths, and even CMA is now considering moving to a separate zone.
>

Hi, how about the problem of this case:
e.g. node 0: 0-4G(dma and dma32)
node 1: 4G-8G(normal), 8-12G(mirror), 12-16G(normal),
so more than one normal zone in a node? or normal zone just span the mirror zone?

Thanks,
Xishi Qiu

2015-06-16 09:46:31

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On 06/16/2015 10:17 AM, Xishi Qiu wrote:
> On 2015/6/16 15:53, Vlastimil Babka wrote:
>
>> On 06/04/2015 02:54 PM, Xishi Qiu wrote:
>>>
>>> I think add a new migratetype is btter and easier than a new zone, so I use
>>
>> If the mirrored memory is in a single reasonably compact (no large holes) range
>> (per NUMA node) and won't dynamically change its size, then zone might be a
>> better option. For one thing, it will still allow distinguishing movable and
>> unmovable allocations within the mirrored memory.
>>
>> We had enough fun with MIGRATE_CMA and all kinds of checks it added to allocator
>> hot paths, and even CMA is now considering moving to a separate zone.
>>
>
> Hi, how about the problem of this case:
> e.g. node 0: 0-4G(dma and dma32)
> node 1: 4G-8G(normal), 8-12G(mirror), 12-16G(normal),
> so more than one normal zone in a node? or normal zone just span the mirror zone?

Normal zone can span the mirror zone just fine. However, it will result in zone
scanners such as compaction to skip over the mirror zone inefficiently. Hmm...

2015-06-18 01:23:43

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On 2015/6/16 17:46, Vlastimil Babka wrote:

> On 06/16/2015 10:17 AM, Xishi Qiu wrote:
>> On 2015/6/16 15:53, Vlastimil Babka wrote:
>>
>>> On 06/04/2015 02:54 PM, Xishi Qiu wrote:
>>>>
>>>> I think add a new migratetype is btter and easier than a new zone, so I use
>>>
>>> If the mirrored memory is in a single reasonably compact (no large holes) range
>>> (per NUMA node) and won't dynamically change its size, then zone might be a
>>> better option. For one thing, it will still allow distinguishing movable and
>>> unmovable allocations within the mirrored memory.
>>>
>>> We had enough fun with MIGRATE_CMA and all kinds of checks it added to allocator
>>> hot paths, and even CMA is now considering moving to a separate zone.
>>>
>>
>> Hi, how about the problem of this case:
>> e.g. node 0: 0-4G(dma and dma32)
>> node 1: 4G-8G(normal), 8-12G(mirror), 12-16G(normal),
>> so more than one normal zone in a node? or normal zone just span the mirror zone?
>
> Normal zone can span the mirror zone just fine. However, it will result in zone
> scanners such as compaction to skip over the mirror zone inefficiently. Hmm...
>

Hi Vlastimil,

If there are many mirror regions in one node, then it will be many holes in the
normal zone, is this fine?

Thanks,
Xishi Qiu

>
> .
>


2015-06-18 05:58:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On 18.6.2015 3:23, Xishi Qiu wrote:
> On 2015/6/16 17:46, Vlastimil Babka wrote:
>
>> On 06/16/2015 10:17 AM, Xishi Qiu wrote:
>>> On 2015/6/16 15:53, Vlastimil Babka wrote:
>>>
>>>> On 06/04/2015 02:54 PM, Xishi Qiu wrote:
>>>>>
>>>>> I think add a new migratetype is btter and easier than a new zone, so I use
>>>>
>>>> If the mirrored memory is in a single reasonably compact (no large holes) range
>>>> (per NUMA node) and won't dynamically change its size, then zone might be a
>>>> better option. For one thing, it will still allow distinguishing movable and
>>>> unmovable allocations within the mirrored memory.
>>>>
>>>> We had enough fun with MIGRATE_CMA and all kinds of checks it added to allocator
>>>> hot paths, and even CMA is now considering moving to a separate zone.
>>>>
>>>
>>> Hi, how about the problem of this case:
>>> e.g. node 0: 0-4G(dma and dma32)
>>> node 1: 4G-8G(normal), 8-12G(mirror), 12-16G(normal),
>>> so more than one normal zone in a node? or normal zone just span the mirror zone?
>>
>> Normal zone can span the mirror zone just fine. However, it will result in zone
>> scanners such as compaction to skip over the mirror zone inefficiently. Hmm...

On the other hand, it would skip just as inefficiently over MIGRATE_MIRROR
pageblocks within a Normal zone. Since migrating pages between MIGRATE_MIRROR
and other types pageblocks would violate what the allocations requested.

Having separate zone instead would allow compaction to run specifically on the
zone and defragment movable allocations there (i.e. userspace pages if/when
userspace requesting mirrored memory is supported).

>>
>
> Hi Vlastimil,
>
> If there are many mirror regions in one node, then it will be many holes in the
> normal zone, is this fine?

Yeah, it doesn't matter how many holes there are.

> Thanks,
> Xishi Qiu
>
>>
>> .
>>
>
>
>

2015-06-18 09:38:09

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On 2015/6/18 13:58, Vlastimil Babka wrote:

> On 18.6.2015 3:23, Xishi Qiu wrote:
>> On 2015/6/16 17:46, Vlastimil Babka wrote:
>>
>>> On 06/16/2015 10:17 AM, Xishi Qiu wrote:
>>>> On 2015/6/16 15:53, Vlastimil Babka wrote:
>>>>
>>>>> On 06/04/2015 02:54 PM, Xishi Qiu wrote:
>>>>>>
>>>>>> I think add a new migratetype is btter and easier than a new zone, so I use
>>>>>
>>>>> If the mirrored memory is in a single reasonably compact (no large holes) range
>>>>> (per NUMA node) and won't dynamically change its size, then zone might be a
>>>>> better option. For one thing, it will still allow distinguishing movable and
>>>>> unmovable allocations within the mirrored memory.
>>>>>
>>>>> We had enough fun with MIGRATE_CMA and all kinds of checks it added to allocator
>>>>> hot paths, and even CMA is now considering moving to a separate zone.
>>>>>
>>>>
>>>> Hi, how about the problem of this case:
>>>> e.g. node 0: 0-4G(dma and dma32)
>>>> node 1: 4G-8G(normal), 8-12G(mirror), 12-16G(normal),
>>>> so more than one normal zone in a node? or normal zone just span the mirror zone?
>>>
>>> Normal zone can span the mirror zone just fine. However, it will result in zone
>>> scanners such as compaction to skip over the mirror zone inefficiently. Hmm...
>
> On the other hand, it would skip just as inefficiently over MIGRATE_MIRROR
> pageblocks within a Normal zone. Since migrating pages between MIGRATE_MIRROR
> and other types pageblocks would violate what the allocations requested.
>
> Having separate zone instead would allow compaction to run specifically on the
> zone and defragment movable allocations there (i.e. userspace pages if/when
> userspace requesting mirrored memory is supported).
>
>>>
>>
>> Hi Vlastimil,
>>
>> If there are many mirror regions in one node, then it will be many holes in the
>> normal zone, is this fine?
>
> Yeah, it doesn't matter how many holes there are.

So mirror zone and normal zone will span each other, right?

e.g. node 1: 4G-8G(normal), 8-12G(mirror), 12-16G(normal), 16-24G(mirror), 24-28G(normal) ...
normal: start=4G, size=28-4=24G,
mirror: start=8G, size=24-8=16G,

I think zone is defined according to the special address range, like 16M(DMA), 4G(DMA32),
and is it appropriate to add a new mirror zone with a volatile physical address?

Thanks,
Xishi Qiu

2015-06-18 09:56:01

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On 06/18/2015 11:37 AM, Xishi Qiu wrote:
> On 2015/6/18 13:58, Vlastimil Babka wrote:
>
>> On 18.6.2015 3:23, Xishi Qiu wrote:
>>> On 2015/6/16 17:46, Vlastimil Babka wrote:
>>>
>>
>> On the other hand, it would skip just as inefficiently over MIGRATE_MIRROR
>> pageblocks within a Normal zone. Since migrating pages between MIGRATE_MIRROR
>> and other types pageblocks would violate what the allocations requested.
>>
>> Having separate zone instead would allow compaction to run specifically on the
>> zone and defragment movable allocations there (i.e. userspace pages if/when
>> userspace requesting mirrored memory is supported).
>>
>>>>
>>>
>>> Hi Vlastimil,
>>>
>>> If there are many mirror regions in one node, then it will be many holes in the
>>> normal zone, is this fine?
>>
>> Yeah, it doesn't matter how many holes there are.
>
> So mirror zone and normal zone will span each other, right?
>
> e.g. node 1: 4G-8G(normal), 8-12G(mirror), 12-16G(normal), 16-24G(mirror), 24-28G(normal) ...
> normal: start=4G, size=28-4=24G,
> mirror: start=8G, size=24-8=16G,

Yes, that works. It's somewhat unfortunate wrt performance that the
hardware does it like this though.

> I think zone is defined according to the special address range, like 16M(DMA), 4G(DMA32),

Traditionally yes. But then there is ZONE_MOVABLE, this year's LSF/MM we
discussed (and didn't outright deny) ZONE_CMA...
I'm not saying others will favour the new zone approach though, it's
just my opinion that it might be a better option than a new migratetype.

> and is it appropriate to add a new mirror zone with a volatile physical address?

By "volatile" you mean what, that the example above would change
dynamically? That would be rather challenging...

> Thanks,
> Xishi Qiu
>

2015-06-18 20:33:46

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On Thu, Jun 18, 2015 at 11:55:42AM +0200, Vlastimil Babka wrote:
> >>>If there are many mirror regions in one node, then it will be many holes in the
> >>>normal zone, is this fine?
> >>
> >>Yeah, it doesn't matter how many holes there are.
> >
> >So mirror zone and normal zone will span each other, right?
> >
> >e.g. node 1: 4G-8G(normal), 8-12G(mirror), 12-16G(normal), 16-24G(mirror), 24-28G(normal) ...
> >normal: start=4G, size=28-4=24G,
> >mirror: start=8G, size=24-8=16G,
>
> Yes, that works. It's somewhat unfortunate wrt performance that the hardware
> does it like this though.

With current Xeon h/w you can have one mirrored range per memory
controller ... and there are two memory controllers on a cpu socket,
so two mirrored ranges per node. So a map might look like:

SKT0: MC0: 0-2G Mirrored (but we may want to ignore mirror here to keep it for ZONE_DMA)
SKT0: MC0: 2G-4G No memory ... I/O mapping area
SKT0: MC0: 4G-34G Not mirrored
SKT0: MC1: 34G-40G Mirrored
SKT0: MC1: 40G-66G Not mirrored

SKT1: MC0: 66G-70G Mirror
SKT1: MC0: 70G-98G Not Mirrored
SKT1: MC1: 98G-102G Mirror
SKT1: MC1: 102G-130G Not Mirrored

... and so on.

> >I think zone is defined according to the special address range, like 16M(DMA), 4G(DMA32),
>
> Traditionally yes. But then there is ZONE_MOVABLE, this year's LSF/MM we
> discussed (and didn't outright deny) ZONE_CMA...
> I'm not saying others will favour the new zone approach though, it's just my
> opinion that it might be a better option than a new migratetype.

If we are going to have lots of zones ... then perhaps we will
need a fast way to look at a "struct page" and decide which zone
it belongs to. Complicated math on the address deosn't sound ideal.
If the complex zone model is just for 64-bit, are there enough bits
available in page->flags (3 bits for 8 options ... which we are close
to filling now ... 4 bits for future breathing room).

> >and is it appropriate to add a new mirror zone with a volatile physical address?
>
> By "volatile" you mean what, that the example above would change
> dynamically? That would be rather challenging...

If we hot-add another cpu together with on die memory controllers connected
to more memory ... then some of the new memory might be mirrored. Current
h/w doesn't allow mirrored areas to grow/shrink (though if there are a lot
of errors we may break a mirror so a whole range could lose the mirror attribute).

-Tony

2015-06-19 01:37:27

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

On 2015/6/19 4:33, Luck, Tony wrote:

> On Thu, Jun 18, 2015 at 11:55:42AM +0200, Vlastimil Babka wrote:
>>>>> If there are many mirror regions in one node, then it will be many holes in the
>>>>> normal zone, is this fine?
>>>>
>>>> Yeah, it doesn't matter how many holes there are.
>>>
>>> So mirror zone and normal zone will span each other, right?
>>>
>>> e.g. node 1: 4G-8G(normal), 8-12G(mirror), 12-16G(normal), 16-24G(mirror), 24-28G(normal) ...
>>> normal: start=4G, size=28-4=24G,
>>> mirror: start=8G, size=24-8=16G,
>>
>> Yes, that works. It's somewhat unfortunate wrt performance that the hardware
>> does it like this though.
>
> With current Xeon h/w you can have one mirrored range per memory
> controller ... and there are two memory controllers on a cpu socket,
> so two mirrored ranges per node. So a map might look like:
>
> SKT0: MC0: 0-2G Mirrored (but we may want to ignore mirror here to keep it for ZONE_DMA)
> SKT0: MC0: 2G-4G No memory ... I/O mapping area
> SKT0: MC0: 4G-34G Not mirrored
> SKT0: MC1: 34G-40G Mirrored
> SKT0: MC1: 40G-66G Not mirrored
>
> SKT1: MC0: 66G-70G Mirror
> SKT1: MC0: 70G-98G Not Mirrored
> SKT1: MC1: 98G-102G Mirror
> SKT1: MC1: 102G-130G Not Mirrored
>
> ... and so on.
>
>>> I think zone is defined according to the special address range, like 16M(DMA), 4G(DMA32),
>>
>> Traditionally yes. But then there is ZONE_MOVABLE, this year's LSF/MM we
>> discussed (and didn't outright deny) ZONE_CMA...
>> I'm not saying others will favour the new zone approach though, it's just my
>> opinion that it might be a better option than a new migratetype.
>
> If we are going to have lots of zones ... then perhaps we will
> need a fast way to look at a "struct page" and decide which zone
> it belongs to. Complicated math on the address deosn't sound ideal.
> If the complex zone model is just for 64-bit, are there enough bits
> available in page->flags (3 bits for 8 options ... which we are close
> to filling now ... 4 bits for future breathing room).
>
>>> and is it appropriate to add a new mirror zone with a volatile physical address?
>>
>> By "volatile" you mean what, that the example above would change
>> dynamically? That would be rather challenging...
>
> If we hot-add another cpu together with on die memory controllers connected
> to more memory ... then some of the new memory might be mirrored. Current
> h/w doesn't allow mirrored areas to grow/shrink (though if there are a lot
> of errors we may break a mirror so a whole range could lose the mirror attribute).
>
> -Tony
>

Hi Tony,

What's your suggestions? a new zone or a new migratetype?
Maybe add a new zone will change more mm code.

Thanks,
Xishi Qiu

> .
>


2015-06-19 18:42:49

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC PATCH 00/12] mm: mirrored memory support for page buddy allocations

> What's your suggestions? a new zone or a new migratetype?
> Maybe add a new zone will change more mm code.

I don't understand this code well enough (yet) to make a recommendation. I think
our primary concern may not be "how much code we change", but more "how can
we minimize the run-time impact on systems that don't have any mirrored memory.

Just putting all the heavy work behind a CONFIG option isn't sufficient ... we want
enterprise distributions to ship with the option turned on ... even though most
machines won't be using this feature.

-Tony

2015-06-25 09:49:00

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:

> On 2015/06/09 19:04, Xishi Qiu wrote:
>> On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:
>>
>>> On 2015/06/04 22:04, Xishi Qiu wrote:
>>>> Add the buddy system interface for address range mirroring feature.
>>>> Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
>>>> left, use other types pages.
>>>>
>>>> Signed-off-by: Xishi Qiu <[email protected]>
>>>> ---
>>>> mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
>>>> 1 file changed, 39 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index d4d2066..0fb55288 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)
>>>>
>>>> return false;
>>>> }
>>>> +
>>>> +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
>>>> +{
>>>> + /*
>>>> + * Do not alloc mirrored memory below 4G, because 0-4G is
>>>> + * all mirrored by default, and the list is always empty.
>>>> + */
>>>> + if (high_zoneidx < ZONE_NORMAL)
>>>> + return false;
>>>> +
>>>> + /* Alloc mirrored memory for only kernel */
>>>> + if (gfp_flags & __GFP_MIRROR)
>>>> + return true;
>>>
>>> GFP_KERNEL itself should imply mirror, I think.
>>>
>>
>> Hi Kame,
>>
>> How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_MIRROR) ?
>>
>
> Hm.... it cannot cover GFP_ATOMIC at el.
>
> I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE


Hi Kame,

Can we distinguish allocations form user or kernel only by GFP flags?

Thanks,
Xishi Qiu

2015-06-25 23:55:15

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/06/25 18:44, Xishi Qiu wrote:
> On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:
>
>> On 2015/06/09 19:04, Xishi Qiu wrote:
>>> On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:
>>>
>>>> On 2015/06/04 22:04, Xishi Qiu wrote:
>>>>> Add the buddy system interface for address range mirroring feature.
>>>>> Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
>>>>> left, use other types pages.
>>>>>
>>>>> Signed-off-by: Xishi Qiu <[email protected]>
>>>>> ---
>>>>> mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
>>>>> 1 file changed, 39 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index d4d2066..0fb55288 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)
>>>>>
>>>>> return false;
>>>>> }
>>>>> +
>>>>> +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
>>>>> +{
>>>>> + /*
>>>>> + * Do not alloc mirrored memory below 4G, because 0-4G is
>>>>> + * all mirrored by default, and the list is always empty.
>>>>> + */
>>>>> + if (high_zoneidx < ZONE_NORMAL)
>>>>> + return false;
>>>>> +
>>>>> + /* Alloc mirrored memory for only kernel */
>>>>> + if (gfp_flags & __GFP_MIRROR)
>>>>> + return true;
>>>>
>>>> GFP_KERNEL itself should imply mirror, I think.
>>>>
>>>
>>> Hi Kame,
>>>
>>> How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_MIRROR) ?
>>>
>>
>> Hm.... it cannot cover GFP_ATOMIC at el.
>>
>> I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE
>
>
> Hi Kame,
>
> Can we distinguish allocations form user or kernel only by GFP flags?
>

Allocation from user and file caches are now *always* done with __GFP_MOVABLE.

By this, pages will be allocated from MIGRATE_MOVABLE migration type.
MOVABLE migration type means it's can
be the target for page compaction or memory-hot-remove.

Thanks,
-Kame






2015-06-26 01:51:06

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/6/26 7:54, Kamezawa Hiroyuki wrote:

> On 2015/06/25 18:44, Xishi Qiu wrote:
>> On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:
>>
>>> On 2015/06/09 19:04, Xishi Qiu wrote:
>>>> On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:
>>>>
>>>>> On 2015/06/04 22:04, Xishi Qiu wrote:
>>>>>> Add the buddy system interface for address range mirroring feature.
>>>>>> Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
>>>>>> left, use other types pages.
>>>>>>
>>>>>> Signed-off-by: Xishi Qiu <[email protected]>
>>>>>> ---
>>>>>> mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
>>>>>> 1 file changed, 39 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>> index d4d2066..0fb55288 100644
>>>>>> --- a/mm/page_alloc.c
>>>>>> +++ b/mm/page_alloc.c
>>>>>> @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)
>>>>>>
>>>>>> return false;
>>>>>> }
>>>>>> +
>>>>>> +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
>>>>>> +{
>>>>>> + /*
>>>>>> + * Do not alloc mirrored memory below 4G, because 0-4G is
>>>>>> + * all mirrored by default, and the list is always empty.
>>>>>> + */
>>>>>> + if (high_zoneidx < ZONE_NORMAL)
>>>>>> + return false;
>>>>>> +
>>>>>> + /* Alloc mirrored memory for only kernel */
>>>>>> + if (gfp_flags & __GFP_MIRROR)
>>>>>> + return true;
>>>>>
>>>>> GFP_KERNEL itself should imply mirror, I think.
>>>>>
>>>>
>>>> Hi Kame,
>>>>
>>>> How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_MIRROR) ?
>>>>
>>>
>>> Hm.... it cannot cover GFP_ATOMIC at el.
>>>
>>> I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE
>>
>>
>> Hi Kame,
>>
>> Can we distinguish allocations form user or kernel only by GFP flags?
>>
>
> Allocation from user and file caches are now *always* done with __GFP_MOVABLE.
>
> By this, pages will be allocated from MIGRATE_MOVABLE migration type.
> MOVABLE migration type means it's can
> be the target for page compaction or memory-hot-remove.
>
> Thanks,
> -Kame
>

So if we want all kernel memory allocated from mirror, how about change like this?
__alloc_pages_nodemask()
gfpflags_to_migratetype()
if (!(gfp_mask & __GFP_MOVABLE))
return MIGRATE_MIRROR

Thanks,
Xishi Qiu

>
>
>
>
>
>
>
> .
>


2015-06-26 08:35:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/06/26 10:43, Xishi Qiu wrote:
> On 2015/6/26 7:54, Kamezawa Hiroyuki wrote:
>
>> On 2015/06/25 18:44, Xishi Qiu wrote:
>>> On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:
>>>
>>>> On 2015/06/09 19:04, Xishi Qiu wrote:
>>>>> On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:
>>>>>
>>>>>> On 2015/06/04 22:04, Xishi Qiu wrote:
>>>>>>> Add the buddy system interface for address range mirroring feature.
>>>>>>> Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
>>>>>>> left, use other types pages.
>>>>>>>
>>>>>>> Signed-off-by: Xishi Qiu <[email protected]>
>>>>>>> ---
>>>>>>> mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
>>>>>>> 1 file changed, 39 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index d4d2066..0fb55288 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)
>>>>>>>
>>>>>>> return false;
>>>>>>> }
>>>>>>> +
>>>>>>> +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
>>>>>>> +{
>>>>>>> + /*
>>>>>>> + * Do not alloc mirrored memory below 4G, because 0-4G is
>>>>>>> + * all mirrored by default, and the list is always empty.
>>>>>>> + */
>>>>>>> + if (high_zoneidx < ZONE_NORMAL)
>>>>>>> + return false;
>>>>>>> +
>>>>>>> + /* Alloc mirrored memory for only kernel */
>>>>>>> + if (gfp_flags & __GFP_MIRROR)
>>>>>>> + return true;
>>>>>>
>>>>>> GFP_KERNEL itself should imply mirror, I think.
>>>>>>
>>>>>
>>>>> Hi Kame,
>>>>>
>>>>> How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_MIRROR) ?
>>>>>
>>>>
>>>> Hm.... it cannot cover GFP_ATOMIC at el.
>>>>
>>>> I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE
>>>
>>>
>>> Hi Kame,
>>>
>>> Can we distinguish allocations form user or kernel only by GFP flags?
>>>
>>
>> Allocation from user and file caches are now *always* done with __GFP_MOVABLE.
>>
>> By this, pages will be allocated from MIGRATE_MOVABLE migration type.
>> MOVABLE migration type means it's can
>> be the target for page compaction or memory-hot-remove.
>>
>> Thanks,
>> -Kame
>>
>
> So if we want all kernel memory allocated from mirror, how about change like this?
> __alloc_pages_nodemask()
> gfpflags_to_migratetype()
> if (!(gfp_mask & __GFP_MOVABLE))
> return MIGRATE_MIRROR

Maybe used with jump label can reduce performance impact.
==
static inline bool memory_mirror_enabled(void)
{
return static_key_false(&memory_mirror_enabled);
}



gfpflags_to_migratetype()
if (memory_mirror_enabled()) { /* We want to mirror all unmovable pages */
if (!(gfp_mask & __GFP_MOVABLE))
return MIGRATE_MIRROR
}
==

BTW, I think current memory compaction code scans ranges of MOVABLE migrate type.
So, if you use other migration type than MOVABLE for user pages, you may see
page fragmentation. If you want to expand this MIRROR to user pages, please check
mm/compaction.c


Thanks,
-Kame



2015-06-26 10:38:32

by Xishi Qiu

[permalink] [raw]
Subject: Re: [RFC PATCH 10/12] mm: add the buddy system interface

On 2015/6/26 16:34, Kamezawa Hiroyuki wrote:

> On 2015/06/26 10:43, Xishi Qiu wrote:
>> On 2015/6/26 7:54, Kamezawa Hiroyuki wrote:
>>
>>> On 2015/06/25 18:44, Xishi Qiu wrote:
>>>> On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:
>>>>
>>>>> On 2015/06/09 19:04, Xishi Qiu wrote:
>>>>>> On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:
>>>>>>
>>>>>>> On 2015/06/04 22:04, Xishi Qiu wrote:
>>>>>>>> Add the buddy system interface for address range mirroring feature.
>>>>>>>> Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
>>>>>>>> left, use other types pages.
>>>>>>>>
>>>>>>>> Signed-off-by: Xishi Qiu <[email protected]>
>>>>>>>> ---
>>>>>>>> mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
>>>>>>>> 1 file changed, 39 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>> index d4d2066..0fb55288 100644
>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>> @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)
>>>>>>>>
>>>>>>>> return false;
>>>>>>>> }
>>>>>>>> +
>>>>>>>> +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
>>>>>>>> +{
>>>>>>>> + /*
>>>>>>>> + * Do not alloc mirrored memory below 4G, because 0-4G is
>>>>>>>> + * all mirrored by default, and the list is always empty.
>>>>>>>> + */
>>>>>>>> + if (high_zoneidx < ZONE_NORMAL)
>>>>>>>> + return false;
>>>>>>>> +
>>>>>>>> + /* Alloc mirrored memory for only kernel */
>>>>>>>> + if (gfp_flags & __GFP_MIRROR)
>>>>>>>> + return true;
>>>>>>>
>>>>>>> GFP_KERNEL itself should imply mirror, I think.
>>>>>>>
>>>>>>
>>>>>> Hi Kame,
>>>>>>
>>>>>> How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_MIRROR) ?
>>>>>>
>>>>>
>>>>> Hm.... it cannot cover GFP_ATOMIC at el.
>>>>>
>>>>> I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE
>>>>
>>>>
>>>> Hi Kame,
>>>>
>>>> Can we distinguish allocations form user or kernel only by GFP flags?
>>>>
>>>
>>> Allocation from user and file caches are now *always* done with __GFP_MOVABLE.
>>>
>>> By this, pages will be allocated from MIGRATE_MOVABLE migration type.
>>> MOVABLE migration type means it's can
>>> be the target for page compaction or memory-hot-remove.
>>>
>>> Thanks,
>>> -Kame
>>>
>>
>> So if we want all kernel memory allocated from mirror, how about change like this?
>> __alloc_pages_nodemask()
>> gfpflags_to_migratetype()
>> if (!(gfp_mask & __GFP_MOVABLE))
>> return MIGRATE_MIRROR
>
> Maybe used with jump label can reduce performance impact.

Hi Kame,

I am not understand jump label, but I wil try.

> ==
> static inline bool memory_mirror_enabled(void)
> {
> return static_key_false(&memory_mirror_enabled);
> }
>
>
>
> gfpflags_to_migratetype()
> if (memory_mirror_enabled()) { /* We want to mirror all unmovable pages */
> if (!(gfp_mask & __GFP_MOVABLE))
> return MIGRATE_MIRROR
> }
> ==
>
> BTW, I think current memory compaction code scans ranges of MOVABLE migrate type.
> So, if you use other migration type than MOVABLE for user pages, you may see
> page fragmentation. If you want to expand this MIRROR to user pages, please check
> mm/compaction.c
>

As Tony said "how can we minimize the run-time impact on systems that don't have
any mirrored memory.", I think the idea "kernel only from MIRROR / user only from
MOVABLE" may be better.

Thanks,
Xishi Qiu


2015-06-26 18:42:52

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC PATCH 10/12] mm: add the buddy system interface

> gfpflags_to_migratetype()
> if (memory_mirror_enabled()) { /* We want to mirror all unmovable pages */
> if (!(gfp_mask & __GFP_MOVABLE))
> return MIGRATE_MIRROR
> }

I'm not sure that we can divide memory into just two buckets of "mirrored" and "movable".

My expectation is that there will be memory that is neither mirrored, nor movable. We'd
allocate that memory to user proceses. Uncorrected errors in that memory would result
in the death of the process (except in the case where the page is a clean copy mapped from
a disk file ... e.g. .text mapping instructions from an executable). Linux would offline
the affected 4K page so as not to hit the problem again.

-Tony