[What we are doing]
This patchset provide a boot option for user to specify ZONE_MOVABLE memory
map for each node in the system.
movablecore_map=nn[KMG]@ss[KMG]
This option make sure memory range from ss to ss+nn is movable memory.
[Why we do this]
If we hot remove a memroy, the memory cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.
Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.
But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x80000000-0c0000000, we have no way to specify
the memory as movable memory.
So we proposed a new feature which specifies memory range to use as
movable memory.
[Ways to do this]
There may be 2 ways to specify movable memory.
1. use firmware information
2. use boot option
1. use firmware information
According to ACPI spec 5.0, SRAT table has memory affinity structure
and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
Affinity Structure". If we use the information, we might be able to
specify movable memory by firmware. For example, if Hot Pluggable
Filed is enabled, Linux sets the memory as movable memory.
2. use boot option
This is our proposal. New boot option can specify memory range to use
as movable memory.
[How we do this]
We chose second way, because if we use first way, users cannot change
memory range to use as movable memory easily. We think if we create
movable memory, performance regression may occur by NUMA. In this case,
user can turn off the feature easily if we prepare the boot option.
And if we prepare the boot optino, the user can select which memory
to use as movable memory easily.
[How to use]
Specify the following boot option:
movablecore_map=nn[KMG]@ss[KMG]
That means physical address range from ss to ss+nn will be allocated as
ZONE_MOVABLE.
And the following points should be considered.
1) If the range is involved in a single node, then from ss to the end of
the node will be ZONE_MOVABLE.
2) If the range covers two or more nodes, then from ss to the end of
the node will be ZONE_MOVABLE, and all the other nodes will only
have ZONE_MOVABLE.
3) If no range is in the node, then the node will have no ZONE_MOVABLE
unless kernelcore or movablecore is specified.
4) This option could be specified at most MAX_NUMNODES times.
5) If kernelcore or movablecore is also specified, movablecore_map will have
higher priority to be satisfied.
6) This option has no conflict with memmap option.
Tang Chen (4):
page_alloc: add movable_memmap kernel parameter
page_alloc: Introduce zone_movable_limit[] to keep movable limit for
nodes
page_alloc: Make movablecore_map has higher priority
page_alloc: Bootmem limit with movablecore_map
Yasuaki Ishimatsu (1):
x86: get pg_data_t's memory from other node
Documentation/kernel-parameters.txt | 17 +++
arch/x86/mm/numa.c | 11 ++-
include/linux/memblock.h | 1 +
include/linux/mm.h | 11 ++
mm/memblock.c | 15 +++-
mm/page_alloc.c | 216 ++++++++++++++++++++++++++++++++++-
6 files changed, 263 insertions(+), 8 deletions(-)
This patch adds functions to parse movablecore_map boot option. Since the
option could be specified more then once, all the maps will be stored in
the global variable movablecore_map.map array.
And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Reviewed-by: Wen Congyang <[email protected]>
Tested-by: Lin Feng <[email protected]>
---
Documentation/kernel-parameters.txt | 17 +++++
include/linux/mm.h | 11 +++
mm/page_alloc.c | 126 +++++++++++++++++++++++++++++++++++
3 files changed, 154 insertions(+), 0 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9776f06..785f878 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1620,6 +1620,23 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.
+ movablecore_map=nn[KMG]@ss[KMG]
+ [KNL,X86,IA-64,PPC] This parameter is similar to
+ memmap except it specifies the memory map of
+ ZONE_MOVABLE.
+ If more areas are all within one node, then from
+ lowest ss to the end of the node will be ZONE_MOVABLE.
+ If an area covers two or more nodes, the area from
+ ss to the end of the 1st node will be ZONE_MOVABLE,
+ and all the rest nodes will only have ZONE_MOVABLE.
+ If memmap is specified at the same time, the
+ movablecore_map will be limited within the memmap
+ areas. If kernelcore or movablecore is also specified,
+ movablecore_map will have higher priority to be
+ satisfied. So the administrator should be careful that
+ the amount of movablecore_map areas are not too large.
+ Otherwise kernel won't have enough memory to start.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..647c980 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1328,6 +1328,17 @@ extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
extern void sparse_memory_present_with_active_regions(int nid);
+#define MOVABLECORE_MAP_MAX MAX_NUMNODES
+struct movablecore_entry {
+ unsigned long start; /* start pfn of memory segment */
+ unsigned long end; /* end pfn of memory segment */
+};
+
+struct movablecore_map {
+ int nr_map;
+ struct movablecore_entry map[MOVABLECORE_MAP_MAX];
+};
+
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b74de6..fb5cf12 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -198,6 +198,9 @@ static unsigned long __meminitdata nr_all_pages;
static unsigned long __meminitdata dma_reserve;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+/* Movable memory ranges, will also be used by memblock subsystem. */
+struct movablecore_map movablecore_map;
+
static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
static unsigned long __initdata required_kernelcore;
@@ -4986,6 +4989,129 @@ static int __init cmdline_parse_movablecore(char *p)
early_param("kernelcore", cmdline_parse_kernelcore);
early_param("movablecore", cmdline_parse_movablecore);
+/**
+ * insert_movablecore_map - Insert a memory range in to movablecore_map.map.
+ * @start_pfn: start pfn of the range
+ * @end_pfn: end pfn of the range
+ *
+ * This function will also merge the overlapped ranges, and sort the array
+ * by start_pfn in monotonic increasing order.
+ */
+static void __init insert_movablecore_map(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ int pos, overlap;
+
+ /*
+ * pos will be at the 1st overlapped range, or the position
+ * where the element should be inserted.
+ */
+ for (pos = 0; pos < movablecore_map.nr_map; pos++)
+ if (start_pfn <= movablecore_map.map[pos].end)
+ break;
+
+ /* If there is no overlapped range, just insert the element. */
+ if (pos == movablecore_map.nr_map ||
+ end_pfn < movablecore_map.map[pos].start) {
+ /*
+ * If pos is not the end of array, we need to move all
+ * the rest elements backward.
+ */
+ if (pos < movablecore_map.nr_map)
+ memmove(&movablecore_map.map[pos+1],
+ &movablecore_map.map[pos],
+ sizeof(struct movablecore_entry) *
+ (movablecore_map.nr_map - pos));
+ movablecore_map.map[pos].start = start_pfn;
+ movablecore_map.map[pos].end = end_pfn;
+ movablecore_map.nr_map++;
+ return;
+ }
+
+ /* overlap will be at the last overlapped range */
+ for (overlap = pos + 1; overlap < movablecore_map.nr_map; overlap++)
+ if (end_pfn < movablecore_map.map[overlap].start)
+ break;
+
+ /*
+ * If there are more ranges overlapped, we need to merge them,
+ * and move the rest elements forward.
+ */
+ overlap--;
+ movablecore_map.map[pos].start = min(start_pfn,
+ movablecore_map.map[pos].start);
+ movablecore_map.map[pos].end = max(end_pfn,
+ movablecore_map.map[overlap].end);
+
+ if (pos != overlap && overlap + 1 != movablecore_map.nr_map)
+ memmove(&movablecore_map.map[pos+1],
+ &movablecore_map.map[overlap+1],
+ sizeof(struct movablecore_entry) *
+ (movablecore_map.nr_map - overlap - 1));
+
+ movablecore_map.nr_map -= overlap - pos;
+}
+
+/**
+ * movablecore_map_add_region - Add a memory range into movablecore_map.
+ * @start: physical start address of range
+ * @end: physical end address of range
+ *
+ * This function transform the physical address into pfn, and then add the
+ * range into movablecore_map by calling insert_movablecore_map().
+ */
+static void __init movablecore_map_add_region(u64 start, u64 size)
+{
+ unsigned long start_pfn, end_pfn;
+
+ /* In case size == 0 or start + size overflows */
+ if (start + size <= start)
+ return;
+
+ if (movablecore_map.nr_map >= ARRAY_SIZE(movablecore_map.map)) {
+ pr_err("movable_memory_map: too many entries;"
+ " ignoring [mem %#010llx-%#010llx]\n",
+ (unsigned long long) start,
+ (unsigned long long) (start + size - 1));
+ return;
+ }
+
+ start_pfn = PFN_DOWN(start);
+ end_pfn = PFN_UP(start + size);
+ insert_movablecore_map(start_pfn, end_pfn);
+}
+
+/*
+ * movablecore_map=nn[KMG]@ss[KMG] sets the region of memory to be used as
+ * movable memory.
+ */
+static int __init cmdline_parse_movablecore_map(char *p)
+{
+ char *oldp;
+ u64 start_at, mem_size;
+
+ if (!p)
+ goto err;
+
+ oldp = p;
+ mem_size = memparse(p, &p);
+ if (p == oldp)
+ goto err;
+
+ if (*p == '@') {
+ oldp = ++p;
+ start_at = memparse(p, &p);
+ if (p == oldp || *p != '\0')
+ goto err;
+
+ movablecore_map_add_region(start_at, mem_size);
+ return 0;
+ }
+err:
+ return -EINVAL;
+}
+early_param("movablecore_map", cmdline_parse_movablecore_map);
+
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
/**
--
1.7.1
This patch introduces a new array zone_movable_limit[] to store the
ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
The function sanitize_zone_movable_limit() will find out to which
node the ranges in movable_map.map[] belongs, and calculates the
low boundary of ZONE_MOVABLE for each node.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wen Congyang <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
Tested-by: Lin Feng <[email protected]>
---
mm/page_alloc.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 55 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fb5cf12..f23d76a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -206,6 +206,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
@@ -4323,6 +4324,55 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
}
+/**
+ * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
+ *
+ * zone_movable_limit is initialized as 0. This function will try to get
+ * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
+ * assigne them to zone_movable_limit.
+ * zone_movable_limit[nid] == 0 means no limit for the node.
+ *
+ * Note: Each range is represented as [start_pfn, end_pfn)
+ */
+static void __meminit sanitize_zone_movable_limit(void)
+{
+ int map_pos = 0, i, nid;
+ unsigned long start_pfn, end_pfn;
+
+ if (!movablecore_map.nr_map)
+ return;
+
+ /* Iterate all ranges from minimum to maximum */
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+ /*
+ * If we have found lowest pfn of ZONE_MOVABLE of the node
+ * specified by user, just go on to check next range.
+ */
+ if (zone_movable_limit[nid])
+ continue;
+
+ while (map_pos < movablecore_map.nr_map) {
+ if (end_pfn <= movablecore_map.map[map_pos].start)
+ break;
+
+ if (start_pfn >= movablecore_map.map[map_pos].end) {
+ map_pos++;
+ continue;
+ }
+
+ /*
+ * The start_pfn of ZONE_MOVABLE is either the minimum
+ * pfn specified by movablecore_map, or 0, which means
+ * the node has no ZONE_MOVABLE.
+ */
+ zone_movable_limit[nid] = max(start_pfn,
+ movablecore_map.map[map_pos].start);
+
+ break;
+ }
+ }
+}
+
#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
@@ -4341,6 +4391,10 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
return zholes_size[zone_type];
}
+static void __meminit sanitize_zone_movable_limit(void)
+{
+}
+
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
@@ -4906,6 +4960,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
/* Find the PFNs that ZONE_MOVABLE begins at in each node */
memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
+ sanitize_zone_movable_limit();
find_zone_movable_pfns_for_nodes();
/* Print out the zone ranges */
--
1.7.1
If kernelcore or movablecore is specified at the same time
with movablecore_map, movablecore_map will have higher
priority to be satisfied.
This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from
zone_movable_limit[].
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wen Congyang <[email protected]>
Reviewed-by: Lai Jiangshan <[email protected]>
Tested-by: Lin Feng <[email protected]>
---
mm/page_alloc.c | 35 +++++++++++++++++++++++++++++++----
1 files changed, 31 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f23d76a..05bafbb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
required_kernelcore = max(required_kernelcore, corepages);
}
- /* If kernelcore was not specified, there is no ZONE_MOVABLE */
- if (!required_kernelcore)
+ /*
+ * No matter kernelcore/movablecore was limited or not, movable_zone
+ * should always be set to a usable zone index.
+ */
+ find_usable_zone_for_movable();
+
+ /*
+ * If neither kernelcore/movablecore nor movablecore_map is specified,
+ * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
+ * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
+ */
+ if (!required_kernelcore) {
+ if (movablecore_map.nr_map)
+ memcpy(zone_movable_pfn, zone_movable_limit,
+ sizeof(zone_movable_pfn));
goto out;
+ }
/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
- find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
restart:
@@ -4833,10 +4846,24 @@ restart:
for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
unsigned long size_pages;
+ /*
+ * Find more memory for kernelcore in
+ * [zone_movable_pfn[nid], zone_movable_limit[nid]).
+ */
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
if (start_pfn >= end_pfn)
continue;
+ if (zone_movable_limit[nid]) {
+ end_pfn = min(end_pfn, zone_movable_limit[nid]);
+ /* No range left for kernelcore in this node */
+ if (start_pfn >= end_pfn) {
+ zone_movable_pfn[nid] =
+ zone_movable_limit[nid];
+ break;
+ }
+ }
+
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
unsigned long kernel_pages;
@@ -4896,12 +4923,12 @@ restart:
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;
+out:
/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
for (nid = 0; nid < MAX_NUMNODES; nid++)
zone_movable_pfn[nid] =
roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
-out:
/* restore the node_state */
node_states[N_HIGH_MEMORY] = saved_node_state;
}
--
1.7.1
This patch make sure bootmem will not allocate memory from areas that
may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Reviewed-by: Wen Congyang <[email protected]>
Tested-by: Lin Feng <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 15 ++++++++++++++-
2 files changed, 15 insertions(+), 1 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d452ee1..6e25597 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {
extern struct memblock memblock;
extern int memblock_debug;
+extern struct movablecore_map movablecore_map;
#define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
diff --git a/mm/memblock.c b/mm/memblock.c
index 6259055..33b3b4d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
{
phys_addr_t this_start, this_end, cand;
u64 i;
+ int curr = movablecore_map.nr_map - 1;
/* pump up @end */
if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
@@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
this_start = clamp(this_start, start, end);
this_end = clamp(this_end, start, end);
- if (this_end < size)
+restart:
+ if (this_end <= this_start || this_end < size)
continue;
+ for (; curr >= 0; curr--) {
+ if (movablecore_map.map[curr].start < this_end)
+ break;
+ }
+
cand = round_down(this_end - size, align);
+ if (curr >= 0 && cand < movablecore_map.map[curr].end) {
+ this_end = movablecore_map.map[curr].start;
+ goto restart;
+ }
+
if (cand >= this_start)
return cand;
}
+
return 0;
}
--
1.7.1
From: Yasuaki Ishimatsu <[email protected]>
If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So when memblock_alloc_nid() fails, setup_node_data() retries
memblock_alloc().
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..734bbd2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
} else {
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
- nd_size, nid);
- return;
+ pr_warn("Cannot find %zu bytes in node %d\n",
+ nd_size, nid);
+ nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
+ if (!nd_pa) {
+ pr_err("Cannot find %zu bytes in other node\n",
+ nd_size);
+ return;
+ }
}
nd = __va(nd_pa);
}
--
1.7.1
On 2012-11-23 18:44, Tang Chen wrote:
> From: Yasuaki Ishimatsu <[email protected]>
>
> If system can create movable node which all memory of the
> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
> allocate memory for the node's pg_data_t.
> So when memblock_alloc_nid() fails, setup_node_data() retries
> memblock_alloc().
>
> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>
> Signed-off-by: Tang Chen <[email protected]>
> ---
> arch/x86/mm/numa.c | 11 ++++++++---
> 1 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2d125be..734bbd2 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
> } else {
> nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
> if (!nd_pa) {
> - pr_err("Cannot find %zu bytes in node %d\n",
> - nd_size, nid);
> - return;
> + pr_warn("Cannot find %zu bytes in node %d\n",
> + nd_size, nid);
Hi Tang,
Should this be an "pr_info" because the allocation failure is expected?
Regards!
Gerry
On 11/24/2012 09:19 AM, Jiang Liu wrote:
> On 2012-11-23 18:44, Tang Chen wrote:
>> From: Yasuaki Ishimatsu<[email protected]>
>> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
>> } else {
>> nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>> if (!nd_pa) {
>> - pr_err("Cannot find %zu bytes in node %d\n",
>> - nd_size, nid);
>> - return;
>> + pr_warn("Cannot find %zu bytes in node %d\n",
>> + nd_size, nid);
> Hi Tang,
> Should this be an "pr_info" because the allocation failure is expected?
Hi Liu,
Sure, followed. Thanks. :)
> Regards!
> Gerry
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 2012-11-23 18:44, Tang Chen wrote:
> This patch make sure bootmem will not allocate memory from areas that
> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
>
> Signed-off-by: Tang Chen <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>
> Reviewed-by: Wen Congyang <[email protected]>
> Tested-by: Lin Feng <[email protected]>
> ---
> include/linux/memblock.h | 1 +
> mm/memblock.c | 15 ++++++++++++++-
> 2 files changed, 15 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index d452ee1..6e25597 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -42,6 +42,7 @@ struct memblock {
>
> extern struct memblock memblock;
> extern int memblock_debug;
> +extern struct movablecore_map movablecore_map;
>
> #define memblock_dbg(fmt, ...) \
> if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 6259055..33b3b4d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> {
> phys_addr_t this_start, this_end, cand;
> u64 i;
> + int curr = movablecore_map.nr_map - 1;
>
> /* pump up @end */
> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> this_start = clamp(this_start, start, end);
> this_end = clamp(this_end, start, end);
>
> - if (this_end < size)
> +restart:
> + if (this_end <= this_start || this_end < size)
> continue;
>
> + for (; curr >= 0; curr--) {
> + if (movablecore_map.map[curr].start < this_end)
movablecore_map[curr].start should be movablecore_map[curr].start << PAGE_SHIFT.
May be you can change movablecore_map[].start/end to movablecore_map[].start_pfn/end_pfn
to avoid confusion.
> + break;
> + }
> +
> cand = round_down(this_end - size, align);
> + if (curr >= 0 && cand < movablecore_map.map[curr].end) {
> + this_end = movablecore_map.map[curr].start;
Ditto.
> + goto restart;
> + }
> +
> if (cand >= this_start)
> return cand;
> }
> +
> return 0;
> }
>
>
Hi Tang,
I tested this patchset in x86_64, and I found that this patch didn't
work as expected.
For example, if node2's memory pfn range is [0x680000-0x980000),
I boot kernel with movablecore_map=4G@0x680000000, all memory in node2 will be
in ZONE_MOVABLE, but bootmem still can be allocated from [0x780000000-0x980000000),
that means bootmem *is allocated* from ZONE_MOVABLE. This because movablecore_map
only contains [0x680000000-0x780000000). I think we can fixup movablecore_map, how
about this:
Signed-off-by: Jianguo Wu <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/mm/srat.c | 15 +++++++++++++++
include/linux/mm.h | 3 +++
mm/page_alloc.c | 2 +-
3 files changed, 19 insertions(+), 1 deletions(-)
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 4ddf497..f1aac08 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -147,6 +147,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
{
u64 start, end;
int node, pxm;
+ int i;
+ unsigned long start_pfn, end_pfn;
if (srat_disabled())
return -1;
@@ -181,6 +183,19 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
node, pxm,
(unsigned long long) start, (unsigned long long) end - 1);
+
+ start_pfn = PFN_DOWN(start);
+ end_pfn = PFN_UP(end);
+ for (i = 0; i < movablecore_map.nr_map; i++) {
+ if (end_pfn <= movablecore_map.map[i].start)
+ break;
+
+ if (movablecore_map.map[i].end < end_pfn) {
+ insert_movablecore_map(movablecore_map.map[i].end,
+ end_pfn);
+ }
+ }
+
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a65251..7a23403 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1356,6 +1356,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn);
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
#endif
+extern void insert_movablecore_map(unsigned long start_pfn,
+ unsigned long end_pfn);
+
extern void set_dma_reserve(unsigned long new_dma_reserve);
extern void memmap_init_zone(unsigned long, int, unsigned long,
unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 544c829..e6b5090 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5089,7 +5089,7 @@ early_param("movablecore", cmdline_parse_movablecore);
* This function will also merge the overlapped ranges, and sort the array
* by start_pfn in monotonic increasing order.
*/
-static void __init insert_movablecore_map(unsigned long start_pfn,
+void __init insert_movablecore_map(unsigned long start_pfn,
unsigned long end_pfn)
{
int pos, overlap;
-- 1.7.6.1
.
Thanks,
Jianguo Wu
On 2012-11-23 18:44, Tang Chen wrote:
> This patch make sure bootmem will not allocate memory from areas that
> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
>
> Signed-off-by: Tang Chen <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>
> Reviewed-by: Wen Congyang <[email protected]>
> Tested-by: Lin Feng <[email protected]>
> ---
> include/linux/memblock.h | 1 +
> mm/memblock.c | 15 ++++++++++++++-
> 2 files changed, 15 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index d452ee1..6e25597 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -42,6 +42,7 @@ struct memblock {
>
> extern struct memblock memblock;
> extern int memblock_debug;
> +extern struct movablecore_map movablecore_map;
>
> #define memblock_dbg(fmt, ...) \
> if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 6259055..33b3b4d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> {
> phys_addr_t this_start, this_end, cand;
> u64 i;
> + int curr = movablecore_map.nr_map - 1;
>
> /* pump up @end */
> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
> this_start = clamp(this_start, start, end);
> this_end = clamp(this_end, start, end);
>
> - if (this_end < size)
> +restart:
> + if (this_end <= this_start || this_end < size)
> continue;
>
> + for (; curr >= 0; curr--) {
> + if (movablecore_map.map[curr].start < this_end)
> + break;
> + }
> +
> cand = round_down(this_end - size, align);
> + if (curr >= 0 && cand < movablecore_map.map[curr].end) {
> + this_end = movablecore_map.map[curr].start;
> + goto restart;
> + }
> +
> if (cand >= this_start)
> return cand;
> }
> +
> return 0;
> }
>
>
On 11/26/2012 08:22 PM, wujianguo wrote:
> On 2012-11-23 18:44, Tang Chen wrote:
>> This patch make sure bootmem will not allocate memory from areas that
>> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
>>
>> Signed-off-by: Tang Chen<[email protected]>
>> Signed-off-by: Lai Jiangshan<[email protected]>
>> Reviewed-by: Wen Congyang<[email protected]>
>> Tested-by: Lin Feng<[email protected]>
>> ---
>> include/linux/memblock.h | 1 +
>> mm/memblock.c | 15 ++++++++++++++-
>> 2 files changed, 15 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index d452ee1..6e25597 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -42,6 +42,7 @@ struct memblock {
>>
>> extern struct memblock memblock;
>> extern int memblock_debug;
>> +extern struct movablecore_map movablecore_map;
>>
>> #define memblock_dbg(fmt, ...) \
>> if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 6259055..33b3b4d 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> {
>> phys_addr_t this_start, this_end, cand;
>> u64 i;
>> + int curr = movablecore_map.nr_map - 1;
>>
>> /* pump up @end */
>> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> this_start = clamp(this_start, start, end);
>> this_end = clamp(this_end, start, end);
>>
>> - if (this_end< size)
>> +restart:
>> + if (this_end<= this_start || this_end< size)
>> continue;
>>
>> + for (; curr>= 0; curr--) {
>> + if (movablecore_map.map[curr].start< this_end)
>
> movablecore_map[curr].start should be movablecore_map[curr].start<< PAGE_SHIFT.
> May be you can change movablecore_map[].start/end to movablecore_map[].start_pfn/end_pfn
> to avoid confusion.
Hi Wu,
Yes, it was my mistake that I forgot to shift the pfn.
And this was tested out by my partner too. And I have fixed it in my v3
patch.
Thanks for the comments. :)
>
>> + break;
>> + }
>> +
>> cand = round_down(this_end - size, align);
>> + if (curr>= 0&& cand< movablecore_map.map[curr].end) {
>> + this_end = movablecore_map.map[curr].start;
>
> Ditto.
>
>> + goto restart;
>> + }
>> +
>> if (cand>= this_start)
>> return cand;
>> }
>> +
>> return 0;
>> }
>>
>>
>
>
On 11/26/2012 08:40 PM, wujianguo wrote:
> Hi Tang,
> I tested this patchset in x86_64, and I found that this patch didn't
> work as expected.
> For example, if node2's memory pfn range is [0x680000-0x980000),
> I boot kernel with movablecore_map=4G@0x680000000, all memory in node2 will be
> in ZONE_MOVABLE, but bootmem still can be allocated from [0x780000000-0x980000000),
> that means bootmem *is allocated* from ZONE_MOVABLE. This because movablecore_map
> only contains [0x680000000-0x780000000). I think we can fixup movablecore_map, how
> about this:
Hi Wu,
That is really a problem. And, before numa memory got initialized,
memblock subsystem would be used to allocate memory. I didn't find any
approach that could fully address it when I making the patches. There
always be risk that memblock allocates memory on ZONE_MOVABLE. I think
we can only do our best to prevent it from happening.
Your patch is very helpful. And after a shot look at the code, it seems
that acpi_numa_memory_affinity_init() is an architecture dependent
function. Could we do this somewhere which is not depending on the
architecture ?
Thanks. :)
>
> Signed-off-by: Jianguo Wu<[email protected]>
> Signed-off-by: Jiang Liu<[email protected]>
> ---
> arch/x86/mm/srat.c | 15 +++++++++++++++
> include/linux/mm.h | 3 +++
> mm/page_alloc.c | 2 +-
> 3 files changed, 19 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
> index 4ddf497..f1aac08 100644
> --- a/arch/x86/mm/srat.c
> +++ b/arch/x86/mm/srat.c
> @@ -147,6 +147,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
> {
> u64 start, end;
> int node, pxm;
> + int i;
> + unsigned long start_pfn, end_pfn;
>
> if (srat_disabled())
> return -1;
> @@ -181,6 +183,19 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
> printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
> node, pxm,
> (unsigned long long) start, (unsigned long long) end - 1);
> +
> + start_pfn = PFN_DOWN(start);
> + end_pfn = PFN_UP(end);
> + for (i = 0; i< movablecore_map.nr_map; i++) {
> + if (end_pfn<= movablecore_map.map[i].start)
> + break;
> +
> + if (movablecore_map.map[i].end< end_pfn) {
> + insert_movablecore_map(movablecore_map.map[i].end,
> + end_pfn);
> + }
> + }
> +
> return 0;
> }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a65251..7a23403 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1356,6 +1356,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn);
> #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
> #endif
>
> +extern void insert_movablecore_map(unsigned long start_pfn,
> + unsigned long end_pfn);
> +
> extern void set_dma_reserve(unsigned long new_dma_reserve);
> extern void memmap_init_zone(unsigned long, int, unsigned long,
> unsigned long, enum memmap_context);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 544c829..e6b5090 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5089,7 +5089,7 @@ early_param("movablecore", cmdline_parse_movablecore);
> * This function will also merge the overlapped ranges, and sort the array
> * by start_pfn in monotonic increasing order.
> */
> -static void __init insert_movablecore_map(unsigned long start_pfn,
> +void __init insert_movablecore_map(unsigned long start_pfn,
> unsigned long end_pfn)
> {
> int pos, overlap;
> -- 1.7.6.1
> .
>
> Thanks,
> Jianguo Wu
>
> On 2012-11-23 18:44, Tang Chen wrote:
>> This patch make sure bootmem will not allocate memory from areas that
>> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
>>
>> Signed-off-by: Tang Chen<[email protected]>
>> Signed-off-by: Lai Jiangshan<[email protected]>
>> Reviewed-by: Wen Congyang<[email protected]>
>> Tested-by: Lin Feng<[email protected]>
>> ---
>> include/linux/memblock.h | 1 +
>> mm/memblock.c | 15 ++++++++++++++-
>> 2 files changed, 15 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index d452ee1..6e25597 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -42,6 +42,7 @@ struct memblock {
>>
>> extern struct memblock memblock;
>> extern int memblock_debug;
>> +extern struct movablecore_map movablecore_map;
>>
>> #define memblock_dbg(fmt, ...) \
>> if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 6259055..33b3b4d 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> {
>> phys_addr_t this_start, this_end, cand;
>> u64 i;
>> + int curr = movablecore_map.nr_map - 1;
>>
>> /* pump up @end */
>> if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>> this_start = clamp(this_start, start, end);
>> this_end = clamp(this_end, start, end);
>>
>> - if (this_end< size)
>> +restart:
>> + if (this_end<= this_start || this_end< size)
>> continue;
>>
>> + for (; curr>= 0; curr--) {
>> + if (movablecore_map.map[curr].start< this_end)
>> + break;
>> + }
>> +
>> cand = round_down(this_end - size, align);
>> + if (curr>= 0&& cand< movablecore_map.map[curr].end) {
>> + this_end = movablecore_map.map[curr].start;
>> + goto restart;
>> + }
>> +
>> if (cand>= this_start)
>> return cand;
>> }
>> +
>> return 0;
>> }
>>
>>
>
>
On 11/26/2012 05:15 AM, Tang Chen wrote:
>
> Hi Wu,
>
> That is really a problem. And, before numa memory got initialized,
> memblock subsystem would be used to allocate memory. I didn't find any
> approach that could fully address it when I making the patches. There
> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
> we can only do our best to prevent it from happening.
>
> Your patch is very helpful. And after a shot look at the code, it seems
> that acpi_numa_memory_affinity_init() is an architecture dependent
> function. Could we do this somewhere which is not depending on the
> architecture ?
>
The movable memory should be classified as a non-RAM type in memblock,
that way we will not allocate from it early on.
-hpa
On 2012/11/26 23:48, H. Peter Anvin wrote:
> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>
>> Hi Wu,
>>
>> That is really a problem. And, before numa memory got initialized,
>> memblock subsystem would be used to allocate memory. I didn't find any
>> approach that could fully address it when I making the patches. There
>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>> we can only do our best to prevent it from happening.
>>
>> Your patch is very helpful. And after a shot look at the code, it seems
>> that acpi_numa_memory_affinity_init() is an architecture dependent
>> function. Could we do this somewhere which is not depending on the
>> architecture ?
>>
>
> The movable memory should be classified as a non-RAM type in memblock,
> that way we will not allocate from it early on.
>
> -hpa
yep, we can put movable memory in reserved.regions in memblock.
>
>
> .
>
On 2012-11-26 23:48, H. Peter Anvin wrote:
> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>
>> Hi Wu,
>>
>> That is really a problem. And, before numa memory got initialized,
>> memblock subsystem would be used to allocate memory. I didn't find any
>> approach that could fully address it when I making the patches. There
>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>> we can only do our best to prevent it from happening.
>>
>> Your patch is very helpful. And after a shot look at the code, it seems
>> that acpi_numa_memory_affinity_init() is an architecture dependent
>> function. Could we do this somewhere which is not depending on the
>> architecture ?
>>
>
> The movable memory should be classified as a non-RAM type in memblock,
> that way we will not allocate from it early on.
Hi Peter,
I have tried to reserved movable memory from bootmem allocator, but the
ACPICA subsystem is initialized later than setting up movable zone.
So still trying to figure out a way to setup/reserve movable zones
according to information from static ACPI tables such as SRAT/MPST etc.
Regards!
Gerry
>
> -hpa
>
>
> .
>
On 11/26/2012 05:12 PM, Jiang Liu wrote:
> Hi Peter,
>
> I have tried to reserved movable memory from bootmem allocator, but the
> ACPICA subsystem is initialized later than setting up movable zone.
> So still trying to figure out a way to setup/reserve movable zones
> according to information from static ACPI tables such as SRAT/MPST etc.
>
[Adding Len Brown]
Right, for the case of platform-configured memory. Len, I'm wondering
if there is any reasonable way we can get memory-map-related stuff out
of ACPI before we initialize the full ACPICA... we could of course write
an ad hoc static parser (these are just static tables, after all), but
I'm not sure if that fits into your overall view of how the subsystem
should work?
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 2012-11-23 18:44, Tang Chen wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
>
> movablecore_map=nn[KMG]@ss[KMG]
>
Hi Tang,
DMA address can't be set as movable, if some one boot kernel with
movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
system maybe boot failed. Should this case be handled or mentioned
in the change log and kernel-parameters.txt?
Thanks,
Jianguo Wu
> This option make sure memory range from ss to ss+nn is movable memory.
>
>
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
>
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
>
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
>
> So we proposed a new feature which specifies memory range to use as
> movable memory.
>
>
> [Ways to do this]
> There may be 2 ways to specify movable memory.
> 1. use firmware information
> 2. use boot option
>
> 1. use firmware information
> According to ACPI spec 5.0, SRAT table has memory affinity structure
> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
> Affinity Structure". If we use the information, we might be able to
> specify movable memory by firmware. For example, if Hot Pluggable
> Filed is enabled, Linux sets the memory as movable memory.
>
> 2. use boot option
> This is our proposal. New boot option can specify memory range to use
> as movable memory.
>
>
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily.
>
>
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
>
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
>
> And the following points should be considered.
>
> 1) If the range is involved in a single node, then from ss to the end of
> the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
> the node will be ZONE_MOVABLE, and all the other nodes will only
> have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
> unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
> higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
>
>
>
> Tang Chen (4):
> page_alloc: add movable_memmap kernel parameter
> page_alloc: Introduce zone_movable_limit[] to keep movable limit for
> nodes
> page_alloc: Make movablecore_map has higher priority
> page_alloc: Bootmem limit with movablecore_map
>
> Yasuaki Ishimatsu (1):
> x86: get pg_data_t's memory from other node
>
> Documentation/kernel-parameters.txt | 17 +++
> arch/x86/mm/numa.c | 11 ++-
> include/linux/memblock.h | 1 +
> include/linux/mm.h | 11 ++
> mm/memblock.c | 15 +++-
> mm/page_alloc.c | 216 ++++++++++++++++++++++++++++++++++-
> 6 files changed, 263 insertions(+), 8 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
At 11/27/2012 08:58 AM, Jianguo Wu Wrote:
> On 2012/11/26 23:48, H. Peter Anvin wrote:
>
>> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>>
>>> Hi Wu,
>>>
>>> That is really a problem. And, before numa memory got initialized,
>>> memblock subsystem would be used to allocate memory. I didn't find any
>>> approach that could fully address it when I making the patches. There
>>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>>> we can only do our best to prevent it from happening.
>>>
>>> Your patch is very helpful. And after a shot look at the code, it seems
>>> that acpi_numa_memory_affinity_init() is an architecture dependent
>>> function. Could we do this somewhere which is not depending on the
>>> architecture ?
>>>
>>
>> The movable memory should be classified as a non-RAM type in memblock,
>> that way we will not allocate from it early on.
>>
>> -hpa
>
>
> yep, we can put movable memory in reserved.regions in memblock.
Hmm, I don't think so. If so, memory in reserved.regions contain two type
memory: bootmem and movable memory. We will put all pages not in reserved.regions
into buddy system. If we put movable memory in reserved.regions, we have
no chance to put them to buddy system, and can't use them after system boots.
Thanks
Wen Congyang
>
>>
>>
>> .
>>
>
>
>
>
On 2012/11/27 11:19, Wen Congyang wrote:
> At 11/27/2012 08:58 AM, Jianguo Wu Wrote:
>> On 2012/11/26 23:48, H. Peter Anvin wrote:
>>
>>> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>>>
>>>> Hi Wu,
>>>>
>>>> That is really a problem. And, before numa memory got initialized,
>>>> memblock subsystem would be used to allocate memory. I didn't find any
>>>> approach that could fully address it when I making the patches. There
>>>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>>>> we can only do our best to prevent it from happening.
>>>>
>>>> Your patch is very helpful. And after a shot look at the code, it seems
>>>> that acpi_numa_memory_affinity_init() is an architecture dependent
>>>> function. Could we do this somewhere which is not depending on the
>>>> architecture ?
>>>>
>>>
>>> The movable memory should be classified as a non-RAM type in memblock,
>>> that way we will not allocate from it early on.
>>>
>>> -hpa
>>
>>
>> yep, we can put movable memory in reserved.regions in memblock.
>
> Hmm, I don't think so. If so, memory in reserved.regions contain two type
> memory: bootmem and movable memory. We will put all pages not in reserved.regions
> into buddy system. If we put movable memory in reserved.regions, we have
> no chance to put them to buddy system, and can't use them after system boots.
>
yes, you are right. Or we can fix movablecore_map when add memory region to memblock.
> Thanks
> Wen Congyang
>
>>
>>>
>>>
>>> .
>>>
>>
>>
>>
>>
>
>
> .
>
At 11/27/2012 11:22 AM, Jianguo Wu Wrote:
> On 2012/11/27 11:19, Wen Congyang wrote:
>
>> At 11/27/2012 08:58 AM, Jianguo Wu Wrote:
>>> On 2012/11/26 23:48, H. Peter Anvin wrote:
>>>
>>>> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>>>>
>>>>> Hi Wu,
>>>>>
>>>>> That is really a problem. And, before numa memory got initialized,
>>>>> memblock subsystem would be used to allocate memory. I didn't find any
>>>>> approach that could fully address it when I making the patches. There
>>>>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>>>>> we can only do our best to prevent it from happening.
>>>>>
>>>>> Your patch is very helpful. And after a shot look at the code, it seems
>>>>> that acpi_numa_memory_affinity_init() is an architecture dependent
>>>>> function. Could we do this somewhere which is not depending on the
>>>>> architecture ?
>>>>>
>>>>
>>>> The movable memory should be classified as a non-RAM type in memblock,
>>>> that way we will not allocate from it early on.
>>>>
>>>> -hpa
>>>
>>>
>>> yep, we can put movable memory in reserved.regions in memblock.
>>
>> Hmm, I don't think so. If so, memory in reserved.regions contain two type
>> memory: bootmem and movable memory. We will put all pages not in reserved.regions
>> into buddy system. If we put movable memory in reserved.regions, we have
>> no chance to put them to buddy system, and can't use them after system boots.
>>
>
> yes, you are right. Or we can fix movablecore_map when add memory region to memblock.
If so, we should know the nodes address range...
Thanks
Wen Congyang
>> Thanks
>> Wen Congyang
>>
>>>
>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>>
>>>
>>
>>
>> .
>>
>
>
>
>
At 11/26/2012 11:48 PM, H. Peter Anvin Wrote:
> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>
>> Hi Wu,
>>
>> That is really a problem. And, before numa memory got initialized,
>> memblock subsystem would be used to allocate memory. I didn't find any
>> approach that could fully address it when I making the patches. There
>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>> we can only do our best to prevent it from happening.
>>
>> Your patch is very helpful. And after a shot look at the code, it seems
>> that acpi_numa_memory_affinity_init() is an architecture dependent
>> function. Could we do this somewhere which is not depending on the
>> architecture ?
>>
>
> The movable memory should be classified as a non-RAM type in memblock,
> that way we will not allocate from it early on.
Hi, hpa
The problem is that:
node1 address rang: [18G, 34G), and the user specifies movable map is [8G, 24G).
We don't know node1's address range before numa init. So we can't prevent
allocating boot memory in the range [24G, 34G).
The movable memory should be classified as a non-RAM type in memblock. What
do you want to say? We don't save type in memblock because we only
add E820_RAM and E820_RESERVED_KERN to memblock.
Thanks
Wen Congyang
>
> -hpa
>
>
On 11/26/2012 07:15 PM, Wen Congyang wrote:
>
> Hi, hpa
>
> The problem is that:
> node1 address rang: [18G, 34G), and the user specifies movable map is [8G, 24G).
> We don't know node1's address range before numa init. So we can't prevent
> allocating boot memory in the range [24G, 34G).
>
> The movable memory should be classified as a non-RAM type in memblock. What
> do you want to say? We don't save type in memblock because we only
> add E820_RAM and E820_RESERVED_KERN to memblock.
>
We either need to keep the type or not add it to the memblocks.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 11/27/2012 11:10 AM, wujianguo wrote:
> On 2012-11-23 18:44, Tang Chen wrote:
>> [What we are doing]
>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>> map for each node in the system.
>>
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>
> Hi Tang,
> DMA address can't be set as movable, if some one boot kernel with
> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
> system maybe boot failed. Should this case be handled or mentioned
> in the change log and kernel-parameters.txt?
Hi Wu,
Right, DMA address can't be set as movable. And I should have mentioned
it in the doc more clear. :)
Actually, the situation is not only for DMA address. Because we limited
the memblock allocation, even if users did not specified the DMA
address, but set too much memory as movable, which means there was too
little memory for kernel to use, kernel will also fail to boot.
I added the following info into doc, but obviously it was not clear
enough. :)
+ If kernelcore or movablecore is also specified,
+ movablecore_map will have higher priority to be
+ satisfied. So the administrator should be careful that
+ the amount of movablecore_map areas are not too large.
+ Otherwise kernel won't have enough memory to start.
And about how to fix it, as you said, we can handle the situation if
user specified DMA address as movable. But how to handle "too little
memory for kernel to start" case ? Is there any info about how much
at least memory kernel needs ?
Thanks for the comments. :)
>
> Thanks,
> Jianguo Wu
>
On 11/26/2012 09:43 PM, Tang Chen wrote:
>
> And about how to fix it, as you said, we can handle the situation if
> user specified DMA address as movable. But how to handle "too little
> memory for kernel to start" case ? Is there any info about how much
> at least memory kernel needs ?
>
Not really, and it depends on so many variables.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 2012/11/27 13:43, Tang Chen wrote:
> On 11/27/2012 11:10 AM, wujianguo wrote:
>> On 2012-11-23 18:44, Tang Chen wrote:
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>
>> Hi Tang,
>> DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
>
> Hi Wu,
>
> Right, DMA address can't be set as movable. And I should have mentioned
> it in the doc more clear. :)
>
> Actually, the situation is not only for DMA address. Because we limited
> the memblock allocation, even if users did not specified the DMA
> address, but set too much memory as movable, which means there was too
> little memory for kernel to use, kernel will also fail to boot.
>
> I added the following info into doc, but obviously it was not clear
> enough. :)
> + If kernelcore or movablecore is also specified,
> + movablecore_map will have higher priority to be
> + satisfied. So the administrator should be careful that
> + the amount of movablecore_map areas are not too large.
> + Otherwise kernel won't have enough memory to start.
>
>
> And about how to fix it, as you said, we can handle the situation if
> user specified DMA address as movable. But how to handle "too little
> memory for kernel to start" case ? Is there any info about how much
> at least memory kernel needs ?
>
As I know, bootmem is mostly used by page structs when CONFIG_SPARSEMEM=y.
But it is hard to calculate how much bootmem is needed exactly.
>
> Thanks for the comments. :)
>
>>
>> Thanks,
>> Jianguo Wu
>>
>
>
>
> .
>
Hi Tang,
On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen <[email protected]> wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
>
> movablecore_map=nn[KMG]@ss[KMG]
>
> This option make sure memory range from ss to ss+nn is movable memory.
>
>
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
>
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
>
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
>
Sorry, I'm still not get your idea.
Why you need a specify range that is movable?
Could you describe the requirement and situation a bit more?
Thank you.
> So we proposed a new feature which specifies memory range to use as
> movable memory.
>
>
> [Ways to do this]
> There may be 2 ways to specify movable memory.
> 1. use firmware information
> 2. use boot option
>
> 1. use firmware information
> According to ACPI spec 5.0, SRAT table has memory affinity structure
> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
> Affinity Structure". If we use the information, we might be able to
> specify movable memory by firmware. For example, if Hot Pluggable
> Filed is enabled, Linux sets the memory as movable memory.
>
> 2. use boot option
> This is our proposal. New boot option can specify memory range to use
> as movable memory.
>
>
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily.
>
>
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
>
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
>
> And the following points should be considered.
>
> 1) If the range is involved in a single node, then from ss to the end of
> the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
> the node will be ZONE_MOVABLE, and all the other nodes will only
> have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
> unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
> higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
>
>
>
> Tang Chen (4):
> page_alloc: add movable_memmap kernel parameter
> page_alloc: Introduce zone_movable_limit[] to keep movable limit for
> nodes
> page_alloc: Make movablecore_map has higher priority
> page_alloc: Bootmem limit with movablecore_map
>
> Yasuaki Ishimatsu (1):
> x86: get pg_data_t's memory from other node
>
> Documentation/kernel-parameters.txt | 17 +++
> arch/x86/mm/numa.c | 11 ++-
> include/linux/memblock.h | 1 +
> include/linux/mm.h | 11 ++
> mm/memblock.c | 15 +++-
> mm/page_alloc.c | 216 ++++++++++++++++++++++++++++++++++-
> 6 files changed, 263 insertions(+), 8 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Regards,
-Bob
On 11/27/2012 04:00 PM, Bob Liu wrote:
> Hi Tang,
>
> On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen<[email protected]> wrote:
>> [What we are doing]
>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>> map for each node in the system.
>>
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>> This option make sure memory range from ss to ss+nn is movable memory.
>>
>>
>> [Why we do this]
>> If we hot remove a memroy, the memory cannot have kernel memory,
>> because Linux cannot migrate kernel memory currently. Therefore,
>> we have to guarantee that the hot removed memory has only movable
>> memoroy.
>>
>> Linux has two boot options, kernelcore= and movablecore=, for
>> creating movable memory. These boot options can specify the amount
>> of memory use as kernel or movable memory. Using them, we can
>> create ZONE_MOVABLE which has only movable memory.
>>
>> But it does not fulfill a requirement of memory hot remove, because
>> even if we specify the boot options, movable memory is distributed
>> in each node evenly. So when we want to hot remove memory which
>> memory range is 0x80000000-0c0000000, we have no way to specify
>> the memory as movable memory.
>>
>
> Sorry, I'm still not get your idea.
> Why you need a specify range that is movable?
> Could you describe the requirement and situation a bit more?
> Thank you.
Hi Liu,
This feature is used in memory hotplug.
In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
User could specify all the memory on a node to be movable, so that the
node could be hot-removed.
Another approach is like the following:
movable_node = 1,3-5,8
This could set all the memory on the nodes to be movable. And the rest
of memory works as usual. But movablecore_map is more flexible.
Thanks. :)
>
>> So we proposed a new feature which specifies memory range to use as
>> movable memory.
>>
>>
>> [Ways to do this]
>> There may be 2 ways to specify movable memory.
>> 1. use firmware information
>> 2. use boot option
>>
>> 1. use firmware information
>> According to ACPI spec 5.0, SRAT table has memory affinity structure
>> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>> Affinity Structure". If we use the information, we might be able to
>> specify movable memory by firmware. For example, if Hot Pluggable
>> Filed is enabled, Linux sets the memory as movable memory.
>>
>> 2. use boot option
>> This is our proposal. New boot option can specify memory range to use
>> as movable memory.
>>
>>
>> [How we do this]
>> We chose second way, because if we use first way, users cannot change
>> memory range to use as movable memory easily. We think if we create
>> movable memory, performance regression may occur by NUMA. In this case,
>> user can turn off the feature easily if we prepare the boot option.
>> And if we prepare the boot optino, the user can select which memory
>> to use as movable memory easily.
>>
>>
>> [How to use]
>> Specify the following boot option:
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>> That means physical address range from ss to ss+nn will be allocated as
>> ZONE_MOVABLE.
>>
>> And the following points should be considered.
>>
>> 1) If the range is involved in a single node, then from ss to the end of
>> the node will be ZONE_MOVABLE.
>> 2) If the range covers two or more nodes, then from ss to the end of
>> the node will be ZONE_MOVABLE, and all the other nodes will only
>> have ZONE_MOVABLE.
>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>> unless kernelcore or movablecore is specified.
>> 4) This option could be specified at most MAX_NUMNODES times.
>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>> higher priority to be satisfied.
>> 6) This option has no conflict with memmap option.
>>
>>
>>
>> Tang Chen (4):
>> page_alloc: add movable_memmap kernel parameter
>> page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>> nodes
>> page_alloc: Make movablecore_map has higher priority
>> page_alloc: Bootmem limit with movablecore_map
>>
>> Yasuaki Ishimatsu (1):
>> x86: get pg_data_t's memory from other node
>>
>> Documentation/kernel-parameters.txt | 17 +++
>> arch/x86/mm/numa.c | 11 ++-
>> include/linux/memblock.h | 1 +
>> include/linux/mm.h | 11 ++
>> mm/memblock.c | 15 +++-
>> mm/page_alloc.c | 216 ++++++++++++++++++++++++++++++++++-
>> 6 files changed, 263 insertions(+), 8 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
>
On 11/27/2012 12:29 AM, Tang Chen wrote:
> Another approach is like the following:
> movable_node = 1,3-5,8
> This could set all the memory on the nodes to be movable. And the rest
> of memory works as usual. But movablecore_map is more flexible.
... but *much* harder for users, so movable_node is better in most cases.
-hpa
At 11/27/2012 04:49 PM, H. Peter Anvin Wrote:
> On 11/27/2012 12:29 AM, Tang Chen wrote:
>> Another approach is like the following:
>> movable_node = 1,3-5,8
>> This could set all the memory on the nodes to be movable. And the rest
>> of memory works as usual. But movablecore_map is more flexible.
>
> ... but *much* harder for users, so movable_node is better in most cases.
But numa is initialized very later, and we need the information in SRAT...
Thanks
Wen Congyang
>
> -hpa
>
>
On 11/27/2012 01:47 AM, Wen Congyang wrote:
> At 11/27/2012 04:49 PM, H. Peter Anvin Wrote:
>> On 11/27/2012 12:29 AM, Tang Chen wrote:
>>> Another approach is like the following:
>>> movable_node = 1,3-5,8
>>> This could set all the memory on the nodes to be movable. And the rest
>>> of memory works as usual. But movablecore_map is more flexible.
>>
>> ... but *much* harder for users, so movable_node is better in most cases.
>
> But numa is initialized very later, and we need the information in SRAT...
>
> Thanks
> Wen Congyang
>
I think you need to deal with it for usability reasons, though...
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
Hi HPA and Tang,
2012/11/27 17:49, H. Peter Anvin wrote:
> On 11/27/2012 12:29 AM, Tang Chen wrote:
>> Another approach is like the following:
>> movable_node = 1,3-5,8
>> This could set all the memory on the nodes to be movable. And the rest
>> of memory works as usual. But movablecore_map is more flexible.
>
> ... but *much* harder for users, so movable_node is better in most cases.
It seems that movable_node is easier to use than movablecore_map.
But I do not think movable_node is better because the node number is set
by OS and changed easily.
For exmaple:
If system has 4 nodes and we set moveble_node=2, we can hot remove node2.
node0 node1 node2 node3
+-----+ +-----+ +-----+ +-----+
| | | | |/////| | |
| | | | |/////| | |
| | | | |/////| | |
| | | | |/////| | |
+-----+ +-----+ +-----+ +-----+
movable
node
But if we hot remove node2 and reboot the system, node3 is changed to node2
and set to movable node.
node0 node1 node2
+-----+ +-----+ +-----+
| | | | |/////|
| | | | |/////|
| | | | |/////|
| | | | |/////|
+-----+ +-----+ +-----+
movable
node
Originally, node3 is not movable node. Changing the node attribution to
movable node is not intended. So if user uses movable_node,
user must confirm whether boot option is correctly set at hotplug.
But memory range is set by firmware and not changed. So if we set node2
as movable node by movablecore_map, the issue does not occur.
Thanks,
Yasuaki Ishimatsu
>
> -hpa
>
On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen <[email protected]> wrote:
> On 11/27/2012 04:00 PM, Bob Liu wrote:
>>
>> Hi Tang,
>>
>> On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen<[email protected]>
>> wrote:
>>>
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE
>>> memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>
>>>
>>> [Why we do this]
>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>> because Linux cannot migrate kernel memory currently. Therefore,
>>> we have to guarantee that the hot removed memory has only movable
>>> memoroy.
>>>
>>> Linux has two boot options, kernelcore= and movablecore=, for
>>> creating movable memory. These boot options can specify the amount
>>> of memory use as kernel or movable memory. Using them, we can
>>> create ZONE_MOVABLE which has only movable memory.
>>>
>>> But it does not fulfill a requirement of memory hot remove, because
>>> even if we specify the boot options, movable memory is distributed
>>> in each node evenly. So when we want to hot remove memory which
>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>> the memory as movable memory.
>>>
>>
>> Sorry, I'm still not get your idea.
>> Why you need a specify range that is movable?
>> Could you describe the requirement and situation a bit more?
>> Thank you.
>
>
> Hi Liu,
>
> This feature is used in memory hotplug.
>
> In order to implement a whole node hotplug, we need to make sure the
> node contains no kernel memory, because memory used by kernel could
> not be migrated. (Since the kernel memory is directly mapped,
> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>
> User could specify all the memory on a node to be movable, so that the
> node could be hot-removed.
>
Thank you for your explanation. It's reasonable.
But i think it's a bit duplicated with CMA, i'm not sure but maybe we
can combine it with CMA which already in mainline?
> Another approach is like the following:
> movable_node = 1,3-5,8
> This could set all the memory on the nodes to be movable. And the rest
> of memory works as usual. But movablecore_map is more flexible.
>
> Thanks. :)
>
>
>>
>>> So we proposed a new feature which specifies memory range to use as
>>> movable memory.
>>>
>>>
>>> [Ways to do this]
>>> There may be 2 ways to specify movable memory.
>>> 1. use firmware information
>>> 2. use boot option
>>>
>>> 1. use firmware information
>>> According to ACPI spec 5.0, SRAT table has memory affinity structure
>>> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>> Affinity Structure". If we use the information, we might be able to
>>> specify movable memory by firmware. For example, if Hot Pluggable
>>> Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>> This is our proposal. New boot option can specify memory range to use
>>> as movable memory.
>>>
>>>
>>> [How we do this]
>>> We chose second way, because if we use first way, users cannot change
>>> memory range to use as movable memory easily. We think if we create
>>> movable memory, performance regression may occur by NUMA. In this case,
>>> user can turn off the feature easily if we prepare the boot option.
>>> And if we prepare the boot optino, the user can select which memory
>>> to use as movable memory easily.
>>>
>>>
>>> [How to use]
>>> Specify the following boot option:
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> That means physical address range from ss to ss+nn will be allocated as
>>> ZONE_MOVABLE.
>>>
>>> And the following points should be considered.
>>>
>>> 1) If the range is involved in a single node, then from ss to the end of
>>> the node will be ZONE_MOVABLE.
>>> 2) If the range covers two or more nodes, then from ss to the end of
>>> the node will be ZONE_MOVABLE, and all the other nodes will only
>>> have ZONE_MOVABLE.
>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>> unless kernelcore or movablecore is specified.
>>> 4) This option could be specified at most MAX_NUMNODES times.
>>> 5) If kernelcore or movablecore is also specified, movablecore_map will
>>> have
>>> higher priority to be satisfied.
>>> 6) This option has no conflict with memmap option.
>>>
>>>
>>>
>>> Tang Chen (4):
>>> page_alloc: add movable_memmap kernel parameter
>>> page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>> nodes
>>> page_alloc: Make movablecore_map has higher priority
>>> page_alloc: Bootmem limit with movablecore_map
>>>
>>> Yasuaki Ishimatsu (1):
>>> x86: get pg_data_t's memory from other node
>>>
>>> Documentation/kernel-parameters.txt | 17 +++
>>> arch/x86/mm/numa.c | 11 ++-
>>> include/linux/memblock.h | 1 +
>>> include/linux/mm.h | 11 ++
>>> mm/memblock.c | 15 +++-
>>> mm/page_alloc.c | 216
>>> ++++++++++++++++++++++++++++++++++-
>>> 6 files changed, 263 insertions(+), 8 deletions(-)
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
>>
>>
>
--
Regards,
--Bob
On 11/27/2012 08:09 PM, Bob Liu wrote:
> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<[email protected]> wrote:
>> Hi Liu,
>>
>> This feature is used in memory hotplug.
>>
>> In order to implement a whole node hotplug, we need to make sure the
>> node contains no kernel memory, because memory used by kernel could
>> not be migrated. (Since the kernel memory is directly mapped,
>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>
>> User could specify all the memory on a node to be movable, so that the
>> node could be hot-removed.
>>
>
> Thank you for your explanation. It's reasonable.
>
> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
> can combine it with CMA which already in mainline?
>
Hi Liu,
Thanks for your advice. :)
CMA is Contiguous Memory Allocator, right? What I'm trying to do is
controlling where is the start of ZONE_MOVABLE of each node. Could
CMA do this job ?
And also, after a short investigation, CMA seems need to base on
memblock. But we need to limit memblock not to allocate memory on
ZONE_MOVABLE. As a result, we need to know the ranges before memblock
could be used. I'm afraid we still need an approach to get the ranges,
such as a boot option, or from static ACPI tables such as SRAT/MPST.
I'm don't know much about CMA for now. So if you have any better idea,
please share with us, thanks. :)
On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <[email protected]> wrote:
> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>
>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<[email protected]>
>> wrote:
>>>
>>> Hi Liu,
>>>
>>>
>>> This feature is used in memory hotplug.
>>>
>>> In order to implement a whole node hotplug, we need to make sure the
>>> node contains no kernel memory, because memory used by kernel could
>>> not be migrated. (Since the kernel memory is directly mapped,
>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>
>>> User could specify all the memory on a node to be movable, so that the
>>> node could be hot-removed.
>>>
>>
>> Thank you for your explanation. It's reasonable.
>>
>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>> can combine it with CMA which already in mainline?
>>
> Hi Liu,
>
> Thanks for your advice. :)
>
> CMA is Contiguous Memory Allocator, right? What I'm trying to do is
> controlling where is the start of ZONE_MOVABLE of each node. Could
> CMA do this job ?
cma will not control the start of ZONE_MOVABLE of each node, but it
can declare a memory that always movable
and all non movable allocate request will not happen on that area.
Currently cma use a boot parameter "cma=" to declare a memory size
that always movable.
I think it might fulfill your requirement if extending the boot
parameter with a start address.
more info at http://lwn.net/Articles/468044/
>
> And also, after a short investigation, CMA seems need to base on
> memblock. But we need to limit memblock not to allocate memory on
> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
> could be used. I'm afraid we still need an approach to get the ranges,
> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>
Yes, it's based on memblock and with boot option.
In setup_arch32()
dma_contiguous_reserve(0); => will declare a cma area using
memblock_reserve()
> I'm don't know much about CMA for now. So if you have any better idea,
> please share with us, thanks. :)
My idea is reuse cma like below patch(even not compiled) and boot with
"cma=size@start_address".
I don't know whether it can work and whether suitable for your
requirement, if not forgive me for this noises.
diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 612afcc..564962a 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
*/
static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
static long size_cmdline = -1;
+static long cma_start_cmdline = -1;
static int __init early_cma(char *p)
{
+ char *oldp;
pr_debug("%s(%s)\n", __func__, p);
+ oldp = p;
size_cmdline = memparse(p, &p);
+
+ if (*p == '@')
+ cma_start_cmdline = memparse(p+1, &p);
+ printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
return 0;
}
early_param("cma", early_cma);
@@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
if (selected_size) {
pr_debug("%s: reserving %ld MiB for global area\n", __func__,
selected_size / SZ_1M);
-
- dma_declare_contiguous(NULL, selected_size, 0, limit);
+ if (cma_size_cmdline != -1)
+ dma_declare_contiguous(NULL, selected_size,
cma_start_cmdline, limit);
+ else
+ dma_declare_contiguous(NULL, selected_size, 0, limit);
}
};
--
Regards,
--Bob
On 11/27/2012 11:10 AM, wujianguo wrote:
>
> Hi Tang,
> DMA address can't be set as movable, if some one boot kernel with
> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
> system maybe boot failed. Should this case be handled or mentioned
> in the change log and kernel-parameters.txt?
Hi Wu,
I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
address as movable. Just ignore the address lower than them, and set
the rest as movable. How do you think ?
And, since we cannot figure out the minimum of memory kernel needs, I
think for now, we can just add some warning into kernel-parameters.txt.
Thanks. :)
>
> Thanks,
> Jianguo Wu
>
On 2012-11-28 11:47, Tang Chen wrote:
> On 11/27/2012 11:10 AM, wujianguo wrote:
>>
>> Hi Tang,
>> DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
>
> Hi Wu,
>
> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
> address as movable. Just ignore the address lower than them, and set
> the rest as movable. How do you think ?
>
> And, since we cannot figure out the minimum of memory kernel needs, I
> think for now, we can just add some warning into kernel-parameters.txt.
>
> Thanks. :)
On one other OS, there is a mechanism to dynamically convert pages from
movable zones into normal zones.
Regards!
Gerry
>
>>
>> Thanks,
>> Jianguo Wu
>>
>
> .
>
On 2012-11-28 11:24, Bob Liu wrote:
> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <[email protected]> wrote:
>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>
>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<[email protected]>
>>> wrote:
>>>>
>>>> Hi Liu,
>>>>
>>>>
>>>> This feature is used in memory hotplug.
>>>>
>>>> In order to implement a whole node hotplug, we need to make sure the
>>>> node contains no kernel memory, because memory used by kernel could
>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>
>>>> User could specify all the memory on a node to be movable, so that the
>>>> node could be hot-removed.
>>>>
>>>
>>> Thank you for your explanation. It's reasonable.
>>>
>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>> can combine it with CMA which already in mainline?
>>>
>> Hi Liu,
>>
>> Thanks for your advice. :)
>>
>> CMA is Contiguous Memory Allocator, right? What I'm trying to do is
>> controlling where is the start of ZONE_MOVABLE of each node. Could
>> CMA do this job ?
>
> cma will not control the start of ZONE_MOVABLE of each node, but it
> can declare a memory that always movable
> and all non movable allocate request will not happen on that area.
>
> Currently cma use a boot parameter "cma=" to declare a memory size
> that always movable.
> I think it might fulfill your requirement if extending the boot
> parameter with a start address.
>
> more info at http://lwn.net/Articles/468044/
>>
>> And also, after a short investigation, CMA seems need to base on
>> memblock. But we need to limit memblock not to allocate memory on
>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>> could be used. I'm afraid we still need an approach to get the ranges,
>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>
>
> Yes, it's based on memblock and with boot option.
> In setup_arch32()
> dma_contiguous_reserve(0); => will declare a cma area using
> memblock_reserve()
>
>> I'm don't know much about CMA for now. So if you have any better idea,
>> please share with us, thanks. :)
>
> My idea is reuse cma like below patch(even not compiled) and boot with
> "cma=size@start_address".
> I don't know whether it can work and whether suitable for your
> requirement, if not forgive me for this noises.
>
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index 612afcc..564962a 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
> */
> static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
> static long size_cmdline = -1;
> +static long cma_start_cmdline = -1;
>
> static int __init early_cma(char *p)
> {
> + char *oldp;
> pr_debug("%s(%s)\n", __func__, p);
> + oldp = p;
> size_cmdline = memparse(p, &p);
> +
> + if (*p == '@')
> + cma_start_cmdline = memparse(p+1, &p);
> + printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
> return 0;
> }
> early_param("cma", early_cma);
> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
> if (selected_size) {
> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
> selected_size / SZ_1M);
> -
> - dma_declare_contiguous(NULL, selected_size, 0, limit);
> + if (cma_size_cmdline != -1)
> + dma_declare_contiguous(NULL, selected_size,
> cma_start_cmdline, limit);
> + else
> + dma_declare_contiguous(NULL, selected_size, 0, limit);
> }
> };
Seems a good idea to reserve memory by reusing CMA logic, though need more
investigation here. One of CMA goal is to ensure pages in CMA are really
movable, and this patchset tries to achieve the same goal at a first glance.
On 2012/11/28 11:47, Tang Chen wrote:
> On 11/27/2012 11:10 AM, wujianguo wrote:
>>
>> Hi Tang,
>> DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
>
> Hi Wu,
>
> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
> address as movable. Just ignore the address lower than them, and set
> the rest as movable. How do you think ?
>
I think it's OK for now.
> And, since we cannot figure out the minimum of memory kernel needs, I
> think for now, we can just add some warning into kernel-parameters.txt.
>
> Thanks. :)
>
>>
>> Thanks,
>> Jianguo Wu
>>
>
> .
>
At 11/28/2012 12:01 PM, Jiang Liu Wrote:
> On 2012-11-28 11:47, Tang Chen wrote:
>> On 11/27/2012 11:10 AM, wujianguo wrote:
>>>
>>> Hi Tang,
>>> DMA address can't be set as movable, if some one boot kernel with
>>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>>> system maybe boot failed. Should this case be handled or mentioned
>>> in the change log and kernel-parameters.txt?
>>
>> Hi Wu,
>>
>> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
>> address as movable. Just ignore the address lower than them, and set
>> the rest as movable. How do you think ?
>>
>> And, since we cannot figure out the minimum of memory kernel needs, I
>> think for now, we can just add some warning into kernel-parameters.txt.
>>
>> Thanks. :)
> On one other OS, there is a mechanism to dynamically convert pages from
> movable zones into normal zones.
The OS auto does it? Or the user coverts it?
We can convert pages from movable zones into normal zones by the following
interface:
echo online_kernel >/sys/devices/system/memory/memoryX/state
We have posted a patchset to implement it, and it is in mm tree now.
Thanks
Wen Congyang
>
> Regards!
> Gerry
>
>>
>>>
>>> Thanks,
>>> Jianguo Wu
>>>
>>
>> .
>>
>
>
>
On 2012-11-28 13:21, Wen Congyang wrote:
> At 11/28/2012 12:01 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:47, Tang Chen wrote:
>>> On 11/27/2012 11:10 AM, wujianguo wrote:
>>>>
>>>> Hi Tang,
>>>> DMA address can't be set as movable, if some one boot kernel with
>>>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>>>> system maybe boot failed. Should this case be handled or mentioned
>>>> in the change log and kernel-parameters.txt?
>>>
>>> Hi Wu,
>>>
>>> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
>>> address as movable. Just ignore the address lower than them, and set
>>> the rest as movable. How do you think ?
>>>
>>> And, since we cannot figure out the minimum of memory kernel needs, I
>>> think for now, we can just add some warning into kernel-parameters.txt.
>>>
>>> Thanks. :)
>> On one other OS, there is a mechanism to dynamically convert pages from
>> movable zones into normal zones.
>
> The OS auto does it? Or the user coverts it?
>
> We can convert pages from movable zones into normal zones by the following
> interface:
> echo online_kernel >/sys/devices/system/memory/memoryX/state
>
> We have posted a patchset to implement it, and it is in mm tree now.
OS automatically converts it, no manual operations needed.
Hi Bob, Liu Jiang,
About CMA, could you give me more info ?
Thanks for your patent and nice advice. :)
1) I saw the following on http://lwn.net/Articles/447405/:
The "CMA" type is sticky; pages which are marked as being for CMA
should never have their migration type changed by the kernel.
As Wen said, we now support a user interface to change movable memory
into kernel memory. But seeing from above, the memory specified as
CMA will not be able to be changed, right ? If so, I don't think
using CMA is a good idea.
2) Is CMA just implemented on ARM platform ? I found the following in
kernel-parameters.txt.
cma=nn[MG] [ARM,KNL]
Sets the size of kernel global memory area for contiguous
memory allocations. For more information, see
include/linux/dma-contiguous.h
We are developing on x86. Could we use it ?
3) Is CMA just used for DMA ? I am a little confused here. :)
I found the main code of CMA is implemented in dma-contiguous.c.
4) The boot options cma=xxx and movablecore_map=xxx have different
meanings for user. Reusing CMA could make user confused, I'm afraid.
And, even if we reuse "cma=" option, we still need to do the work
in patch 3~5, right ?
Thanks. :)
On 11/28/2012 12:08 PM, Jiang Liu wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen<[email protected]> wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>
>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi Liu,
>>>>>
>>>>>
>>>>> This feature is used in memory hotplug.
>>>>>
>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>> node contains no kernel memory, because memory used by kernel could
>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>
>>>>> User could specify all the memory on a node to be movable, so that the
>>>>> node could be hot-removed.
>>>>>
>>>>
>>>> Thank you for your explanation. It's reasonable.
>>>>
>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>> can combine it with CMA which already in mainline?
>>>>
>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right? What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>> dma_contiguous_reserve(0); => will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>> */
>> static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>> static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>> static int __init early_cma(char *p)
>> {
>> + char *oldp;
>> pr_debug("%s(%s)\n", __func__, p);
>> + oldp = p;
>> size_cmdline = memparse(p,&p);
>> +
>> + if (*p == '@')
>> + cma_start_cmdline = memparse(p+1,&p);
>> + printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>> return 0;
>> }
>> early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>> if (selected_size) {
>> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>> selected_size / SZ_1M);
>> -
>> - dma_declare_contiguous(NULL, selected_size, 0, limit);
>> + if (cma_size_cmdline != -1)
>> + dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> + else
>> + dma_declare_contiguous(NULL, selected_size, 0, limit);
>> }
>> };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.
>
>
>
>
>
Hi Chen,
If a pageblock's migration type is movable, it may be converted to
reclaimable under memory pressure. CMA is introduced to guarantee
that pages of CMA won't be converted to other migratetypes.
And we are trying to avoid allocating kernel/DMA memory from specific
memory ranges, so we could easily reclaim pages when hot-removing
memory devices.
I think the idea is not to directly reuse CMA for hotplug, but to
reuse the mechanism to reserve specific memory ranges from bootmem
allocator. So CMA and hotplug could use the same code.
Basically we may try to reuse dma_declare_contiguous(), so that
we don't need to add special logic into bootmem allocator.
Regards!
Gerry
On 2012-11-28 14:16, Tang Chen wrote:
> Hi Bob, Liu Jiang,
>
> About CMA, could you give me more info ?
> Thanks for your patent and nice advice. :)
>
>
> 1) I saw the following on http://lwn.net/Articles/447405/:
>
> The "CMA" type is sticky; pages which are marked as being for CMA
> should never have their migration type changed by the kernel.
>
> As Wen said, we now support a user interface to change movable memory
> into kernel memory. But seeing from above, the memory specified as
> CMA will not be able to be changed, right ? If so, I don't think
> using CMA is a good idea.
>
>
> 2) Is CMA just implemented on ARM platform ? I found the following in
> kernel-parameters.txt.
>
> cma=nn[MG] [ARM,KNL]
> Sets the size of kernel global memory area for contiguous
> memory allocations. For more information, see
> include/linux/dma-contiguous.h
>
> We are developing on x86. Could we use it ?
>
>
> 3) Is CMA just used for DMA ? I am a little confused here. :)
> I found the main code of CMA is implemented in dma-contiguous.c.
>
>
> 4) The boot options cma=xxx and movablecore_map=xxx have different
> meanings for user. Reusing CMA could make user confused, I'm afraid.
>
> And, even if we reuse "cma=" option, we still need to do the work
> in patch 3~5, right ?
>
>
> Thanks. :)
>
>
>
> On 11/28/2012 12:08 PM, Jiang Liu wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen<[email protected]> wrote:
>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>
>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hi Liu,
>>>>>>
>>>>>>
>>>>>> This feature is used in memory hotplug.
>>>>>>
>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>
>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>> node could be hot-removed.
>>>>>>
>>>>>
>>>>> Thank you for your explanation. It's reasonable.
>>>>>
>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>> can combine it with CMA which already in mainline?
>>>>>
>>>> Hi Liu,
>>>>
>>>> Thanks for your advice. :)
>>>>
>>>> CMA is Contiguous Memory Allocator, right? What I'm trying to do is
>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>> CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/
>>>>
>>>> And also, after a short investigation, CMA seems need to base on
>>>> memblock. But we need to limit memblock not to allocate memory on
>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>
>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>> dma_contiguous_reserve(0); => will declare a cma area using
>>> memblock_reserve()
>>>
>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>> please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>> */
>>> static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>> static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>> static int __init early_cma(char *p)
>>> {
>>> + char *oldp;
>>> pr_debug("%s(%s)\n", __func__, p);
>>> + oldp = p;
>>> size_cmdline = memparse(p,&p);
>>> +
>>> + if (*p == '@')
>>> + cma_start_cmdline = memparse(p+1,&p);
>>> + printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>> return 0;
>>> }
>>> early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>> if (selected_size) {
>>> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>> selected_size / SZ_1M);
>>> -
>>> - dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> + if (cma_size_cmdline != -1)
>>> + dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> + else
>>> + dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> }
>>> };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
>>
>>
>>
>>
>>
>
>
> .
>
At 11/28/2012 12:08 PM, Jiang Liu Wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <[email protected]> wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>
>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi Liu,
>>>>>
>>>>>
>>>>> This feature is used in memory hotplug.
>>>>>
>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>> node contains no kernel memory, because memory used by kernel could
>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>
>>>>> User could specify all the memory on a node to be movable, so that the
>>>>> node could be hot-removed.
>>>>>
>>>>
>>>> Thank you for your explanation. It's reasonable.
>>>>
>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>> can combine it with CMA which already in mainline?
>>>>
>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right? What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>> dma_contiguous_reserve(0); => will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>> */
>> static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>> static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>> static int __init early_cma(char *p)
>> {
>> + char *oldp;
>> pr_debug("%s(%s)\n", __func__, p);
>> + oldp = p;
>> size_cmdline = memparse(p, &p);
>> +
>> + if (*p == '@')
>> + cma_start_cmdline = memparse(p+1, &p);
>> + printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>> return 0;
>> }
>> early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>> if (selected_size) {
>> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>> selected_size / SZ_1M);
>> -
>> - dma_declare_contiguous(NULL, selected_size, 0, limit);
>> + if (cma_size_cmdline != -1)
>> + dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> + else
>> + dma_declare_contiguous(NULL, selected_size, 0, limit);
>> }
>> };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.
Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
for movable memory, I think movable zone is enough. And the start address is
not acceptable, because we want to specify the start address for each node.
I think we can implement movablecore_map like that:
1. parse the parameter
2. reserve the memory after efi_reserve_boot_services()
3. release the memory in mem_init
What about this?
Thanks
Wen Congyang
>
>
>
>
>
On 2012-11-28 16:29, Wen Congyang wrote:
> At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <[email protected]> wrote:
>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>
>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hi Liu,
>>>>>>
>>>>>>
>>>>>> This feature is used in memory hotplug.
>>>>>>
>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>
>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>> node could be hot-removed.
>>>>>>
>>>>>
>>>>> Thank you for your explanation. It's reasonable.
>>>>>
>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>> can combine it with CMA which already in mainline?
>>>>>
>>>> Hi Liu,
>>>>
>>>> Thanks for your advice. :)
>>>>
>>>> CMA is Contiguous Memory Allocator, right? What I'm trying to do is
>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>> CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/
>>>>
>>>> And also, after a short investigation, CMA seems need to base on
>>>> memblock. But we need to limit memblock not to allocate memory on
>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>
>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>> dma_contiguous_reserve(0); => will declare a cma area using
>>> memblock_reserve()
>>>
>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>> please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>> */
>>> static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>> static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>> static int __init early_cma(char *p)
>>> {
>>> + char *oldp;
>>> pr_debug("%s(%s)\n", __func__, p);
>>> + oldp = p;
>>> size_cmdline = memparse(p, &p);
>>> +
>>> + if (*p == '@')
>>> + cma_start_cmdline = memparse(p+1, &p);
>>> + printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>> return 0;
>>> }
>>> early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>> if (selected_size) {
>>> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>> selected_size / SZ_1M);
>>> -
>>> - dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> + if (cma_size_cmdline != -1)
>>> + dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> + else
>>> + dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> }
>>> };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
>
> Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
> for movable memory, I think movable zone is enough. And the start address is
> not acceptable, because we want to specify the start address for each node.
>
> I think we can implement movablecore_map like that:
> 1. parse the parameter
> 2. reserve the memory after efi_reserve_boot_services()
This sounds good, but the code to reserve memory for movable
nodes will be similar to dma_declare_contiguous().
> 3. release the memory in mem_init
>
> What about this?
>
> Thanks
> Wen Congyang
>>
>>
>>
>>
>>
>
>
> .
>
At 11/28/2012 04:28 PM, Jiang Liu Wrote:
> On 2012-11-28 16:29, Wen Congyang wrote:
>> At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>>> On 2012-11-28 11:24, Bob Liu wrote:
>>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <[email protected]> wrote:
>>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>>
>>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<[email protected]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Liu,
>>>>>>>
>>>>>>>
>>>>>>> This feature is used in memory hotplug.
>>>>>>>
>>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>>
>>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>>> node could be hot-removed.
>>>>>>>
>>>>>>
>>>>>> Thank you for your explanation. It's reasonable.
>>>>>>
>>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>>> can combine it with CMA which already in mainline?
>>>>>>
>>>>> Hi Liu,
>>>>>
>>>>> Thanks for your advice. :)
>>>>>
>>>>> CMA is Contiguous Memory Allocator, right? What I'm trying to do is
>>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>>> CMA do this job ?
>>>>
>>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>>> can declare a memory that always movable
>>>> and all non movable allocate request will not happen on that area.
>>>>
>>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>>> that always movable.
>>>> I think it might fulfill your requirement if extending the boot
>>>> parameter with a start address.
>>>>
>>>> more info at http://lwn.net/Articles/468044/
>>>>>
>>>>> And also, after a short investigation, CMA seems need to base on
>>>>> memblock. But we need to limit memblock not to allocate memory on
>>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>>
>>>>
>>>> Yes, it's based on memblock and with boot option.
>>>> In setup_arch32()
>>>> dma_contiguous_reserve(0); => will declare a cma area using
>>>> memblock_reserve()
>>>>
>>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>>> please share with us, thanks. :)
>>>>
>>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>>> "cma=size@start_address".
>>>> I don't know whether it can work and whether suitable for your
>>>> requirement, if not forgive me for this noises.
>>>>
>>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>>> index 612afcc..564962a 100644
>>>> --- a/drivers/base/dma-contiguous.c
>>>> +++ b/drivers/base/dma-contiguous.c
>>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>> */
>>>> static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>> static long size_cmdline = -1;
>>>> +static long cma_start_cmdline = -1;
>>>>
>>>> static int __init early_cma(char *p)
>>>> {
>>>> + char *oldp;
>>>> pr_debug("%s(%s)\n", __func__, p);
>>>> + oldp = p;
>>>> size_cmdline = memparse(p, &p);
>>>> +
>>>> + if (*p == '@')
>>>> + cma_start_cmdline = memparse(p+1, &p);
>>>> + printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>>> return 0;
>>>> }
>>>> early_param("cma", early_cma);
>>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>>> if (selected_size) {
>>>> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>>> selected_size / SZ_1M);
>>>> -
>>>> - dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>> + if (cma_size_cmdline != -1)
>>>> + dma_declare_contiguous(NULL, selected_size,
>>>> cma_start_cmdline, limit);
>>>> + else
>>>> + dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>> }
>>>> };
>>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>>> investigation here. One of CMA goal is to ensure pages in CMA are really
>>> movable, and this patchset tries to achieve the same goal at a first glance.
>>
>> Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
>> for movable memory, I think movable zone is enough. And the start address is
>> not acceptable, because we want to specify the start address for each node.
>>
>> I think we can implement movablecore_map like that:
>> 1. parse the parameter
>> 2. reserve the memory after efi_reserve_boot_services()
> This sounds good, but the code to reserve memory for movable
> nodes will be similar to dma_declare_contiguous().
Yes, it may be very similar. I think we can move it into mm/page_alloc.c, and
both cma and movablecore_map can use this function.
Thanks
Wen Congyang
>
>> 3. release the memory in mem_init
>>
>> What about this?
>>
>> Thanks
>> Wen Congyang
>>>
>>>
>>>
>>>
>>>
>>
>>
>> .
>>
>
>
>
Hi all,
Seems it's a great chance to discuss about the memory hotplug feature
within this thread. So I will try to give some high level thoughts about memory
hotplug feature on x86/IA64. Any comments are welcomed!
First of all, I think usability really matters. Ideally, memory hotplug
feature should just work out of box, and we shouldn't expect administrators to
add several extra platform dependent parameters to enable memory hotplug.
But how to enable memory (or CPU/node) hotplug out of box? I think the key point
is to cooperate with BIOS/ACPI/firmware/device management teams.
I still position memory hotplug as an advanced feature for high end
servers and those systems may/should provide some management interfaces to
configure CPU/memory/node hotplug features. The configuration UI may be provided
by BIOS, BMC or centralized system management suite. Once administrator enables
hotplug feature through those management UI, OS should support system device
hotplug out of box. For example, HP SuperDome2 management suite provides interface
to configure a node as floating node(hot-removable). And OpenSolaris supports
CPU/memory hotplug out of box without any extra configurations. So we should
shape interfaces between firmware and OS to better support system device hotplug.
On the other hand, I think there are no commercial available x86/IA64
platforms with system device hotplug capabilities in the field yet, at least only
limited quantity if any. So backward compatibility is not a big issue for us now.
So I think it's doable to rely on firmware to provide better support for system
device hotplug.
Then what should be enhanced to better support system device hotplug?
1) ACPI specification should be enhanced to provide a static table to describe
components with hotplug features, so OS could reserve special resources for
hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
hot-add. Currently we guess maximum number of CPUs supported by the platform
by counting CPU entries in APIC table, that's not reliable.
2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
hotplug. SRAT associates memory ranges with proximity domains with an extra
"hotpluggable" flag. PMTT provides memory device topology information, such
as "socket->memory controller->DIMM". MPST is used for memory power management
and provides a way to associate memory ranges with memory devices in PMTT.
With all information from SRAT, MPST and PMTT, OS could figure out hotplug
memory ranges automatically, so no extra kernel parameters needed.
3) Enhance ACPICA to provide a method to scan static ACPI tables before
memory subsystem has been initialized because OS need to access SRAT,
MPST and PMTT when initializing memory subsystem.
4) The last and the most important issue is how to minimize performance
drop caused by memory hotplug. As proposed by this patchset, once we
configure all memory of a NUMA node as movable, it essentially disable
NUMA optimization of kernel memory allocation from that node. According
to experience, that will cause huge performance drop. We have observed
10-30% performance drop with memory hotplug enabled. And on another
OS the average performance drop caused by memory hotplug is about 10%.
If we can't resolve the performance drop, memory hotplug is just a feature
for demo:( With help from hardware, we do have some chances to reduce
performance penalty caused by memory hotplug.
As we know, Linux could migrate movable page, but can't migrate
non-movable pages used by kernel/DMA etc. And the most hard part is how
to deal with those unmovable pages when hot-removing a memory device.
Now hardware has given us a hand with a technology named memory migration,
which could transparently migrate memory between memory devices. There's
no OS visible changes except NUMA topology before and after hardware memory
migration.
And if there are multiple memory devices within a NUMA node,
we could configure some memory devices to host unmovable memory and the
other to host movable memory. With this configuration, there won't be
bigger performance drop because we have preserved all NUMA optimizations.
We also could achieve memory hotplug remove by:
1) Use existing page migration mechanism to reclaim movable pages.
2) For memory devices hosting unmovable pages, we need:
2.1) find a movable memory device on other nodes with enough capacity
and reclaim it.
2.2) use hardware migration technology to migrate unmovable memory to
the just reclaimed memory device on other nodes.
I hope we could expect users to adopt memory hotplug technology
with all these implemented.
Back to this patch, we could rely on the mechanism provided
by it to automatically mark memory ranges as movable with information
from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
manually configure kernel parameters to enable memory hotplug.
Again, any comments are welcomed!
Regards!
Gerry
On 2012-11-23 18:44, Tang Chen wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
>
> movablecore_map=nn[KMG]@ss[KMG]
>
> This option make sure memory range from ss to ss+nn is movable memory.
>
>
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
>
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
>
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
>
> So we proposed a new feature which specifies memory range to use as
> movable memory.
>
>
> [Ways to do this]
> There may be 2 ways to specify movable memory.
> 1. use firmware information
> 2. use boot option
>
> 1. use firmware information
> According to ACPI spec 5.0, SRAT table has memory affinity structure
> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
> Affinity Structure". If we use the information, we might be able to
> specify movable memory by firmware. For example, if Hot Pluggable
> Filed is enabled, Linux sets the memory as movable memory.
>
> 2. use boot option
> This is our proposal. New boot option can specify memory range to use
> as movable memory.
>
>
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily.
>
>
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
>
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
>
> And the following points should be considered.
>
> 1) If the range is involved in a single node, then from ss to the end of
> the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
> the node will be ZONE_MOVABLE, and all the other nodes will only
> have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
> unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
> higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
>
>
>
> Tang Chen (4):
> page_alloc: add movable_memmap kernel parameter
> page_alloc: Introduce zone_movable_limit[] to keep movable limit for
> nodes
> page_alloc: Make movablecore_map has higher priority
> page_alloc: Bootmem limit with movablecore_map
>
> Yasuaki Ishimatsu (1):
> x86: get pg_data_t's memory from other node
>
> Documentation/kernel-parameters.txt | 17 +++
> arch/x86/mm/numa.c | 11 ++-
> include/linux/memblock.h | 1 +
> include/linux/mm.h | 11 ++
> mm/memblock.c | 15 +++-
> mm/page_alloc.c | 216 ++++++++++++++++++++++++++++++++++-
> 6 files changed, 263 insertions(+), 8 deletions(-)
>
>
> .
>
> 1. use firmware information
> According to ACPI spec 5.0, SRAT table has memory affinity structure
> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
> Affinity Structure". If we use the information, we might be able to
> specify movable memory by firmware. For example, if Hot Pluggable
> Filed is enabled, Linux sets the memory as movable memory.
>
> 2. use boot option
> This is our proposal. New boot option can specify memory range to use
> as movable memory.
Isn't this just moving the work to the user? To pick good values for the
movable areas, they need to know how the memory lines up across
node boundaries ... because they need to make sure to allow some
non-movable memory allocations on each node so that the kernel can
take advantage of node locality.
So the user would have to read at least the SRAT table, and perhaps
more, to figure out what to provide as arguments.
Since this is going to be used on a dynamic system where nodes might
be added an removed - the right values for these arguments might
change from one boot to the next. So even if the user gets them right
on day 1, a month later when a new node has been added, or a broken
node removed the values would be stale.
-Tony
On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>
>> 2. use boot option
>> This is our proposal. New boot option can specify memory range to use
>> as movable memory.
>
> Isn't this just moving the work to the user? To pick good values for the
> movable areas, they need to know how the memory lines up across
> node boundaries ... because they need to make sure to allow some
> non-movable memory allocations on each node so that the kernel can
> take advantage of node locality.
>
> So the user would have to read at least the SRAT table, and perhaps
> more, to figure out what to provide as arguments.
>
> Since this is going to be used on a dynamic system where nodes might
> be added an removed - the right values for these arguments might
> change from one boot to the next. So even if the user gets them right
> on day 1, a month later when a new node has been added, or a broken
> node removed the values would be stale.
>
I gave this feedback in person at LCE: I consider the kernel
configuration option to be useless for anything other than debugging.
Trying to promote it as an actual solution, to be used by end users in
the field, is ridiculous at best.
-hpa
On Wed, Nov 28, 2012 at 04:29:01PM +0800, Wen Congyang wrote:
>At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <[email protected]> wrote:
>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>
>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hi Liu,
>>>>>>
>>>>>>
>>>>>> This feature is used in memory hotplug.
>>>>>>
>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>
>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>> node could be hot-removed.
>>>>>>
>>>>>
>>>>> Thank you for your explanation. It's reasonable.
>>>>>
>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>> can combine it with CMA which already in mainline?
>>>>>
>>>> Hi Liu,
>>>>
>>>> Thanks for your advice. :)
>>>>
>>>> CMA is Contiguous Memory Allocator, right? What I'm trying to do is
>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>> CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/
>>>>
>>>> And also, after a short investigation, CMA seems need to base on
>>>> memblock. But we need to limit memblock not to allocate memory on
>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>
>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>> dma_contiguous_reserve(0); => will declare a cma area using
>>> memblock_reserve()
>>>
>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>> please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>> */
>>> static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>> static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>> static int __init early_cma(char *p)
>>> {
>>> + char *oldp;
>>> pr_debug("%s(%s)\n", __func__, p);
>>> + oldp = p;
>>> size_cmdline = memparse(p, &p);
>>> +
>>> + if (*p == '@')
>>> + cma_start_cmdline = memparse(p+1, &p);
>>> + printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>> return 0;
>>> }
>>> early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>> if (selected_size) {
>>> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>> selected_size / SZ_1M);
>>> -
>>> - dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> + if (cma_size_cmdline != -1)
>>> + dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> + else
>>> + dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> }
>>> };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
>
>Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
>for movable memory, I think movable zone is enough. And the start address is
>not acceptable, because we want to specify the start address for each node.
>
>I think we can implement movablecore_map like that:
>1. parse the parameter
>2. reserve the memory after efi_reserve_boot_services()
>3. release the memory in mem_init
>
Hi Tang,
I haven't read the patchset yet, but could you give a short describe how
you design your implementation in this patchset?
Regards,
Jaegeuk
>What about this?
>
>Thanks
>Wen Congyang
>>
>>
>>
>>
>>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to [email protected]. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 11/29/2012 08:43 AM, Jaegeuk Hanse wrote:
> Hi Tang,
>
> I haven't read the patchset yet, but could you give a short describe how
> you design your implementation in this patchset?
>
> Regards,
> Jaegeuk
>
Hi Jaegeuk,
Thanks for your joining in. :)
This feature is used in memory hotplug.
In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
With this boot option, user could specify all the memory on a node to
be movable(which means they are in ZONE_MOVABLE), so that the node
could be hot-removed.
Thanks.
On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>Hi all,
> Seems it's a great chance to discuss about the memory hotplug feature
>within this thread. So I will try to give some high level thoughts about memory
>hotplug feature on x86/IA64. Any comments are welcomed!
> First of all, I think usability really matters. Ideally, memory hotplug
>feature should just work out of box, and we shouldn't expect administrators to
>add several extra platform dependent parameters to enable memory hotplug.
>But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>is to cooperate with BIOS/ACPI/firmware/device management teams.
> I still position memory hotplug as an advanced feature for high end
>servers and those systems may/should provide some management interfaces to
>configure CPU/memory/node hotplug features. The configuration UI may be provided
>by BIOS, BMC or centralized system management suite. Once administrator enables
>hotplug feature through those management UI, OS should support system device
>hotplug out of box. For example, HP SuperDome2 management suite provides interface
>to configure a node as floating node(hot-removable). And OpenSolaris supports
>CPU/memory hotplug out of box without any extra configurations. So we should
>shape interfaces between firmware and OS to better support system device hotplug.
> On the other hand, I think there are no commercial available x86/IA64
>platforms with system device hotplug capabilities in the field yet, at least only
>limited quantity if any. So backward compatibility is not a big issue for us now.
>So I think it's doable to rely on firmware to provide better support for system
>device hotplug.
> Then what should be enhanced to better support system device hotplug?
>
>1) ACPI specification should be enhanced to provide a static table to describe
>components with hotplug features, so OS could reserve special resources for
>hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>hot-add. Currently we guess maximum number of CPUs supported by the platform
>by counting CPU entries in APIC table, that's not reliable.
>
>2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>hotplug. SRAT associates memory ranges with proximity domains with an extra
>"hotpluggable" flag. PMTT provides memory device topology information, such
>as "socket->memory controller->DIMM". MPST is used for memory power management
>and provides a way to associate memory ranges with memory devices in PMTT.
>With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>memory ranges automatically, so no extra kernel parameters needed.
>
>3) Enhance ACPICA to provide a method to scan static ACPI tables before
>memory subsystem has been initialized because OS need to access SRAT,
>MPST and PMTT when initializing memory subsystem.
>
>4) The last and the most important issue is how to minimize performance
>drop caused by memory hotplug. As proposed by this patchset, once we
>configure all memory of a NUMA node as movable, it essentially disable
>NUMA optimization of kernel memory allocation from that node. According
>to experience, that will cause huge performance drop. We have observed
>10-30% performance drop with memory hotplug enabled. And on another
>OS the average performance drop caused by memory hotplug is about 10%.
>If we can't resolve the performance drop, memory hotplug is just a feature
>for demo:( With help from hardware, we do have some chances to reduce
>performance penalty caused by memory hotplug.
> As we know, Linux could migrate movable page, but can't migrate
>non-movable pages used by kernel/DMA etc. And the most hard part is how
>to deal with those unmovable pages when hot-removing a memory device.
>Now hardware has given us a hand with a technology named memory migration,
>which could transparently migrate memory between memory devices. There's
>no OS visible changes except NUMA topology before and after hardware memory
>migration.
> And if there are multiple memory devices within a NUMA node,
>we could configure some memory devices to host unmovable memory and the
>other to host movable memory. With this configuration, there won't be
>bigger performance drop because we have preserved all NUMA optimizations.
>We also could achieve memory hotplug remove by:
>1) Use existing page migration mechanism to reclaim movable pages.
>2) For memory devices hosting unmovable pages, we need:
>2.1) find a movable memory device on other nodes with enough capacity
>and reclaim it.
>2.2) use hardware migration technology to migrate unmovable memory to
Hi Jiang,
Could you give an explanation how hardware migration technology works?
Regards,
Jaegeuk
>the just reclaimed memory device on other nodes.
>
> I hope we could expect users to adopt memory hotplug technology
>with all these implemented.
>
> Back to this patch, we could rely on the mechanism provided
>by it to automatically mark memory ranges as movable with information
>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>manually configure kernel parameters to enable memory hotplug.
>
> Again, any comments are welcomed!
>
>Regards!
>Gerry
>
>
>On 2012-11-23 18:44, Tang Chen wrote:
>> [What we are doing]
>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>> map for each node in the system.
>>
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>> This option make sure memory range from ss to ss+nn is movable memory.
>>
>>
>> [Why we do this]
>> If we hot remove a memroy, the memory cannot have kernel memory,
>> because Linux cannot migrate kernel memory currently. Therefore,
>> we have to guarantee that the hot removed memory has only movable
>> memoroy.
>>
>> Linux has two boot options, kernelcore= and movablecore=, for
>> creating movable memory. These boot options can specify the amount
>> of memory use as kernel or movable memory. Using them, we can
>> create ZONE_MOVABLE which has only movable memory.
>>
>> But it does not fulfill a requirement of memory hot remove, because
>> even if we specify the boot options, movable memory is distributed
>> in each node evenly. So when we want to hot remove memory which
>> memory range is 0x80000000-0c0000000, we have no way to specify
>> the memory as movable memory.
>>
>> So we proposed a new feature which specifies memory range to use as
>> movable memory.
>>
>>
>> [Ways to do this]
>> There may be 2 ways to specify movable memory.
>> 1. use firmware information
>> 2. use boot option
>>
>> 1. use firmware information
>> According to ACPI spec 5.0, SRAT table has memory affinity structure
>> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>> Affinity Structure". If we use the information, we might be able to
>> specify movable memory by firmware. For example, if Hot Pluggable
>> Filed is enabled, Linux sets the memory as movable memory.
>>
>> 2. use boot option
>> This is our proposal. New boot option can specify memory range to use
>> as movable memory.
>>
>>
>> [How we do this]
>> We chose second way, because if we use first way, users cannot change
>> memory range to use as movable memory easily. We think if we create
>> movable memory, performance regression may occur by NUMA. In this case,
>> user can turn off the feature easily if we prepare the boot option.
>> And if we prepare the boot optino, the user can select which memory
>> to use as movable memory easily.
>>
>>
>> [How to use]
>> Specify the following boot option:
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>> That means physical address range from ss to ss+nn will be allocated as
>> ZONE_MOVABLE.
>>
>> And the following points should be considered.
>>
>> 1) If the range is involved in a single node, then from ss to the end of
>> the node will be ZONE_MOVABLE.
>> 2) If the range covers two or more nodes, then from ss to the end of
>> the node will be ZONE_MOVABLE, and all the other nodes will only
>> have ZONE_MOVABLE.
>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>> unless kernelcore or movablecore is specified.
>> 4) This option could be specified at most MAX_NUMNODES times.
>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>> higher priority to be satisfied.
>> 6) This option has no conflict with memmap option.
>>
>>
>>
>> Tang Chen (4):
>> page_alloc: add movable_memmap kernel parameter
>> page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>> nodes
>> page_alloc: Make movablecore_map has higher priority
>> page_alloc: Bootmem limit with movablecore_map
>>
>> Yasuaki Ishimatsu (1):
>> x86: get pg_data_t's memory from other node
>>
>> Documentation/kernel-parameters.txt | 17 +++
>> arch/x86/mm/numa.c | 11 ++-
>> include/linux/memblock.h | 1 +
>> include/linux/mm.h | 11 ++
>> mm/memblock.c | 15 +++-
>> mm/page_alloc.c | 216 ++++++++++++++++++++++++++++++++++-
>> 6 files changed, 263 insertions(+), 8 deletions(-)
>>
>>
>> .
>>
>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to [email protected]. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 2012-11-29 9:42, Jaegeuk Hanse wrote:
> On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>> Hi all,
>> Seems it's a great chance to discuss about the memory hotplug feature
>> within this thread. So I will try to give some high level thoughts about memory
>> hotplug feature on x86/IA64. Any comments are welcomed!
>> First of all, I think usability really matters. Ideally, memory hotplug
>> feature should just work out of box, and we shouldn't expect administrators to
>> add several extra platform dependent parameters to enable memory hotplug.
>> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>> is to cooperate with BIOS/ACPI/firmware/device management teams.
>> I still position memory hotplug as an advanced feature for high end
>> servers and those systems may/should provide some management interfaces to
>> configure CPU/memory/node hotplug features. The configuration UI may be provided
>> by BIOS, BMC or centralized system management suite. Once administrator enables
>> hotplug feature through those management UI, OS should support system device
>> hotplug out of box. For example, HP SuperDome2 management suite provides interface
>> to configure a node as floating node(hot-removable). And OpenSolaris supports
>> CPU/memory hotplug out of box without any extra configurations. So we should
>> shape interfaces between firmware and OS to better support system device hotplug.
>> On the other hand, I think there are no commercial available x86/IA64
>> platforms with system device hotplug capabilities in the field yet, at least only
>> limited quantity if any. So backward compatibility is not a big issue for us now.
>> So I think it's doable to rely on firmware to provide better support for system
>> device hotplug.
>> Then what should be enhanced to better support system device hotplug?
>>
>> 1) ACPI specification should be enhanced to provide a static table to describe
>> components with hotplug features, so OS could reserve special resources for
>> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>> hot-add. Currently we guess maximum number of CPUs supported by the platform
>> by counting CPU entries in APIC table, that's not reliable.
>>
>> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>> hotplug. SRAT associates memory ranges with proximity domains with an extra
>> "hotpluggable" flag. PMTT provides memory device topology information, such
>> as "socket->memory controller->DIMM". MPST is used for memory power management
>> and provides a way to associate memory ranges with memory devices in PMTT.
>> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>> memory ranges automatically, so no extra kernel parameters needed.
>>
>> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
>> memory subsystem has been initialized because OS need to access SRAT,
>> MPST and PMTT when initializing memory subsystem.
>>
>> 4) The last and the most important issue is how to minimize performance
>> drop caused by memory hotplug. As proposed by this patchset, once we
>> configure all memory of a NUMA node as movable, it essentially disable
>> NUMA optimization of kernel memory allocation from that node. According
>> to experience, that will cause huge performance drop. We have observed
>> 10-30% performance drop with memory hotplug enabled. And on another
>> OS the average performance drop caused by memory hotplug is about 10%.
>> If we can't resolve the performance drop, memory hotplug is just a feature
>> for demo:( With help from hardware, we do have some chances to reduce
>> performance penalty caused by memory hotplug.
>> As we know, Linux could migrate movable page, but can't migrate
>> non-movable pages used by kernel/DMA etc. And the most hard part is how
>> to deal with those unmovable pages when hot-removing a memory device.
>> Now hardware has given us a hand with a technology named memory migration,
>> which could transparently migrate memory between memory devices. There's
>> no OS visible changes except NUMA topology before and after hardware memory
>> migration.
>> And if there are multiple memory devices within a NUMA node,
>> we could configure some memory devices to host unmovable memory and the
>> other to host movable memory. With this configuration, there won't be
>> bigger performance drop because we have preserved all NUMA optimizations.
>> We also could achieve memory hotplug remove by:
>> 1) Use existing page migration mechanism to reclaim movable pages.
>> 2) For memory devices hosting unmovable pages, we need:
>> 2.1) find a movable memory device on other nodes with enough capacity
>> and reclaim it.
>> 2.2) use hardware migration technology to migrate unmovable memory to
>
> Hi Jiang,
>
> Could you give an explanation how hardware migration technology works?
Hi Jaegeuk,
Now some severs support a hardware memory RAS feature called memory
mirror, something like RAID1. The mirrored memory devices will be configured
with the same address and host same contents. And you could transparently
hot-remove one of the mirrored memory device without any help from OS.
We could think memory migration as an extension to the memory mirror technology.
The basic flow for memory migration is:
1) Find a spare memory device with enough capacity in the system.
2) OS issues a request to firmware to migrate from source memory device (A)
to the spare memory device (B).
3) Firmware configures A and B into memory mode, and configure A as master
and B as slave.
4) Firmware resilver the mirror to synchronize the content from A to B
5) Firmware reconfigure B as master and A as slave.
6) Firmware deconfigures the memory mirror and removes A
7) Firmware report results to OS.
8) Now user could hot-remove the source memory device A from system.
During memory migration, A and B are in mirror mode, so CPUs and IO devices
could access it as normal. After memory migration, memory device B will have
the same address ranges and content as memory device A, so there's no OS
visible changes except latency (because A and B may belong to different NUMA
domains).
So hardware memory migration could be used to migrate pages can't be migrated
by OS.
Regards!
Gerry
>
> Regards,
> Jaegeuk
>
>> the just reclaimed memory device on other nodes.
>>
>> I hope we could expect users to adopt memory hotplug technology
>> with all these implemented.
>>
>> Back to this patch, we could rely on the mechanism provided
>> by it to automatically mark memory ranges as movable with information
>>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>> manually configure kernel parameters to enable memory hotplug.
>>
>> Again, any comments are welcomed!
>>
>> Regards!
>> Gerry
>>
>>
>> On 2012-11-23 18:44, Tang Chen wrote:
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>
>>>
>>> [Why we do this]
>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>> because Linux cannot migrate kernel memory currently. Therefore,
>>> we have to guarantee that the hot removed memory has only movable
>>> memoroy.
>>>
>>> Linux has two boot options, kernelcore= and movablecore=, for
>>> creating movable memory. These boot options can specify the amount
>>> of memory use as kernel or movable memory. Using them, we can
>>> create ZONE_MOVABLE which has only movable memory.
>>>
>>> But it does not fulfill a requirement of memory hot remove, because
>>> even if we specify the boot options, movable memory is distributed
>>> in each node evenly. So when we want to hot remove memory which
>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>> the memory as movable memory.
>>>
>>> So we proposed a new feature which specifies memory range to use as
>>> movable memory.
>>>
>>>
>>> [Ways to do this]
>>> There may be 2 ways to specify movable memory.
>>> 1. use firmware information
>>> 2. use boot option
>>>
>>> 1. use firmware information
>>> According to ACPI spec 5.0, SRAT table has memory affinity structure
>>> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>> Affinity Structure". If we use the information, we might be able to
>>> specify movable memory by firmware. For example, if Hot Pluggable
>>> Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>> This is our proposal. New boot option can specify memory range to use
>>> as movable memory.
>>>
>>>
>>> [How we do this]
>>> We chose second way, because if we use first way, users cannot change
>>> memory range to use as movable memory easily. We think if we create
>>> movable memory, performance regression may occur by NUMA. In this case,
>>> user can turn off the feature easily if we prepare the boot option.
>>> And if we prepare the boot optino, the user can select which memory
>>> to use as movable memory easily.
>>>
>>>
>>> [How to use]
>>> Specify the following boot option:
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> That means physical address range from ss to ss+nn will be allocated as
>>> ZONE_MOVABLE.
>>>
>>> And the following points should be considered.
>>>
>>> 1) If the range is involved in a single node, then from ss to the end of
>>> the node will be ZONE_MOVABLE.
>>> 2) If the range covers two or more nodes, then from ss to the end of
>>> the node will be ZONE_MOVABLE, and all the other nodes will only
>>> have ZONE_MOVABLE.
>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>> unless kernelcore or movablecore is specified.
>>> 4) This option could be specified at most MAX_NUMNODES times.
>>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>> higher priority to be satisfied.
>>> 6) This option has no conflict with memmap option.
>>>
>>>
>>>
>>> Tang Chen (4):
>>> page_alloc: add movable_memmap kernel parameter
>>> page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>> nodes
>>> page_alloc: Make movablecore_map has higher priority
>>> page_alloc: Bootmem limit with movablecore_map
>>>
>>> Yasuaki Ishimatsu (1):
>>> x86: get pg_data_t's memory from other node
>>>
>>> Documentation/kernel-parameters.txt | 17 +++
>>> arch/x86/mm/numa.c | 11 ++-
>>> include/linux/memblock.h | 1 +
>>> include/linux/mm.h | 11 ++
>>> mm/memblock.c | 15 +++-
>>> mm/page_alloc.c | 216 ++++++++++++++++++++++++++++++++++-
>>> 6 files changed, 263 insertions(+), 8 deletions(-)
>>>
>>>
>>> .
>>>
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> .
>
On 2012-11-29 10:49, Wanpeng Li wrote:
> On Thu, Nov 29, 2012 at 10:25:40AM +0800, Jiang Liu wrote:
>> On 2012-11-29 9:42, Jaegeuk Hanse wrote:
>>> On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>>>> Hi all,
>>>> Seems it's a great chance to discuss about the memory hotplug feature
>>>> within this thread. So I will try to give some high level thoughts about memory
>>>> hotplug feature on x86/IA64. Any comments are welcomed!
>>>> First of all, I think usability really matters. Ideally, memory hotplug
>>>> feature should just work out of box, and we shouldn't expect administrators to
>>>> add several extra platform dependent parameters to enable memory hotplug.
>>>> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>>>> is to cooperate with BIOS/ACPI/firmware/device management teams.
>>>> I still position memory hotplug as an advanced feature for high end
>>>> servers and those systems may/should provide some management interfaces to
>>>> configure CPU/memory/node hotplug features. The configuration UI may be provided
>>>> by BIOS, BMC or centralized system management suite. Once administrator enables
>>>> hotplug feature through those management UI, OS should support system device
>>>> hotplug out of box. For example, HP SuperDome2 management suite provides interface
>>>> to configure a node as floating node(hot-removable). And OpenSolaris supports
>>>> CPU/memory hotplug out of box without any extra configurations. So we should
>>>> shape interfaces between firmware and OS to better support system device hotplug.
>>>> On the other hand, I think there are no commercial available x86/IA64
>>>> platforms with system device hotplug capabilities in the field yet, at least only
>>>> limited quantity if any. So backward compatibility is not a big issue for us now.
>>>> So I think it's doable to rely on firmware to provide better support for system
>>>> device hotplug.
>>>> Then what should be enhanced to better support system device hotplug?
>>>>
>>>> 1) ACPI specification should be enhanced to provide a static table to describe
>>>> components with hotplug features, so OS could reserve special resources for
>>>> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>>>> hot-add. Currently we guess maximum number of CPUs supported by the platform
>>>> by counting CPU entries in APIC table, that's not reliable.
>>>>
>>>> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>>>> hotplug. SRAT associates memory ranges with proximity domains with an extra
>>>> "hotpluggable" flag. PMTT provides memory device topology information, such
>>>> as "socket->memory controller->DIMM". MPST is used for memory power management
>>>> and provides a way to associate memory ranges with memory devices in PMTT.
>>>> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>>>> memory ranges automatically, so no extra kernel parameters needed.
>>>>
>>>> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
>>>> memory subsystem has been initialized because OS need to access SRAT,
>>>> MPST and PMTT when initializing memory subsystem.
>>>>
>>>> 4) The last and the most important issue is how to minimize performance
>>>> drop caused by memory hotplug. As proposed by this patchset, once we
>>>> configure all memory of a NUMA node as movable, it essentially disable
>>>> NUMA optimization of kernel memory allocation from that node. According
>>>> to experience, that will cause huge performance drop. We have observed
>>>> 10-30% performance drop with memory hotplug enabled. And on another
>>>> OS the average performance drop caused by memory hotplug is about 10%.
>>>> If we can't resolve the performance drop, memory hotplug is just a feature
>>>> for demo:( With help from hardware, we do have some chances to reduce
>>>> performance penalty caused by memory hotplug.
>>>> As we know, Linux could migrate movable page, but can't migrate
>>>> non-movable pages used by kernel/DMA etc. And the most hard part is how
>>>> to deal with those unmovable pages when hot-removing a memory device.
>>>> Now hardware has given us a hand with a technology named memory migration,
>>>> which could transparently migrate memory between memory devices. There's
>>>> no OS visible changes except NUMA topology before and after hardware memory
>>>> migration.
>>>> And if there are multiple memory devices within a NUMA node,
>>>> we could configure some memory devices to host unmovable memory and the
>>>> other to host movable memory. With this configuration, there won't be
>>>> bigger performance drop because we have preserved all NUMA optimizations.
>>>> We also could achieve memory hotplug remove by:
>>>> 1) Use existing page migration mechanism to reclaim movable pages.
>>>> 2) For memory devices hosting unmovable pages, we need:
>>>> 2.1) find a movable memory device on other nodes with enough capacity
>>>> and reclaim it.
>>>> 2.2) use hardware migration technology to migrate unmovable memory to
>>>
>>> Hi Jiang,
>>>
>>> Could you give an explanation how hardware migration technology works?
>> Hi Jaegeuk,
>> Now some severs support a hardware memory RAS feature called memory
>> mirror, something like RAID1. The mirrored memory devices will be configured
>> with the same address and host same contents. And you could transparently
>> hot-remove one of the mirrored memory device without any help from OS.
>>
>> We could think memory migration as an extension to the memory mirror technology.
>> The basic flow for memory migration is:
>> 1) Find a spare memory device with enough capacity in the system.
>> 2) OS issues a request to firmware to migrate from source memory device (A)
>> to the spare memory device (B).
>> 3) Firmware configures A and B into memory mode, and configure A as master
>> and B as slave.
>
> Hi Jiang,
>
> THanks for your detail explanation. Then why should configure who is
> master and who is slave? It seems that in your explanation OS only can
> know the change after firmware report the results.
Hi Wanpeng,
It's a hardware requirement. The memory mirror is designed that
1) all memory read/write transactions will be directed to the master
2) master will synchronize write transactions to the slave
But all these details are handled by memory controller and transparent
to OS. From Linux mm subsystem's view, it doesn't know/care about whether
a memory range is mirrored or not. All the magics are hidden by hardware.
Regards!
Gerry
>
> Regards,
> Jaegeuk
>
>> 4) Firmware resilver the mirror to synchronize the content from A to B
>> 5) Firmware reconfigure B as master and A as slave.
>> 6) Firmware deconfigures the memory mirror and removes A
>> 7) Firmware report results to OS.
>> 8) Now user could hot-remove the source memory device A from system.
>>
>> During memory migration, A and B are in mirror mode, so CPUs and IO devices
>> could access it as normal. After memory migration, memory device B will have
>> the same address ranges and content as memory device A, so there's no OS
>> visible changes except latency (because A and B may belong to different NUMA
>> domains).
>>
>> So hardware memory migration could be used to migrate pages can't be migrated
>> by OS.
>>
>> Regards!
>> Gerry
>>
>>>
>>> Regards,
>>> Jaegeuk
>>>
>>>> the just reclaimed memory device on other nodes.
>>>>
>>>> I hope we could expect users to adopt memory hotplug technology
>>>> with all these implemented.
>>>>
>>>> Back to this patch, we could rely on the mechanism provided
>>>> by it to automatically mark memory ranges as movable with information
>>> >from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>>>> manually configure kernel parameters to enable memory hotplug.
>>>>
>>>> Again, any comments are welcomed!
>>>>
>>>> Regards!
>>>> Gerry
>>>>
>>>>
>>>> On 2012-11-23 18:44, Tang Chen wrote:
>>>>> [What we are doing]
>>>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>>>> map for each node in the system.
>>>>>
>>>>> movablecore_map=nn[KMG]@ss[KMG]
>>>>>
>>>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>>>
>>>>>
>>>>> [Why we do this]
>>>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>>>> because Linux cannot migrate kernel memory currently. Therefore,
>>>>> we have to guarantee that the hot removed memory has only movable
>>>>> memoroy.
>>>>>
>>>>> Linux has two boot options, kernelcore= and movablecore=, for
>>>>> creating movable memory. These boot options can specify the amount
>>>>> of memory use as kernel or movable memory. Using them, we can
>>>>> create ZONE_MOVABLE which has only movable memory.
>>>>>
>>>>> But it does not fulfill a requirement of memory hot remove, because
>>>>> even if we specify the boot options, movable memory is distributed
>>>>> in each node evenly. So when we want to hot remove memory which
>>>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>>>> the memory as movable memory.
>>>>>
>>>>> So we proposed a new feature which specifies memory range to use as
>>>>> movable memory.
>>>>>
>>>>>
>>>>> [Ways to do this]
>>>>> There may be 2 ways to specify movable memory.
>>>>> 1. use firmware information
>>>>> 2. use boot option
>>>>>
>>>>> 1. use firmware information
>>>>> According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>>> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>>> Affinity Structure". If we use the information, we might be able to
>>>>> specify movable memory by firmware. For example, if Hot Pluggable
>>>>> Filed is enabled, Linux sets the memory as movable memory.
>>>>>
>>>>> 2. use boot option
>>>>> This is our proposal. New boot option can specify memory range to use
>>>>> as movable memory.
>>>>>
>>>>>
>>>>> [How we do this]
>>>>> We chose second way, because if we use first way, users cannot change
>>>>> memory range to use as movable memory easily. We think if we create
>>>>> movable memory, performance regression may occur by NUMA. In this case,
>>>>> user can turn off the feature easily if we prepare the boot option.
>>>>> And if we prepare the boot optino, the user can select which memory
>>>>> to use as movable memory easily.
>>>>>
>>>>>
>>>>> [How to use]
>>>>> Specify the following boot option:
>>>>> movablecore_map=nn[KMG]@ss[KMG]
>>>>>
>>>>> That means physical address range from ss to ss+nn will be allocated as
>>>>> ZONE_MOVABLE.
>>>>>
>>>>> And the following points should be considered.
>>>>>
>>>>> 1) If the range is involved in a single node, then from ss to the end of
>>>>> the node will be ZONE_MOVABLE.
>>>>> 2) If the range covers two or more nodes, then from ss to the end of
>>>>> the node will be ZONE_MOVABLE, and all the other nodes will only
>>>>> have ZONE_MOVABLE.
>>>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>>>> unless kernelcore or movablecore is specified.
>>>>> 4) This option could be specified at most MAX_NUMNODES times.
>>>>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>>>> higher priority to be satisfied.
>>>>> 6) This option has no conflict with memmap option.
>>>>>
>>>>>
>>>>>
>>>>> Tang Chen (4):
>>>>> page_alloc: add movable_memmap kernel parameter
>>>>> page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>>>> nodes
>>>>> page_alloc: Make movablecore_map has higher priority
>>>>> page_alloc: Bootmem limit with movablecore_map
>>>>>
>>>>> Yasuaki Ishimatsu (1):
>>>>> x86: get pg_data_t's memory from other node
>>>>>
>>>>> Documentation/kernel-parameters.txt | 17 +++
>>>>> arch/x86/mm/numa.c | 11 ++-
>>>>> include/linux/memblock.h | 1 +
>>>>> include/linux/mm.h | 11 ++
>>>>> mm/memblock.c | 15 +++-
>>>>> mm/page_alloc.c | 216 ++++++++++++++++++++++++++++++++++-
>>>>> 6 files changed, 263 insertions(+), 8 deletions(-)
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to [email protected]. For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>
>>> .
>>>
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
>
> .
>
Hi Tony,
2012/11/29 6:34, Luck, Tony wrote:
>> 1. use firmware information
>> According to ACPI spec 5.0, SRAT table has memory affinity structure
>> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>> Affinity Structure". If we use the information, we might be able to
>> specify movable memory by firmware. For example, if Hot Pluggable
>> Filed is enabled, Linux sets the memory as movable memory.
>>
>> 2. use boot option
>> This is our proposal. New boot option can specify memory range to use
>> as movable memory.
>
> Isn't this just moving the work to the user? To pick good values for the
Yes.
> movable areas, they need to know how the memory lines up across
> node boundaries ... because they need to make sure to allow some
> non-movable memory allocations on each node so that the kernel can
> take advantage of node locality.
There is no problem.
Linux has already two boot options, kernelcore= and movablecore=.
So if we use them, non-movable memory is divided into each node evenly.
But there is no way to specify a node used as movable currently. So
we proposed the new boot option.
> So the user would have to read at least the SRAT table, and perhaps
> more, to figure out what to provide as arguments.
>
> Since this is going to be used on a dynamic system where nodes might
> be added an removed - the right values for these arguments might
> change from one boot to the next. So even if the user gets them right
> on day 1, a month later when a new node has been added, or a broken
> node removed the values would be stale.
I don't think so. Even if we hot add/remove node, the memory range of
each memory device is not changed. So we don't need to change the boot
option.
Thanks,
Yasuaki Ishimatsu
>
> -Tony
>
On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
> On 11/28/2012 01:34 PM, Luck, Tony wrote:
> >>
> >> 2. use boot option
> >> This is our proposal. New boot option can specify memory range to use
> >> as movable memory.
> >
> > Isn't this just moving the work to the user? To pick good values for the
> > movable areas, they need to know how the memory lines up across
> > node boundaries ... because they need to make sure to allow some
> > non-movable memory allocations on each node so that the kernel can
> > take advantage of node locality.
> >
> > So the user would have to read at least the SRAT table, and perhaps
> > more, to figure out what to provide as arguments.
> >
> > Since this is going to be used on a dynamic system where nodes might
> > be added an removed - the right values for these arguments might
> > change from one boot to the next. So even if the user gets them right
> > on day 1, a month later when a new node has been added, or a broken
> > node removed the values would be stale.
> >
>
> I gave this feedback in person at LCE: I consider the kernel
> configuration option to be useless for anything other than debugging.
> Trying to promote it as an actual solution, to be used by end users in
> the field, is ridiculous at best.
>
I've not been paying a whole pile of attention to this because it's not an
area I'm active in but I agree that configuring ZONE_MOVABLE like
this at boot-time is going to be problematic. As awkward as it is, it
would probably work out better to only boot with one node by default and
then hot-add the nodes at runtime using either an online sysfs file or
an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
clumsy but better than specifying addresses on the command line.
That said, I also find using ZONE_MOVABLE to be a problem in itself that
will cause problems down the road. Maybe this was discussed already but
just in case I'll describe the problems I see.
If any significant percentage of memory is in ZONE_MOVABLE then the memory
hotplug people will have to deal with all the lowmem/highmem problems
that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
metadata intensive workloads will not be able to use all of memory because
the kernel allocations will be confined to a subset of memory. A more
complex example is that page table page allocations are also restricted
meaning it's possible that a process will not even be able to mmap() a high
percentage of memory simply because it cannot allocate the page tables to
store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
was a hack when it was introduced but at least then the expectation was
that ZONE_MOVABLE was going to be used for huge pages and there at least
an expectation that it would not be available for normal usage.
Fundamentally the reason one would want to use ZONE_MOVABLE is because
we cannot migrate a lot of kernel memory -- slab pages, page table pages,
device-allocated buffers etc. My understanding is that other OS's get around
this by requiring that subsystems and drivers have callbacks that allow the
core VM to force certain memory to be released but that may be impractical
for Linux. I don't know for sure though, this is just what I heard.
For Linux, the hotplug people need to start thinking about how to get
around this migration problem. The first problem faced is the memory model
and how it maps virt->phys addresses. We have a 1:1 mapping because it's
fast but not because it's a fundamental requirement. Start considering
what happens if the memory model is changed to allow some sections to have
fast lookup for virt_to_phys and other sections to have slow lookups. On
hotplug, try and empty all the sections. If the section cannot be emptied
because of kernel pages then the section gets marked as "offline-migrated"
or something. Stop the whole machine (yes, I mean stop_machine), copy
those unmovable pages to another location, update the kernel virt->phys
mapping for the section being offlined so the virt addresses point to the
new physical addresses and resume. Virt->phys lookups are going to be
a lot slower because a full section lookup will be necessary every time
effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
but it should work. This will cover some slab pages where the data is only
accessed via the virtual address -- inode caches, dcache etc.
It will not work where the physical address is used. The obvious example
is page table pages. For page tables, during stop machine you will have to
walk all processes page tables looking for references to the page you're
trying to move and update them. It is possible to just plain migrate
page table pages but when it was last implemented years ago there was a
constant performance penalty for everybody and it was not popular. Taking a
heavy-handed approach just during memory hot-remove might be more palatable.
For the remaining pages such as those that have been handed to devices
or are pinned for DMA then your options become more limited. You may
still have to restrict allocating these pages (where possible) to a
region that cannot be hot-removed but at least this will be relatively
few pages.
The big downside of this proposal is that it's unproven, not designed,
would be extremely intrusive and I expect it would be a *massive* amount
of development effort that will be difficult to get right. The upside is
configuring it will be a lot easier because all you'll need is a variation
of kernelcore= to reserve a percentage of memory for allocations we *really*
cannot migrate because the physical pages are owned by a device that cannot
release them, potentially forever. The other upside is that it does not
hit crazy lowmem/highmem style problems.
ZONE_MOVABLE at least will all a node to be removed very quickly but
because it will paste you into a corner there should be a plan on what
you're going to replace it with.
--
Mel Gorman
SUSE Labs
On Thu, Nov 29, 2012 at 07:38:26PM +0900, Yasuaki Ishimatsu wrote:
> Hi Tony,
>
> 2012/11/29 6:34, Luck, Tony wrote:
> >>1. use firmware information
> >> According to ACPI spec 5.0, SRAT table has memory affinity structure
> >> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
> >> Affinity Structure". If we use the information, we might be able to
> >> specify movable memory by firmware. For example, if Hot Pluggable
> >> Filed is enabled, Linux sets the memory as movable memory.
> >>
> >>2. use boot option
> >> This is our proposal. New boot option can specify memory range to use
> >> as movable memory.
> >
> >Isn't this just moving the work to the user? To pick good values for the
>
> Yes.
>
> >movable areas, they need to know how the memory lines up across
> >node boundaries ... because they need to make sure to allow some
> >non-movable memory allocations on each node so that the kernel can
> >take advantage of node locality.
>
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
>
The motivation for those options was to reserve a percentage of memory
to be used for hugepage allocation. If hugepages were not being used at
a particular time then they could be used for other purposes. While the
system could in theory face lowmem/highmem style problems, in practice
it did not happen because the memory would be allocated as hugetlbfs
pages and unavailable anyway. The same does not really apply to a general
purpose system that you want to support memory hot-remove on so be wary of
lowmem/highmem style problems caused by relying too heavily on ZONE_MOVABLE.
--
Mel Gorman
SUSE Labs
On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
> Hi Tony,
>
> 2012/11/29 6:34, Luck, Tony wrote:
>>> 1. use firmware information
>>> According to ACPI spec 5.0, SRAT table has memory affinity structure
>>> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>> Affinity Structure". If we use the information, we might be able to
>>> specify movable memory by firmware. For example, if Hot Pluggable
>>> Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>> This is our proposal. New boot option can specify memory range to use
>>> as movable memory.
>>
>> Isn't this just moving the work to the user? To pick good values for the
>
> Yes.
>
>> movable areas, they need to know how the memory lines up across
>> node boundaries ... because they need to make sure to allow some
>> non-movable memory allocations on each node so that the kernel can
>> take advantage of node locality.
>
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
>
> But there is no way to specify a node used as movable currently. So
> we proposed the new boot option.
>
>> So the user would have to read at least the SRAT table, and perhaps
>> more, to figure out what to provide as arguments.
>>
>
>> Since this is going to be used on a dynamic system where nodes might
>> be added an removed - the right values for these arguments might
>> change from one boot to the next. So even if the user gets them right
>> on day 1, a month later when a new node has been added, or a broken
>> node removed the values would be stale.
>
> I don't think so. Even if we hot add/remove node, the memory range of
> each memory device is not changed. So we don't need to change the boot
> option.
Hi Yasuaki,
Addresses assigned to each memory device may change under different
hardware configurations.
According to my experiences with some hotplug capable Xeon and Itanium
systems, a typical algorithm adopted by BIOS to support memory hotplug is:
1) For backward compatibility, BIOS assigns continuous addresses to memory
devices present at boot time. In other words, there are no holes in the memory
addresses except the hole just below 4G reserved for MMIO and other arch
specific usage.
2) To support memory hotplug, BIOS reserves enough memory address ranges
at the high end.
Let's take a typical 4 sockets system as an example. Say we have four
sockets S0-S3, and each socket supports two memory devices(M0-M1) at maximum.
Each memory device supports 128G memory at maximum. And at boot, all memory
slots are fully populated with 4GB memory. Then the address assignment looks
like:
0-2G: S0.M0
2-4G: MMIO
4-8G: S0.M1
8-12G: S1.M0
12-16G: S1.M1
16-20G: S2.M0
20-24G: S2.M1
24-28G: S2.M0
28-32G: S2.M1
32-34G: S0.M0 (memory recovered from the MMIO hole)
1024-1152G: reserved for S0.M0
1152-1280G: reserved for S0.M1
1280-1408G: reserved for S1.M0
1408-1536G: reserved for S1.M1
1536-1664G: reserved for S2.M0
1664-1792G: reserved for S2.M1
1792-1920G: reserved for S3.M0
1920-2048G: reserved for S4.M1
If we hot-remove S2.M0 and add back a bigger memory device with 8G memory, it will
be assigned a new memory address range 1536-1544G.
Based on above algorithm, and we configure 16-24G(S2.M0 and S2.M1) as movable memory.
1) memory on S3 will be configured as movable if S2 isn't present at boot time. (the
same effect as "movable_node" in discussion at https://lkml.org/lkml/2012/11/27/154)
2) S2.M0 will be configured as non-movable and S3.M0 will be configured as movable
if S1.M0 isn't present at boot.
3) And how about replace S1.M0 with a 8GB memory device?
To summarize, kernel parameter to configure movable memory for hotplug will easily
become invalid if hardware configuration changes, and that may confuse administrators.
I still think the most reliable way is to figure out movable memory for hotplug by
parsing hardware configuration information from BIOS.
Regards!
Gerry
Hi Yasuaki,
Forgot to mention that I have no objection to this patchset.
I think it's a good start point, but we still need to improve usabilities
of memory hotplug by passing platform specific information from BIOS.
And mechanism provided by this patchset will/may be used to improve
usabilities too.
Regards!
Gerry
On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
> Hi Tony,
>
> 2012/11/29 6:34, Luck, Tony wrote:
>>> 1. use firmware information
>>> According to ACPI spec 5.0, SRAT table has memory affinity structure
>>> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>> Affinity Structure". If we use the information, we might be able to
>>> specify movable memory by firmware. For example, if Hot Pluggable
>>> Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>> This is our proposal. New boot option can specify memory range to use
>>> as movable memory.
>>
>> Isn't this just moving the work to the user? To pick good values for the
>
> Yes.
>
>> movable areas, they need to know how the memory lines up across
>> node boundaries ... because they need to make sure to allow some
>> non-movable memory allocations on each node so that the kernel can
>> take advantage of node locality.
>
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
>
> But there is no way to specify a node used as movable currently. So
> we proposed the new boot option.
>
>> So the user would have to read at least the SRAT table, and perhaps
>> more, to figure out what to provide as arguments.
>>
>
>> Since this is going to be used on a dynamic system where nodes might
>> be added an removed - the right values for these arguments might
>> change from one boot to the next. So even if the user gets them right
>> on day 1, a month later when a new node has been added, or a broken
>> node removed the values would be stale.
>
> I don't think so. Even if we hot add/remove node, the memory range of
> each memory device is not changed. So we don't need to change the boot
> option.
>
> Thanks,
> Yasuaki Ishimatsu
>
>>
>> -Tony
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On 11/29/2012 03:00 AM, Mel Gorman wrote:
>
> I've not been paying a whole pile of attention to this because it's not an
> area I'm active in but I agree that configuring ZONE_MOVABLE like
> this at boot-time is going to be problematic. As awkward as it is, it
> would probably work out better to only boot with one node by default and
> then hot-add the nodes at runtime using either an online sysfs file or
> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
> clumsy but better than specifying addresses on the command line.
>
> That said, I also find using ZONE_MOVABLE to be a problem in itself that
> will cause problems down the road. Maybe this was discussed already but
> just in case I'll describe the problems I see.
>
Yes, and it does mean that we definitely don't want everything that can
be in ZONE_MOVABLE to be there without administrator control. I suspect
that a lot of users of such platforms actually will not use the feature,
and don't want to take the substantial penalty.
The other bit is that if you really really want high reliability, memory
mirroring is the way to go; it is the only way you will be able to
hotremove memory without having to have a pre-event to migrate the
memory away from the affected node before the memory is offlined.
-hpa
> The other bit is that if you really really want high reliability, memory
> mirroring is the way to go; it is the only way you will be able to
> hotremove memory without having to have a pre-event to migrate the
> memory away from the affected node before the memory is offlined.
Some platforms don't support cross-node mirrors ... but we still want to
be able to remove a node.
-Tony
On 11/29/2012 02:41 PM, Luck, Tony wrote:
>> The other bit is that if you really really want high reliability, memory
>> mirroring is the way to go; it is the only way you will be able to
>> hotremove memory without having to have a pre-event to migrate the
>> memory away from the affected node before the memory is offlined.
>
> Some platforms don't support cross-node mirrors ... but we still want to
> be able to remove a node.
>
Yes, well, those platforms don't support that degree of "really really
high reliability", since the unannounced failure of the node controller
will bring down the system.
-hpa
> If any significant percentage of memory is in ZONE_MOVABLE then the memory
> hotplug people will have to deal with all the lowmem/highmem problems
> that used to be faced by 32-bit x86 with PAE enabled.
While these problems may still exist on large systems - I think it becomes
harder to construct workloads that run into problems. In those bad old days
a significant fraction of lowmem was consumed by the kernel ... so it was
pretty easy to find meta-data intensive workloads that would push it over
a cliff. Here we are talking about systems with say 128GB per node divided
into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
low-end machine). Unless the workload consists of zillions of tiny processes
all mapping shared memory blocks, the percentage of memory allocated to
the kernel is going to be tiny compared with the old 4GB days.
-Tony
Hi Mel,
Thanks for your great comments!
On 2012-11-29 19:00, Mel Gorman wrote:
> On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
>> On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>>>
>>>> 2. use boot option
>>>> This is our proposal. New boot option can specify memory range to use
>>>> as movable memory.
>>>
>>> Isn't this just moving the work to the user? To pick good values for the
>>> movable areas, they need to know how the memory lines up across
>>> node boundaries ... because they need to make sure to allow some
>>> non-movable memory allocations on each node so that the kernel can
>>> take advantage of node locality.
>>>
>>> So the user would have to read at least the SRAT table, and perhaps
>>> more, to figure out what to provide as arguments.
>>>
>>> Since this is going to be used on a dynamic system where nodes might
>>> be added an removed - the right values for these arguments might
>>> change from one boot to the next. So even if the user gets them right
>>> on day 1, a month later when a new node has been added, or a broken
>>> node removed the values would be stale.
>>>
>>
>> I gave this feedback in person at LCE: I consider the kernel
>> configuration option to be useless for anything other than debugging.
>> Trying to promote it as an actual solution, to be used by end users in
>> the field, is ridiculous at best.
>>
>
> I've not been paying a whole pile of attention to this because it's not an
> area I'm active in but I agree that configuring ZONE_MOVABLE like
> this at boot-time is going to be problematic. As awkward as it is, it
> would probably work out better to only boot with one node by default and
> then hot-add the nodes at runtime using either an online sysfs file or
> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
> clumsy but better than specifying addresses on the command line.
>
> That said, I also find using ZONE_MOVABLE to be a problem in itself that
> will cause problems down the road. Maybe this was discussed already but
> just in case I'll describe the problems I see.
>
> If any significant percentage of memory is in ZONE_MOVABLE then the memory
> hotplug people will have to deal with all the lowmem/highmem problems
> that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
> metadata intensive workloads will not be able to use all of memory because
> the kernel allocations will be confined to a subset of memory. A more
> complex example is that page table page allocations are also restricted
> meaning it's possible that a process will not even be able to mmap() a high
> percentage of memory simply because it cannot allocate the page tables to
> store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
> was a hack when it was introduced but at least then the expectation was
> that ZONE_MOVABLE was going to be used for huge pages and there at least
> an expectation that it would not be available for normal usage.
>
> Fundamentally the reason one would want to use ZONE_MOVABLE is because
> we cannot migrate a lot of kernel memory -- slab pages, page table pages,
> device-allocated buffers etc. My understanding is that other OS's get around
> this by requiring that subsystems and drivers have callbacks that allow the
> core VM to force certain memory to be released but that may be impractical
> for Linux. I don't know for sure though, this is just what I heard.
As I know, one other OS limits immovable pages at low end, and the limit
will increase on demand. But the drawback of this solution is serious
performance drop (average about 10%) because it essentially disable NUMA
optimization for kernel/DMA memory allocations.
> For Linux, the hotplug people need to start thinking about how to get
> around this migration problem. The first problem faced is the memory model
> and how it maps virt->phys addresses. We have a 1:1 mapping because it's
> fast but not because it's a fundamental requirement. Start considering
> what happens if the memory model is changed to allow some sections to have
> fast lookup for virt_to_phys and other sections to have slow lookups. On
> hotplug, try and empty all the sections. If the section cannot be emptied
> because of kernel pages then the section gets marked as "offline-migrated"
> or something. Stop the whole machine (yes, I mean stop_machine), copy
> those unmovable pages to another location, update the kernel virt->phys
> mapping for the section being offlined so the virt addresses point to the
> new physical addresses and resume. Virt->phys lookups are going to be
> a lot slower because a full section lookup will be necessary every time
> effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
> but it should work. This will cover some slab pages where the data is only
> accessed via the virtual address -- inode caches, dcache etc.
>
> It will not work where the physical address is used. The obvious example
> is page table pages. For page tables, during stop machine you will have to
> walk all processes page tables looking for references to the page you're
> trying to move and update them. It is possible to just plain migrate
> page table pages but when it was last implemented years ago there was a
> constant performance penalty for everybody and it was not popular. Taking a
> heavy-handed approach just during memory hot-remove might be more palatable.
>
> For the remaining pages such as those that have been handed to devices
> or are pinned for DMA then your options become more limited. You may
> still have to restrict allocating these pages (where possible) to a
> region that cannot be hot-removed but at least this will be relatively
> few pages.
>
> The big downside of this proposal is that it's unproven, not designed,
> would be extremely intrusive and I expect it would be a *massive* amount
> of development effort that will be difficult to get right. The upside is
> configuring it will be a lot easier because all you'll need is a variation
> of kernelcore= to reserve a percentage of memory for allocations we *really*
> cannot migrate because the physical pages are owned by a device that cannot
> release them, potentially forever. The other upside is that it does not
> hit crazy lowmem/highmem style problems.
>
> ZONE_MOVABLE at least will all a node to be removed very quickly but
> because it will paste you into a corner there should be a plan on what
> you're going to replace it with.
I have some thoughts here. The basic idea is that it needs cooperation
between OS, BIOS and hardware to implement a flexible memory hotplug
solution.
As you have mentioned, ZONE_MOVABLE is a quick but a little dirty
solution. It's quick because we could rely on existing mechanism
to configure movable zone and no changes to the memory model needed.
It's a little dirty because:
1) We need to handle cases of running out of immovable pages. The hotplug
implementation shouldn't cause extra service interruption when normal zones
are under pressure. Otherwise it's really a joke that some service
interruptions are really caused by features trying to improve service
availabilities.
2) We still can't handle normal kernel pages used by kernel, device etc.
3) It may cause serious performance drop if we configure all memory
on a NUMA node as ZONE_MOVABLE.
For the first issue, I think we could automatically convert pages
from movable zones into normal zones. Congyan from Fujitsu has provided
a patchset to manually convert pages from movable zones into normal zones,
I think we could extend that mechanism to automatically convert when
normal zones are under pressure by hooking into the slow page allocation
path.
We rely on hardware features to solve the second and third issues.
Some new platforms provide a new RAS feature called "hardware memory
migration", which transparent migrate memory from one memory device
to another. With hardware memory migration, we could configure one
memory device on a NUMA node to host normal zone, and the other memory
devices to host movable zone. By this configuration, it won't cause
performance drop because each NUMA node still has local normal zone.
When trying to remove a memory device hosting normal zone, we just
need to find another spare memory device and use hardware memory migration
to transparently migrate memory content to the spare one. The drawback
is we have strong dependency on hardware features so it's not a common
solution for all architectures.
Regards!
Gerry
Hi Jiang,
2012/11/30 11:56, Jiang Liu wrote:
> Hi Mel,
> Thanks for your great comments!
>
> On 2012-11-29 19:00, Mel Gorman wrote:
>> On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
>>> On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>>>>
>>>>> 2. use boot option
>>>>> This is our proposal. New boot option can specify memory range to use
>>>>> as movable memory.
>>>>
>>>> Isn't this just moving the work to the user? To pick good values for the
>>>> movable areas, they need to know how the memory lines up across
>>>> node boundaries ... because they need to make sure to allow some
>>>> non-movable memory allocations on each node so that the kernel can
>>>> take advantage of node locality.
>>>>
>>>> So the user would have to read at least the SRAT table, and perhaps
>>>> more, to figure out what to provide as arguments.
>>>>
>>>> Since this is going to be used on a dynamic system where nodes might
>>>> be added an removed - the right values for these arguments might
>>>> change from one boot to the next. So even if the user gets them right
>>>> on day 1, a month later when a new node has been added, or a broken
>>>> node removed the values would be stale.
>>>>
>>>
>>> I gave this feedback in person at LCE: I consider the kernel
>>> configuration option to be useless for anything other than debugging.
>>> Trying to promote it as an actual solution, to be used by end users in
>>> the field, is ridiculous at best.
>>>
>>
>> I've not been paying a whole pile of attention to this because it's not an
>> area I'm active in but I agree that configuring ZONE_MOVABLE like
>> this at boot-time is going to be problematic. As awkward as it is, it
>> would probably work out better to only boot with one node by default and
>> then hot-add the nodes at runtime using either an online sysfs file or
>> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
>> clumsy but better than specifying addresses on the command line.
>>
>> That said, I also find using ZONE_MOVABLE to be a problem in itself that
>> will cause problems down the road. Maybe this was discussed already but
>> just in case I'll describe the problems I see.
>>
>> If any significant percentage of memory is in ZONE_MOVABLE then the memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
>> metadata intensive workloads will not be able to use all of memory because
>> the kernel allocations will be confined to a subset of memory. A more
>> complex example is that page table page allocations are also restricted
>> meaning it's possible that a process will not even be able to mmap() a high
>> percentage of memory simply because it cannot allocate the page tables to
>> store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
>> was a hack when it was introduced but at least then the expectation was
>> that ZONE_MOVABLE was going to be used for huge pages and there at least
>> an expectation that it would not be available for normal usage.
>>
>> Fundamentally the reason one would want to use ZONE_MOVABLE is because
>> we cannot migrate a lot of kernel memory -- slab pages, page table pages,
>> device-allocated buffers etc. My understanding is that other OS's get around
>> this by requiring that subsystems and drivers have callbacks that allow the
>> core VM to force certain memory to be released but that may be impractical
>> for Linux. I don't know for sure though, this is just what I heard.
> As I know, one other OS limits immovable pages at low end, and the limit
> will increase on demand. But the drawback of this solution is serious
> performance drop (average about 10%) because it essentially disable NUMA
> optimization for kernel/DMA memory allocations.
>
>> For Linux, the hotplug people need to start thinking about how to get
>> around this migration problem. The first problem faced is the memory model
>> and how it maps virt->phys addresses. We have a 1:1 mapping because it's
>> fast but not because it's a fundamental requirement. Start considering
>> what happens if the memory model is changed to allow some sections to have
>> fast lookup for virt_to_phys and other sections to have slow lookups. On
>> hotplug, try and empty all the sections. If the section cannot be emptied
>> because of kernel pages then the section gets marked as "offline-migrated"
>> or something. Stop the whole machine (yes, I mean stop_machine), copy
>> those unmovable pages to another location, update the kernel virt->phys
>> mapping for the section being offlined so the virt addresses point to the
>> new physical addresses and resume. Virt->phys lookups are going to be
>> a lot slower because a full section lookup will be necessary every time
>> effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
>> but it should work. This will cover some slab pages where the data is only
>> accessed via the virtual address -- inode caches, dcache etc.
>>
>> It will not work where the physical address is used. The obvious example
>> is page table pages. For page tables, during stop machine you will have to
>> walk all processes page tables looking for references to the page you're
>> trying to move and update them. It is possible to just plain migrate
>> page table pages but when it was last implemented years ago there was a
>> constant performance penalty for everybody and it was not popular. Taking a
>> heavy-handed approach just during memory hot-remove might be more palatable.
>>
>> For the remaining pages such as those that have been handed to devices
>> or are pinned for DMA then your options become more limited. You may
>> still have to restrict allocating these pages (where possible) to a
>> region that cannot be hot-removed but at least this will be relatively
>> few pages.
>>
>> The big downside of this proposal is that it's unproven, not designed,
>> would be extremely intrusive and I expect it would be a *massive* amount
>> of development effort that will be difficult to get right. The upside is
>> configuring it will be a lot easier because all you'll need is a variation
>> of kernelcore= to reserve a percentage of memory for allocations we *really*
>> cannot migrate because the physical pages are owned by a device that cannot
>> release them, potentially forever. The other upside is that it does not
>> hit crazy lowmem/highmem style problems.
>>
>> ZONE_MOVABLE at least will all a node to be removed very quickly but
>> because it will paste you into a corner there should be a plan on what
>> you're going to replace it with.
>
> I have some thoughts here. The basic idea is that it needs cooperation
> between OS, BIOS and hardware to implement a flexible memory hotplug
> solution.
>
> As you have mentioned, ZONE_MOVABLE is a quick but a little dirty
> solution. It's quick because we could rely on existing mechanism
> to configure movable zone and no changes to the memory model needed.
> It's a little dirty because:
> 1) We need to handle cases of running out of immovable pages. The hotplug
> implementation shouldn't cause extra service interruption when normal zones
> are under pressure. Otherwise it's really a joke that some service
> interruptions are really caused by features trying to improve service
> availabilities.
> 2) We still can't handle normal kernel pages used by kernel, device etc.
> 3) It may cause serious performance drop if we configure all memory
> on a NUMA node as ZONE_MOVABLE.
>
> For the first issue, I think we could automatically convert pages
> from movable zones into normal zones. Congyan from Fujitsu has provided
> a patchset to manually convert pages from movable zones into normal zones,
> I think we could extend that mechanism to automatically convert when
> normal zones are under pressure by hooking into the slow page allocation
> path.
>
> We rely on hardware features to solve the second and third issues.
> Some new platforms provide a new RAS feature called "hardware memory
> migration", which transparent migrate memory from one memory device
> to another. With hardware memory migration, we could configure one
> memory device on a NUMA node to host normal zone, and the other memory
> devices to host movable zone. By this configuration, it won't cause
> performance drop because each NUMA node still has local normal zone.
> When trying to remove a memory device hosting normal zone, we just
> need to find another spare memory device and use hardware memory migration
> to transparently migrate memory content to the spare one. The drawback
> is we have strong dependency on hardware features so it's not a common
> solution for all architectures.
I agree with you. If BIOS and hardware support memory hotplug, OS should
use them. But if OS cannot use them, we need to solve in OS. I think
that our proposal which used ZONE_MOVABLE is first step for supporting
memory hotplug.
Thanks,
Yasuaki Ishimatsu
>
> Regards!
> Gerry
>
>
Disk I/O is still a big consumer of lowmem.
"Luck, Tony" <[email protected]> wrote:
>> If any significant percentage of memory is in ZONE_MOVABLE then the
>memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled.
>
>While these problems may still exist on large systems - I think it
>becomes
>harder to construct workloads that run into problems. In those bad old
>days
>a significant fraction of lowmem was consumed by the kernel ... so it
>was
>pretty easy to find meta-data intensive workloads that would push it
>over
>a cliff. Here we are talking about systems with say 128GB per node
>divided
>into 64GB moveable and 64GB non-moveable (and I'd regard this as a
>rather
>low-end machine). Unless the workload consists of zillions of tiny
>processes
>all mapping shared memory blocks, the percentage of memory allocated to
>the kernel is going to be tiny compared with the old 4GB days.
>
>-Tony
--
Sent from my mobile phone. Please excuse brevity and lack of formatting.
On 11/28/2012 12:08 PM, Jiang Liu wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <[email protected]> wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>
>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi Liu,
>>>>>
>>>>>
>>>>> This feature is used in memory hotplug.
>>>>>
>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>> node contains no kernel memory, because memory used by kernel could
>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>
>>>>> User could specify all the memory on a node to be movable, so that the
>>>>> node could be hot-removed.
>>>>>
>>>>
>>>> Thank you for your explanation. It's reasonable.
>>>>
>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>> can combine it with CMA which already in mainline?
>>>>
>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right? What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>> dma_contiguous_reserve(0); => will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>> */
>> static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>> static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>> static int __init early_cma(char *p)
>> {
>> + char *oldp;
>> pr_debug("%s(%s)\n", __func__, p);
>> + oldp = p;
>> size_cmdline = memparse(p, &p);
>> +
>> + if (*p == '@')
>> + cma_start_cmdline = memparse(p+1, &p);
>> + printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>> return 0;
>> }
>> early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>> if (selected_size) {
>> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>> selected_size / SZ_1M);
>> -
>> - dma_declare_contiguous(NULL, selected_size, 0, limit);
>> + if (cma_size_cmdline != -1)
>> + dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> + else
>> + dma_declare_contiguous(NULL, selected_size, 0, limit);
>> }
>> };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.
>
The approach is already implemented: https://lkml.org/lkml/2012/7/4/145
(add new MIGRATE_HOTREMOVE, not reuse MIGRATE_CMA)
MIGRATE_HOTREMOVE and MIGRATE_CMA both have this problem:
https://lkml.org/lkml/2012/7/5/83
R.I.P for this idea.
zone->managed_pages(you proposed, but don't manage MIGRATE_HOTREMOVE nor MIGRATE_CMA) +
proxy zone(handle all MIGRATE_HOTREMOVE, MIGRATE_CMA and ZONE_MOVABLE of the node)
may be a good idea.
Thanks,
Lai
On 11/30/2012 06:58 AM, Luck, Tony wrote:
>> If any significant percentage of memory is in ZONE_MOVABLE then the memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled.
>
> While these problems may still exist on large systems - I think it becomes
> harder to construct workloads that run into problems. In those bad old days
> a significant fraction of lowmem was consumed by the kernel ... so it was
> pretty easy to find meta-data intensive workloads that would push it over
> a cliff. Here we are talking about systems with say 128GB per node divided
> into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
> low-end machine). Unless the workload consists of zillions of tiny processes
> all mapping shared memory blocks, the percentage of memory allocated to
> the kernel is going to be tiny compared with the old 4GB days.
>
Which is a perfectly common workload for containers, where you can have
hundreds of machines (per node) being sold out to third parties, a lot
of them consuming every single bit of metadata they can.
On Fri, Nov 30, 2012 at 02:58:40AM +0000, Luck, Tony wrote:
> > If any significant percentage of memory is in ZONE_MOVABLE then the memory
> > hotplug people will have to deal with all the lowmem/highmem problems
> > that used to be faced by 32-bit x86 with PAE enabled.
>
> While these problems may still exist on large systems - I think it becomes
> harder to construct workloads that run into problems. In those bad old days
> a significant fraction of lowmem was consumed by the kernel ... so it was
> pretty easy to find meta-data intensive workloads that would push it over
> a cliff. Here we are talking about systems with say 128GB per node divided
> into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
> low-end machine). Unless the workload consists of zillions of tiny processes
> all mapping shared memory blocks, the percentage of memory allocated to
> the kernel is going to be tiny compared with the old 4GB days.
>
Sure, if that's how the end-user decides to configure it. My concern is
what they'll do is configure node-0 to be ZONE_NORMAL and all other nodes
to be ZONE_MOVABLE -- 3 to 1 ratio "highmem" to "lowmem" effectively on
a 4-node machine or 7 to 1 on an 8-node. It'll be harder than it was in
the old days to trigger the problems but it'll still be possible and it
will generate bug reports down the road. Some will be obvious at least --
OOM killer triggered for GFP_KERNEL with plenty of free memory but all in
ZONE_MOVABLE. Others will be less obvious -- major stalls during IO tests
while ramping up with large amounts of reclaim activity visible even though
only 20-40% of memory is in use.
I'm not even getting into the impact this has on NUMA performance.
I'm not saying that ZONE_MOVABLE will not work. It will and it'll work
in the short-term but it's far from being a great long-term solution and
it is going to generate bug reports that will have to be supported by
distributions. Even if the interface to how it is configured gets ironed
out there still should be a replacement plan in place. FWIW, I dislike the
command-line configuration option. If it was me, I would have gone with
starting a machine with memory mostly off-lined and used sysfs files or
different sysfs strings written to the "online" file to determine if a
section was ZONE_MOVABLE or the next best alternative.
--
Mel Gorman
SUSE Labs
On 11/30/2012 11:15 AM, Yasuaki Ishimatsu wrote:
> Hi Jiang,
>
>>
>> For the first issue, I think we could automatically convert pages
>> from movable zones into normal zones. Congyan from Fujitsu has provided
>> a patchset to manually convert pages from movable zones into normal zones,
>> I think we could extend that mechanism to automatically convert when
>> normal zones are under pressure by hooking into the slow page allocation
>> path.
>>
>> We rely on hardware features to solve the second and third issues.
>> Some new platforms provide a new RAS feature called "hardware memory
>> migration", which transparent migrate memory from one memory device
>> to another. With hardware memory migration, we could configure one
>> memory device on a NUMA node to host normal zone, and the other memory
>> devices to host movable zone. By this configuration, it won't cause
>> performance drop because each NUMA node still has local normal zone.
>> When trying to remove a memory device hosting normal zone, we just
>> need to find another spare memory device and use hardware memory migration
>> to transparently migrate memory content to the spare one. The drawback
>> is we have strong dependency on hardware features so it's not a common
>> solution for all architectures.
>
> I agree with you. If BIOS and hardware support memory hotplug, OS should
> use them. But if OS cannot use them, we need to solve in OS. I think
> that our proposal which used ZONE_MOVABLE is first step for supporting
> memory hotplug.
Hi Yasuaki,
It's true, we should start with first step then improve it.
Regards!
Gerry
On Thu, 2012-11-29 at 10:25 +0800, Jiang Liu wrote:
> On 2012-11-29 9:42, Jaegeuk Hanse wrote:
> > On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
> >> Hi all,
> >> Seems it's a great chance to discuss about the memory hotplug feature
> >> within this thread. So I will try to give some high level thoughts about memory
> >> hotplug feature on x86/IA64. Any comments are welcomed!
> >> First of all, I think usability really matters. Ideally, memory hotplug
> >> feature should just work out of box, and we shouldn't expect administrators to
> >> add several extra platform dependent parameters to enable memory hotplug.
> >> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
> >> is to cooperate with BIOS/ACPI/firmware/device management teams.
> >> I still position memory hotplug as an advanced feature for high end
> >> servers and those systems may/should provide some management interfaces to
> >> configure CPU/memory/node hotplug features. The configuration UI may be provided
> >> by BIOS, BMC or centralized system management suite. Once administrator enables
> >> hotplug feature through those management UI, OS should support system device
> >> hotplug out of box. For example, HP SuperDome2 management suite provides interface
> >> to configure a node as floating node(hot-removable). And OpenSolaris supports
> >> CPU/memory hotplug out of box without any extra configurations. So we should
> >> shape interfaces between firmware and OS to better support system device hotplug.
Well described. I agree with you. I am also OK to have the boot option
for the time being, but we should be able to get the info from ACPI for
better TCE.
> >> On the other hand, I think there are no commercial available x86/IA64
> >> platforms with system device hotplug capabilities in the field yet, at least only
> >> limited quantity if any. So backward compatibility is not a big issue for us now.
HP SuperDome is IA64-based and supports node hotplug when running with
HP-UX. It implements vendor-unique ACPI interface to describe movable
memory ranges.
> >> So I think it's doable to rely on firmware to provide better support for system
> >> device hotplug.
> >> Then what should be enhanced to better support system device hotplug?
> >>
> >> 1) ACPI specification should be enhanced to provide a static table to describe
> >> components with hotplug features, so OS could reserve special resources for
> >> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
> >> hot-add. Currently we guess maximum number of CPUs supported by the platform
> >> by counting CPU entries in APIC table, that's not reliable.
Right. HP SuperDome implements vendor-unique ACPI interface for this as
well. For Linux, it is nice to have a standard interface defined.
> >> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
> >> hotplug. SRAT associates memory ranges with proximity domains with an extra
> >> "hotpluggable" flag. PMTT provides memory device topology information, such
> >> as "socket->memory controller->DIMM". MPST is used for memory power management
> >> and provides a way to associate memory ranges with memory devices in PMTT.
> >> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
> >> memory ranges automatically, so no extra kernel parameters needed.
I agree that using SRAT is a good compromise. The hotpluggable flag is
supposed to indicate the platform's capability, but could use for this
purpose until we have a better interface defined.
> >> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
> >> memory subsystem has been initialized because OS need to access SRAT,
> >> MPST and PMTT when initializing memory subsystem.
I do not think this is an ACPICA issue. HP-UX also uses ACPICA, and can
access ACPI tables and walk ACPI namespace during early boot-time. This
is achieved by the acpi_os layer to use special early boot-time memory
allocator at early boot-time. Therefore, boot-time and hot-add config
code are very consistent in HP-UX.
> >> 4) The last and the most important issue is how to minimize performance
> >> drop caused by memory hotplug. As proposed by this patchset, once we
> >> configure all memory of a NUMA node as movable, it essentially disable
> >> NUMA optimization of kernel memory allocation from that node. According
> >> to experience, that will cause huge performance drop. We have observed
> >> 10-30% performance drop with memory hotplug enabled. And on another
> >> OS the average performance drop caused by memory hotplug is about 10%.
> >> If we can't resolve the performance drop, memory hotplug is just a feature
> >> for demo:( With help from hardware, we do have some chances to reduce
> >> performance penalty caused by memory hotplug.
> >> As we know, Linux could migrate movable page, but can't migrate
> >> non-movable pages used by kernel/DMA etc. And the most hard part is how
> >> to deal with those unmovable pages when hot-removing a memory device.
> >> Now hardware has given us a hand with a technology named memory migration,
> >> which could transparently migrate memory between memory devices. There's
> >> no OS visible changes except NUMA topology before and after hardware memory
> >> migration.
> >> And if there are multiple memory devices within a NUMA node,
> >> we could configure some memory devices to host unmovable memory and the
> >> other to host movable memory. With this configuration, there won't be
> >> bigger performance drop because we have preserved all NUMA optimizations.
> >> We also could achieve memory hotplug remove by:
> >> 1) Use existing page migration mechanism to reclaim movable pages.
> >> 2) For memory devices hosting unmovable pages, we need:
> >> 2.1) find a movable memory device on other nodes with enough capacity
> >> and reclaim it.
> >> 2.2) use hardware migration technology to migrate unmovable memory to
> >> the just reclaimed memory device on other nodes.
>>>
> >> I hope we could expect users to adopt memory hotplug technology
> >> with all these implemented.
> >>
> >> Back to this patch, we could rely on the mechanism provided
> >> by it to automatically mark memory ranges as movable with information
> >>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
> >> manually configure kernel parameters to enable memory hotplug.
Right.
Thanks,
-Toshi