2012-08-02 06:00:48

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 00/23 V2] memory,numa: introduce MOVABLE-dedicated node and online_movable for hotplug

A) Introduction:

This patchset adds MOVABLE-dedicated node and online_movable for memory-management.

It is used for anti-fragmentation(hugepage, big-order allocation...),
hot-removal-of-memory(virtualization, power-conserve, move memory between systems
to make better utilities of memories).

B) changed from V1:

The original V1 patchset of MOVABLE-dedicated node is here:
http://comments.gmane.org/gmane.linux.kernel.mm/78122

The new V2 adds N_MEMORY and a notion of "MOVABLE-dedicated node".
And fix some related problems.

The orignal V1 patchset of "add online_movable" is here:
https://lkml.org/lkml/2012/7/4/145

The new V2 discards the MIGRATE_HOTREMOVE approach, and use a more straight
implementation(only 1 patch).

C) User Interface:

When users(big system manager) need config some node/memory as MOVABLE:
1 Use kernelcore_max_addr=XX when boot
2 Use movable_online hotplug action when running
We may introduce some more convenient interface, such as
movable_node=NODE_LIST boot option.

D) Patches

Patch1 introduce N_MEMORY
Patch2-13 use N_MEMORY instead N_HIGH_MEMORY.
The patches are separated by subsystem,
*these conversions was(must be) checked carefully*.
Patch13 also changes the node_states initialization
Patch14,15,17 Fix problems of the current code.(all related with hotplug)
Patch18 Add config to allow MOVABLE-dedicated node
Patch19-22 Add kernelcore_max_addr
Patch23 Add online_movable


Lai Jiangshan (19):
node_states: introduce N_MEMORY
cpuset: use N_MEMORY instead N_HIGH_MEMORY
procfs: use N_MEMORY instead N_HIGH_MEMORY
oom: use N_MEMORY instead N_HIGH_MEMORY
mm,migrate: use N_MEMORY instead N_HIGH_MEMORY
mempolicy: use N_MEMORY instead N_HIGH_MEMORY
memcontrol: use N_MEMORY instead N_HIGH_MEMORY
hugetlb: use N_MEMORY instead N_HIGH_MEMORY
vmstat: use N_MEMORY instead N_HIGH_MEMORY
kthread: use N_MEMORY instead N_HIGH_MEMORY
init: use N_MEMORY instead N_HIGH_MEMORY
vmscan: use N_MEMORY instead N_HIGH_MEMORY
page_alloc: use N_MEMORY instead N_HIGH_MEMORY and change the node_states initialization
slub, hotplug: ignore unrelated node's hot-adding and hot-removing
memory_hotplug: fix missing nodemask management
numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
page_alloc.c: don't subtract unrelated memmap from zone's present pages
page_alloc: add kernelcore_max_addr
mm, memory-hotplug: add online_movable

Yasuaki Ishimatsu (4):
x86: get pg_data_t's memory from other node
x86: use memblock_set_current_limit() to set memblock.current_limit
memblock: limit memory address from memblock
memblock: compare current_limit with end variable at
memblock_find_in_range_node()

Documentation/cgroups/cpusets.txt | 2 +-
Documentation/kernel-parameters.txt | 9 +++
Documentation/memory-hotplug.txt | 16 ++++-
arch/x86/kernel/setup.c | 4 +-
arch/x86/mm/init_64.c | 4 +-
arch/x86/mm/numa.c | 8 ++-
drivers/base/memory.c | 19 +++--
drivers/base/node.c | 8 ++-
fs/proc/kcore.c | 2 +-
fs/proc/task_mmu.c | 4 +-
include/linux/cpuset.h | 2 +-
include/linux/memblock.h | 1 +
include/linux/memory_hotplug.h | 13 +++-
include/linux/nodemask.h | 5 ++
init/main.c | 2 +-
kernel/cpuset.c | 32 ++++----
kernel/kthread.c | 2 +-
mm/Kconfig | 8 ++
mm/hugetlb.c | 24 +++---
mm/memblock.c | 10 ++-
mm/memcontrol.c | 18 +++---
mm/memory_hotplug.c | 137 ++++++++++++++++++++++++++++++++---
mm/mempolicy.c | 12 ++--
mm/migrate.c | 2 +-
mm/oom_kill.c | 2 +-
mm/page_alloc.c | 96 +++++++++++++++----------
mm/page_cgroup.c | 2 +-
mm/slub.c | 6 ++
mm/vmscan.c | 4 +-
mm/vmstat.c | 4 +-
30 files changed, 335 insertions(+), 123 deletions(-)


2012-08-02 06:00:51

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 02/23 V2] cpuset: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/cgroups/cpusets.txt | 2 +-
include/linux/cpuset.h | 2 +-
kernel/cpuset.c | 32 ++++++++++++++++----------------
3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index cefd3d8..12e01d4 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -218,7 +218,7 @@ and name space for cpusets, with a minimum of additional kernel code.
The cpus and mems files in the root (top_cpuset) cpuset are
read-only. The cpus file automatically tracks the value of
cpu_online_mask using a CPU hotplug notifier, and the mems file
-automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
+automatically tracks the value of node_states[N_MEMORY]--i.e.,
nodes with memory--using the cpuset_track_online_nodes() hook.


diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 838320f..8c8a60d 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -144,7 +144,7 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
return node_possible_map;
}

-#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
+#define cpuset_current_mems_allowed (node_states[N_MEMORY])
static inline void cpuset_init_current_mems_allowed(void) {}

static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f33c715..2b133db 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -302,10 +302,10 @@ static void guarantee_online_cpus(const struct cpuset *cs,
* are online, with memory. If none are online with memory, walk
* up the cpuset hierarchy until we find one that does have some
* online mems. If we get all the way to the top and still haven't
- * found any online mems, return node_states[N_HIGH_MEMORY].
+ * found any online mems, return node_states[N_MEMORY].
*
* One way or another, we guarantee to return some non-empty subset
- * of node_states[N_HIGH_MEMORY].
+ * of node_states[N_MEMORY].
*
* Call with callback_mutex held.
*/
@@ -313,14 +313,14 @@ static void guarantee_online_cpus(const struct cpuset *cs,
static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
{
while (cs && !nodes_intersects(cs->mems_allowed,
- node_states[N_HIGH_MEMORY]))
+ node_states[N_MEMORY]))
cs = cs->parent;
if (cs)
nodes_and(*pmask, cs->mems_allowed,
- node_states[N_HIGH_MEMORY]);
+ node_states[N_MEMORY]);
else
- *pmask = node_states[N_HIGH_MEMORY];
- BUG_ON(!nodes_intersects(*pmask, node_states[N_HIGH_MEMORY]));
+ *pmask = node_states[N_MEMORY];
+ BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
}

/*
@@ -1100,7 +1100,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
return -ENOMEM;

/*
- * top_cpuset.mems_allowed tracks node_stats[N_HIGH_MEMORY];
+ * top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
* it's read-only
*/
if (cs == &top_cpuset) {
@@ -1122,7 +1122,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
goto done;

if (!nodes_subset(trialcs->mems_allowed,
- node_states[N_HIGH_MEMORY])) {
+ node_states[N_MEMORY])) {
retval = -EINVAL;
goto done;
}
@@ -2034,7 +2034,7 @@ static struct cpuset *cpuset_next(struct list_head *queue)
* before dropping down to the next. It always processes a node before
* any of its children.
*
- * In the case of memory hot-unplug, it will remove nodes from N_HIGH_MEMORY
+ * In the case of memory hot-unplug, it will remove nodes from N_MEMORY
* if all present pages from a node are offlined.
*/
static void
@@ -2073,7 +2073,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)

/* Continue past cpusets with all mems online */
if (nodes_subset(cp->mems_allowed,
- node_states[N_HIGH_MEMORY]))
+ node_states[N_MEMORY]))
continue;

oldmems = cp->mems_allowed;
@@ -2081,7 +2081,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
/* Remove offline mems from this cpuset. */
mutex_lock(&callback_mutex);
nodes_and(cp->mems_allowed, cp->mems_allowed,
- node_states[N_HIGH_MEMORY]);
+ node_states[N_MEMORY]);
mutex_unlock(&callback_mutex);

/* Move tasks from the empty cpuset to a parent */
@@ -2134,8 +2134,8 @@ void cpuset_update_active_cpus(bool cpu_online)

#ifdef CONFIG_MEMORY_HOTPLUG
/*
- * Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY].
- * Call this routine anytime after node_states[N_HIGH_MEMORY] changes.
+ * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
+ * Call this routine anytime after node_states[N_MEMORY] changes.
* See cpuset_update_active_cpus() for CPU hotplug handling.
*/
static int cpuset_track_online_nodes(struct notifier_block *self,
@@ -2148,7 +2148,7 @@ static int cpuset_track_online_nodes(struct notifier_block *self,
case MEM_ONLINE:
oldmems = top_cpuset.mems_allowed;
mutex_lock(&callback_mutex);
- top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
+ top_cpuset.mems_allowed = node_states[N_MEMORY];
mutex_unlock(&callback_mutex);
update_tasks_nodemask(&top_cpuset, &oldmems, NULL);
break;
@@ -2177,7 +2177,7 @@ static int cpuset_track_online_nodes(struct notifier_block *self,
void __init cpuset_init_smp(void)
{
cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
- top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
+ top_cpuset.mems_allowed = node_states[N_MEMORY];

hotplug_memory_notifier(cpuset_track_online_nodes, 10);

@@ -2245,7 +2245,7 @@ void cpuset_init_current_mems_allowed(void)
*
* Description: Returns the nodemask_t mems_allowed of the cpuset
* attached to the specified @tsk. Guaranteed to return some non-empty
- * subset of node_states[N_HIGH_MEMORY], even if this means going outside the
+ * subset of node_states[N_MEMORY], even if this means going outside the
* tasks cpuset.
**/

--
1.7.1

2012-08-02 06:00:55

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 03/23 V2] procfs: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
fs/proc/kcore.c | 2 +-
fs/proc/task_mmu.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 86c67ee..e96d4f1 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -249,7 +249,7 @@ static int kcore_update_ram(void)
/* Not inialized....update now */
/* find out "max pfn" */
end_pfn = 0;
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long node_end;
node_end = NODE_DATA(nid)->node_start_pfn +
NODE_DATA(nid)->node_spanned_pages;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4540b8f..ed3d381 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1080,7 +1080,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
return NULL;

nid = page_to_nid(page);
- if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+ if (!node_isset(nid, node_states[N_MEMORY]))
return NULL;

return page;
@@ -1232,7 +1232,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
if (md->writeback)
seq_printf(m, " writeback=%lu", md->writeback);

- for_each_node_state(n, N_HIGH_MEMORY)
+ for_each_node_state(n, N_MEMORY)
if (md->node[n])
seq_printf(m, " N%d=%lu", n, md->node[n]);
out:
--
1.7.1

2012-08-02 06:01:01

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 06/23 V2] mempolicy: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/mempolicy.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..ad0381d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -212,9 +212,9 @@ static int mpol_set_nodemask(struct mempolicy *pol,
/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
if (pol == NULL)
return 0;
- /* Check N_HIGH_MEMORY */
+ /* Check N_MEMORY */
nodes_and(nsc->mask1,
- cpuset_current_mems_allowed, node_states[N_HIGH_MEMORY]);
+ cpuset_current_mems_allowed, node_states[N_MEMORY]);

VM_BUG_ON(!nodes);
if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
@@ -1363,7 +1363,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
goto out_put;
}

- if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) {
+ if (!nodes_subset(*new, node_states[N_MEMORY])) {
err = -EINVAL;
goto out_put;
}
@@ -2314,7 +2314,7 @@ void __init numa_policy_init(void)
* fall back to the largest node if they're all smaller.
*/
nodes_clear(interleave_nodes);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long total_pages = node_present_pages(nid);

/* Preserve the largest node */
@@ -2395,7 +2395,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
*nodelist++ = '\0';
if (nodelist_parse(nodelist, nodes))
goto out;
- if (!nodes_subset(nodes, node_states[N_HIGH_MEMORY]))
+ if (!nodes_subset(nodes, node_states[N_MEMORY]))
goto out;
} else
nodes_clear(nodes);
@@ -2429,7 +2429,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
* Default to online nodes with memory if no nodelist
*/
if (!nodelist)
- nodes = node_states[N_HIGH_MEMORY];
+ nodes = node_states[N_MEMORY];
break;
case MPOL_LOCAL:
/*
--
1.7.1

2012-08-02 06:01:06

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 09/23 V2] vmstat: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/vmstat.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1bbbbd9..aa3da12 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -917,7 +917,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
pg_data_t *pgdat = (pg_data_t *)arg;

/* check memoryless node */
- if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+ if (!node_state(pgdat->node_id, N_MEMORY))
return 0;

seq_printf(m, "Page block order: %d\n", pageblock_order);
@@ -1279,7 +1279,7 @@ static int unusable_show(struct seq_file *m, void *arg)
pg_data_t *pgdat = (pg_data_t *)arg;

/* check memoryless node */
- if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+ if (!node_state(pgdat->node_id, N_MEMORY))
return 0;

walk_zones_in_node(m, pgdat, unusable_show_print);
--
1.7.1

2012-08-02 06:01:13

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 11/23 V2] init: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
init/main.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/main.c b/init/main.c
index 4121d1f..c9317aa 100644
--- a/init/main.c
+++ b/init/main.c
@@ -846,7 +846,7 @@ static int __init kernel_init(void * unused)
/*
* init can allocate pages on any node
*/
- set_mems_allowed(node_states[N_HIGH_MEMORY]);
+ set_mems_allowed(node_states[N_MEMORY]);
/*
* init can run on any cpu.
*/
--
1.7.1

2012-08-02 06:01:26

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 17/23 V2] page_alloc.c: don't subtract unrelated memmap from zone's present pages

A)======
Currently, memory-page-map(struct page array) is not defined in struct zone.
It is defined in several ways:

FLATMEM: global memmap, can be allocated from any zone <= ZONE_NORMAL
CONFIG_DISCONTIGMEM: node-specific memmap, can be allocated from any
zone <= ZONE_NORMAL within that node.
CONFIG_SPARSEMEM: memorysection-specific memmap, can be allocated from any zone,
when CONFIG_SPARSEMEM_VMEMMAP, it is even not physical continuous.

So, the memmap has nothing directly related with the zone. And it's memory can be
allocated outside, so it is wrong to subtract memmap's size from zone's
present pages.

B)======
When system has large holes, the subtracted-present-pages-size will become
very small or negative, make the memory management works bad at the zone or
make the zone unusable even the real-present-pages-size is actually large.

C)======
And subtracted-present-pages-size has problem when memory-hot-removing,
the zone->zone->present_pages may overflow and become huge(unsigned long).

D)======
memory-page-map is large and long living unreclaimable memory, it is good to
subtract them for proper watermark.
So a new proper approach is needed to do it similarly
and new approach should also handle other long living unreclaimable memory.

Current blindly subtracted-present-pages-size approach does wrong, remove it.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/page_alloc.c | 20 +-------------------
1 files changed, 1 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 737faf7..03ad63d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4360,30 +4360,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
- unsigned long size, realsize, memmap_pages;
+ unsigned long size, realsize;

size = zone_spanned_pages_in_node(nid, j, zones_size);
realsize = size - zone_absent_pages_in_node(nid, j,
zholes_size);

- /*
- * Adjust realsize so that it accounts for how much memory
- * is used by this zone for memmap. This affects the watermark
- * and per-cpu initialisations
- */
- memmap_pages =
- PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
- if (realsize >= memmap_pages) {
- realsize -= memmap_pages;
- if (memmap_pages)
- printk(KERN_DEBUG
- " %s zone: %lu pages used for memmap\n",
- zone_names[j], memmap_pages);
- } else
- printk(KERN_WARNING
- " %s zone: %lu pages exceeds realsize %lu\n",
- zone_names[j], memmap_pages, realsize);
-
/* Account for reserved pages */
if (j == 0 && realsize > dma_reserve) {
realsize -= dma_reserve;
--
1.7.1

2012-08-02 06:01:28

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 16/23 V2] numa: add CONFIG_MOVABLE_NODE for movable-dedicated node

All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node

Signed-off-by: Lai Jiangshan <[email protected]>
---
drivers/base/node.c | 6 ++++++
include/linux/nodemask.h | 4 ++++
mm/Kconfig | 8 ++++++++
mm/page_alloc.c | 3 +++
4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 31f4805..4bf5629 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -621,6 +621,9 @@ static struct node_attr node_state_attr[] = {
#ifdef CONFIG_HIGHMEM
_NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ _NODE_ATTR(has_memory, N_MEMORY),
+#endif
};

static struct attribute *node_state_attrs[] = {
@@ -631,6 +634,9 @@ static struct attribute *node_state_attrs[] = {
#ifdef CONFIG_HIGHMEM
&node_state_attr[4].attr.attr,
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ &node_state_attr[4].attr.attr,
+#endif
NULL
};

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index c6ebdc9..4e2cbfa 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -380,7 +380,11 @@ enum node_states {
#else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ N_MEMORY, /* The node has memory(regular, high, movable) */
+#else
N_MEMORY = N_HIGH_MEMORY,
+#endif
N_CPU, /* The node has one or more cpus */
NR_NODE_STATES
};
diff --git a/mm/Kconfig b/mm/Kconfig
index 82fed4e..4371c65 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -140,6 +140,14 @@ config ARCH_DISCARD_MEMBLOCK
config NO_BOOTMEM
boolean

+config MOVABLE_NODE
+ boolean "Enable to assign a node has only movable memory"
+ depends on HAVE_MEMBLOCK
+ depends on NO_BOOTMEM
+ depends on X86_64
+ depends on NUMA
+ default y
+
# eventually, we can have this option just 'select SPARSEMEM'
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0571f2a..737faf7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -91,6 +91,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
#ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ [N_MEMORY] = { { [0] = 1UL } },
+#endif
[N_CPU] = { { [0] = 1UL } },
#endif /* NUMA */
};
--
1.7.1

2012-08-02 06:01:16

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 15/23 V2] memory_hotplug: fix missing nodemask management

Currently memory_hotplug only manages the node_states[N_HIGH_MEMORY],
it forgot to manage node_states[N_NORMAL_MEMORY]. fix it.

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/memory-hotplug.txt | 2 +-
mm/memory_hotplug.c | 23 +++++++++++++++++++++--
2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 6d0c251..89f21b2 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -382,7 +382,7 @@ struct memory_notify {

start_pfn is start_pfn of online/offline memory.
nr_pages is # of pages of online/offline memory.
-status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
+status_change_nid is set node id when N_MEMORY of nodemask is (will be)
set/clear. It means a new(memoryless) node gets new memory by online and a
node loses all memory. If this is -1, then nodemask status is not changed.
If status_changed_nid >= 0, callback should create/discard structures for the
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 427bb29..c44b39e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -522,8 +522,18 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
init_per_zone_wmark_min();

if (onlined_pages) {
+ enum zone_type zoneid = zone_idx(zone);
+
kswapd_run(zone_to_nid(zone));
- node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
+
+ node_set_state(nid, N_MEMORY);
+ if (zoneid <= ZONE_NORMAL && N_NORMAL_MEMORY != N_MEMORY)
+ node_set_state(nid, N_NORMAL_MEMORY);
+#ifdef CONFIG_HIGMEM
+ if (zoneid <= ZONE_HIGHMEM && N_HIGH_MEMORY != N_MEMORY)
+ node_set_state(nid, N_HIGH_MEMORY);
+#endif
+
}

vm_total_pages = nr_free_pagecache_pages();
@@ -966,7 +976,16 @@ repeat:
init_per_zone_wmark_min();

if (!node_present_pages(node)) {
- node_clear_state(node, N_HIGH_MEMORY);
+ enum zone_type zoneid = zone_idx(zone);
+
+ node_clear_state(node, N_MEMORY);
+ if (zoneid <= ZONE_NORMAL && N_NORMAL_MEMORY != N_MEMORY)
+ node_clear_state(node, N_NORMAL_MEMORY);
+#ifdef CONFIG_HIGMEM
+ if (zoneid <= ZONE_HIGHMEM && N_HIGH_MEMORY != N_MEMORY)
+ node_clear_state(node, N_HIGH_MEMORY);
+#endif
+
kswapd_stop(node);
}

--
1.7.1

2012-08-02 06:01:19

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 14/23 V2] slub, hotplug: ignore unrelated node's hot-adding and hot-removing

SLUB only fucus on the nodes which has normal memory, so ignore the other
node's hot-adding and hot-removing.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/slub.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8c691fa..4c5bdc0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3577,6 +3577,9 @@ static void slab_mem_offline_callback(void *arg)
if (offline_node < 0)
return;

+ if (page_zonenum(pfn_to_page(marg->start_pfn)) > ZONE_NORMAL)
+ return;
+
down_read(&slub_lock);
list_for_each_entry(s, &slab_caches, list) {
n = get_node(s, offline_node);
@@ -3611,6 +3614,9 @@ static int slab_mem_going_online_callback(void *arg)
if (nid < 0)
return 0;

+ if (page_zonenum(pfn_to_page(marg->start_pfn)) > ZONE_NORMAL)
+ return 0;
+
/*
* We are bringing a node online. No memory is available yet. We must
* allocate a kmem_cache_node structure in order to bring the node
--
1.7.1

2012-08-02 06:01:40

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 20/23 V2] x86: use memblock_set_current_limit() to set memblock.current_limit

From: Yasuaki Ishimatsu <[email protected]>

memblock.current_limit is set directly though memblock_set_current_limit()
is prepared. So fix it.

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/kernel/setup.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f4b9b80..bb9d9f8 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -889,7 +889,7 @@ void __init setup_arch(char **cmdline_p)

cleanup_highmap();

- memblock.current_limit = get_max_mapped();
+ memblock_set_current_limit(get_max_mapped());
memblock_x86_fill();

/*
@@ -925,7 +925,7 @@ void __init setup_arch(char **cmdline_p)
max_low_pfn = max_pfn;
}
#endif
- memblock.current_limit = get_max_mapped();
+ memblock_set_current_limit(get_max_mapped());
dma_contiguous_reserve(0);

/*
--
1.7.1

2012-08-02 06:01:45

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 23/23 V2] mm, memory-hotplug: add online_movable

When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.

It makes things easy when we dynamic hot-add/remove memory, make better
utilities of memories, and helps for THP.

Current constraints: Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be onlined from ZONE_NORMAL to ZONE_MOVABLE.

For opposite onlining behavior, we also introduce "online_kernel" to change
a memoryblock of ZONE_MOVABLE to ZONE_KERNEL when online.

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/memory-hotplug.txt | 14 ++++-
drivers/base/memory.c | 19 ++++--
include/linux/memory_hotplug.h | 13 ++++-
mm/memory_hotplug.c | 114 +++++++++++++++++++++++++++++++++++--
4 files changed, 144 insertions(+), 16 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 89f21b2..7b1269c 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -161,7 +161,8 @@ a recent addition and not present on older kernels.
in the memory block.
'state' : read-write
at read: contains online/offline state of memory.
- at write: user can specify "online", "offline" command
+ at write: user can specify "online_kernel",
+ "online_movable", "online", "offline" command
which will be performed on al sections in the block.
'phys_device' : read-only: designed to show the name of physical memory
device. This is not well implemented now.
@@ -255,6 +256,17 @@ For onlining, you have to write "online" to the section's state file as:

% echo online > /sys/devices/system/memory/memoryXXX/state

+This onlining will not change the ZONE type of the target memory section,
+If the memory section is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
+
+% echo online_movable > /sys/devices/system/memory/memoryXXX/state
+(NOTE: current limit: this memory section must be adjacent to ZONE_MOVABLE)
+
+And if the memory section is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
+
+% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+(NOTE: current limit: this memory section must be adjacent to ZONE_NORMAL)
+
After this, section memoryXXX's state will be 'online' and the amount of
available memory will be increased.

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 7dda4f7..1ad2f48 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -246,7 +246,7 @@ static bool pages_correctly_reserved(unsigned long start_pfn,
* OK to have direct references to sparsemem variables in here.
*/
static int
-memory_block_action(unsigned long phys_index, unsigned long action)
+memory_block_action(unsigned long phys_index, unsigned long action, int online_type)
{
unsigned long start_pfn, start_paddr;
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
@@ -262,7 +262,7 @@ memory_block_action(unsigned long phys_index, unsigned long action)
if (!pages_correctly_reserved(start_pfn, nr_pages))
return -EBUSY;

- ret = online_pages(start_pfn, nr_pages);
+ ret = online_pages(start_pfn, nr_pages, online_type);
break;
case MEM_OFFLINE:
start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
@@ -279,7 +279,8 @@ memory_block_action(unsigned long phys_index, unsigned long action)
}

static int memory_block_change_state(struct memory_block *mem,
- unsigned long to_state, unsigned long from_state_req)
+ unsigned long to_state, unsigned long from_state_req,
+ int online_type)
{
int ret = 0;

@@ -293,7 +294,7 @@ static int memory_block_change_state(struct memory_block *mem,
if (to_state == MEM_OFFLINE)
mem->state = MEM_GOING_OFFLINE;

- ret = memory_block_action(mem->start_section_nr, to_state);
+ ret = memory_block_action(mem->start_section_nr, to_state, online_type);

if (ret) {
mem->state = from_state_req;
@@ -325,10 +326,14 @@ store_mem_state(struct device *dev,

mem = container_of(dev, struct memory_block, dev);

- if (!strncmp(buf, "online", min((int)count, 6)))
- ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
+ if (!strncmp(buf, "online_kernel", min((int)count, 13)))
+ ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, ONLINE_KERNEL);
+ else if (!strncmp(buf, "online_movable", min((int)count, 14)))
+ ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, ONLINE_MOVABLE);
+ else if (!strncmp(buf, "online", min((int)count, 6)))
+ ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, ONLINE_KEEP);
else if(!strncmp(buf, "offline", min((int)count, 7)))
- ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
+ ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE, -1);

if (ret)
return ret;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 910550f..047cd1d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -25,6 +25,13 @@ enum {
MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
};

+/* Types for control the zone type of onlined memory */
+enum {
+ ONLINE_KEEP,
+ ONLINE_KERNEL,
+ ONLINE_MOVABLE,
+};
+
/*
* pgdat resizing functions
*/
@@ -45,6 +52,10 @@ void pgdat_resize_init(struct pglist_data *pgdat)
}
/*
* Zone resizing functions
+ *
+ * Note: any attempt to resize a zone should has pgdat_resize_lock()
+ * zone_span_writelock() both held. This ensure the size of a zone
+ * can't be changed while pgdat_resize_lock() held.
*/
static inline unsigned zone_span_seqbegin(struct zone *zone)
{
@@ -70,7 +81,7 @@ extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
/* VM interface that may be used by firmware interface */
-extern int online_pages(unsigned long, unsigned long);
+extern int online_pages(unsigned long, unsigned long, int);
extern void __offline_isolated_pages(unsigned long, unsigned long);

typedef void (*online_page_callback_t)(struct page *page);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c44b39e..b5ee3db 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -210,6 +210,89 @@ static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
zone_span_writeunlock(zone);
}

+static void resize_zone(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+
+ zone_span_writelock(zone);
+
+ zone->zone_start_pfn = start_pfn;
+ zone->spanned_pages = end_pfn - start_pfn;
+
+ zone_span_writeunlock(zone);
+}
+
+static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ enum zone_type zid = zone_idx(zone);
+ int nid = zone->zone_pgdat->node_id;
+ unsigned long pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ set_page_links(pfn_to_page(pfn), zid, nid, pfn);
+}
+
+static int move_pfn_range_left(struct zone *z1, struct zone *z2,
+ unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long flags;
+
+ pgdat_resize_lock(z1->zone_pgdat, &flags);
+
+ /* can't move pfns which are higher than @z2 */
+ if (end_pfn > z2->zone_start_pfn + z2->spanned_pages)
+ goto out_fail;
+ /* the move out part mast at the left most of @z2 */
+ if (start_pfn > z2->zone_start_pfn)
+ goto out_fail;
+ /* must included/overlap */
+ if (end_pfn <= z2->zone_start_pfn)
+ goto out_fail;
+
+ resize_zone(z1, z1->zone_start_pfn, end_pfn);
+ resize_zone(z2, end_pfn, z2->zone_start_pfn + z2->spanned_pages);
+
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+
+ fix_zone_id(z1, start_pfn, end_pfn);
+
+ return 0;
+out_fail:
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+ return -1;
+}
+
+static int move_pfn_range_right(struct zone *z1, struct zone *z2,
+ unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long flags;
+
+ pgdat_resize_lock(z1->zone_pgdat, &flags);
+
+ /* can't move pfns which are lower than @z1 */
+ if (z1->zone_start_pfn > start_pfn)
+ goto out_fail;
+ /* the move out part mast at the right most of @z1 */
+ if (z1->zone_start_pfn + z1->spanned_pages > end_pfn)
+ goto out_fail;
+ /* must included/overlap */
+ if (start_pfn >= z1->zone_start_pfn + z1->spanned_pages)
+ goto out_fail;
+
+ resize_zone(z1, z1->zone_start_pfn, start_pfn);
+ resize_zone(z2, start_pfn, z2->zone_start_pfn + z2->spanned_pages);
+
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+
+ fix_zone_id(z2, start_pfn, end_pfn);
+
+ return 0;
+out_fail:
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+ return -1;
+}
+
static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
unsigned long end_pfn)
{
@@ -457,7 +540,7 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
}


-int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
+int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
{
unsigned long onlined_pages = 0;
struct zone *zone;
@@ -467,6 +550,29 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
struct memory_notify arg;

lock_memory_hotplug();
+ /*
+ * This doesn't need a lock to do pfn_to_page().
+ * The section can't be removed here because of the
+ * memory_block->state_mutex.
+ */
+ zone = page_zone(pfn_to_page(pfn));
+
+ if (online_type == ONLINE_KERNEL && zone_idx(zone) == ZONE_MOVABLE) {
+ if (move_pfn_range_left(zone - 1, zone, pfn, pfn + nr_pages)) {
+ unlock_memory_hotplug();
+ return -1;
+ }
+ }
+ if (online_type == ONLINE_MOVABLE && zone_idx(zone) == ZONE_MOVABLE - 1) {
+ if (move_pfn_range_right(zone, zone + 1, pfn, pfn + nr_pages)) {
+ unlock_memory_hotplug();
+ return -1;
+ }
+ }
+
+ /* Previous code may changed the zone of the pfn range */
+ zone = page_zone(pfn_to_page(pfn));
+
arg.start_pfn = pfn;
arg.nr_pages = nr_pages;
arg.status_change_nid = -1;
@@ -483,12 +589,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
return ret;
}
/*
- * This doesn't need a lock to do pfn_to_page().
- * The section can't be removed here because of the
- * memory_block->state_mutex.
- */
- zone = page_zone(pfn_to_page(pfn));
- /*
* If this zone is not populated, then it is not in zonelist.
* This means the page allocator ignores this zone.
* So, zonelist must be updated after online.
--
1.7.1

2012-08-02 06:01:38

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 21/23 V2] memblock: limit memory address from memblock

From: Yasuaki Ishimatsu <[email protected]>

Setting kernelcore_max_pfn means all memory which is bigger than
the boot parameter is allocated as ZONE_MOVABLE. So memory which
is allocated by memblock also should be limited by the parameter.

The patch limits memory from memblock.

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 5 ++++-
mm/page_alloc.c | 6 +++++-
3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 19dc455..f2977ae 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {

extern struct memblock memblock;
extern int memblock_debug;
+extern phys_addr_t memblock_limit;

#define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
diff --git a/mm/memblock.c b/mm/memblock.c
index 5cc6731..663b805 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -931,7 +931,10 @@ int __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t si

void __init_memblock memblock_set_current_limit(phys_addr_t limit)
{
- memblock.current_limit = limit;
+ if (!memblock_limit || (memblock_limit > limit))
+ memblock.current_limit = limit;
+ else
+ memblock.current_limit = memblock_limit;
}

static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65ac5c9..c4d3aa0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -209,6 +209,8 @@ static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];

+phys_addr_t memblock_limit;
+
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
EXPORT_SYMBOL(movable_zone);
@@ -4876,7 +4878,9 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
*/
static int __init cmdline_parse_kernelcore_max_addr(char *p)
{
- return cmdline_parse_core(p, &required_kernelcore_max_pfn);
+ cmdline_parse_core(p, &required_kernelcore_max_pfn);
+ memblock_limit = required_kernelcore_max_pfn << PAGE_SHIFT;
+ return 0;
}
early_param("kernelcore_max_addr", cmdline_parse_kernelcore_max_addr);
#endif
--
1.7.1

2012-08-02 06:02:38

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 22/23 V2] memblock: compare current_limit with end variable at memblock_find_in_range_node()

From: Yasuaki Ishimatsu <[email protected]>

memblock_find_in_range_node() does not compare memblock.current_limit
with end variable. Thus even if memblock.current_limit is smaller than
end variable, the function allocates memory address that is bigger than
memblock.current_limit.

The patch adds the check to "memblock_find_in_range_node()"

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/memblock.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 663b805..ce7fcb6 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -99,11 +99,12 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
phys_addr_t align, int nid)
{
phys_addr_t this_start, this_end, cand;
+ phys_addr_t current_limit = memblock.current_limit;
u64 i;

/* pump up @end */
- if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
- end = memblock.current_limit;
+ if ((end == MEMBLOCK_ALLOC_ACCESSIBLE) || (end > current_limit))
+ end = current_limit;

/* avoid allocating the first page */
start = max_t(phys_addr_t, start, PAGE_SIZE);
--
1.7.1

2012-08-02 06:02:55

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 19/23 V2] x86: get pg_data_t's memory from other node

From: Yasuaki Ishimatsu <[email protected]>

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So when memblock_alloc_nid() fails, setup_node_data() retries
memblock_alloc().

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/mm/numa.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a86e315 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -223,9 +223,13 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
remapped = true;
} else {
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
- if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
+ if (!nd_pa)
+ printk(KERN_WARNING "Cannot find %zu bytes in node %d\n",
nd_size, nid);
+ nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
+ if (!nd_pa) {
+ pr_err("Cannot find %zu bytes in other node\n",
+ nd_size);
return;
}
nd = __va(nd_pa);
--
1.7.1

2012-08-02 06:03:18

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 18/23 V2] page_alloc: add kernelcore_max_addr

Current ZONE_MOVABLE (kernelcore=) setting policy with boot option doesn't meet
our requirement. We need something like kernelcore_max_addr=XX boot option
to limit the kernelcore upper address.

The memory with higher address will be migratable(movable) and they
are easier to be offline(always ready to be offline when the system don't require
so much memory).

It makes things easy when we dynamic hot-add/remove memory, make better
utilities of memories, and helps for THP.

All kernelcore_max_addr=, kernelcore= and movablecore= can be safely specified
at the same time(or any 2 of them).

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/kernel-parameters.txt | 9 +++++++++
mm/page_alloc.c | 29 ++++++++++++++++++++++++++++-
2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 12783fa..48dff61 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1216,6 +1216,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.

+ kernelcore_max_addr=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
+ is the same effect as kernelcore parameter, except it
+ specifies the up physical address of memory range
+ usable by the kernel for non-movable allocations.
+ If both kernelcore and kernelcore_max_addr are
+ specified, this requested's priority is higher than
+ kernelcore's.
+ See the kernelcore parameter.
+
kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03ad63d..65ac5c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -204,6 +204,7 @@ static unsigned long __meminitdata dma_reserve;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+static unsigned long __initdata required_kernelcore_max_pfn;
static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
@@ -4600,6 +4601,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
{
int i, nid;
unsigned long usable_startpfn;
+ unsigned long kernelcore_max_pfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
nodemask_t saved_node_state = node_states[N_MEMORY];
@@ -4628,6 +4630,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
required_kernelcore = max(required_kernelcore, corepages);
}

+ if (required_kernelcore_max_pfn && !required_kernelcore)
+ required_kernelcore = totalpages;
+
/* If kernelcore was not specified, there is no ZONE_MOVABLE */
if (!required_kernelcore)
goto out;
@@ -4636,6 +4641,12 @@ static void __init find_zone_movable_pfns_for_nodes(void)
find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];

+ if (required_kernelcore_max_pfn)
+ kernelcore_max_pfn = required_kernelcore_max_pfn;
+ else
+ kernelcore_max_pfn = ULONG_MAX >> PAGE_SHIFT;
+ kernelcore_max_pfn = max(kernelcore_max_pfn, usable_startpfn);
+
restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
@@ -4662,8 +4673,12 @@ restart:
unsigned long size_pages;

start_pfn = max(start_pfn, zone_movable_pfn[nid]);
- if (start_pfn >= end_pfn)
+ end_pfn = min(kernelcore_max_pfn, end_pfn);
+ if (start_pfn >= end_pfn) {
+ if (!zone_movable_pfn[nid])
+ zone_movable_pfn[nid] = start_pfn;
continue;
+ }

/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
@@ -4854,6 +4869,18 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
return 0;
}

+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * kernelcore_max_addr=addr sets the up physical address of memory range
+ * for use for allocations that cannot be reclaimed or migrated.
+ */
+static int __init cmdline_parse_kernelcore_max_addr(char *p)
+{
+ return cmdline_parse_core(p, &required_kernelcore_max_pfn);
+}
+early_param("kernelcore_max_addr", cmdline_parse_kernelcore_max_addr);
+#endif
+
/*
* kernelcore=size sets the amount of memory for use for allocations that
* cannot be reclaimed or migrated.
--
1.7.1

2012-08-02 06:03:55

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 13/23 V2] page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states initialization

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Since we introduced N_MEMORY, we update the initialization of node_states.

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/mm/init_64.c | 4 +++-
mm/page_alloc.c | 40 ++++++++++++++++++++++------------------
2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..005f00c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -625,7 +625,9 @@ void __init paging_init(void)
* numa support is not compiled in, and later node_set_state
* will not set it back.
*/
- node_clear_state(0, N_NORMAL_MEMORY);
+ node_clear_state(0, N_MEMORY);
+ if (N_MEMORY != N_NORMAL_MEMORY)
+ node_clear_state(0, N_NORMAL_MEMORY);

zone_sizes_init();
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..0571f2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1646,7 +1646,7 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
*
* If the zonelist cache is present in the passed in zonelist, then
* returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_states[N_HIGH_MEMORY].)
+ * tasks mems_allowed, or node_states[N_MEMORY].)
*
* If the zonelist cache is not available for this zonelist, does
* nothing and returns NULL.
@@ -1675,7 +1675,7 @@ static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)

allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
&cpuset_current_mems_allowed :
- &node_states[N_HIGH_MEMORY];
+ &node_states[N_MEMORY];
return allowednodes;
}

@@ -3070,7 +3070,7 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
return node;
}

- for_each_node_state(n, N_HIGH_MEMORY) {
+ for_each_node_state(n, N_MEMORY) {

/* Don't want a node to appear more than once */
if (node_isset(n, *used_node_mask))
@@ -3212,7 +3212,7 @@ static int default_zonelist_order(void)
* local memory, NODE_ORDER may be suitable.
*/
average_size = total_size /
- (nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
+ (nodes_weight(node_states[N_MEMORY]) + 1);
for_each_online_node(nid) {
low_kmem_size = 0;
total_size = 0;
@@ -4587,7 +4587,7 @@ unsigned long __init find_min_pfn_with_active_regions(void)
/*
* early_calculate_totalpages()
* Sum pages in active regions for movable zone.
- * Populate N_HIGH_MEMORY for calculating usable_nodes.
+ * Populate N_MEMORY for calculating usable_nodes.
*/
static unsigned long __init early_calculate_totalpages(void)
{
@@ -4600,7 +4600,7 @@ static unsigned long __init early_calculate_totalpages(void)

totalpages += pages;
if (pages)
- node_set_state(nid, N_HIGH_MEMORY);
+ node_set_state(nid, N_MEMORY);
}
return totalpages;
}
@@ -4617,9 +4617,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
unsigned long usable_startpfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
- nodemask_t saved_node_state = node_states[N_HIGH_MEMORY];
+ nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
- int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
+ int usable_nodes = nodes_weight(node_states[N_MEMORY]);

/*
* If movablecore was specified, calculate what size of
@@ -4654,7 +4654,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;

/*
@@ -4746,23 +4746,27 @@ restart:

out:
/* restore the node_state */
- node_states[N_HIGH_MEMORY] = saved_node_state;
+ node_states[N_MEMORY] = saved_node_state;
}

-/* Any regular memory on that node ? */
-static void check_for_regular_memory(pg_data_t *pgdat)
+/* Any regular or high memory on that node ? */
+static void check_for_memory(pg_data_t *pgdat, int nid)
{
-#ifdef CONFIG_HIGHMEM
enum zone_type zone_type;

- for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
+ if (N_MEMORY == N_NORMAL_MEMORY)
+ return;
+
+ for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) {
struct zone *zone = &pgdat->node_zones[zone_type];
if (zone->present_pages) {
- node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
+ node_set_state(nid, N_HIGH_MEMORY);
+ if (N_NORMAL_MEMORY != N_HIGH_MEMORY &&
+ zone_type <= ZONE_NORMAL)
+ node_set_state(nid, N_NORMAL_MEMORY);
break;
}
}
-#endif
}

/**
@@ -4845,8 +4849,8 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)

/* Any memory on that node */
if (pgdat->node_present_pages)
- node_set_state(nid, N_HIGH_MEMORY);
- check_for_regular_memory(pgdat);
+ node_set_state(nid, N_MEMORY);
+ check_for_memory(pgdat, nid);
}
}

--
1.7.1

2012-08-02 06:01:10

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 12/23 V2] vmscan: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/vmscan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..1888026 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2921,7 +2921,7 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
int nid;

if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
pg_data_t *pgdat = NODE_DATA(nid);
const struct cpumask *mask;

@@ -2976,7 +2976,7 @@ static int __init kswapd_init(void)
int nid;

swap_setup();
- for_each_node_state(nid, N_HIGH_MEMORY)
+ for_each_node_state(nid, N_MEMORY)
kswapd_run(nid);
hotcpu_notifier(cpu_callback, 0);
return 0;
--
1.7.1

2012-08-02 06:04:38

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 10/23 V2] kthread: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/kthread.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..4139962 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -280,7 +280,7 @@ int kthreadd(void *unused)
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
- set_mems_allowed(node_states[N_HIGH_MEMORY]);
+ set_mems_allowed(node_states[N_MEMORY]);

current->flags |= PF_NOFREEZE;

--
1.7.1

2012-08-02 06:04:52

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 08/23 V2] hugetlb: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
drivers/base/node.c | 2 +-
mm/hugetlb.c | 24 ++++++++++++------------
2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index af1a177..31f4805 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -227,7 +227,7 @@ static node_registration_func_t __hugetlb_unregister_node;
static inline bool hugetlb_register_node(struct node *node)
{
if (__hugetlb_register_node &&
- node_state(node->dev.id, N_HIGH_MEMORY)) {
+ node_state(node->dev.id, N_MEMORY)) {
__hugetlb_register_node(node);
return true;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e198831..661db47 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1046,7 +1046,7 @@ static void return_unused_surplus_pages(struct hstate *h,
* on-line nodes with memory and will handle the hstate accounting.
*/
while (nr_pages--) {
- if (!free_pool_huge_page(h, &node_states[N_HIGH_MEMORY], 1))
+ if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
break;
}
}
@@ -1150,14 +1150,14 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
int __weak alloc_bootmem_huge_page(struct hstate *h)
{
struct huge_bootmem_page *m;
- int nr_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
+ int nr_nodes = nodes_weight(node_states[N_MEMORY]);

while (nr_nodes) {
void *addr;

addr = __alloc_bootmem_node_nopanic(
NODE_DATA(hstate_next_node_to_alloc(h,
- &node_states[N_HIGH_MEMORY])),
+ &node_states[N_MEMORY])),
huge_page_size(h), huge_page_size(h), 0);

if (addr) {
@@ -1229,7 +1229,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
if (!alloc_bootmem_huge_page(h))
break;
} else if (!alloc_fresh_huge_page(h,
- &node_states[N_HIGH_MEMORY]))
+ &node_states[N_MEMORY]))
break;
}
h->max_huge_pages = i;
@@ -1497,7 +1497,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
if (!(obey_mempolicy &&
init_nodemask_of_mempolicy(nodes_allowed))) {
NODEMASK_FREE(nodes_allowed);
- nodes_allowed = &node_states[N_HIGH_MEMORY];
+ nodes_allowed = &node_states[N_MEMORY];
}
} else if (nodes_allowed) {
/*
@@ -1507,11 +1507,11 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
init_nodemask_of_node(nodes_allowed, nid);
} else
- nodes_allowed = &node_states[N_HIGH_MEMORY];
+ nodes_allowed = &node_states[N_MEMORY];

h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);

- if (nodes_allowed != &node_states[N_HIGH_MEMORY])
+ if (nodes_allowed != &node_states[N_MEMORY])
NODEMASK_FREE(nodes_allowed);

return len;
@@ -1812,7 +1812,7 @@ static void hugetlb_register_all_nodes(void)
{
int nid;

- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
struct node *node = &node_devices[nid];
if (node->dev.id == nid)
hugetlb_register_node(node);
@@ -1906,8 +1906,8 @@ void __init hugetlb_add_hstate(unsigned order)
h->free_huge_pages = 0;
for (i = 0; i < MAX_NUMNODES; ++i)
INIT_LIST_HEAD(&h->hugepage_freelists[i]);
- h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
- h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
+ h->next_nid_to_alloc = first_node(node_states[N_MEMORY]);
+ h->next_nid_to_free = first_node(node_states[N_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);

@@ -1995,11 +1995,11 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
if (!(obey_mempolicy &&
init_nodemask_of_mempolicy(nodes_allowed))) {
NODEMASK_FREE(nodes_allowed);
- nodes_allowed = &node_states[N_HIGH_MEMORY];
+ nodes_allowed = &node_states[N_MEMORY];
}
h->max_huge_pages = set_max_huge_pages(h, tmp, nodes_allowed);

- if (nodes_allowed != &node_states[N_HIGH_MEMORY])
+ if (nodes_allowed != &node_states[N_MEMORY])
NODEMASK_FREE(nodes_allowed);
}
out:
--
1.7.1

2012-08-02 06:05:36

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 05/23 V2] mm,migrate: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/migrate.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..dbe4f86 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1226,7 +1226,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
if (node < 0 || node >= MAX_NUMNODES)
goto out_pm;

- if (!node_state(node, N_HIGH_MEMORY))
+ if (!node_state(node, N_MEMORY))
goto out_pm;

err = -EACCES;
--
1.7.1

2012-08-02 06:05:34

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 07/23 V2] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/memcontrol.c | 18 +++++++++---------
mm/page_cgroup.c | 2 +-
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f72b5e5..4402c2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -797,7 +797,7 @@ static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
int nid;
u64 total = 0;

- for_each_node_state(nid, N_HIGH_MEMORY)
+ for_each_node_state(nid, N_MEMORY)
total += mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask);
return total;
}
@@ -1549,9 +1549,9 @@ static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
return;

/* make a nodemask where this memcg uses memory from */
- memcg->scan_nodes = node_states[N_HIGH_MEMORY];
+ memcg->scan_nodes = node_states[N_MEMORY];

- for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+ for_each_node_mask(nid, node_states[N_MEMORY]) {

if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
node_clear(nid, memcg->scan_nodes);
@@ -1622,7 +1622,7 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
/*
* Check rest of nodes.
*/
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
if (node_isset(nid, memcg->scan_nodes))
continue;
if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
@@ -3700,7 +3700,7 @@ move_account:
drain_all_stock_sync(memcg);
ret = 0;
mem_cgroup_start_move(memcg);
- for_each_node_state(node, N_HIGH_MEMORY) {
+ for_each_node_state(node, N_MEMORY) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
enum lru_list lru;
for_each_lru(lru) {
@@ -4025,7 +4025,7 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,

total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL);
seq_printf(m, "total=%lu", total_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL);
seq_printf(m, " N%d=%lu", nid, node_nr);
}
@@ -4033,7 +4033,7 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,

file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
seq_printf(m, "file=%lu", file_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
LRU_ALL_FILE);
seq_printf(m, " N%d=%lu", nid, node_nr);
@@ -4042,7 +4042,7 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,

anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
seq_printf(m, "anon=%lu", anon_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
LRU_ALL_ANON);
seq_printf(m, " N%d=%lu", nid, node_nr);
@@ -4051,7 +4051,7 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,

unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE));
seq_printf(m, "unevictable=%lu", unevictable_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
BIT(LRU_UNEVICTABLE));
seq_printf(m, " N%d=%lu", nid, node_nr);
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index eb750f8..e775239 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
if (mem_cgroup_disabled())
return;

- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;

start_pfn = node_start_pfn(nid);
--
1.7.1

2012-08-02 06:06:29

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 04/23 V2] oom: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/oom_kill.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ac300c9..1e58f12 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -257,7 +257,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
* the page allocator means a mempolicy is in effect. Cpuset policy
* is enforced in get_page_from_freelist().
*/
- if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+ if (nodemask && !nodes_subset(node_states[N_MEMORY], *nodemask)) {
*totalpages = total_swap_pages;
for_each_node_mask(nid, *nodemask)
*totalpages += node_spanned_pages(nid);
--
1.7.1

2012-08-02 06:06:53

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC PATCH 01/23 V2] node_states: introduce N_MEMORY

We have N_NORMAL_MEMORY for standing for the nodes that have normal memory with
zone_type <= ZONE_NORMAL.

And we have N_HIGH_MEMORY for standing for the nodes that have normal or high
memory.

But we don't have any word to stand for the nodes that have *any* memory.

And we have N_CPU but without N_MEMORY.

Current code reuse the N_HIGH_MEMORY for this purpose because any node which
has memory must have high memory or normal memory currently.

A) But this reusing is bad for *readability*. Because the name
N_HIGH_MEMORY just stands for high or normal:

A.example 1)
mem_cgroup_nr_lru_pages():
for_each_node_state(nid, N_HIGH_MEMORY)

The user will be confused(why this function just counts for high or
normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
until someone else tell them N_HIGH_MEMORY is reused to stand for
nodes that have any memory.

A.cont) If we introduce N_MEMORY, we can reduce this confusing
AND make the code more clearly:

A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:

One is in page_cgroup_init(void):
for_each_node_state(nid, N_HIGH_MEMORY) {

It means if the node have memory, we will allocate page_cgroup map for
the node. We should use N_MEMORY instead here to gaim more clearly.

The second using is in alloc_page_cgroup():
if (node_state(nid, N_HIGH_MEMORY))
addr = vzalloc_node(size, nid);

It means if the node has high or normal memory that can be allocated
from kernel. We should keep N_HIGH_MEMORY here, and it will be better
if the "any memory" semantic of N_HIGH_MEMORY is removed.

B) This reusing is out-dated if we introduce MOVABLE-dedicated node.
The MOVABLE-dedicated node should not appear in
node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
because MOVABLE-dedicated node has no high or normal memory.

In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
is in node_stats[N_HIGH_MEMORY], it is also means it is in
node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.

The slub uses
for_each_node_state(nid, N_NORMAL_MEMORY)
and creates kmem_cache_node for MOVABLE-dedicated node and cause problem.

In one word, we need a N_MEMORY. We just intrude it as an alias to
N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late patches.

Signed-off-by: Lai Jiangshan <[email protected]>
---
include/linux/nodemask.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 7afc363..c6ebdc9 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -380,6 +380,7 @@ enum node_states {
#else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
+ N_MEMORY = N_HIGH_MEMORY,
N_CPU, /* The node has one or more cpus */
NR_NODE_STATES
};
--
1.7.1

Subject: Re: [RFC PATCH 05/23 V2] mm,migrate: use N_MEMORY instead N_HIGH_MEMORY

On Thu, 2 Aug 2012, Lai Jiangshan wrote:

> The code here need to handle with the nodes which have memory, we should
> use N_MEMORY instead.

Acked-by: Christoph Lameter <[email protected]>

Subject: Re: [RFC PATCH 09/23 V2] vmstat: use N_MEMORY instead N_HIGH_MEMORY

On Thu, 2 Aug 2012, Lai Jiangshan wrote:

> The code here need to handle with the nodes which have memory, we should
> use N_MEMORY instead.

Acked-by: Christoph Lameter <[email protected]>

2012-08-04 14:02:49

by Hillf Danton

[permalink] [raw]
Subject: Re: [RFC PATCH 08/23 V2] hugetlb: use N_MEMORY instead N_HIGH_MEMORY

On Thu, Aug 2, 2012 at 2:01 PM, Lai Jiangshan <[email protected]> wrote:
> N_HIGH_MEMORY stands for the nodes that has normal or high memory.
> N_MEMORY stands for the nodes that has any memory.
>
> The code here need to handle with the nodes which have memory, we should
> use N_MEMORY instead.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> drivers/base/node.c | 2 +-
> mm/hugetlb.c | 24 ++++++++++++------------
> 2 files changed, 13 insertions(+), 13 deletions(-)
>

Better if the patch is split for hugetlb and node respectively.

Acked-by: Hillf Danton <[email protected]>

> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index af1a177..31f4805 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -227,7 +227,7 @@ static node_registration_func_t __hugetlb_unregister_node;
> static inline bool hugetlb_register_node(struct node *node)
> {
> if (__hugetlb_register_node &&
> - node_state(node->dev.id, N_HIGH_MEMORY)) {
> + node_state(node->dev.id, N_MEMORY)) {
> __hugetlb_register_node(node);
> return true;
> }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e198831..661db47 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1046,7 +1046,7 @@ static void return_unused_surplus_pages(struct hstate *h,
> * on-line nodes with memory and will handle the hstate accounting.
> */
> while (nr_pages--) {
> - if (!free_pool_huge_page(h, &node_states[N_HIGH_MEMORY], 1))
> + if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
> break;
> }
> }
> @@ -1150,14 +1150,14 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> int __weak alloc_bootmem_huge_page(struct hstate *h)
> {
> struct huge_bootmem_page *m;
> - int nr_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
> + int nr_nodes = nodes_weight(node_states[N_MEMORY]);
>
> while (nr_nodes) {
> void *addr;
>
> addr = __alloc_bootmem_node_nopanic(
> NODE_DATA(hstate_next_node_to_alloc(h,
> - &node_states[N_HIGH_MEMORY])),
> + &node_states[N_MEMORY])),
> huge_page_size(h), huge_page_size(h), 0);
>
> if (addr) {
> @@ -1229,7 +1229,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> if (!alloc_bootmem_huge_page(h))
> break;
> } else if (!alloc_fresh_huge_page(h,
> - &node_states[N_HIGH_MEMORY]))
> + &node_states[N_MEMORY]))
> break;
> }
> h->max_huge_pages = i;
> @@ -1497,7 +1497,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
> if (!(obey_mempolicy &&
> init_nodemask_of_mempolicy(nodes_allowed))) {
> NODEMASK_FREE(nodes_allowed);
> - nodes_allowed = &node_states[N_HIGH_MEMORY];
> + nodes_allowed = &node_states[N_MEMORY];
> }
> } else if (nodes_allowed) {
> /*
> @@ -1507,11 +1507,11 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
> count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> init_nodemask_of_node(nodes_allowed, nid);
> } else
> - nodes_allowed = &node_states[N_HIGH_MEMORY];
> + nodes_allowed = &node_states[N_MEMORY];
>
> h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);
>
> - if (nodes_allowed != &node_states[N_HIGH_MEMORY])
> + if (nodes_allowed != &node_states[N_MEMORY])
> NODEMASK_FREE(nodes_allowed);
>
> return len;
> @@ -1812,7 +1812,7 @@ static void hugetlb_register_all_nodes(void)
> {
> int nid;
>
> - for_each_node_state(nid, N_HIGH_MEMORY) {
> + for_each_node_state(nid, N_MEMORY) {
> struct node *node = &node_devices[nid];
> if (node->dev.id == nid)
> hugetlb_register_node(node);
> @@ -1906,8 +1906,8 @@ void __init hugetlb_add_hstate(unsigned order)
> h->free_huge_pages = 0;
> for (i = 0; i < MAX_NUMNODES; ++i)
> INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> - h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
> - h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
> + h->next_nid_to_alloc = first_node(node_states[N_MEMORY]);
> + h->next_nid_to_free = first_node(node_states[N_MEMORY]);
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> huge_page_size(h)/1024);
>
> @@ -1995,11 +1995,11 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
> if (!(obey_mempolicy &&
> init_nodemask_of_mempolicy(nodes_allowed))) {
> NODEMASK_FREE(nodes_allowed);
> - nodes_allowed = &node_states[N_HIGH_MEMORY];
> + nodes_allowed = &node_states[N_MEMORY];
> }
> h->max_huge_pages = set_max_huge_pages(h, tmp, nodes_allowed);
>
> - if (nodes_allowed != &node_states[N_HIGH_MEMORY])
> + if (nodes_allowed != &node_states[N_MEMORY])
> NODEMASK_FREE(nodes_allowed);
> }
> out:
> --
> 1.7.1
>

2012-08-06 09:23:25

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 00/25] memory,numa: introduce MOVABLE-dedicated node and online_movable for hotplug

A) Introduction:

This patchset adds MOVABLE-dedicated node and online_movable for memory-management.

It is used for anti-fragmentation(hugepage, big-order allocation...),
hot-removal-of-memory(virtualization, power-conserve, move memory between systems
to make better utilities of memories).

B) User Interface:

When users(big system manager) need config some node/memory as MOVABLE:
1 Use kernelcore_max_addr=XX when boot
2 Use movable_online hotplug action when running
We may introduce some more convenient interface, such as
movable_node=NODE_LIST boot option.

C) Patches

Patch1-3 Fix problems of the current code.(all related with hotplug)
Patch4 cleanup for node_state_attr
Patch5 introduce N_MEMORY
Patch6-18 use N_MEMORY instead N_HIGH_MEMORY.
The patches are separated by subsystem,
*these conversions was(must be) checked carefully*.
Patch18 also changes the node_states initialization
Patch19 Add config to allow MOVABLE-dedicated node
Patch20-24 Add kernelcore_max_addr
Patch25 Add online_movable and online_kernel


D) changes
change V3-v2:
Proper nodemask management

change V2-V1:

The original V1 patchset of MOVABLE-dedicated node is here:
http://comments.gmane.org/gmane.linux.kernel.mm/78122

The new V2 adds N_MEMORY and a notion of "MOVABLE-dedicated node".
And fix some related problems.

The orignal V1 patchset of "add online_movable" is here:
https://lkml.org/lkml/2012/7/4/145

The new V2 discards the MIGRATE_HOTREMOVE approach, and use a more straight
implementation(only 1 patch).


Lai Jiangshan (21):
page_alloc.c: don't subtract unrelated memmap from zone's present
pages
memory_hotplug: fix missing nodemask management
slub, hotplug: ignore unrelated node's hot-adding and hot-removing
node: cleanup node_state_attr
node_states: introduce N_MEMORY
cpuset: use N_MEMORY instead N_HIGH_MEMORY
procfs: use N_MEMORY instead N_HIGH_MEMORY
memcontrol: use N_MEMORY instead N_HIGH_MEMORY
oom: use N_MEMORY instead N_HIGH_MEMORY
mm,migrate: use N_MEMORY instead N_HIGH_MEMORY
mempolicy: use N_MEMORY instead N_HIGH_MEMORY
hugetlb: use N_MEMORY instead N_HIGH_MEMORY
vmstat: use N_MEMORY instead N_HIGH_MEMORY
kthread: use N_MEMORY instead N_HIGH_MEMORY
init: use N_MEMORY instead N_HIGH_MEMORY
vmscan: use N_MEMORY instead N_HIGH_MEMORY
page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states
initialization
hotplug: update nodemasks management
numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
page_alloc: add kernelcore_max_addr
mm, memory-hotplug: add online_movable and online_kernel

Yasuaki Ishimatsu (4):
x86: get pg_data_t's memory from other node
x86: use memblock_set_current_limit() to set memblock.current_limit
memblock: limit memory address from memblock
memblock: compare current_limit with end variable at
memblock_find_in_range_node()

Documentation/cgroups/cpusets.txt | 2 +-
Documentation/kernel-parameters.txt | 9 ++
Documentation/memory-hotplug.txt | 24 +++-
arch/x86/kernel/setup.c | 4 +-
arch/x86/mm/init_64.c | 4 +-
arch/x86/mm/numa.c | 8 +-
drivers/base/memory.c | 19 ++-
drivers/base/node.c | 28 +++--
fs/proc/kcore.c | 2 +-
fs/proc/task_mmu.c | 4 +-
include/linux/cpuset.h | 2 +-
include/linux/memblock.h | 1 +
include/linux/memory.h | 2 +
include/linux/memory_hotplug.h | 13 ++-
include/linux/nodemask.h | 5 +
init/main.c | 2 +-
kernel/cpuset.c | 32 +++---
kernel/kthread.c | 2 +-
mm/Kconfig | 8 ++
mm/hugetlb.c | 24 ++--
mm/memblock.c | 10 +-
mm/memcontrol.c | 18 ++--
mm/memory_hotplug.c | 232 ++++++++++++++++++++++++++++++++---
mm/mempolicy.c | 12 +-
mm/migrate.c | 2 +-
mm/oom_kill.c | 2 +-
mm/page_alloc.c | 96 +++++++++------
mm/page_cgroup.c | 2 +-
mm/slub.c | 4 +-
mm/vmscan.c | 4 +-
mm/vmstat.c | 4 +-
31 files changed, 437 insertions(+), 144 deletions(-)

--
1.7.4.4

2012-08-06 09:23:34

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 05/25] node_states: introduce N_MEMORY

We have N_NORMAL_MEMORY for standing for the nodes that have normal memory with
zone_type <= ZONE_NORMAL.

And we have N_HIGH_MEMORY for standing for the nodes that have normal or high
memory.

But we don't have any word to stand for the nodes that have *any* memory.

And we have N_CPU but without N_MEMORY.

Current code reuse the N_HIGH_MEMORY for this purpose because any node which
has memory must have high memory or normal memory currently.

A) But this reusing is bad for *readability*. Because the name
N_HIGH_MEMORY just stands for high or normal:

A.example 1)
mem_cgroup_nr_lru_pages():
for_each_node_state(nid, N_HIGH_MEMORY)

The user will be confused(why this function just counts for high or
normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
until someone else tell them N_HIGH_MEMORY is reused to stand for
nodes that have any memory.

A.cont) If we introduce N_MEMORY, we can reduce this confusing
AND make the code more clearly:

A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:

One is in page_cgroup_init(void):
for_each_node_state(nid, N_HIGH_MEMORY) {

It means if the node have memory, we will allocate page_cgroup map for
the node. We should use N_MEMORY instead here to gaim more clearly.

The second using is in alloc_page_cgroup():
if (node_state(nid, N_HIGH_MEMORY))
addr = vzalloc_node(size, nid);

It means if the node has high or normal memory that can be allocated
from kernel. We should keep N_HIGH_MEMORY here, and it will be better
if the "any memory" semantic of N_HIGH_MEMORY is removed.

B) This reusing is out-dated if we introduce MOVABLE-dedicated node.
The MOVABLE-dedicated node should not appear in
node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
because MOVABLE-dedicated node has no high or normal memory.

In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
is in node_stats[N_HIGH_MEMORY], it is also means it is in
node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.

The slub uses
for_each_node_state(nid, N_NORMAL_MEMORY)
and creates kmem_cache_node for MOVABLE-dedicated node and cause problem.

In one word, we need a N_MEMORY. We just intrude it as an alias to
N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late patches.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
include/linux/nodemask.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 7afc363..c6ebdc9 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -380,6 +380,7 @@ enum node_states {
#else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
+ N_MEMORY = N_HIGH_MEMORY,
N_CPU, /* The node has one or more cpus */
NR_NODE_STATES
};
--
1.7.4.4

2012-08-06 09:23:49

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 15/25] init: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
init/main.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/main.c b/init/main.c
index 4121d1f..c9317aa 100644
--- a/init/main.c
+++ b/init/main.c
@@ -846,7 +846,7 @@ static int __init kernel_init(void * unused)
/*
* init can allocate pages on any node
*/
- set_mems_allowed(node_states[N_HIGH_MEMORY]);
+ set_mems_allowed(node_states[N_MEMORY]);
/*
* init can run on any cpu.
*/
--
1.7.4.4

2012-08-06 09:23:57

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 17/25] page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states initialization

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Since we introduced N_MEMORY, we update the initialization of node_states.

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/mm/init_64.c | 4 +++-
mm/page_alloc.c | 40 ++++++++++++++++++++++------------------
2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..005f00c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -625,7 +625,9 @@ void __init paging_init(void)
* numa support is not compiled in, and later node_set_state
* will not set it back.
*/
- node_clear_state(0, N_NORMAL_MEMORY);
+ node_clear_state(0, N_MEMORY);
+ if (N_MEMORY != N_NORMAL_MEMORY)
+ node_clear_state(0, N_NORMAL_MEMORY);

zone_sizes_init();
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9312702..edffc35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1646,7 +1646,7 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
*
* If the zonelist cache is present in the passed in zonelist, then
* returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_states[N_HIGH_MEMORY].)
+ * tasks mems_allowed, or node_states[N_MEMORY].)
*
* If the zonelist cache is not available for this zonelist, does
* nothing and returns NULL.
@@ -1675,7 +1675,7 @@ static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)

allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
&cpuset_current_mems_allowed :
- &node_states[N_HIGH_MEMORY];
+ &node_states[N_MEMORY];
return allowednodes;
}

@@ -3070,7 +3070,7 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
return node;
}

- for_each_node_state(n, N_HIGH_MEMORY) {
+ for_each_node_state(n, N_MEMORY) {

/* Don't want a node to appear more than once */
if (node_isset(n, *used_node_mask))
@@ -3212,7 +3212,7 @@ static int default_zonelist_order(void)
* local memory, NODE_ORDER may be suitable.
*/
average_size = total_size /
- (nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
+ (nodes_weight(node_states[N_MEMORY]) + 1);
for_each_online_node(nid) {
low_kmem_size = 0;
total_size = 0;
@@ -4569,7 +4569,7 @@ unsigned long __init find_min_pfn_with_active_regions(void)
/*
* early_calculate_totalpages()
* Sum pages in active regions for movable zone.
- * Populate N_HIGH_MEMORY for calculating usable_nodes.
+ * Populate N_MEMORY for calculating usable_nodes.
*/
static unsigned long __init early_calculate_totalpages(void)
{
@@ -4582,7 +4582,7 @@ static unsigned long __init early_calculate_totalpages(void)

totalpages += pages;
if (pages)
- node_set_state(nid, N_HIGH_MEMORY);
+ node_set_state(nid, N_MEMORY);
}
return totalpages;
}
@@ -4599,9 +4599,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
unsigned long usable_startpfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
- nodemask_t saved_node_state = node_states[N_HIGH_MEMORY];
+ nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
- int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
+ int usable_nodes = nodes_weight(node_states[N_MEMORY]);

/*
* If movablecore was specified, calculate what size of
@@ -4636,7 +4636,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;

/*
@@ -4728,23 +4728,27 @@ restart:

out:
/* restore the node_state */
- node_states[N_HIGH_MEMORY] = saved_node_state;
+ node_states[N_MEMORY] = saved_node_state;
}

-/* Any regular memory on that node ? */
-static void check_for_regular_memory(pg_data_t *pgdat)
+/* Any regular or high memory on that node ? */
+static void check_for_memory(pg_data_t *pgdat, int nid)
{
-#ifdef CONFIG_HIGHMEM
enum zone_type zone_type;

- for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
+ if (N_MEMORY == N_NORMAL_MEMORY)
+ return;
+
+ for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) {
struct zone *zone = &pgdat->node_zones[zone_type];
if (zone->present_pages) {
- node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
+ node_set_state(nid, N_HIGH_MEMORY);
+ if (N_NORMAL_MEMORY != N_HIGH_MEMORY &&
+ zone_type <= ZONE_NORMAL)
+ node_set_state(nid, N_NORMAL_MEMORY);
break;
}
}
-#endif
}

/**
@@ -4827,8 +4831,8 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)

/* Any memory on that node */
if (pgdat->node_present_pages)
- node_set_state(nid, N_HIGH_MEMORY);
- check_for_regular_memory(pgdat);
+ node_set_state(nid, N_MEMORY);
+ check_for_memory(pgdat, nid);
}
}

--
1.7.4.4

2012-08-06 09:23:52

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 16/25] vmscan: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
mm/vmscan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..1888026 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2921,7 +2921,7 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
int nid;

if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
pg_data_t *pgdat = NODE_DATA(nid);
const struct cpumask *mask;

@@ -2976,7 +2976,7 @@ static int __init kswapd_init(void)
int nid;

swap_setup();
- for_each_node_state(nid, N_HIGH_MEMORY)
+ for_each_node_state(nid, N_MEMORY)
kswapd_run(nid);
hotcpu_notifier(cpu_callback, 0);
return 0;
--
1.7.4.4

2012-08-06 09:24:00

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 18/25] hotplug: update nodemasks management

update nodemasks management for N_MEMORY

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/memory-hotplug.txt | 5 +++-
include/linux/memory.h | 1 +
mm/memory_hotplug.c | 49 +++++++++++++++++++++++++++++++++----
3 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 6e6cbc7..70bc1c7 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -378,6 +378,7 @@ struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
int status_change_nid_normal;
+ int status_change_nid_high;
int status_change_nid;
}

@@ -385,7 +386,9 @@ start_pfn is start_pfn of online/offline memory.
nr_pages is # of pages of online/offline memory.
status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
is (will be) set/clear, if this is -1, then nodemask status is not changed.
-status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
+status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
+is (will be) set/clear, if this is -1, then nodemask status is not changed.
+status_change_nid is set node id when N_MEMORY of nodemask is (will be)
set/clear. It means a new(memoryless) node gets new memory by online and a
node loses all memory. If this is -1, then nodemask status is not changed.
If status_changed_nid* >= 0, callback should create/discard structures for the
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 6b9202b..8089e49 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -54,6 +54,7 @@ struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
int status_change_nid_normal;
+ int status_change_nid_high;
int status_change_nid;
};

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3438c4a..c2c96a4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -462,7 +462,7 @@ static void check_nodemasks_changes_online(unsigned long nr_pages,
int nid = zone_to_nid(zone);
enum zone_type zone_last = ZONE_NORMAL;

- if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+ if (N_MEMORY == N_NORMAL_MEMORY)
zone_last = ZONE_MOVABLE;

if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
@@ -470,7 +470,20 @@ static void check_nodemasks_changes_online(unsigned long nr_pages,
else
arg->status_change_nid_normal = -1;

- if (!node_state(nid, N_HIGH_MEMORY))
+#ifdef CONFIG_HIGHMEM
+ zone_last = ZONE_HIGH;
+ if (N_MEMORY == N_HIGH_MEMORY)
+ zone_last = ZONE_MOVABLE;
+
+ if (zone_idx(zone) <= zone_last && !node_state(nid, N_HIGH_MEMORY))
+ arg->status_change_nid_high = nid;
+ else
+ arg->status_change_nid_high = -1;
+#else
+ arg->status_change_nid_high = arg->status_change_nid_normal;
+#endif
+
+ if (!node_state(nid, N_MEMORY))
arg->status_change_nid = nid;
else
arg->status_change_nid = -1;
@@ -481,7 +494,10 @@ static void set_nodemasks(int node, struct memory_notify *arg)
if (arg->status_change_nid_normal >= 0)
node_set_state(node, N_NORMAL_MEMORY);

- node_set_state(node, N_HIGH_MEMORY);
+ if (arg->status_change_nid_high >= 0)
+ node_set_state(node, N_HIGH_MEMORY);
+
+ node_set_state(node, N_MEMORY);
}


@@ -899,7 +915,7 @@ static void check_nodemasks_changes_offline(unsigned long nr_pages,
unsigned long present_pages = 0;
enum zone_type zt, zone_last = ZONE_NORMAL;

- if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+ if (N_MEMORY == N_NORMAL_MEMORY)
zone_last = ZONE_MOVABLE;

for (zt = 0; zt <= zone_last; zt++)
@@ -909,6 +925,21 @@ static void check_nodemasks_changes_offline(unsigned long nr_pages,
else
arg->status_change_nid_normal = -1;

+#ifdef CONIG_HIGHMEM
+ zone_last = ZONE_HIGH;
+ if (N_MEMORY == N_HIGH_MEMORY)
+ zone_last = ZONE_MOVABLE;
+
+ for (; zt <= zone_last; zt++)
+ present_pages += pgdat->node_zones[zt].present_pages;
+ if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
+ arg->status_change_nid_high = zone_to_nid(zone);
+ else
+ arg->status_change_nid_high = -1;
+#else
+ arg->status_change_nid_high = arg->status_change_nid_normal;
+#endif
+
zone_last = ZONE_MOVABLE;
for (; zt <= zone_last; zt++)
present_pages += pgdat->node_zones[zt].present_pages;
@@ -923,11 +954,17 @@ static void clear_nodemasks(int node, struct memory_notify *arg)
if (arg->status_change_nid_normal >= 0)
node_clear_state(node, N_NORMAL_MEMORY);

- if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+ if (N_MEMORY == N_NORMAL_MEMORY)
return;

- if (arg->status_change_nid >= 0)
+ if (arg->status_change_nid_high >= 0)
node_clear_state(node, N_HIGH_MEMORY);
+
+ if (N_MEMORY == N_HIGH_MEMORY)
+ return;
+
+ if (arg->status_change_nid >= 0)
+ node_clear_state(node, N_MEMORY);
}

static int __ref offline_pages(unsigned long start_pfn,
--
1.7.4.4

2012-08-06 09:24:24

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 23/25] memblock: limit memory address from memblock

From: Yasuaki Ishimatsu <[email protected]>

Setting kernelcore_max_pfn means all memory which is bigger than
the boot parameter is allocated as ZONE_MOVABLE. So memory which
is allocated by memblock also should be limited by the parameter.

The patch limits memory from memblock.

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 5 ++++-
mm/page_alloc.c | 6 +++++-
3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 19dc455..f2977ae 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {

extern struct memblock memblock;
extern int memblock_debug;
+extern phys_addr_t memblock_limit;

#define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
diff --git a/mm/memblock.c b/mm/memblock.c
index 5cc6731..663b805 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -931,7 +931,10 @@ int __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t si

void __init_memblock memblock_set_current_limit(phys_addr_t limit)
{
- memblock.current_limit = limit;
+ if (!memblock_limit || (memblock_limit > limit))
+ memblock.current_limit = limit;
+ else
+ memblock.current_limit = memblock_limit;
}

static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65ac5c9..c4d3aa0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -209,6 +209,8 @@ static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];

+phys_addr_t memblock_limit;
+
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
EXPORT_SYMBOL(movable_zone);
@@ -4876,7 +4878,9 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
*/
static int __init cmdline_parse_kernelcore_max_addr(char *p)
{
- return cmdline_parse_core(p, &required_kernelcore_max_pfn);
+ cmdline_parse_core(p, &required_kernelcore_max_pfn);
+ memblock_limit = required_kernelcore_max_pfn << PAGE_SHIFT;
+ return 0;
}
early_param("kernelcore_max_addr", cmdline_parse_kernelcore_max_addr);
#endif
--
1.7.4.4

2012-08-06 09:24:14

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 22/25] x86: use memblock_set_current_limit() to set memblock.current_limit

From: Yasuaki Ishimatsu <[email protected]>

memblock.current_limit is set directly though memblock_set_current_limit()
is prepared. So fix it.

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/kernel/setup.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f4b9b80..bb9d9f8 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -889,7 +889,7 @@ void __init setup_arch(char **cmdline_p)

cleanup_highmap();

- memblock.current_limit = get_max_mapped();
+ memblock_set_current_limit(get_max_mapped());
memblock_x86_fill();

/*
@@ -925,7 +925,7 @@ void __init setup_arch(char **cmdline_p)
max_low_pfn = max_pfn;
}
#endif
- memblock.current_limit = get_max_mapped();
+ memblock_set_current_limit(get_max_mapped());
dma_contiguous_reserve(0);

/*
--
1.7.4.4

2012-08-06 09:24:22

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 24/25] memblock: compare current_limit with end variable at memblock_find_in_range_node()

From: Yasuaki Ishimatsu <[email protected]>

memblock_find_in_range_node() does not compare memblock.current_limit
with end variable. Thus even if memblock.current_limit is smaller than
end variable, the function allocates memory address that is bigger than
memblock.current_limit.

The patch adds the check to "memblock_find_in_range_node()"

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/memblock.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 663b805..ce7fcb6 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -99,11 +99,12 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
phys_addr_t align, int nid)
{
phys_addr_t this_start, this_end, cand;
+ phys_addr_t current_limit = memblock.current_limit;
u64 i;

/* pump up @end */
- if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
- end = memblock.current_limit;
+ if ((end == MEMBLOCK_ALLOC_ACCESSIBLE) || (end > current_limit))
+ end = current_limit;

/* avoid allocating the first page */
start = max_t(phys_addr_t, start, PAGE_SIZE);
--
1.7.4.4

2012-08-06 09:24:31

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 25/25] mm, memory-hotplug: add online_movable and online_kernel

When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.

It makes things easy when we dynamic hot-add/remove memory, make better
utilities of memories, and helps for THP.

Current constraints: Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be onlined from ZONE_NORMAL to ZONE_MOVABLE.

For opposite onlining behavior, we also introduce "online_kernel" to change
a memoryblock of ZONE_MOVABLE to ZONE_KERNEL when online.

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/memory-hotplug.txt | 14 +++++-
drivers/base/memory.c | 19 +++++---
include/linux/memory_hotplug.h | 13 +++++-
mm/memory_hotplug.c | 101 +++++++++++++++++++++++++++++++++++++-
4 files changed, 137 insertions(+), 10 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 70bc1c7..8e5eacb 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -161,7 +161,8 @@ a recent addition and not present on older kernels.
in the memory block.
'state' : read-write
at read: contains online/offline state of memory.
- at write: user can specify "online", "offline" command
+ at write: user can specify "online_kernel",
+ "online_movable", "online", "offline" command
which will be performed on al sections in the block.
'phys_device' : read-only: designed to show the name of physical memory
device. This is not well implemented now.
@@ -255,6 +256,17 @@ For onlining, you have to write "online" to the section's state file as:

% echo online > /sys/devices/system/memory/memoryXXX/state

+This onlining will not change the ZONE type of the target memory section,
+If the memory section is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
+
+% echo online_movable > /sys/devices/system/memory/memoryXXX/state
+(NOTE: current limit: this memory section must be adjacent to ZONE_MOVABLE)
+
+And if the memory section is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
+
+% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+(NOTE: current limit: this memory section must be adjacent to ZONE_NORMAL)
+
After this, section memoryXXX's state will be 'online' and the amount of
available memory will be increased.

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 7dda4f7..1ad2f48 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -246,7 +246,7 @@ static bool pages_correctly_reserved(unsigned long start_pfn,
* OK to have direct references to sparsemem variables in here.
*/
static int
-memory_block_action(unsigned long phys_index, unsigned long action)
+memory_block_action(unsigned long phys_index, unsigned long action, int online_type)
{
unsigned long start_pfn, start_paddr;
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
@@ -262,7 +262,7 @@ memory_block_action(unsigned long phys_index, unsigned long action)
if (!pages_correctly_reserved(start_pfn, nr_pages))
return -EBUSY;

- ret = online_pages(start_pfn, nr_pages);
+ ret = online_pages(start_pfn, nr_pages, online_type);
break;
case MEM_OFFLINE:
start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
@@ -279,7 +279,8 @@ memory_block_action(unsigned long phys_index, unsigned long action)
}

static int memory_block_change_state(struct memory_block *mem,
- unsigned long to_state, unsigned long from_state_req)
+ unsigned long to_state, unsigned long from_state_req,
+ int online_type)
{
int ret = 0;

@@ -293,7 +294,7 @@ static int memory_block_change_state(struct memory_block *mem,
if (to_state == MEM_OFFLINE)
mem->state = MEM_GOING_OFFLINE;

- ret = memory_block_action(mem->start_section_nr, to_state);
+ ret = memory_block_action(mem->start_section_nr, to_state, online_type);

if (ret) {
mem->state = from_state_req;
@@ -325,10 +326,14 @@ store_mem_state(struct device *dev,

mem = container_of(dev, struct memory_block, dev);

- if (!strncmp(buf, "online", min((int)count, 6)))
- ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
+ if (!strncmp(buf, "online_kernel", min((int)count, 13)))
+ ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, ONLINE_KERNEL);
+ else if (!strncmp(buf, "online_movable", min((int)count, 14)))
+ ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, ONLINE_MOVABLE);
+ else if (!strncmp(buf, "online", min((int)count, 6)))
+ ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, ONLINE_KEEP);
else if(!strncmp(buf, "offline", min((int)count, 7)))
- ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
+ ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE, -1);

if (ret)
return ret;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 910550f..047cd1d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -25,6 +25,13 @@ enum {
MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
};

+/* Types for control the zone type of onlined memory */
+enum {
+ ONLINE_KEEP,
+ ONLINE_KERNEL,
+ ONLINE_MOVABLE,
+};
+
/*
* pgdat resizing functions
*/
@@ -45,6 +52,10 @@ void pgdat_resize_init(struct pglist_data *pgdat)
}
/*
* Zone resizing functions
+ *
+ * Note: any attempt to resize a zone should has pgdat_resize_lock()
+ * zone_span_writelock() both held. This ensure the size of a zone
+ * can't be changed while pgdat_resize_lock() held.
*/
static inline unsigned zone_span_seqbegin(struct zone *zone)
{
@@ -70,7 +81,7 @@ extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
/* VM interface that may be used by firmware interface */
-extern int online_pages(unsigned long, unsigned long);
+extern int online_pages(unsigned long, unsigned long, int);
extern void __offline_isolated_pages(unsigned long, unsigned long);

typedef void (*online_page_callback_t)(struct page *page);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c2c96a4..4e1db0a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -210,6 +210,89 @@ static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
zone_span_writeunlock(zone);
}

+static void resize_zone(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+
+ zone_span_writelock(zone);
+
+ zone->zone_start_pfn = start_pfn;
+ zone->spanned_pages = end_pfn - start_pfn;
+
+ zone_span_writeunlock(zone);
+}
+
+static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ enum zone_type zid = zone_idx(zone);
+ int nid = zone->zone_pgdat->node_id;
+ unsigned long pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ set_page_links(pfn_to_page(pfn), zid, nid, pfn);
+}
+
+static int move_pfn_range_left(struct zone *z1, struct zone *z2,
+ unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long flags;
+
+ pgdat_resize_lock(z1->zone_pgdat, &flags);
+
+ /* can't move pfns which are higher than @z2 */
+ if (end_pfn > z2->zone_start_pfn + z2->spanned_pages)
+ goto out_fail;
+ /* the move out part mast at the left most of @z2 */
+ if (start_pfn > z2->zone_start_pfn)
+ goto out_fail;
+ /* must included/overlap */
+ if (end_pfn <= z2->zone_start_pfn)
+ goto out_fail;
+
+ resize_zone(z1, z1->zone_start_pfn, end_pfn);
+ resize_zone(z2, end_pfn, z2->zone_start_pfn + z2->spanned_pages);
+
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+
+ fix_zone_id(z1, start_pfn, end_pfn);
+
+ return 0;
+out_fail:
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+ return -1;
+}
+
+static int move_pfn_range_right(struct zone *z1, struct zone *z2,
+ unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long flags;
+
+ pgdat_resize_lock(z1->zone_pgdat, &flags);
+
+ /* can't move pfns which are lower than @z1 */
+ if (z1->zone_start_pfn > start_pfn)
+ goto out_fail;
+ /* the move out part mast at the right most of @z1 */
+ if (z1->zone_start_pfn + z1->spanned_pages > end_pfn)
+ goto out_fail;
+ /* must included/overlap */
+ if (start_pfn >= z1->zone_start_pfn + z1->spanned_pages)
+ goto out_fail;
+
+ resize_zone(z1, z1->zone_start_pfn, start_pfn);
+ resize_zone(z2, start_pfn, z2->zone_start_pfn + z2->spanned_pages);
+
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+
+ fix_zone_id(z2, start_pfn, end_pfn);
+
+ return 0;
+out_fail:
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+ return -1;
+}
+
static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
unsigned long end_pfn)
{
@@ -501,7 +584,7 @@ static void set_nodemasks(int node, struct memory_notify *arg)
}


-int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
+int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
{
unsigned long onlined_pages = 0;
struct zone *zone;
@@ -518,6 +601,22 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
*/
zone = page_zone(pfn_to_page(pfn));

+ if (online_type == ONLINE_KERNEL && zone_idx(zone) == ZONE_MOVABLE) {
+ if (move_pfn_range_left(zone - 1, zone, pfn, pfn + nr_pages)) {
+ unlock_memory_hotplug();
+ return -1;
+ }
+ }
+ if (online_type == ONLINE_MOVABLE && zone_idx(zone) == ZONE_MOVABLE - 1) {
+ if (move_pfn_range_right(zone, zone + 1, pfn, pfn + nr_pages)) {
+ unlock_memory_hotplug();
+ return -1;
+ }
+ }
+
+ /* Previous code may changed the zone of the pfn range */
+ zone = page_zone(pfn_to_page(pfn));
+
arg.start_pfn = pfn;
arg.nr_pages = nr_pages;
check_nodemasks_changes_online(nr_pages, zone, &arg);
--
1.7.4.4

2012-08-06 09:24:12

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 19/25] numa: add CONFIG_MOVABLE_NODE for movable-dedicated node

All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node

Signed-off-by: Lai Jiangshan <[email protected]>
---
drivers/base/node.c | 6 ++++++
include/linux/nodemask.h | 4 ++++
mm/Kconfig | 8 ++++++++
mm/page_alloc.c | 3 +++
4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 4c3aa7c..653b5e2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -620,6 +620,9 @@ static struct node_attr node_state_attr[] = {
#ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
+#endif
[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
};

@@ -630,6 +633,9 @@ static struct attribute *node_state_attrs[] = {
#ifdef CONFIG_HIGHMEM
&node_state_attr[N_HIGH_MEMORY].attr.attr,
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ &node_state_attr[N_MEMORY].attr.attr,
+#endif
&node_state_attr[N_CPU].attr.attr,
NULL
};
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index c6ebdc9..4e2cbfa 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -380,7 +380,11 @@ enum node_states {
#else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ N_MEMORY, /* The node has memory(regular, high, movable) */
+#else
N_MEMORY = N_HIGH_MEMORY,
+#endif
N_CPU, /* The node has one or more cpus */
NR_NODE_STATES
};
diff --git a/mm/Kconfig b/mm/Kconfig
index 82fed4e..4371c65 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -140,6 +140,14 @@ config ARCH_DISCARD_MEMBLOCK
config NO_BOOTMEM
boolean

+config MOVABLE_NODE
+ boolean "Enable to assign a node has only movable memory"
+ depends on HAVE_MEMBLOCK
+ depends on NO_BOOTMEM
+ depends on X86_64
+ depends on NUMA
+ default y
+
# eventually, we can have this option just 'select SPARSEMEM'
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index edffc35..03ad63d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -91,6 +91,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
#ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ [N_MEMORY] = { { [0] = 1UL } },
+#endif
[N_CPU] = { { [0] = 1UL } },
#endif /* NUMA */
};
--
1.7.4.4

2012-08-06 09:24:10

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 20/25] page_alloc: add kernelcore_max_addr

Current ZONE_MOVABLE (kernelcore=) setting policy with boot option doesn't meet
our requirement. We need something like kernelcore_max_addr=XX boot option
to limit the kernelcore upper address.

The memory with higher address will be migratable(movable) and they
are easier to be offline(always ready to be offline when the system don't require
so much memory).

It makes things easy when we dynamic hot-add/remove memory, make better
utilities of memories, and helps for THP.

All kernelcore_max_addr=, kernelcore= and movablecore= can be safely specified
at the same time(or any 2 of them).

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/kernel-parameters.txt | 9 +++++++++
mm/page_alloc.c | 29 ++++++++++++++++++++++++++++-
2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 12783fa..48dff61 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1216,6 +1216,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.

+ kernelcore_max_addr=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
+ is the same effect as kernelcore parameter, except it
+ specifies the up physical address of memory range
+ usable by the kernel for non-movable allocations.
+ If both kernelcore and kernelcore_max_addr are
+ specified, this requested's priority is higher than
+ kernelcore's.
+ See the kernelcore parameter.
+
kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03ad63d..65ac5c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -204,6 +204,7 @@ static unsigned long __meminitdata dma_reserve;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+static unsigned long __initdata required_kernelcore_max_pfn;
static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
@@ -4600,6 +4601,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
{
int i, nid;
unsigned long usable_startpfn;
+ unsigned long kernelcore_max_pfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
nodemask_t saved_node_state = node_states[N_MEMORY];
@@ -4628,6 +4630,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
required_kernelcore = max(required_kernelcore, corepages);
}

+ if (required_kernelcore_max_pfn && !required_kernelcore)
+ required_kernelcore = totalpages;
+
/* If kernelcore was not specified, there is no ZONE_MOVABLE */
if (!required_kernelcore)
goto out;
@@ -4636,6 +4641,12 @@ static void __init find_zone_movable_pfns_for_nodes(void)
find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];

+ if (required_kernelcore_max_pfn)
+ kernelcore_max_pfn = required_kernelcore_max_pfn;
+ else
+ kernelcore_max_pfn = ULONG_MAX >> PAGE_SHIFT;
+ kernelcore_max_pfn = max(kernelcore_max_pfn, usable_startpfn);
+
restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
@@ -4662,8 +4673,12 @@ restart:
unsigned long size_pages;

start_pfn = max(start_pfn, zone_movable_pfn[nid]);
- if (start_pfn >= end_pfn)
+ end_pfn = min(kernelcore_max_pfn, end_pfn);
+ if (start_pfn >= end_pfn) {
+ if (!zone_movable_pfn[nid])
+ zone_movable_pfn[nid] = start_pfn;
continue;
+ }

/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
@@ -4854,6 +4869,18 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
return 0;
}

+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * kernelcore_max_addr=addr sets the up physical address of memory range
+ * for use for allocations that cannot be reclaimed or migrated.
+ */
+static int __init cmdline_parse_kernelcore_max_addr(char *p)
+{
+ return cmdline_parse_core(p, &required_kernelcore_max_pfn);
+}
+early_param("kernelcore_max_addr", cmdline_parse_kernelcore_max_addr);
+#endif
+
/*
* kernelcore=size sets the amount of memory for use for allocations that
* cannot be reclaimed or migrated.
--
1.7.4.4

2012-08-06 09:25:46

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 21/25] x86: get pg_data_t's memory from other node

From: Yasuaki Ishimatsu <[email protected]>

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So when memblock_alloc_nid() fails, setup_node_data() retries
memblock_alloc().

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/mm/numa.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a86e315 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -223,9 +223,13 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
remapped = true;
} else {
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
- if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
+ if (!nd_pa)
+ printk(KERN_WARNING "Cannot find %zu bytes in node %d\n",
nd_size, nid);
+ nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
+ if (!nd_pa) {
+ pr_err("Cannot find %zu bytes in other node\n",
+ nd_size);
return;
}
nd = __va(nd_pa);
--
1.7.4.4

2012-08-06 09:23:48

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 13/25] vmstat: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---
mm/vmstat.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1bbbbd9..aa3da12 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -917,7 +917,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
pg_data_t *pgdat = (pg_data_t *)arg;

/* check memoryless node */
- if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+ if (!node_state(pgdat->node_id, N_MEMORY))
return 0;

seq_printf(m, "Page block order: %d\n", pageblock_order);
@@ -1279,7 +1279,7 @@ static int unusable_show(struct seq_file *m, void *arg)
pg_data_t *pgdat = (pg_data_t *)arg;

/* check memoryless node */
- if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+ if (!node_state(pgdat->node_id, N_MEMORY))
return 0;

walk_zones_in_node(m, pgdat, unusable_show_print);
--
1.7.4.4

2012-08-06 09:23:46

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 08/25] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/memcontrol.c | 18 +++++++++---------
mm/page_cgroup.c | 2 +-
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f72b5e5..4402c2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -797,7 +797,7 @@ static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
int nid;
u64 total = 0;

- for_each_node_state(nid, N_HIGH_MEMORY)
+ for_each_node_state(nid, N_MEMORY)
total += mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask);
return total;
}
@@ -1549,9 +1549,9 @@ static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
return;

/* make a nodemask where this memcg uses memory from */
- memcg->scan_nodes = node_states[N_HIGH_MEMORY];
+ memcg->scan_nodes = node_states[N_MEMORY];

- for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+ for_each_node_mask(nid, node_states[N_MEMORY]) {

if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
node_clear(nid, memcg->scan_nodes);
@@ -1622,7 +1622,7 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
/*
* Check rest of nodes.
*/
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
if (node_isset(nid, memcg->scan_nodes))
continue;
if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
@@ -3700,7 +3700,7 @@ move_account:
drain_all_stock_sync(memcg);
ret = 0;
mem_cgroup_start_move(memcg);
- for_each_node_state(node, N_HIGH_MEMORY) {
+ for_each_node_state(node, N_MEMORY) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
enum lru_list lru;
for_each_lru(lru) {
@@ -4025,7 +4025,7 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,

total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL);
seq_printf(m, "total=%lu", total_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL);
seq_printf(m, " N%d=%lu", nid, node_nr);
}
@@ -4033,7 +4033,7 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,

file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
seq_printf(m, "file=%lu", file_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
LRU_ALL_FILE);
seq_printf(m, " N%d=%lu", nid, node_nr);
@@ -4042,7 +4042,7 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,

anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
seq_printf(m, "anon=%lu", anon_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
LRU_ALL_ANON);
seq_printf(m, " N%d=%lu", nid, node_nr);
@@ -4051,7 +4051,7 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,

unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE));
seq_printf(m, "unevictable=%lu", unevictable_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
BIT(LRU_UNEVICTABLE));
seq_printf(m, " N%d=%lu", nid, node_nr);
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index eb750f8..e775239 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
if (mem_cgroup_disabled())
return;

- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;

start_pfn = node_start_pfn(nid);
--
1.7.4.4

2012-08-06 09:27:11

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 14/25] kthread: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/kthread.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..4139962 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -280,7 +280,7 @@ int kthreadd(void *unused)
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
- set_mems_allowed(node_states[N_HIGH_MEMORY]);
+ set_mems_allowed(node_states[N_MEMORY]);

current->flags |= PF_NOFREEZE;

--
1.7.4.4

2012-08-06 09:23:45

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 09/25] oom: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
mm/oom_kill.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ac300c9..1e58f12 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -257,7 +257,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
* the page allocator means a mempolicy is in effect. Cpuset policy
* is enforced in get_page_from_freelist().
*/
- if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+ if (nodemask && !nodes_subset(node_states[N_MEMORY], *nodemask)) {
*totalpages = total_swap_pages;
for_each_node_mask(nid, *nodemask)
*totalpages += node_spanned_pages(nid);
--
1.7.4.4

2012-08-06 09:23:44

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 10/25] mm,migrate: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---
mm/migrate.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..dbe4f86 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1226,7 +1226,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
if (node < 0 || node >= MAX_NUMNODES)
goto out_pm;

- if (!node_state(node, N_HIGH_MEMORY))
+ if (!node_state(node, N_MEMORY))
goto out_pm;

err = -EACCES;
--
1.7.4.4

2012-08-06 09:27:47

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 12/25] hugetlb: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
drivers/base/node.c | 2 +-
mm/hugetlb.c | 24 ++++++++++++------------
2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5d7731e..4c3aa7c 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -227,7 +227,7 @@ static node_registration_func_t __hugetlb_unregister_node;
static inline bool hugetlb_register_node(struct node *node)
{
if (__hugetlb_register_node &&
- node_state(node->dev.id, N_HIGH_MEMORY)) {
+ node_state(node->dev.id, N_MEMORY)) {
__hugetlb_register_node(node);
return true;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e198831..661db47 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1046,7 +1046,7 @@ static void return_unused_surplus_pages(struct hstate *h,
* on-line nodes with memory and will handle the hstate accounting.
*/
while (nr_pages--) {
- if (!free_pool_huge_page(h, &node_states[N_HIGH_MEMORY], 1))
+ if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
break;
}
}
@@ -1150,14 +1150,14 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
int __weak alloc_bootmem_huge_page(struct hstate *h)
{
struct huge_bootmem_page *m;
- int nr_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
+ int nr_nodes = nodes_weight(node_states[N_MEMORY]);

while (nr_nodes) {
void *addr;

addr = __alloc_bootmem_node_nopanic(
NODE_DATA(hstate_next_node_to_alloc(h,
- &node_states[N_HIGH_MEMORY])),
+ &node_states[N_MEMORY])),
huge_page_size(h), huge_page_size(h), 0);

if (addr) {
@@ -1229,7 +1229,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
if (!alloc_bootmem_huge_page(h))
break;
} else if (!alloc_fresh_huge_page(h,
- &node_states[N_HIGH_MEMORY]))
+ &node_states[N_MEMORY]))
break;
}
h->max_huge_pages = i;
@@ -1497,7 +1497,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
if (!(obey_mempolicy &&
init_nodemask_of_mempolicy(nodes_allowed))) {
NODEMASK_FREE(nodes_allowed);
- nodes_allowed = &node_states[N_HIGH_MEMORY];
+ nodes_allowed = &node_states[N_MEMORY];
}
} else if (nodes_allowed) {
/*
@@ -1507,11 +1507,11 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
init_nodemask_of_node(nodes_allowed, nid);
} else
- nodes_allowed = &node_states[N_HIGH_MEMORY];
+ nodes_allowed = &node_states[N_MEMORY];

h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);

- if (nodes_allowed != &node_states[N_HIGH_MEMORY])
+ if (nodes_allowed != &node_states[N_MEMORY])
NODEMASK_FREE(nodes_allowed);

return len;
@@ -1812,7 +1812,7 @@ static void hugetlb_register_all_nodes(void)
{
int nid;

- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
struct node *node = &node_devices[nid];
if (node->dev.id == nid)
hugetlb_register_node(node);
@@ -1906,8 +1906,8 @@ void __init hugetlb_add_hstate(unsigned order)
h->free_huge_pages = 0;
for (i = 0; i < MAX_NUMNODES; ++i)
INIT_LIST_HEAD(&h->hugepage_freelists[i]);
- h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
- h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
+ h->next_nid_to_alloc = first_node(node_states[N_MEMORY]);
+ h->next_nid_to_free = first_node(node_states[N_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);

@@ -1995,11 +1995,11 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
if (!(obey_mempolicy &&
init_nodemask_of_mempolicy(nodes_allowed))) {
NODEMASK_FREE(nodes_allowed);
- nodes_allowed = &node_states[N_HIGH_MEMORY];
+ nodes_allowed = &node_states[N_MEMORY];
}
h->max_huge_pages = set_max_huge_pages(h, tmp, nodes_allowed);

- if (nodes_allowed != &node_states[N_HIGH_MEMORY])
+ if (nodes_allowed != &node_states[N_MEMORY])
NODEMASK_FREE(nodes_allowed);
}
out:
--
1.7.4.4

2012-08-06 09:28:33

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 07/25] procfs: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
fs/proc/kcore.c | 2 +-
fs/proc/task_mmu.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 86c67ee..e96d4f1 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -249,7 +249,7 @@ static int kcore_update_ram(void)
/* Not inialized....update now */
/* find out "max pfn" */
end_pfn = 0;
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long node_end;
node_end = NODE_DATA(nid)->node_start_pfn +
NODE_DATA(nid)->node_spanned_pages;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4540b8f..ed3d381 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1080,7 +1080,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
return NULL;

nid = page_to_nid(page);
- if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+ if (!node_isset(nid, node_states[N_MEMORY]))
return NULL;

return page;
@@ -1232,7 +1232,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
if (md->writeback)
seq_printf(m, " writeback=%lu", md->writeback);

- for_each_node_state(n, N_HIGH_MEMORY)
+ for_each_node_state(n, N_MEMORY)
if (md->node[n])
seq_printf(m, " N%d=%lu", n, md->node[n]);
out:
--
1.7.4.4

2012-08-06 09:28:32

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 11/25] mempolicy: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/mempolicy.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..ad0381d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -212,9 +212,9 @@ static int mpol_set_nodemask(struct mempolicy *pol,
/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
if (pol == NULL)
return 0;
- /* Check N_HIGH_MEMORY */
+ /* Check N_MEMORY */
nodes_and(nsc->mask1,
- cpuset_current_mems_allowed, node_states[N_HIGH_MEMORY]);
+ cpuset_current_mems_allowed, node_states[N_MEMORY]);

VM_BUG_ON(!nodes);
if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
@@ -1363,7 +1363,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
goto out_put;
}

- if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) {
+ if (!nodes_subset(*new, node_states[N_MEMORY])) {
err = -EINVAL;
goto out_put;
}
@@ -2314,7 +2314,7 @@ void __init numa_policy_init(void)
* fall back to the largest node if they're all smaller.
*/
nodes_clear(interleave_nodes);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long total_pages = node_present_pages(nid);

/* Preserve the largest node */
@@ -2395,7 +2395,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
*nodelist++ = '\0';
if (nodelist_parse(nodelist, nodes))
goto out;
- if (!nodes_subset(nodes, node_states[N_HIGH_MEMORY]))
+ if (!nodes_subset(nodes, node_states[N_MEMORY]))
goto out;
} else
nodes_clear(nodes);
@@ -2429,7 +2429,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
* Default to online nodes with memory if no nodelist
*/
if (!nodelist)
- nodes = node_states[N_HIGH_MEMORY];
+ nodes = node_states[N_MEMORY];
break;
case MPOL_LOCAL:
/*
--
1.7.4.4

2012-08-06 09:23:33

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 01/25] page_alloc.c: don't subtract unrelated memmap from zone's present pages

A)======
Currently, memory-page-map(struct page array) is not defined in struct zone.
It is defined in several ways:

FLATMEM: global memmap, can be allocated from any zone <= ZONE_NORMAL
CONFIG_DISCONTIGMEM: node-specific memmap, can be allocated from any
zone <= ZONE_NORMAL within that node.
CONFIG_SPARSEMEM: memorysection-specific memmap, can be allocated from any zone,
when CONFIG_SPARSEMEM_VMEMMAP, it is even not physical continuous.

So, the memmap has nothing directly related with the zone. And it's memory can be
allocated outside, so it is wrong to subtract memmap's size from zone's
present pages.

B)======
When system has large holes, the subtracted-present-pages-size will become
very small or negative, make the memory management works bad at the zone or
make the zone unusable even the real-present-pages-size is actually large.

C)======
And subtracted-present-pages-size has problem when memory-hot-removing,
the zone->zone->present_pages may overflow and become huge(unsigned long).

D)======
memory-page-map is large and long living unreclaimable memory, it is good to
subtract them for proper watermark.
So a new proper approach is needed to do it similarly
and new approach should also handle other long living unreclaimable memory.

Current blindly subtracted-present-pages-size approach does wrong, remove it.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/page_alloc.c | 20 +-------------------
1 files changed, 1 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..9312702 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4357,30 +4357,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
- unsigned long size, realsize, memmap_pages;
+ unsigned long size, realsize;

size = zone_spanned_pages_in_node(nid, j, zones_size);
realsize = size - zone_absent_pages_in_node(nid, j,
zholes_size);

- /*
- * Adjust realsize so that it accounts for how much memory
- * is used by this zone for memmap. This affects the watermark
- * and per-cpu initialisations
- */
- memmap_pages =
- PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
- if (realsize >= memmap_pages) {
- realsize -= memmap_pages;
- if (memmap_pages)
- printk(KERN_DEBUG
- " %s zone: %lu pages used for memmap\n",
- zone_names[j], memmap_pages);
- } else
- printk(KERN_WARNING
- " %s zone: %lu pages exceeds realsize %lu\n",
- zone_names[j], memmap_pages, realsize);
-
/* Account for reserved pages */
if (j == 0 && realsize > dma_reserve) {
realsize -= dma_reserve;
--
1.7.4.4

2012-08-06 09:29:34

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 06/25] cpuset: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
Documentation/cgroups/cpusets.txt | 2 +-
include/linux/cpuset.h | 2 +-
kernel/cpuset.c | 32 ++++++++++++++++----------------
3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index cefd3d8..12e01d4 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -218,7 +218,7 @@ and name space for cpusets, with a minimum of additional kernel code.
The cpus and mems files in the root (top_cpuset) cpuset are
read-only. The cpus file automatically tracks the value of
cpu_online_mask using a CPU hotplug notifier, and the mems file
-automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
+automatically tracks the value of node_states[N_MEMORY]--i.e.,
nodes with memory--using the cpuset_track_online_nodes() hook.


diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 838320f..8c8a60d29 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -144,7 +144,7 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
return node_possible_map;
}

-#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
+#define cpuset_current_mems_allowed (node_states[N_MEMORY])
static inline void cpuset_init_current_mems_allowed(void) {}

static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f33c715..2b133db 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -302,10 +302,10 @@ static void guarantee_online_cpus(const struct cpuset *cs,
* are online, with memory. If none are online with memory, walk
* up the cpuset hierarchy until we find one that does have some
* online mems. If we get all the way to the top and still haven't
- * found any online mems, return node_states[N_HIGH_MEMORY].
+ * found any online mems, return node_states[N_MEMORY].
*
* One way or another, we guarantee to return some non-empty subset
- * of node_states[N_HIGH_MEMORY].
+ * of node_states[N_MEMORY].
*
* Call with callback_mutex held.
*/
@@ -313,14 +313,14 @@ static void guarantee_online_cpus(const struct cpuset *cs,
static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
{
while (cs && !nodes_intersects(cs->mems_allowed,
- node_states[N_HIGH_MEMORY]))
+ node_states[N_MEMORY]))
cs = cs->parent;
if (cs)
nodes_and(*pmask, cs->mems_allowed,
- node_states[N_HIGH_MEMORY]);
+ node_states[N_MEMORY]);
else
- *pmask = node_states[N_HIGH_MEMORY];
- BUG_ON(!nodes_intersects(*pmask, node_states[N_HIGH_MEMORY]));
+ *pmask = node_states[N_MEMORY];
+ BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
}

/*
@@ -1100,7 +1100,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
return -ENOMEM;

/*
- * top_cpuset.mems_allowed tracks node_stats[N_HIGH_MEMORY];
+ * top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
* it's read-only
*/
if (cs == &top_cpuset) {
@@ -1122,7 +1122,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
goto done;

if (!nodes_subset(trialcs->mems_allowed,
- node_states[N_HIGH_MEMORY])) {
+ node_states[N_MEMORY])) {
retval = -EINVAL;
goto done;
}
@@ -2034,7 +2034,7 @@ static struct cpuset *cpuset_next(struct list_head *queue)
* before dropping down to the next. It always processes a node before
* any of its children.
*
- * In the case of memory hot-unplug, it will remove nodes from N_HIGH_MEMORY
+ * In the case of memory hot-unplug, it will remove nodes from N_MEMORY
* if all present pages from a node are offlined.
*/
static void
@@ -2073,7 +2073,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)

/* Continue past cpusets with all mems online */
if (nodes_subset(cp->mems_allowed,
- node_states[N_HIGH_MEMORY]))
+ node_states[N_MEMORY]))
continue;

oldmems = cp->mems_allowed;
@@ -2081,7 +2081,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
/* Remove offline mems from this cpuset. */
mutex_lock(&callback_mutex);
nodes_and(cp->mems_allowed, cp->mems_allowed,
- node_states[N_HIGH_MEMORY]);
+ node_states[N_MEMORY]);
mutex_unlock(&callback_mutex);

/* Move tasks from the empty cpuset to a parent */
@@ -2134,8 +2134,8 @@ void cpuset_update_active_cpus(bool cpu_online)

#ifdef CONFIG_MEMORY_HOTPLUG
/*
- * Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY].
- * Call this routine anytime after node_states[N_HIGH_MEMORY] changes.
+ * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
+ * Call this routine anytime after node_states[N_MEMORY] changes.
* See cpuset_update_active_cpus() for CPU hotplug handling.
*/
static int cpuset_track_online_nodes(struct notifier_block *self,
@@ -2148,7 +2148,7 @@ static int cpuset_track_online_nodes(struct notifier_block *self,
case MEM_ONLINE:
oldmems = top_cpuset.mems_allowed;
mutex_lock(&callback_mutex);
- top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
+ top_cpuset.mems_allowed = node_states[N_MEMORY];
mutex_unlock(&callback_mutex);
update_tasks_nodemask(&top_cpuset, &oldmems, NULL);
break;
@@ -2177,7 +2177,7 @@ static int cpuset_track_online_nodes(struct notifier_block *self,
void __init cpuset_init_smp(void)
{
cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
- top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
+ top_cpuset.mems_allowed = node_states[N_MEMORY];

hotplug_memory_notifier(cpuset_track_online_nodes, 10);

@@ -2245,7 +2245,7 @@ void cpuset_init_current_mems_allowed(void)
*
* Description: Returns the nodemask_t mems_allowed of the cpuset
* attached to the specified @tsk. Guaranteed to return some non-empty
- * subset of node_states[N_HIGH_MEMORY], even if this means going outside the
+ * subset of node_states[N_MEMORY], even if this means going outside the
* tasks cpuset.
**/

--
1.7.4.4

2012-08-06 09:23:31

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 03/25] slub, hotplug: ignore unrelated node's hot-adding and hot-removing

SLUB only fucus on the nodes which has normal memory, so ignore the other
node's hot-adding and hot-removing.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/slub.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8c691fa..f8b137a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3568,7 +3568,7 @@ static void slab_mem_offline_callback(void *arg)
struct memory_notify *marg = arg;
int offline_node;

- offline_node = marg->status_change_nid;
+ offline_node = marg->status_change_nid_normal;

/*
* If the node still has available memory. we need kmem_cache_node
@@ -3601,7 +3601,7 @@ static int slab_mem_going_online_callback(void *arg)
struct kmem_cache_node *n;
struct kmem_cache *s;
struct memory_notify *marg = arg;
- int nid = marg->status_change_nid;
+ int nid = marg->status_change_nid_normal;
int ret = 0;

/*
--
1.7.4.4

2012-08-06 09:30:42

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 02/25] memory_hotplug: fix missing nodemask management

Currently memory_hotplug only manages the node_states[N_HIGH_MEMORY],
it forgot to manage node_states[N_NORMAL_MEMORY]. fix it.

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/memory-hotplug.txt | 5 ++-
include/linux/memory.h | 1 +
mm/memory_hotplug.c | 94 +++++++++++++++++++++++++++++++------
3 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 6d0c251..6e6cbc7 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -377,15 +377,18 @@ The third argument is passed by pointer of struct memory_notify.
struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
+ int status_change_nid_normal;
int status_change_nid;
}

start_pfn is start_pfn of online/offline memory.
nr_pages is # of pages of online/offline memory.
+status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
+is (will be) set/clear, if this is -1, then nodemask status is not changed.
status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
set/clear. It means a new(memoryless) node gets new memory by online and a
node loses all memory. If this is -1, then nodemask status is not changed.
-If status_changed_nid >= 0, callback should create/discard structures for the
+If status_changed_nid* >= 0, callback should create/discard structures for the
node if necessary.

--------------
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 1ac7f6e..6b9202b 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -53,6 +53,7 @@ int arch_get_memory_phys_device(unsigned long start_pfn);
struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
+ int status_change_nid_normal;
int status_change_nid;
};

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 427bb29..3438c4a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -456,6 +456,34 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
return 0;
}

+static void check_nodemasks_changes_online(unsigned long nr_pages,
+ struct zone *zone, struct memory_notify *arg)
+{
+ int nid = zone_to_nid(zone);
+ enum zone_type zone_last = ZONE_NORMAL;
+
+ if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+ zone_last = ZONE_MOVABLE;
+
+ if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
+ arg->status_change_nid_normal = nid;
+ else
+ arg->status_change_nid_normal = -1;
+
+ if (!node_state(nid, N_HIGH_MEMORY))
+ arg->status_change_nid = nid;
+ else
+ arg->status_change_nid = -1;
+}
+
+static void set_nodemasks(int node, struct memory_notify *arg)
+{
+ if (arg->status_change_nid_normal >= 0)
+ node_set_state(node, N_NORMAL_MEMORY);
+
+ node_set_state(node, N_HIGH_MEMORY);
+}
+

int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
{
@@ -467,13 +495,18 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
struct memory_notify arg;

lock_memory_hotplug();
+ /*
+ * This doesn't need a lock to do pfn_to_page().
+ * The section can't be removed here because of the
+ * memory_block->state_mutex.
+ */
+ zone = page_zone(pfn_to_page(pfn));
+
arg.start_pfn = pfn;
arg.nr_pages = nr_pages;
- arg.status_change_nid = -1;
+ check_nodemasks_changes_online(nr_pages, zone, &arg);

nid = page_to_nid(pfn_to_page(pfn));
- if (node_present_pages(nid) == 0)
- arg.status_change_nid = nid;

ret = memory_notify(MEM_GOING_ONLINE, &arg);
ret = notifier_to_errno(ret);
@@ -483,12 +516,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
return ret;
}
/*
- * This doesn't need a lock to do pfn_to_page().
- * The section can't be removed here because of the
- * memory_block->state_mutex.
- */
- zone = page_zone(pfn_to_page(pfn));
- /*
* If this zone is not populated, then it is not in zonelist.
* This means the page allocator ignores this zone.
* So, zonelist must be updated after online.
@@ -523,7 +550,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)

if (onlined_pages) {
kswapd_run(zone_to_nid(zone));
- node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
+ set_nodemasks(zone_to_nid(zone), &arg);
}

vm_total_pages = nr_free_pagecache_pages();
@@ -865,6 +892,44 @@ check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
return offlined;
}

+static void check_nodemasks_changes_offline(unsigned long nr_pages,
+ struct zone *zone, struct memory_notify *arg)
+{
+ struct pglist_data *pgdat = zone->zone_pgdat;
+ unsigned long present_pages = 0;
+ enum zone_type zt, zone_last = ZONE_NORMAL;
+
+ if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+ zone_last = ZONE_MOVABLE;
+
+ for (zt = 0; zt <= zone_last; zt++)
+ present_pages += pgdat->node_zones[zt].present_pages;
+ if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
+ arg->status_change_nid_normal = zone_to_nid(zone);
+ else
+ arg->status_change_nid_normal = -1;
+
+ zone_last = ZONE_MOVABLE;
+ for (; zt <= zone_last; zt++)
+ present_pages += pgdat->node_zones[zt].present_pages;
+ if (nr_pages >= present_pages)
+ arg->status_change_nid = zone_to_nid(zone);
+ else
+ arg->status_change_nid = -1;
+}
+
+static void clear_nodemasks(int node, struct memory_notify *arg)
+{
+ if (arg->status_change_nid_normal >= 0)
+ node_clear_state(node, N_NORMAL_MEMORY);
+
+ if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+ return;
+
+ if (arg->status_change_nid >= 0)
+ node_clear_state(node, N_HIGH_MEMORY);
+}
+
static int __ref offline_pages(unsigned long start_pfn,
unsigned long end_pfn, unsigned long timeout)
{
@@ -898,9 +963,7 @@ static int __ref offline_pages(unsigned long start_pfn,

arg.start_pfn = start_pfn;
arg.nr_pages = nr_pages;
- arg.status_change_nid = -1;
- if (nr_pages >= node_present_pages(node))
- arg.status_change_nid = node;
+ check_nodemasks_changes_offline(nr_pages, zone, &arg);

ret = memory_notify(MEM_GOING_OFFLINE, &arg);
ret = notifier_to_errno(ret);
@@ -965,10 +1028,9 @@ repeat:

init_per_zone_wmark_min();

- if (!node_present_pages(node)) {
- node_clear_state(node, N_HIGH_MEMORY);
+ clear_nodemasks(node, &arg);
+ if (arg.status_change_nid >= 0)
kswapd_stop(node);
- }

vm_total_pages = nr_free_pagecache_pages();
writeback_set_ratelimit();
--
1.7.4.4

2012-08-06 09:30:41

by Lai Jiangshan

[permalink] [raw]
Subject: [RFC V3 PATCH 04/25] node: cleanup node_state_attr

Make it more readability and easy to add new state.

Signed-off-by: Lai Jiangshan <[email protected]>
---
drivers/base/node.c | 20 ++++++++++----------
1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index af1a177..5d7731e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -614,23 +614,23 @@ static ssize_t show_node_state(struct device *dev,
{ __ATTR(name, 0444, show_node_state, NULL), state }

static struct node_attr node_state_attr[] = {
- _NODE_ATTR(possible, N_POSSIBLE),
- _NODE_ATTR(online, N_ONLINE),
- _NODE_ATTR(has_normal_memory, N_NORMAL_MEMORY),
- _NODE_ATTR(has_cpu, N_CPU),
+ [N_POSSIBLE] = _NODE_ATTR(possible, N_POSSIBLE),
+ [N_ONLINE] = _NODE_ATTR(online, N_ONLINE),
+ [N_NORMAL_MEMORY] = _NODE_ATTR(has_normal_memory, N_NORMAL_MEMORY),
#ifdef CONFIG_HIGHMEM
- _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
+ [N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
#endif
+ [N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
};

static struct attribute *node_state_attrs[] = {
- &node_state_attr[0].attr.attr,
- &node_state_attr[1].attr.attr,
- &node_state_attr[2].attr.attr,
- &node_state_attr[3].attr.attr,
+ &node_state_attr[N_POSSIBLE].attr.attr,
+ &node_state_attr[N_ONLINE].attr.attr,
+ &node_state_attr[N_NORMAL_MEMORY].attr.attr,
#ifdef CONFIG_HIGHMEM
- &node_state_attr[4].attr.attr,
+ &node_state_attr[N_HIGH_MEMORY].attr.attr,
#endif
+ &node_state_attr[N_CPU].attr.attr,
NULL
};

--
1.7.4.4

2012-08-23 08:22:52

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [RFC V3 PATCH 00/25] memory,numa: introduce MOVABLE-dedicated node and online_movable for hotplug

Hi Lai,

Sorry for late reply.
I'm trying to apply your patchset into linux-3.6-rc3.
But I failed to apply them.
What is the based kernel that can apply your patch?

Thanks,
Yasuaki Ishimatsu


2012/08/06 18:22, Lai Jiangshan wrote:
> A) Introduction:
>
> This patchset adds MOVABLE-dedicated node and online_movable for memory-management.
>
> It is used for anti-fragmentation(hugepage, big-order allocation...),
> hot-removal-of-memory(virtualization, power-conserve, move memory between systems
> to make better utilities of memories).
>
> B) User Interface:
>
> When users(big system manager) need config some node/memory as MOVABLE:
> 1 Use kernelcore_max_addr=XX when boot
> 2 Use movable_online hotplug action when running
> We may introduce some more convenient interface, such as
> movable_node=NODE_LIST boot option.
>
> C) Patches
>
> Patch1-3 Fix problems of the current code.(all related with hotplug)
> Patch4 cleanup for node_state_attr
> Patch5 introduce N_MEMORY
> Patch6-18 use N_MEMORY instead N_HIGH_MEMORY.
> The patches are separated by subsystem,
> *these conversions was(must be) checked carefully*.
> Patch18 also changes the node_states initialization
> Patch19 Add config to allow MOVABLE-dedicated node
> Patch20-24 Add kernelcore_max_addr
> Patch25 Add online_movable and online_kernel
>
>
> D) changes
> change V3-v2:
> Proper nodemask management
>
> change V2-V1:
>
> The original V1 patchset of MOVABLE-dedicated node is here:
> http://comments.gmane.org/gmane.linux.kernel.mm/78122
>
> The new V2 adds N_MEMORY and a notion of "MOVABLE-dedicated node".
> And fix some related problems.
>
> The orignal V1 patchset of "add online_movable" is here:
> https://lkml.org/lkml/2012/7/4/145
>
> The new V2 discards the MIGRATE_HOTREMOVE approach, and use a more straight
> implementation(only 1 patch).
>
>
> Lai Jiangshan (21):
> page_alloc.c: don't subtract unrelated memmap from zone's present
> pages
> memory_hotplug: fix missing nodemask management
> slub, hotplug: ignore unrelated node's hot-adding and hot-removing
> node: cleanup node_state_attr
> node_states: introduce N_MEMORY
> cpuset: use N_MEMORY instead N_HIGH_MEMORY
> procfs: use N_MEMORY instead N_HIGH_MEMORY
> memcontrol: use N_MEMORY instead N_HIGH_MEMORY
> oom: use N_MEMORY instead N_HIGH_MEMORY
> mm,migrate: use N_MEMORY instead N_HIGH_MEMORY
> mempolicy: use N_MEMORY instead N_HIGH_MEMORY
> hugetlb: use N_MEMORY instead N_HIGH_MEMORY
> vmstat: use N_MEMORY instead N_HIGH_MEMORY
> kthread: use N_MEMORY instead N_HIGH_MEMORY
> init: use N_MEMORY instead N_HIGH_MEMORY
> vmscan: use N_MEMORY instead N_HIGH_MEMORY
> page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states
> initialization
> hotplug: update nodemasks management
> numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
> page_alloc: add kernelcore_max_addr
> mm, memory-hotplug: add online_movable and online_kernel
>
> Yasuaki Ishimatsu (4):
> x86: get pg_data_t's memory from other node
> x86: use memblock_set_current_limit() to set memblock.current_limit
> memblock: limit memory address from memblock
> memblock: compare current_limit with end variable at
> memblock_find_in_range_node()
>
> Documentation/cgroups/cpusets.txt | 2 +-
> Documentation/kernel-parameters.txt | 9 ++
> Documentation/memory-hotplug.txt | 24 +++-
> arch/x86/kernel/setup.c | 4 +-
> arch/x86/mm/init_64.c | 4 +-
> arch/x86/mm/numa.c | 8 +-
> drivers/base/memory.c | 19 ++-
> drivers/base/node.c | 28 +++--
> fs/proc/kcore.c | 2 +-
> fs/proc/task_mmu.c | 4 +-
> include/linux/cpuset.h | 2 +-
> include/linux/memblock.h | 1 +
> include/linux/memory.h | 2 +
> include/linux/memory_hotplug.h | 13 ++-
> include/linux/nodemask.h | 5 +
> init/main.c | 2 +-
> kernel/cpuset.c | 32 +++---
> kernel/kthread.c | 2 +-
> mm/Kconfig | 8 ++
> mm/hugetlb.c | 24 ++--
> mm/memblock.c | 10 +-
> mm/memcontrol.c | 18 ++--
> mm/memory_hotplug.c | 232 ++++++++++++++++++++++++++++++++---
> mm/mempolicy.c | 12 +-
> mm/migrate.c | 2 +-
> mm/oom_kill.c | 2 +-
> mm/page_alloc.c | 96 +++++++++------
> mm/page_cgroup.c | 2 +-
> mm/slub.c | 4 +-
> mm/vmscan.c | 4 +-
> mm/vmstat.c | 4 +-
> 31 files changed, 437 insertions(+), 144 deletions(-)
>