2012-10-29 15:06:07

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 00/26] mm, memory-hotplug: dynamic configure movable memory and introduce movable node

Movable memory is a very important concept of memory-management,
we need to consolidate it and make use of it on systems.

Movable memory is needed for
o anti-fragmentation(hugepage, big-order allocation...)
o logic hot-remove(virtualization, Memory capacity on Demand)
o physic hot-remove(power-saving, hardware partitioning, hardware fault management)

All these require dynamic configuring the memory and making better utilities of memories
and safer. We also need physic hot-remove, so we need movable node too.
(Although some systems support physic-memory-migration, we don't require all
memory on physic-node is movable, but movable node is still needed here
for logic-node if we want to make physic-migration is transparent)

We add dynamic configuration commands "online_movalbe" and "online_kernel".
We also add non-dynamic boot option kernelcore_max_addr.
We may add some more dynamic/non-dynamic configuration in future.


The patchset is based on 3.7-rc3 with these three patches already applied:
https://lkml.org/lkml/2012/10/24/151
https://lkml.org/lkml/2012/10/26/150

You can also simply pull all the patches from:
git pull https://github.com/laijs/linux.git hotplug-next



Issues):

mempolicy(M_BIND) don't act well when the nodemask has movable nodes only,
the kernel allocation will fail and the task can't create new task or other
kernel objects.

So we change the strategy/policy
when the bound nodemask has movable node(s) only, we only
apply mempolicy for userspace allocation, don't apply it
for kernel allocation.

CPUSET also has the same problem, but the code spread in page_alloc.c,
and we doesn't fix it yet, we can/will change allocation strategy to one of
these 3 strategies:
1) the same strategy as mempolicy
2) change cpuset, make nodemask always has at least a normal node
3) split nodemask: nodemask_user and nodemask_kernel

Thoughts?



Patches):

patch1-3: add online_movable and online_kernel, bot don't result movable node
Patch4 cleanup for node_state_attr
Patch5 introduce N_MEMORY
Patch6-17 use N_MEMORY instead N_HIGH_MEMORY.
The patches are separated by subsystem,
Patch18 also changes the node_states initialization
Patch18-20 Add MOVABLE-dedicated node
Patch21-25 Add kernelcore_max_addr
patch26: mempolicy handle movable node




Changes):

change V5-V4:
consolidate online_movable/online_kernel
nodemask management

change V4-v3
rebase.
online_movable/online_kernel can create a zone from empty
or empyt a zone

change V3-v2:
Proper nodemask management

change V2-V1:

The original V1 patchset of MOVABLE-dedicated node is here:
http://comments.gmane.org/gmane.linux.kernel.mm/78122

The new V2 adds N_MEMORY and a notion of "MOVABLE-dedicated node".
And fix some related problems.

The orignal V1 patchset of "add online_movable" is here:
https://lkml.org/lkml/2012/7/4/145

The new V2 discards the MIGRATE_HOTREMOVE approach, and use a more straight
implementation(only 1 patch).



Lai Jiangshan (22):
mm, memory-hotplug: dynamic configure movable memory and portion
memory
memory_hotplug: handle empty zone when online_movable/online_kernel
memory_hotplug: ensure every online node has NORMAL memory
node: cleanup node_state_attr
node_states: introduce N_MEMORY
cpuset: use N_MEMORY instead N_HIGH_MEMORY
procfs: use N_MEMORY instead N_HIGH_MEMORY
memcontrol: use N_MEMORY instead N_HIGH_MEMORY
oom: use N_MEMORY instead N_HIGH_MEMORY
mm,migrate: use N_MEMORY instead N_HIGH_MEMORY
mempolicy: use N_MEMORY instead N_HIGH_MEMORY
hugetlb: use N_MEMORY instead N_HIGH_MEMORY
vmstat: use N_MEMORY instead N_HIGH_MEMORY
kthread: use N_MEMORY instead N_HIGH_MEMORY
init: use N_MEMORY instead N_HIGH_MEMORY
vmscan: use N_MEMORY instead N_HIGH_MEMORY
page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states
initialization
hotplug: update nodemasks management
numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
memory_hotplug: allow online/offline memory to result movable node
page_alloc: add kernelcore_max_addr
mempolicy: fix is_valid_nodemask()

Yasuaki Ishimatsu (4):
x86: get pg_data_t's memory from other node
x86: use memblock_set_current_limit() to set memblock.current_limit
memblock: limit memory address from memblock
memblock: compare current_limit with end variable at
memblock_find_in_range_node()

Documentation/cgroups/cpusets.txt | 2 +-
Documentation/kernel-parameters.txt | 9 +
Documentation/memory-hotplug.txt | 19 ++-
arch/x86/kernel/setup.c | 4 +-
arch/x86/mm/init_64.c | 4 +-
arch/x86/mm/numa.c | 8 +-
drivers/base/memory.c | 27 ++--
drivers/base/node.c | 28 ++--
fs/proc/kcore.c | 2 +-
fs/proc/task_mmu.c | 4 +-
include/linux/cpuset.h | 2 +-
include/linux/memblock.h | 1 +
include/linux/memory.h | 1 +
include/linux/memory_hotplug.h | 13 ++-
include/linux/nodemask.h | 5 +
init/main.c | 2 +-
kernel/cpuset.c | 32 ++--
kernel/kthread.c | 2 +-
mm/Kconfig | 8 +
mm/hugetlb.c | 24 ++--
mm/memblock.c | 10 +-
mm/memcontrol.c | 18 +-
mm/memory_hotplug.c | 283 +++++++++++++++++++++++++++++++++--
mm/mempolicy.c | 48 ++++---
mm/migrate.c | 2 +-
mm/oom_kill.c | 2 +-
mm/page_alloc.c | 76 +++++++---
mm/page_cgroup.c | 2 +-
mm/vmscan.c | 4 +-
mm/vmstat.c | 4 +-
30 files changed, 508 insertions(+), 138 deletions(-)

--
1.7.4.4

cover-letter:


2012-10-29 15:06:15

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 01/26] mm, memory-hotplug: dynamic configure movable memory and portion memory

Add online_movable and online_kernel for logic memory hotplug.
This is the dynamic version of "movablecore" & "kernelcore".

We have the same reason to introduce it as to introduce "movablecore" & "kernelcore".
It has the same motive as "movablecore" & "kernelcore", but it is dynamic/running-time:

o We can configure memory as kernelcore or movablecore after boot.

Userspace workload is increased, we need more hugepage, we can't
use "online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.

Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.

Memory capacity on Demand

o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it
as we expected when we logic-online it.

This configuration also helps for physically-memory-migrate.

o all benefit as the same as existed "movablecore" & "kernelcore".

o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault management).

(Note, we don't introduce movable-node here.)


Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.

When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.

Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.


Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/memory-hotplug.txt | 14 +++++-
drivers/base/memory.c | 27 ++++++----
include/linux/memory_hotplug.h | 13 +++++-
mm/memory_hotplug.c | 101 +++++++++++++++++++++++++++++++++++++-
4 files changed, 142 insertions(+), 13 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 6e6cbc7..c6f993d 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -161,7 +161,8 @@ a recent addition and not present on older kernels.
in the memory block.
'state' : read-write
at read: contains online/offline state of memory.
- at write: user can specify "online", "offline" command
+ at write: user can specify "online_kernel",
+ "online_movable", "online", "offline" command
which will be performed on al sections in the block.
'phys_device' : read-only: designed to show the name of physical memory
device. This is not well implemented now.
@@ -255,6 +256,17 @@ For onlining, you have to write "online" to the section's state file as:

% echo online > /sys/devices/system/memory/memoryXXX/state

+This onlining will not change the ZONE type of the target memory section,
+If the memory section is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
+
+% echo online_movable > /sys/devices/system/memory/memoryXXX/state
+(NOTE: current limit: this memory section must be adjacent to ZONE_MOVABLE)
+
+And if the memory section is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
+
+% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+(NOTE: current limit: this memory section must be adjacent to ZONE_NORMAL)
+
After this, section memoryXXX's state will be 'online' and the amount of
available memory will be increased.

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 86c8821..15a1dd7 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -246,7 +246,7 @@ static bool pages_correctly_reserved(unsigned long start_pfn,
* OK to have direct references to sparsemem variables in here.
*/
static int
-memory_block_action(unsigned long phys_index, unsigned long action)
+memory_block_action(unsigned long phys_index, unsigned long action, int online_type)
{
unsigned long start_pfn;
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
@@ -261,7 +261,7 @@ memory_block_action(unsigned long phys_index, unsigned long action)
if (!pages_correctly_reserved(start_pfn, nr_pages))
return -EBUSY;

- ret = online_pages(start_pfn, nr_pages);
+ ret = online_pages(start_pfn, nr_pages, online_type);
break;
case MEM_OFFLINE:
ret = offline_pages(start_pfn, nr_pages);
@@ -276,7 +276,8 @@ memory_block_action(unsigned long phys_index, unsigned long action)
}

static int __memory_block_change_state(struct memory_block *mem,
- unsigned long to_state, unsigned long from_state_req)
+ unsigned long to_state, unsigned long from_state_req,
+ int online_type)
{
int ret = 0;

@@ -288,7 +289,7 @@ static int __memory_block_change_state(struct memory_block *mem,
if (to_state == MEM_OFFLINE)
mem->state = MEM_GOING_OFFLINE;

- ret = memory_block_action(mem->start_section_nr, to_state);
+ ret = memory_block_action(mem->start_section_nr, to_state, online_type);

if (ret) {
mem->state = from_state_req;
@@ -311,12 +312,14 @@ out:
}

static int memory_block_change_state(struct memory_block *mem,
- unsigned long to_state, unsigned long from_state_req)
+ unsigned long to_state, unsigned long from_state_req,
+ int online_type)
{
int ret;

mutex_lock(&mem->state_mutex);
- ret = __memory_block_change_state(mem, to_state, from_state_req);
+ ret = __memory_block_change_state(mem, to_state, from_state_req,
+ online_type);
mutex_unlock(&mem->state_mutex);

return ret;
@@ -330,10 +333,14 @@ store_mem_state(struct device *dev,

mem = container_of(dev, struct memory_block, dev);

- if (!strncmp(buf, "online", min((int)count, 6)))
- ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
+ if (!strncmp(buf, "online_kernel", min((int)count, 13)))
+ ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, ONLINE_KERNEL);
+ else if (!strncmp(buf, "online_movable", min((int)count, 14)))
+ ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, ONLINE_MOVABLE);
+ else if (!strncmp(buf, "online", min((int)count, 6)))
+ ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, ONLINE_KEEP);
else if(!strncmp(buf, "offline", min((int)count, 7)))
- ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
+ ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE, -1);

if (ret)
return ret;
@@ -669,7 +676,7 @@ int offline_memory_block(struct memory_block *mem)

mutex_lock(&mem->state_mutex);
if (mem->state != MEM_OFFLINE)
- ret = __memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
+ ret = __memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE, -1);
mutex_unlock(&mem->state_mutex);

return ret;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 95573ec..4a45c4e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -26,6 +26,13 @@ enum {
MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
};

+/* Types for control the zone type of onlined memory */
+enum {
+ ONLINE_KEEP,
+ ONLINE_KERNEL,
+ ONLINE_MOVABLE,
+};
+
/*
* pgdat resizing functions
*/
@@ -46,6 +53,10 @@ void pgdat_resize_init(struct pglist_data *pgdat)
}
/*
* Zone resizing functions
+ *
+ * Note: any attempt to resize a zone should has pgdat_resize_lock()
+ * zone_span_writelock() both held. This ensure the size of a zone
+ * can't be changed while pgdat_resize_lock() held.
*/
static inline unsigned zone_span_seqbegin(struct zone *zone)
{
@@ -71,7 +82,7 @@ extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
/* VM interface that may be used by firmware interface */
-extern int online_pages(unsigned long, unsigned long);
+extern int online_pages(unsigned long, unsigned long, int);
extern void __offline_isolated_pages(unsigned long, unsigned long);

typedef void (*online_page_callback_t)(struct page *page);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a1920fb..6d3bec4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -221,6 +221,89 @@ static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
zone_span_writeunlock(zone);
}

+static void resize_zone(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+
+ zone_span_writelock(zone);
+
+ zone->zone_start_pfn = start_pfn;
+ zone->spanned_pages = end_pfn - start_pfn;
+
+ zone_span_writeunlock(zone);
+}
+
+static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ enum zone_type zid = zone_idx(zone);
+ int nid = zone->zone_pgdat->node_id;
+ unsigned long pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ set_page_links(pfn_to_page(pfn), zid, nid, pfn);
+}
+
+static int move_pfn_range_left(struct zone *z1, struct zone *z2,
+ unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long flags;
+
+ pgdat_resize_lock(z1->zone_pgdat, &flags);
+
+ /* can't move pfns which are higher than @z2 */
+ if (end_pfn > z2->zone_start_pfn + z2->spanned_pages)
+ goto out_fail;
+ /* the move out part mast at the left most of @z2 */
+ if (start_pfn > z2->zone_start_pfn)
+ goto out_fail;
+ /* must included/overlap */
+ if (end_pfn <= z2->zone_start_pfn)
+ goto out_fail;
+
+ resize_zone(z1, z1->zone_start_pfn, end_pfn);
+ resize_zone(z2, end_pfn, z2->zone_start_pfn + z2->spanned_pages);
+
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+
+ fix_zone_id(z1, start_pfn, end_pfn);
+
+ return 0;
+out_fail:
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+ return -1;
+}
+
+static int move_pfn_range_right(struct zone *z1, struct zone *z2,
+ unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long flags;
+
+ pgdat_resize_lock(z1->zone_pgdat, &flags);
+
+ /* can't move pfns which are lower than @z1 */
+ if (z1->zone_start_pfn > start_pfn)
+ goto out_fail;
+ /* the move out part mast at the right most of @z1 */
+ if (z1->zone_start_pfn + z1->spanned_pages > end_pfn)
+ goto out_fail;
+ /* must included/overlap */
+ if (start_pfn >= z1->zone_start_pfn + z1->spanned_pages)
+ goto out_fail;
+
+ resize_zone(z1, z1->zone_start_pfn, start_pfn);
+ resize_zone(z2, start_pfn, z2->zone_start_pfn + z2->spanned_pages);
+
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+
+ fix_zone_id(z2, start_pfn, end_pfn);
+
+ return 0;
+out_fail:
+ pgdat_resize_unlock(z1->zone_pgdat, &flags);
+ return -1;
+}
+
static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
unsigned long end_pfn)
{
@@ -515,7 +598,7 @@ static void node_states_set_node(int node, struct memory_notify *arg)
}


-int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
+int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
{
unsigned long onlined_pages = 0;
struct zone *zone;
@@ -532,6 +615,22 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
*/
zone = page_zone(pfn_to_page(pfn));

+ if (online_type == ONLINE_KERNEL && zone_idx(zone) == ZONE_MOVABLE) {
+ if (move_pfn_range_left(zone - 1, zone, pfn, pfn + nr_pages)) {
+ unlock_memory_hotplug();
+ return -1;
+ }
+ }
+ if (online_type == ONLINE_MOVABLE && zone_idx(zone) == ZONE_MOVABLE - 1) {
+ if (move_pfn_range_right(zone, zone + 1, pfn, pfn + nr_pages)) {
+ unlock_memory_hotplug();
+ return -1;
+ }
+ }
+
+ /* Previous code may changed the zone of the pfn range */
+ zone = page_zone(pfn_to_page(pfn));
+
arg.start_pfn = pfn;
arg.nr_pages = nr_pages;
node_states_check_changes_online(nr_pages, zone, &arg);
--
1.7.4.4

2012-10-29 15:48:35

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 17/26] page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states initialization

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Since we introduced N_MEMORY, we update the initialization of node_states.

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/mm/init_64.c | 4 +++-
mm/page_alloc.c | 40 ++++++++++++++++++++++------------------
2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3baff25..2ead3c8 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -630,7 +630,9 @@ void __init paging_init(void)
* numa support is not compiled in, and later node_set_state
* will not set it back.
*/
- node_clear_state(0, N_NORMAL_MEMORY);
+ node_clear_state(0, N_MEMORY);
+ if (N_MEMORY != N_NORMAL_MEMORY)
+ node_clear_state(0, N_NORMAL_MEMORY);

zone_sizes_init();
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b1ef9b0..b70c929 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1692,7 +1692,7 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
*
* If the zonelist cache is present in the passed in zonelist, then
* returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_states[N_HIGH_MEMORY].)
+ * tasks mems_allowed, or node_states[N_MEMORY].)
*
* If the zonelist cache is not available for this zonelist, does
* nothing and returns NULL.
@@ -1721,7 +1721,7 @@ static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)

allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
&cpuset_current_mems_allowed :
- &node_states[N_HIGH_MEMORY];
+ &node_states[N_MEMORY];
return allowednodes;
}

@@ -3194,7 +3194,7 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
return node;
}

- for_each_node_state(n, N_HIGH_MEMORY) {
+ for_each_node_state(n, N_MEMORY) {

/* Don't want a node to appear more than once */
if (node_isset(n, *used_node_mask))
@@ -3336,7 +3336,7 @@ static int default_zonelist_order(void)
* local memory, NODE_ORDER may be suitable.
*/
average_size = total_size /
- (nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
+ (nodes_weight(node_states[N_MEMORY]) + 1);
for_each_online_node(nid) {
low_kmem_size = 0;
total_size = 0;
@@ -4669,7 +4669,7 @@ unsigned long __init find_min_pfn_with_active_regions(void)
/*
* early_calculate_totalpages()
* Sum pages in active regions for movable zone.
- * Populate N_HIGH_MEMORY for calculating usable_nodes.
+ * Populate N_MEMORY for calculating usable_nodes.
*/
static unsigned long __init early_calculate_totalpages(void)
{
@@ -4682,7 +4682,7 @@ static unsigned long __init early_calculate_totalpages(void)

totalpages += pages;
if (pages)
- node_set_state(nid, N_HIGH_MEMORY);
+ node_set_state(nid, N_MEMORY);
}
return totalpages;
}
@@ -4699,9 +4699,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
unsigned long usable_startpfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
- nodemask_t saved_node_state = node_states[N_HIGH_MEMORY];
+ nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
- int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
+ int usable_nodes = nodes_weight(node_states[N_MEMORY]);

/*
* If movablecore was specified, calculate what size of
@@ -4736,7 +4736,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;

/*
@@ -4828,23 +4828,27 @@ restart:

out:
/* restore the node_state */
- node_states[N_HIGH_MEMORY] = saved_node_state;
+ node_states[N_MEMORY] = saved_node_state;
}

-/* Any regular memory on that node ? */
-static void __init check_for_regular_memory(pg_data_t *pgdat)
+/* Any regular or high memory on that node ? */
+static void check_for_memory(pg_data_t *pgdat, int nid)
{
-#ifdef CONFIG_HIGHMEM
enum zone_type zone_type;

- for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
+ if (N_MEMORY == N_NORMAL_MEMORY)
+ return;
+
+ for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) {
struct zone *zone = &pgdat->node_zones[zone_type];
if (zone->present_pages) {
- node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
+ node_set_state(nid, N_HIGH_MEMORY);
+ if (N_NORMAL_MEMORY != N_HIGH_MEMORY &&
+ zone_type <= ZONE_NORMAL)
+ node_set_state(nid, N_NORMAL_MEMORY);
break;
}
}
-#endif
}

/**
@@ -4927,8 +4931,8 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)

/* Any memory on that node */
if (pgdat->node_present_pages)
- node_set_state(nid, N_HIGH_MEMORY);
- check_for_regular_memory(pgdat);
+ node_set_state(nid, N_MEMORY);
+ check_for_memory(pgdat, nid);
}
}

--
1.7.4.4

2012-10-29 15:48:32

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 13/26] vmstat: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---
mm/vmstat.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..1b5cacd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -930,7 +930,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
pg_data_t *pgdat = (pg_data_t *)arg;

/* check memoryless node */
- if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+ if (!node_state(pgdat->node_id, N_MEMORY))
return 0;

seq_printf(m, "Page block order: %d\n", pageblock_order);
@@ -1292,7 +1292,7 @@ static int unusable_show(struct seq_file *m, void *arg)
pg_data_t *pgdat = (pg_data_t *)arg;

/* check memoryless node */
- if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+ if (!node_state(pgdat->node_id, N_MEMORY))
return 0;

walk_zones_in_node(m, pgdat, unusable_show_print);
--
1.7.4.4

2012-10-29 15:48:30

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 04/26] node: cleanup node_state_attr

use [index] = init_value
use N_xxxxx instead of hardcode.

Make it more readability and easy to add new state.

Signed-off-by: Lai Jiangshan <[email protected]>
---
drivers/base/node.c | 20 ++++++++++----------
1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index af1a177..5d7731e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -614,23 +614,23 @@ static ssize_t show_node_state(struct device *dev,
{ __ATTR(name, 0444, show_node_state, NULL), state }

static struct node_attr node_state_attr[] = {
- _NODE_ATTR(possible, N_POSSIBLE),
- _NODE_ATTR(online, N_ONLINE),
- _NODE_ATTR(has_normal_memory, N_NORMAL_MEMORY),
- _NODE_ATTR(has_cpu, N_CPU),
+ [N_POSSIBLE] = _NODE_ATTR(possible, N_POSSIBLE),
+ [N_ONLINE] = _NODE_ATTR(online, N_ONLINE),
+ [N_NORMAL_MEMORY] = _NODE_ATTR(has_normal_memory, N_NORMAL_MEMORY),
#ifdef CONFIG_HIGHMEM
- _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
+ [N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
#endif
+ [N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
};

static struct attribute *node_state_attrs[] = {
- &node_state_attr[0].attr.attr,
- &node_state_attr[1].attr.attr,
- &node_state_attr[2].attr.attr,
- &node_state_attr[3].attr.attr,
+ &node_state_attr[N_POSSIBLE].attr.attr,
+ &node_state_attr[N_ONLINE].attr.attr,
+ &node_state_attr[N_NORMAL_MEMORY].attr.attr,
#ifdef CONFIG_HIGHMEM
- &node_state_attr[4].attr.attr,
+ &node_state_attr[N_HIGH_MEMORY].attr.attr,
#endif
+ &node_state_attr[N_CPU].attr.attr,
NULL
};

--
1.7.4.4

2012-10-29 15:48:28

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 20/26] memory_hotplug: allow online/offline memory to result movable node

Now, memory management can handle movable node or nodes which don't have
any normal memory, so we can dynamic configure and add movable node by:
online a ZONE_MOVABLE memory from a previous offline node
offline the last normal memory which result a non-normal-memory-node

movable-node is very important for power-saving,
hardware partitioning and high-available-system(hardware fault management).


Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/memory_hotplug.c | 16 ++++++++++++++++
1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a55b547..756744c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -589,11 +589,19 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
return 0;
}

+#ifdef CONFIG_MOVABLE_NODE
+/* when CONFIG_MOVABLE_NODE, we allow online node don't have normal memory */
+static bool can_online_high_movable(struct zone *zone)
+{
+ return true;
+}
+#else /* #ifdef CONFIG_MOVABLE_NODE */
/* ensure every online node has NORMAL memory */
static bool can_online_high_movable(struct zone *zone)
{
return node_state(zone_to_nid(zone), N_NORMAL_MEMORY);
}
+#endif /* #ifdef CONFIG_MOVABLE_NODE */

/* check which state of node_states will be changed when online memory */
static void node_states_check_changes_online(unsigned long nr_pages,
@@ -1097,6 +1105,13 @@ check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
return offlined;
}

+#ifdef CONFIG_MOVABLE_NODE
+/* when CONFIG_MOVABLE_NODE, we allow online node don't have normal memory */
+static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
+{
+ return true;
+}
+#else /* #ifdef CONFIG_MOVABLE_NODE */
/* ensure the node has NORMAL memory if it is still online */
static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
{
@@ -1120,6 +1135,7 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
*/
return present_pages == 0;
}
+#endif /* #ifdef CONFIG_MOVABLE_NODE */

/* check which state of node_states will be changed when offline memory */
static void node_states_check_changes_offline(unsigned long nr_pages,
--
1.7.4.4

2012-10-29 15:49:44

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 19/26] numa: add CONFIG_MOVABLE_NODE for movable-dedicated node

All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node

Signed-off-by: Lai Jiangshan <[email protected]>
---
drivers/base/node.c | 6 ++++++
include/linux/nodemask.h | 4 ++++
mm/Kconfig | 8 ++++++++
mm/page_alloc.c | 3 +++
4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 4c3aa7c..9cdd66f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -620,6 +620,9 @@ static struct node_attr node_state_attr[] = {
#ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
+#endif
[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
};

@@ -630,6 +633,9 @@ static struct attribute *node_state_attrs[] = {
#ifdef CONFIG_HIGHMEM
&node_state_attr[N_HIGH_MEMORY].attr.attr,
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ &node_state_attr[N_MEMORY].attr.attr,
+#endif
&node_state_attr[N_CPU].attr.attr,
NULL
};
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index c6ebdc9..4e2cbfa 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -380,7 +380,11 @@ enum node_states {
#else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ N_MEMORY, /* The node has memory(regular, high, movable) */
+#else
N_MEMORY = N_HIGH_MEMORY,
+#endif
N_CPU, /* The node has one or more cpus */
NR_NODE_STATES
};
diff --git a/mm/Kconfig b/mm/Kconfig
index a3f8ddd..957ebd5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -143,6 +143,14 @@ config NO_BOOTMEM
config MEMORY_ISOLATION
boolean

+config MOVABLE_NODE
+ boolean "Enable to assign a node has only movable memory"
+ depends on HAVE_MEMBLOCK
+ depends on NO_BOOTMEM
+ depends on X86_64
+ depends on NUMA
+ default y
+
# eventually, we can have this option just 'select SPARSEMEM'
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b70c929..a42337f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -90,6 +90,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
#ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
#endif
+#ifdef CONFIG_MOVABLE_NODE
+ [N_MEMORY] = { { [0] = 1UL } },
+#endif
[N_CPU] = { { [0] = 1UL } },
#endif /* NUMA */
};
--
1.7.4.4

2012-10-29 15:49:42

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 08/26] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/memcontrol.c | 18 +++++++++---------
mm/page_cgroup.c | 2 +-
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..1b69665 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -800,7 +800,7 @@ static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
int nid;
u64 total = 0;

- for_each_node_state(nid, N_HIGH_MEMORY)
+ for_each_node_state(nid, N_MEMORY)
total += mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask);
return total;
}
@@ -1611,9 +1611,9 @@ static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
return;

/* make a nodemask where this memcg uses memory from */
- memcg->scan_nodes = node_states[N_HIGH_MEMORY];
+ memcg->scan_nodes = node_states[N_MEMORY];

- for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+ for_each_node_mask(nid, node_states[N_MEMORY]) {

if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
node_clear(nid, memcg->scan_nodes);
@@ -1684,7 +1684,7 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
/*
* Check rest of nodes.
*/
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
if (node_isset(nid, memcg->scan_nodes))
continue;
if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
@@ -3759,7 +3759,7 @@ move_account:
drain_all_stock_sync(memcg);
ret = 0;
mem_cgroup_start_move(memcg);
- for_each_node_state(node, N_HIGH_MEMORY) {
+ for_each_node_state(node, N_MEMORY) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
enum lru_list lru;
for_each_lru(lru) {
@@ -4087,7 +4087,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,

total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL);
seq_printf(m, "total=%lu", total_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL);
seq_printf(m, " N%d=%lu", nid, node_nr);
}
@@ -4095,7 +4095,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,

file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
seq_printf(m, "file=%lu", file_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
LRU_ALL_FILE);
seq_printf(m, " N%d=%lu", nid, node_nr);
@@ -4104,7 +4104,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,

anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
seq_printf(m, "anon=%lu", anon_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
LRU_ALL_ANON);
seq_printf(m, " N%d=%lu", nid, node_nr);
@@ -4113,7 +4113,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,

unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE));
seq_printf(m, "unevictable=%lu", unevictable_nr);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
BIT(LRU_UNEVICTABLE));
seq_printf(m, " N%d=%lu", nid, node_nr);
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 5ddad0c..c1054ad 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
if (mem_cgroup_disabled())
return;

- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;

start_pfn = node_start_pfn(nid);
--
1.7.4.4

2012-10-29 15:48:25

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 09/26] oom: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
mm/oom_kill.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 79e0f3e..aa2d89c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -257,7 +257,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
* the page allocator means a mempolicy is in effect. Cpuset policy
* is enforced in get_page_from_freelist().
*/
- if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+ if (nodemask && !nodes_subset(node_states[N_MEMORY], *nodemask)) {
*totalpages = total_swap_pages;
for_each_node_mask(nid, *nodemask)
*totalpages += node_spanned_pages(nid);
--
1.7.4.4

2012-10-29 15:50:23

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 24/26] memblock: limit memory address from memblock

From: Yasuaki Ishimatsu <[email protected]>

Setting kernelcore_max_pfn means all memory which is bigger than
the boot parameter is allocated as ZONE_MOVABLE. So memory which
is allocated by memblock also should be limited by the parameter.

The patch limits memory from memblock.

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 5 ++++-
mm/page_alloc.c | 6 +++++-
3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d452ee1..3e52911 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {

extern struct memblock memblock;
extern int memblock_debug;
+extern phys_addr_t memblock_limit;

#define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
diff --git a/mm/memblock.c b/mm/memblock.c
index 6259055..ee2e307 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -957,7 +957,10 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)

void __init_memblock memblock_set_current_limit(phys_addr_t limit)
{
- memblock.current_limit = limit;
+ if (!memblock_limit || (memblock_limit > limit))
+ memblock.current_limit = limit;
+ else
+ memblock.current_limit = memblock_limit;
}

static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11df8b5..f76b696 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -208,6 +208,8 @@ static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];

+phys_addr_t memblock_limit;
+
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
EXPORT_SYMBOL(movable_zone);
@@ -4976,7 +4978,9 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
*/
static int __init cmdline_parse_kernelcore_max_addr(char *p)
{
- return cmdline_parse_core(p, &required_kernelcore_max_pfn);
+ cmdline_parse_core(p, &required_kernelcore_max_pfn);
+ memblock_limit = required_kernelcore_max_pfn << PAGE_SHIFT;
+ return 0;
}
early_param("kernelcore_max_addr", cmdline_parse_kernelcore_max_addr);
#endif
--
1.7.4.4

2012-10-29 15:50:46

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 18/26] hotplug: update nodemasks management

update nodemasks management for N_MEMORY

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/memory-hotplug.txt | 5 ++-
include/linux/memory.h | 1 +
mm/memory_hotplug.c | 87 +++++++++++++++++++++++++++++++-------
3 files changed, 77 insertions(+), 16 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index c6f993d..8e5eacb 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -390,6 +390,7 @@ struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
int status_change_nid_normal;
+ int status_change_nid_high;
int status_change_nid;
}

@@ -397,7 +398,9 @@ start_pfn is start_pfn of online/offline memory.
nr_pages is # of pages of online/offline memory.
status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
is (will be) set/clear, if this is -1, then nodemask status is not changed.
-status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
+status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
+is (will be) set/clear, if this is -1, then nodemask status is not changed.
+status_change_nid is set node id when N_MEMORY of nodemask is (will be)
set/clear. It means a new(memoryless) node gets new memory by online and a
node loses all memory. If this is -1, then nodemask status is not changed.
If status_changed_nid* >= 0, callback should create/discard structures for the
diff --git a/include/linux/memory.h b/include/linux/memory.h
index a09216d..45e93b4 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -54,6 +54,7 @@ struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
int status_change_nid_normal;
+ int status_change_nid_high;
int status_change_nid;
};

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9af9641..a55b547 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -603,13 +603,15 @@ static void node_states_check_changes_online(unsigned long nr_pages,
enum zone_type zone_last = ZONE_NORMAL;

/*
- * If we have HIGHMEM, node_states[N_NORMAL_MEMORY] contains nodes
- * which have 0...ZONE_NORMAL, set zone_last to ZONE_NORMAL.
+ * If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
+ * contains nodes which have zones of 0...ZONE_NORMAL,
+ * set zone_last to ZONE_NORMAL.
*
- * If we don't have HIGHMEM, node_states[N_NORMAL_MEMORY] contains nodes
- * which have 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
+ * If we don't have HIGHMEM nor movable node,
+ * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
+ * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
*/
- if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+ if (N_MEMORY == N_NORMAL_MEMORY)
zone_last = ZONE_MOVABLE;

/*
@@ -623,12 +625,34 @@ static void node_states_check_changes_online(unsigned long nr_pages,
else
arg->status_change_nid_normal = -1;

+#ifdef CONFIG_HIGHMEM
+ /*
+ * If we have movable node, node_states[N_HIGH_MEMORY]
+ * contains nodes which have zones of 0...ZONE_HIGH,
+ * set zone_last to ZONE_HIGH.
+ *
+ * If we don't have movable node, node_states[N_NORMAL_MEMORY]
+ * contains nodes which have zones of 0...ZONE_MOVABLE,
+ * set zone_last to ZONE_MOVABLE.
+ */
+ zone_last = ZONE_HIGH;
+ if (N_MEMORY == N_HIGH_MEMORY)
+ zone_last = ZONE_MOVABLE;
+
+ if (zone_idx(zone) <= zone_last && !node_state(nid, N_HIGH_MEMORY))
+ arg->status_change_nid_high = nid;
+ else
+ arg->status_change_nid_high = -1;
+#else
+ arg->status_change_nid_high = arg->status_change_nid_normal;
+#endif
+
/*
* if the node don't have memory befor online, we will need to
- * set the node to node_states[N_HIGH_MEMORY] after the memory
+ * set the node to node_states[N_MEMORY] after the memory
* is online.
*/
- if (!node_state(nid, N_HIGH_MEMORY))
+ if (!node_state(nid, N_MEMORY))
arg->status_change_nid = nid;
else
arg->status_change_nid = -1;
@@ -639,7 +663,10 @@ static void node_states_set_node(int node, struct memory_notify *arg)
if (arg->status_change_nid_normal >= 0)
node_set_state(node, N_NORMAL_MEMORY);

- node_set_state(node, N_HIGH_MEMORY);
+ if (arg->status_change_nid_high >= 0)
+ node_set_state(node, N_HIGH_MEMORY);
+
+ node_set_state(node, N_MEMORY);
}


@@ -1103,13 +1130,15 @@ static void node_states_check_changes_offline(unsigned long nr_pages,
enum zone_type zt, zone_last = ZONE_NORMAL;

/*
- * If we have HIGHMEM, node_states[N_NORMAL_MEMORY] contains nodes
- * which have 0...ZONE_NORMAL, set zone_last to ZONE_NORMAL.
+ * If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
+ * contains nodes which have zones of 0...ZONE_NORMAL,
+ * set zone_last to ZONE_NORMAL.
*
- * If we don't have HIGHMEM, node_states[N_NORMAL_MEMORY] contains nodes
- * which have 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
+ * If we don't have HIGHMEM nor movable node,
+ * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
+ * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
*/
- if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+ if (N_MEMORY == N_NORMAL_MEMORY)
zone_last = ZONE_MOVABLE;

/*
@@ -1126,6 +1155,30 @@ static void node_states_check_changes_offline(unsigned long nr_pages,
else
arg->status_change_nid_normal = -1;

+#ifdef CONIG_HIGHMEM
+ /*
+ * If we have movable node, node_states[N_HIGH_MEMORY]
+ * contains nodes which have zones of 0...ZONE_HIGH,
+ * set zone_last to ZONE_HIGH.
+ *
+ * If we don't have movable node, node_states[N_NORMAL_MEMORY]
+ * contains nodes which have zones of 0...ZONE_MOVABLE,
+ * set zone_last to ZONE_MOVABLE.
+ */
+ zone_last = ZONE_HIGH;
+ if (N_MEMORY == N_HIGH_MEMORY)
+ zone_last = ZONE_MOVABLE;
+
+ for (; zt <= zone_last; zt++)
+ present_pages += pgdat->node_zones[zt].present_pages;
+ if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
+ arg->status_change_nid_high = zone_to_nid(zone);
+ else
+ arg->status_change_nid_high = -1;
+#else
+ arg->status_change_nid_high = arg->status_change_nid_normal;
+#endif
+
/*
* node_states[N_HIGH_MEMORY] contains nodes which have 0...ZONE_MOVABLE
*/
@@ -1150,9 +1203,13 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
if (arg->status_change_nid_normal >= 0)
node_clear_state(node, N_NORMAL_MEMORY);

- if ((N_HIGH_MEMORY != N_NORMAL_MEMORY) &&
- (arg->status_change_nid >= 0))
+ if ((N_MEMORY != N_NORMAL_MEMORY) &&
+ (arg->status_change_nid_high >= 0))
node_clear_state(node, N_HIGH_MEMORY);
+
+ if ((N_MEMORY != N_HIGH_MEMORY) &&
+ (arg->status_change_nid >= 0))
+ node_clear_state(node, N_MEMORY);
}

static int __ref __offline_pages(unsigned long start_pfn,
--
1.7.4.4

2012-10-29 15:51:00

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 07/26] procfs: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
fs/proc/kcore.c | 2 +-
fs/proc/task_mmu.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 86c67ee..e96d4f1 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -249,7 +249,7 @@ static int kcore_update_ram(void)
/* Not inialized....update now */
/* find out "max pfn" */
end_pfn = 0;
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long node_end;
node_end = NODE_DATA(nid)->node_start_pfn +
NODE_DATA(nid)->node_spanned_pages;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 90c63f9..2d89601 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1126,7 +1126,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
return NULL;

nid = page_to_nid(page);
- if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+ if (!node_isset(nid, node_states[N_MEMORY]))
return NULL;

return page;
@@ -1279,7 +1279,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
if (md->writeback)
seq_printf(m, " writeback=%lu", md->writeback);

- for_each_node_state(n, N_HIGH_MEMORY)
+ for_each_node_state(n, N_MEMORY)
if (md->node[n])
seq_printf(m, " N%d=%lu", n, md->node[n]);
out:
--
1.7.4.4

2012-10-29 15:51:17

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 12/26] hugetlb: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
drivers/base/node.c | 2 +-
mm/hugetlb.c | 24 ++++++++++++------------
2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5d7731e..4c3aa7c 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -227,7 +227,7 @@ static node_registration_func_t __hugetlb_unregister_node;
static inline bool hugetlb_register_node(struct node *node)
{
if (__hugetlb_register_node &&
- node_state(node->dev.id, N_HIGH_MEMORY)) {
+ node_state(node->dev.id, N_MEMORY)) {
__hugetlb_register_node(node);
return true;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59a0059..7720ade 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1057,7 +1057,7 @@ static void return_unused_surplus_pages(struct hstate *h,
* on-line nodes with memory and will handle the hstate accounting.
*/
while (nr_pages--) {
- if (!free_pool_huge_page(h, &node_states[N_HIGH_MEMORY], 1))
+ if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
break;
}
}
@@ -1180,14 +1180,14 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
int __weak alloc_bootmem_huge_page(struct hstate *h)
{
struct huge_bootmem_page *m;
- int nr_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
+ int nr_nodes = nodes_weight(node_states[N_MEMORY]);

while (nr_nodes) {
void *addr;

addr = __alloc_bootmem_node_nopanic(
NODE_DATA(hstate_next_node_to_alloc(h,
- &node_states[N_HIGH_MEMORY])),
+ &node_states[N_MEMORY])),
huge_page_size(h), huge_page_size(h), 0);

if (addr) {
@@ -1259,7 +1259,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
if (!alloc_bootmem_huge_page(h))
break;
} else if (!alloc_fresh_huge_page(h,
- &node_states[N_HIGH_MEMORY]))
+ &node_states[N_MEMORY]))
break;
}
h->max_huge_pages = i;
@@ -1527,7 +1527,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
if (!(obey_mempolicy &&
init_nodemask_of_mempolicy(nodes_allowed))) {
NODEMASK_FREE(nodes_allowed);
- nodes_allowed = &node_states[N_HIGH_MEMORY];
+ nodes_allowed = &node_states[N_MEMORY];
}
} else if (nodes_allowed) {
/*
@@ -1537,11 +1537,11 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
init_nodemask_of_node(nodes_allowed, nid);
} else
- nodes_allowed = &node_states[N_HIGH_MEMORY];
+ nodes_allowed = &node_states[N_MEMORY];

h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);

- if (nodes_allowed != &node_states[N_HIGH_MEMORY])
+ if (nodes_allowed != &node_states[N_MEMORY])
NODEMASK_FREE(nodes_allowed);

return len;
@@ -1844,7 +1844,7 @@ static void hugetlb_register_all_nodes(void)
{
int nid;

- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
struct node *node = &node_devices[nid];
if (node->dev.id == nid)
hugetlb_register_node(node);
@@ -1939,8 +1939,8 @@ void __init hugetlb_add_hstate(unsigned order)
for (i = 0; i < MAX_NUMNODES; ++i)
INIT_LIST_HEAD(&h->hugepage_freelists[i]);
INIT_LIST_HEAD(&h->hugepage_activelist);
- h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
- h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
+ h->next_nid_to_alloc = first_node(node_states[N_MEMORY]);
+ h->next_nid_to_free = first_node(node_states[N_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
/*
@@ -2035,11 +2035,11 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
if (!(obey_mempolicy &&
init_nodemask_of_mempolicy(nodes_allowed))) {
NODEMASK_FREE(nodes_allowed);
- nodes_allowed = &node_states[N_HIGH_MEMORY];
+ nodes_allowed = &node_states[N_MEMORY];
}
h->max_huge_pages = set_max_huge_pages(h, tmp, nodes_allowed);

- if (nodes_allowed != &node_states[N_HIGH_MEMORY])
+ if (nodes_allowed != &node_states[N_MEMORY])
NODEMASK_FREE(nodes_allowed);
}
out:
--
1.7.4.4

2012-10-29 15:48:23

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 03/26] memory_hotplug: ensure every online node has NORMAL memory

Old memory hotplug code and new online/movable may cause a online node
don't have any normal memory, but memory-management acts bad when we have
nodes which is online but don't have any normal memory.
Example: it may cause a bound task fail on all kernel allocation and
cause the task can't create task or create other kernel object.

So we disable non-normal-memory-node here, we will enable it
when we prepared.


Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/memory_hotplug.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index bdcdaf6..9af9641 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -589,6 +589,12 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
return 0;
}

+/* ensure every online node has NORMAL memory */
+static bool can_online_high_movable(struct zone *zone)
+{
+ return node_state(zone_to_nid(zone), N_NORMAL_MEMORY);
+}
+
/* check which state of node_states will be changed when online memory */
static void node_states_check_changes_online(unsigned long nr_pages,
struct zone *zone, struct memory_notify *arg)
@@ -654,6 +660,12 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
*/
zone = page_zone(pfn_to_page(pfn));

+ if ((zone_idx(zone) > ZONE_NORMAL || online_type == ONLINE_MOVABLE) &&
+ !can_online_high_movable(zone)) {
+ unlock_memory_hotplug();
+ return -1;
+ }
+
if (online_type == ONLINE_KERNEL && zone_idx(zone) == ZONE_MOVABLE) {
if (move_pfn_range_left(zone - 1, zone, pfn, pfn + nr_pages)) {
unlock_memory_hotplug();
@@ -1058,6 +1070,30 @@ check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
return offlined;
}

+/* ensure the node has NORMAL memory if it is still online */
+static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
+{
+ struct pglist_data *pgdat = zone->zone_pgdat;
+ unsigned long present_pages = 0;
+ enum zone_type zt;
+
+ for (zt = 0; zt <= ZONE_NORMAL; zt++)
+ present_pages += pgdat->node_zones[zt].present_pages;
+
+ if (present_pages > nr_pages)
+ return true;
+
+ present_pages = 0;
+ for (; zt <= ZONE_MOVABLE; zt++)
+ present_pages += pgdat->node_zones[zt].present_pages;
+
+ /*
+ * we can't offline the last normal memory until all
+ * higher memory is offlined.
+ */
+ return present_pages == 0;
+}
+
/* check which state of node_states will be changed when offline memory */
static void node_states_check_changes_offline(unsigned long nr_pages,
struct zone *zone, struct memory_notify *arg)
@@ -1145,6 +1181,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
node = zone_to_nid(zone);
nr_pages = end_pfn - start_pfn;

+ ret = -EINVAL;
+ if (zone_idx(zone) <= ZONE_NORMAL && !can_offline_normal(zone, nr_pages))
+ goto out;
+
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE, true);
if (ret)
--
1.7.4.4

2012-10-29 15:51:52

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 23/26] x86: use memblock_set_current_limit() to set memblock.current_limit

From: Yasuaki Ishimatsu <[email protected]>

memblock.current_limit is set directly though memblock_set_current_limit()
is prepared. So fix it.

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/kernel/setup.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ca45696..ab3017a 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -890,7 +890,7 @@ void __init setup_arch(char **cmdline_p)

cleanup_highmap();

- memblock.current_limit = get_max_mapped();
+ memblock_set_current_limit(get_max_mapped());
memblock_x86_fill();

/*
@@ -940,7 +940,7 @@ void __init setup_arch(char **cmdline_p)
max_low_pfn = max_pfn;
}
#endif
- memblock.current_limit = get_max_mapped();
+ memblock_set_current_limit(get_max_mapped());
dma_contiguous_reserve(0);

/*
--
1.7.4.4

2012-10-29 15:52:11

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 14/26] kthread: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/kthread.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 29fb60c..691dc2e 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -428,7 +428,7 @@ int kthreadd(void *unused)
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
- set_mems_allowed(node_states[N_HIGH_MEMORY]);
+ set_mems_allowed(node_states[N_MEMORY]);

current->flags |= PF_NOFREEZE;

--
1.7.4.4

2012-10-29 15:48:20

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 05/26] node_states: introduce N_MEMORY

We have N_NORMAL_MEMORY for standing for the nodes that have normal memory with
zone_type <= ZONE_NORMAL.

And we have N_HIGH_MEMORY for standing for the nodes that have normal or high
memory.

But we don't have any word to stand for the nodes that have *any* memory.

And we have N_CPU but without N_MEMORY.

Current code reuse the N_HIGH_MEMORY for this purpose because any node which
has memory must have high memory or normal memory currently.

A) But this reusing is bad for *readability*. Because the name
N_HIGH_MEMORY just stands for high or normal:

A.example 1)
mem_cgroup_nr_lru_pages():
for_each_node_state(nid, N_HIGH_MEMORY)

The user will be confused(why this function just counts for high or
normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
until someone else tell them N_HIGH_MEMORY is reused to stand for
nodes that have any memory.

A.cont) If we introduce N_MEMORY, we can reduce this confusing
AND make the code more clearly:

A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:

One is in page_cgroup_init(void):
for_each_node_state(nid, N_HIGH_MEMORY) {

It means if the node have memory, we will allocate page_cgroup map for
the node. We should use N_MEMORY instead here to gaim more clearly.

The second using is in alloc_page_cgroup():
if (node_state(nid, N_HIGH_MEMORY))
addr = vzalloc_node(size, nid);

It means if the node has high or normal memory that can be allocated
from kernel. We should keep N_HIGH_MEMORY here, and it will be better
if the "any memory" semantic of N_HIGH_MEMORY is removed.

B) This reusing is out-dated if we introduce MOVABLE-dedicated node.
The MOVABLE-dedicated node should not appear in
node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
because MOVABLE-dedicated node has no high or normal memory.

In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
is in node_stats[N_HIGH_MEMORY], it is also means it is in
node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.

The slub uses
for_each_node_state(nid, N_NORMAL_MEMORY)
and creates kmem_cache_node for MOVABLE-dedicated node and cause problem.

In one word, we need a N_MEMORY. We just intrude it as an alias to
N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late patches.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
include/linux/nodemask.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 7afc363..c6ebdc9 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -380,6 +380,7 @@ enum node_states {
#else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
+ N_MEMORY = N_HIGH_MEMORY,
N_CPU, /* The node has one or more cpus */
NR_NODE_STATES
};
--
1.7.4.4

2012-10-29 15:52:34

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 22/26] x86: get pg_data_t's memory from other node

From: Yasuaki Ishimatsu <[email protected]>

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So when memblock_alloc_nid() fails, setup_node_data() retries
memblock_alloc().

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/mm/numa.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a86e315 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -223,9 +223,13 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
remapped = true;
} else {
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
- if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
+ if (!nd_pa)
+ printk(KERN_WARNING "Cannot find %zu bytes in node %d\n",
nd_size, nid);
+ nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
+ if (!nd_pa) {
+ pr_err("Cannot find %zu bytes in other node\n",
+ nd_size);
return;
}
nd = __va(nd_pa);
--
1.7.4.4

2012-10-29 15:52:31

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 25/26] memblock: compare current_limit with end variable at memblock_find_in_range_node()

From: Yasuaki Ishimatsu <[email protected]>

memblock_find_in_range_node() does not compare memblock.current_limit
with end variable. Thus even if memblock.current_limit is smaller than
end variable, the function allocates memory address that is bigger than
memblock.current_limit.

The patch adds the check to "memblock_find_in_range_node()"

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/memblock.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index ee2e307..50ab53c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -100,11 +100,12 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
phys_addr_t align, int nid)
{
phys_addr_t this_start, this_end, cand;
+ phys_addr_t current_limit = memblock.current_limit;
u64 i;

/* pump up @end */
- if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
- end = memblock.current_limit;
+ if ((end == MEMBLOCK_ALLOC_ACCESSIBLE) || (end > current_limit))
+ end = current_limit;

/* avoid allocating the first page */
start = max_t(phys_addr_t, start, PAGE_SIZE);
--
1.7.4.4

2012-10-29 15:53:32

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 10/26] mm,migrate: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---
mm/migrate.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..d595e58 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1201,7 +1201,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
if (node < 0 || node >= MAX_NUMNODES)
goto out_pm;

- if (!node_state(node, N_HIGH_MEMORY))
+ if (!node_state(node, N_MEMORY))
goto out_pm;

err = -EACCES;
--
1.7.4.4

2012-10-29 15:53:49

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 15/26] init: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
init/main.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/main.c b/init/main.c
index 9cf77ab..9595968 100644
--- a/init/main.c
+++ b/init/main.c
@@ -855,7 +855,7 @@ static void __init kernel_init_freeable(void)
/*
* init can allocate pages on any node
*/
- set_mems_allowed(node_states[N_HIGH_MEMORY]);
+ set_mems_allowed(node_states[N_MEMORY]);
/*
* init can run on any cpu.
*/
--
1.7.4.4

2012-10-29 15:53:47

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 26/26] mempolicy: fix is_valid_nodemask()

is_valid_nodemask() is introduced by 19770b32. but it does not match
its comments, because it does not check the zone which > policy_zone.

Also in b377fd, this commits told us, if highest zone is ZONE_MOVABLE,
we should also apply memory policies to it. so ZONE_MOVABLE should be valid zone
for policies. is_valid_nodemask() need to be changed to match it.

Fix: check all zones, even its zoneid > policy_zone.
Use nodes_intersects() instead open code to check it.

Signed-off-by: Lai Jiangshan <[email protected]>
Reported-by: Wen Congyang <[email protected]>
---
mm/mempolicy.c | 36 ++++++++++++++++++++++--------------
1 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d4a084c..ed7c249 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -140,19 +140,7 @@ static const struct mempolicy_operations {
/* Check that the nodemask contains at least one populated zone */
static int is_valid_nodemask(const nodemask_t *nodemask)
{
- int nd, k;
-
- for_each_node_mask(nd, *nodemask) {
- struct zone *z;
-
- for (k = 0; k <= policy_zone; k++) {
- z = &NODE_DATA(nd)->node_zones[k];
- if (z->present_pages > 0)
- return 1;
- }
- }
-
- return 0;
+ return nodes_intersects(*nodemask, node_states[N_MEMORY]);
}

static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
@@ -1572,6 +1560,26 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
return pol;
}

+static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
+{
+ enum zone_type dynamic_policy_zone = policy_zone;
+
+ BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
+
+ /*
+ * if policy->v.nodes has movable memory only,
+ * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
+ *
+ * policy->v.nodes is intersect with node_states[N_MEMORY].
+ * so if the following test faile, it implies
+ * policy->v.nodes has movable memory only.
+ */
+ if (!nodes_intersects(policy->v.nodes, node_states[N_HIGH_MEMORY]))
+ dynamic_policy_zone = ZONE_MOVABLE;
+
+ return zone >= dynamic_policy_zone;
+}
+
/*
* Return a nodemask representing a mempolicy for filtering nodes for
* page allocation
@@ -1580,7 +1588,7 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
/* Lower zones don't get a nodemask applied for MPOL_BIND */
if (unlikely(policy->mode == MPOL_BIND) &&
- gfp_zone(gfp) >= policy_zone &&
+ apply_policy_zone(policy, gfp_zone(gfp)) &&
cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
return &policy->v.nodes;

--
1.7.4.4

2012-10-29 15:53:45

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 06/26] cpuset: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
Documentation/cgroups/cpusets.txt | 2 +-
include/linux/cpuset.h | 2 +-
kernel/cpuset.c | 32 ++++++++++++++++----------------
3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index cefd3d8..12e01d4 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -218,7 +218,7 @@ and name space for cpusets, with a minimum of additional kernel code.
The cpus and mems files in the root (top_cpuset) cpuset are
read-only. The cpus file automatically tracks the value of
cpu_online_mask using a CPU hotplug notifier, and the mems file
-automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
+automatically tracks the value of node_states[N_MEMORY]--i.e.,
nodes with memory--using the cpuset_track_online_nodes() hook.


diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 838320f..8c8a60d29 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -144,7 +144,7 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
return node_possible_map;
}

-#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
+#define cpuset_current_mems_allowed (node_states[N_MEMORY])
static inline void cpuset_init_current_mems_allowed(void) {}

static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f33c715..2b133db 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -302,10 +302,10 @@ static void guarantee_online_cpus(const struct cpuset *cs,
* are online, with memory. If none are online with memory, walk
* up the cpuset hierarchy until we find one that does have some
* online mems. If we get all the way to the top and still haven't
- * found any online mems, return node_states[N_HIGH_MEMORY].
+ * found any online mems, return node_states[N_MEMORY].
*
* One way or another, we guarantee to return some non-empty subset
- * of node_states[N_HIGH_MEMORY].
+ * of node_states[N_MEMORY].
*
* Call with callback_mutex held.
*/
@@ -313,14 +313,14 @@ static void guarantee_online_cpus(const struct cpuset *cs,
static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
{
while (cs && !nodes_intersects(cs->mems_allowed,
- node_states[N_HIGH_MEMORY]))
+ node_states[N_MEMORY]))
cs = cs->parent;
if (cs)
nodes_and(*pmask, cs->mems_allowed,
- node_states[N_HIGH_MEMORY]);
+ node_states[N_MEMORY]);
else
- *pmask = node_states[N_HIGH_MEMORY];
- BUG_ON(!nodes_intersects(*pmask, node_states[N_HIGH_MEMORY]));
+ *pmask = node_states[N_MEMORY];
+ BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
}

/*
@@ -1100,7 +1100,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
return -ENOMEM;

/*
- * top_cpuset.mems_allowed tracks node_stats[N_HIGH_MEMORY];
+ * top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
* it's read-only
*/
if (cs == &top_cpuset) {
@@ -1122,7 +1122,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
goto done;

if (!nodes_subset(trialcs->mems_allowed,
- node_states[N_HIGH_MEMORY])) {
+ node_states[N_MEMORY])) {
retval = -EINVAL;
goto done;
}
@@ -2034,7 +2034,7 @@ static struct cpuset *cpuset_next(struct list_head *queue)
* before dropping down to the next. It always processes a node before
* any of its children.
*
- * In the case of memory hot-unplug, it will remove nodes from N_HIGH_MEMORY
+ * In the case of memory hot-unplug, it will remove nodes from N_MEMORY
* if all present pages from a node are offlined.
*/
static void
@@ -2073,7 +2073,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)

/* Continue past cpusets with all mems online */
if (nodes_subset(cp->mems_allowed,
- node_states[N_HIGH_MEMORY]))
+ node_states[N_MEMORY]))
continue;

oldmems = cp->mems_allowed;
@@ -2081,7 +2081,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
/* Remove offline mems from this cpuset. */
mutex_lock(&callback_mutex);
nodes_and(cp->mems_allowed, cp->mems_allowed,
- node_states[N_HIGH_MEMORY]);
+ node_states[N_MEMORY]);
mutex_unlock(&callback_mutex);

/* Move tasks from the empty cpuset to a parent */
@@ -2134,8 +2134,8 @@ void cpuset_update_active_cpus(bool cpu_online)

#ifdef CONFIG_MEMORY_HOTPLUG
/*
- * Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY].
- * Call this routine anytime after node_states[N_HIGH_MEMORY] changes.
+ * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
+ * Call this routine anytime after node_states[N_MEMORY] changes.
* See cpuset_update_active_cpus() for CPU hotplug handling.
*/
static int cpuset_track_online_nodes(struct notifier_block *self,
@@ -2148,7 +2148,7 @@ static int cpuset_track_online_nodes(struct notifier_block *self,
case MEM_ONLINE:
oldmems = top_cpuset.mems_allowed;
mutex_lock(&callback_mutex);
- top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
+ top_cpuset.mems_allowed = node_states[N_MEMORY];
mutex_unlock(&callback_mutex);
update_tasks_nodemask(&top_cpuset, &oldmems, NULL);
break;
@@ -2177,7 +2177,7 @@ static int cpuset_track_online_nodes(struct notifier_block *self,
void __init cpuset_init_smp(void)
{
cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
- top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
+ top_cpuset.mems_allowed = node_states[N_MEMORY];

hotplug_memory_notifier(cpuset_track_online_nodes, 10);

@@ -2245,7 +2245,7 @@ void cpuset_init_current_mems_allowed(void)
*
* Description: Returns the nodemask_t mems_allowed of the cpuset
* attached to the specified @tsk. Guaranteed to return some non-empty
- * subset of node_states[N_HIGH_MEMORY], even if this means going outside the
+ * subset of node_states[N_MEMORY], even if this means going outside the
* tasks cpuset.
**/

--
1.7.4.4

2012-10-29 15:48:18

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 02/26] memory_hotplug: handle empty zone when online_movable/online_kernel

make online_movable/online_kernel can empty a zone
or can move memory to a empty zone.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/memory_hotplug.c | 51 +++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6d3bec4..bdcdaf6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -227,8 +227,17 @@ static void resize_zone(struct zone *zone, unsigned long start_pfn,

zone_span_writelock(zone);

- zone->zone_start_pfn = start_pfn;
- zone->spanned_pages = end_pfn - start_pfn;
+ if (end_pfn - start_pfn) {
+ zone->zone_start_pfn = start_pfn;
+ zone->spanned_pages = end_pfn - start_pfn;
+ } else {
+ /*
+ * make it consist as free_area_init_core(),
+ * if spanned_pages = 0, then keep start_pfn = 0
+ */
+ zone->zone_start_pfn = 0;
+ zone->spanned_pages = 0;
+ }

zone_span_writeunlock(zone);
}
@@ -244,10 +253,19 @@ static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
set_page_links(pfn_to_page(pfn), zid, nid, pfn);
}

-static int move_pfn_range_left(struct zone *z1, struct zone *z2,
+static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
unsigned long start_pfn, unsigned long end_pfn)
{
+ int ret;
unsigned long flags;
+ unsigned long z1_start_pfn;
+
+ if (!z1->wait_table) {
+ ret = init_currently_empty_zone(z1, start_pfn,
+ end_pfn - start_pfn, MEMMAP_HOTPLUG);
+ if (ret)
+ return ret;
+ }

pgdat_resize_lock(z1->zone_pgdat, &flags);

@@ -261,7 +279,13 @@ static int move_pfn_range_left(struct zone *z1, struct zone *z2,
if (end_pfn <= z2->zone_start_pfn)
goto out_fail;

- resize_zone(z1, z1->zone_start_pfn, end_pfn);
+ /* use start_pfn for z1's start_pfn if z1 is empty */
+ if (z1->spanned_pages)
+ z1_start_pfn = z1->zone_start_pfn;
+ else
+ z1_start_pfn = start_pfn;
+
+ resize_zone(z1, z1_start_pfn, end_pfn);
resize_zone(z2, end_pfn, z2->zone_start_pfn + z2->spanned_pages);

pgdat_resize_unlock(z1->zone_pgdat, &flags);
@@ -274,10 +298,19 @@ out_fail:
return -1;
}

-static int move_pfn_range_right(struct zone *z1, struct zone *z2,
+static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2,
unsigned long start_pfn, unsigned long end_pfn)
{
+ int ret;
unsigned long flags;
+ unsigned long z2_end_pfn;
+
+ if (!z2->wait_table) {
+ ret = init_currently_empty_zone(z2, start_pfn,
+ end_pfn - start_pfn, MEMMAP_HOTPLUG);
+ if (ret)
+ return ret;
+ }

pgdat_resize_lock(z1->zone_pgdat, &flags);

@@ -291,8 +324,14 @@ static int move_pfn_range_right(struct zone *z1, struct zone *z2,
if (start_pfn >= z1->zone_start_pfn + z1->spanned_pages)
goto out_fail;

+ /* use end_pfn for z2's end_pfn if z2 is empty */
+ if (z2->spanned_pages)
+ z2_end_pfn = z2->zone_start_pfn + z2->spanned_pages;
+ else
+ z2_end_pfn = end_pfn;
+
resize_zone(z1, z1->zone_start_pfn, start_pfn);
- resize_zone(z2, start_pfn, z2->zone_start_pfn + z2->spanned_pages);
+ resize_zone(z2, start_pfn, z2_end_pfn);

pgdat_resize_unlock(z1->zone_pgdat, &flags);

--
1.7.4.4

2012-10-29 15:55:05

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 16/26] vmscan: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
mm/vmscan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2624edc..98a2e11 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3135,7 +3135,7 @@ static int __devinit cpu_callback(struct notifier_block *nfb,
int nid;

if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
pg_data_t *pgdat = NODE_DATA(nid);
const struct cpumask *mask;

@@ -3191,7 +3191,7 @@ static int __init kswapd_init(void)
int nid;

swap_setup();
- for_each_node_state(nid, N_HIGH_MEMORY)
+ for_each_node_state(nid, N_MEMORY)
kswapd_run(nid);
hotcpu_notifier(cpu_callback, 0);
return 0;
--
1.7.4.4

2012-10-29 16:03:24

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 21/26] page_alloc: add kernelcore_max_addr

Current ZONE_MOVABLE (kernelcore=) setting policy with boot option doesn't meet
our requirement. We need something like kernelcore_max_addr=XX boot option
to limit the kernelcore upper address.

The memory with higher address will be migratable(movable) and they
are easier to be offline(always ready to be offline when the system don't require
so much memory).

It makes things easy when we dynamic hot-add/remove memory, make better
utilities of memories, and helps for THP.

All kernelcore_max_addr=, kernelcore= and movablecore= can be safely specified
at the same time(or any 2 of them).

Signed-off-by: Lai Jiangshan <[email protected]>
---
Documentation/kernel-parameters.txt | 9 +++++++++
mm/page_alloc.c | 29 ++++++++++++++++++++++++++++-
2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9776f06..2b72ffb 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1223,6 +1223,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.

+ kernelcore_max_addr=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
+ is the same effect as kernelcore parameter, except it
+ specifies the up physical address of memory range
+ usable by the kernel for non-movable allocations.
+ If both kernelcore and kernelcore_max_addr are
+ specified, this requested's priority is higher than
+ kernelcore's.
+ See the kernelcore parameter.
+
kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a42337f..11df8b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -203,6 +203,7 @@ static unsigned long __meminitdata dma_reserve;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+static unsigned long __initdata required_kernelcore_max_pfn;
static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
@@ -4700,6 +4701,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
{
int i, nid;
unsigned long usable_startpfn;
+ unsigned long kernelcore_max_pfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
nodemask_t saved_node_state = node_states[N_MEMORY];
@@ -4728,6 +4730,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
required_kernelcore = max(required_kernelcore, corepages);
}

+ if (required_kernelcore_max_pfn && !required_kernelcore)
+ required_kernelcore = totalpages;
+
/* If kernelcore was not specified, there is no ZONE_MOVABLE */
if (!required_kernelcore)
goto out;
@@ -4736,6 +4741,12 @@ static void __init find_zone_movable_pfns_for_nodes(void)
find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];

+ if (required_kernelcore_max_pfn)
+ kernelcore_max_pfn = required_kernelcore_max_pfn;
+ else
+ kernelcore_max_pfn = ULONG_MAX >> PAGE_SHIFT;
+ kernelcore_max_pfn = max(kernelcore_max_pfn, usable_startpfn);
+
restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
@@ -4762,8 +4773,12 @@ restart:
unsigned long size_pages;

start_pfn = max(start_pfn, zone_movable_pfn[nid]);
- if (start_pfn >= end_pfn)
+ end_pfn = min(kernelcore_max_pfn, end_pfn);
+ if (start_pfn >= end_pfn) {
+ if (!zone_movable_pfn[nid])
+ zone_movable_pfn[nid] = start_pfn;
continue;
+ }

/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
@@ -4954,6 +4969,18 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
return 0;
}

+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * kernelcore_max_addr=addr sets the up physical address of memory range
+ * for use for allocations that cannot be reclaimed or migrated.
+ */
+static int __init cmdline_parse_kernelcore_max_addr(char *p)
+{
+ return cmdline_parse_core(p, &required_kernelcore_max_pfn);
+}
+early_param("kernelcore_max_addr", cmdline_parse_kernelcore_max_addr);
+#endif
+
/*
* kernelcore=size sets the amount of memory for use for allocations that
* cannot be reclaimed or migrated.
--
1.7.4.4

2012-10-29 16:03:18

by Lai Jiangshan

[permalink] [raw]
Subject: [V5 PATCH 11/26] mempolicy: use N_MEMORY instead N_HIGH_MEMORY

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <[email protected]>
---
mm/mempolicy.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..d4a084c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -212,9 +212,9 @@ static int mpol_set_nodemask(struct mempolicy *pol,
/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
if (pol == NULL)
return 0;
- /* Check N_HIGH_MEMORY */
+ /* Check N_MEMORY */
nodes_and(nsc->mask1,
- cpuset_current_mems_allowed, node_states[N_HIGH_MEMORY]);
+ cpuset_current_mems_allowed, node_states[N_MEMORY]);

VM_BUG_ON(!nodes);
if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
@@ -1388,7 +1388,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
goto out_put;
}

- if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) {
+ if (!nodes_subset(*new, node_states[N_MEMORY])) {
err = -EINVAL;
goto out_put;
}
@@ -2361,7 +2361,7 @@ void __init numa_policy_init(void)
* fall back to the largest node if they're all smaller.
*/
nodes_clear(interleave_nodes);
- for_each_node_state(nid, N_HIGH_MEMORY) {
+ for_each_node_state(nid, N_MEMORY) {
unsigned long total_pages = node_present_pages(nid);

/* Preserve the largest node */
@@ -2442,7 +2442,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
*nodelist++ = '\0';
if (nodelist_parse(nodelist, nodes))
goto out;
- if (!nodes_subset(nodes, node_states[N_HIGH_MEMORY]))
+ if (!nodes_subset(nodes, node_states[N_MEMORY]))
goto out;
} else
nodes_clear(nodes);
@@ -2476,7 +2476,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
* Default to online nodes with memory if no nodelist
*/
if (!nodelist)
- nodes = node_states[N_HIGH_MEMORY];
+ nodes = node_states[N_MEMORY];
break;
case MPOL_LOCAL:
/*
--
1.7.4.4

2012-10-29 16:22:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [V5 PATCH 08/26] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

On Mon 29-10-12 23:20:58, Lai Jiangshan wrote:
> N_HIGH_MEMORY stands for the nodes that has normal or high memory.
> N_MEMORY stands for the nodes that has any memory.

What is the difference of those two?

> The code here need to handle with the nodes which have memory, we should
> use N_MEMORY instead.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> mm/memcontrol.c | 18 +++++++++---------
> mm/page_cgroup.c | 2 +-
> 2 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7acf43b..1b69665 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -800,7 +800,7 @@ static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
> int nid;
> u64 total = 0;
>
> - for_each_node_state(nid, N_HIGH_MEMORY)
> + for_each_node_state(nid, N_MEMORY)
> total += mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask);
> return total;
> }
> @@ -1611,9 +1611,9 @@ static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
> return;
>
> /* make a nodemask where this memcg uses memory from */
> - memcg->scan_nodes = node_states[N_HIGH_MEMORY];
> + memcg->scan_nodes = node_states[N_MEMORY];
>
> - for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
> + for_each_node_mask(nid, node_states[N_MEMORY]) {
>
> if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
> node_clear(nid, memcg->scan_nodes);
> @@ -1684,7 +1684,7 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
> /*
> * Check rest of nodes.
> */
> - for_each_node_state(nid, N_HIGH_MEMORY) {
> + for_each_node_state(nid, N_MEMORY) {
> if (node_isset(nid, memcg->scan_nodes))
> continue;
> if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
> @@ -3759,7 +3759,7 @@ move_account:
> drain_all_stock_sync(memcg);
> ret = 0;
> mem_cgroup_start_move(memcg);
> - for_each_node_state(node, N_HIGH_MEMORY) {
> + for_each_node_state(node, N_MEMORY) {
> for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> enum lru_list lru;
> for_each_lru(lru) {
> @@ -4087,7 +4087,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,
>
> total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL);
> seq_printf(m, "total=%lu", total_nr);
> - for_each_node_state(nid, N_HIGH_MEMORY) {
> + for_each_node_state(nid, N_MEMORY) {
> node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL);
> seq_printf(m, " N%d=%lu", nid, node_nr);
> }
> @@ -4095,7 +4095,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,
>
> file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
> seq_printf(m, "file=%lu", file_nr);
> - for_each_node_state(nid, N_HIGH_MEMORY) {
> + for_each_node_state(nid, N_MEMORY) {
> node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
> LRU_ALL_FILE);
> seq_printf(m, " N%d=%lu", nid, node_nr);
> @@ -4104,7 +4104,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,
>
> anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
> seq_printf(m, "anon=%lu", anon_nr);
> - for_each_node_state(nid, N_HIGH_MEMORY) {
> + for_each_node_state(nid, N_MEMORY) {
> node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
> LRU_ALL_ANON);
> seq_printf(m, " N%d=%lu", nid, node_nr);
> @@ -4113,7 +4113,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,
>
> unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE));
> seq_printf(m, "unevictable=%lu", unevictable_nr);
> - for_each_node_state(nid, N_HIGH_MEMORY) {
> + for_each_node_state(nid, N_MEMORY) {
> node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
> BIT(LRU_UNEVICTABLE));
> seq_printf(m, " N%d=%lu", nid, node_nr);
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 5ddad0c..c1054ad 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
> if (mem_cgroup_disabled())
> return;
>
> - for_each_node_state(nid, N_HIGH_MEMORY) {
> + for_each_node_state(nid, N_MEMORY) {
> unsigned long start_pfn, end_pfn;
>
> start_pfn = node_start_pfn(nid);
> --
> 1.7.4.4
>

--
Michal Hocko
SUSE Labs

2012-10-29 20:40:47

by David Rientjes

[permalink] [raw]
Subject: Re: [V5 PATCH 08/26] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

On Mon, 29 Oct 2012, Michal Hocko wrote:

> > N_HIGH_MEMORY stands for the nodes that has normal or high memory.
> > N_MEMORY stands for the nodes that has any memory.
>
> What is the difference of those two?
>

Patch 5 in the series introduces it to be equal to N_HIGH_MEMORY, so
accepting this patch would be an implicit ack of the direction taken
there.

2012-10-29 20:46:41

by David Rientjes

[permalink] [raw]
Subject: Re: [V5 PATCH 05/26] node_states: introduce N_MEMORY

On Mon, 29 Oct 2012, Lai Jiangshan wrote:

> We have N_NORMAL_MEMORY for standing for the nodes that have normal memory with
> zone_type <= ZONE_NORMAL.
>
> And we have N_HIGH_MEMORY for standing for the nodes that have normal or high
> memory.
>

(In other words, all memory.)

> But we don't have any word to stand for the nodes that have *any* memory.
>

It's N_HIGH_MEMORY, or at least it's supposed to be. Is there a problem
where the bit isn't getting set for a node with memory?

> A) But this reusing is bad for *readability*. Because the name
> N_HIGH_MEMORY just stands for high or normal:
>
> A.example 1)
> mem_cgroup_nr_lru_pages():
> for_each_node_state(nid, N_HIGH_MEMORY)
>
> The user will be confused(why this function just counts for high or
> normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
> until someone else tell them N_HIGH_MEMORY is reused to stand for
> nodes that have any memory.
>
> A.cont) If we introduce N_MEMORY, we can reduce this confusing
> AND make the code more clearly:
>
> A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:
>
> One is in page_cgroup_init(void):
> for_each_node_state(nid, N_HIGH_MEMORY) {
>
> It means if the node have memory, we will allocate page_cgroup map for
> the node. We should use N_MEMORY instead here to gaim more clearly.
>
> The second using is in alloc_page_cgroup():
> if (node_state(nid, N_HIGH_MEMORY))
> addr = vzalloc_node(size, nid);
>
> It means if the node has high or normal memory that can be allocated
> from kernel. We should keep N_HIGH_MEMORY here, and it will be better
> if the "any memory" semantic of N_HIGH_MEMORY is removed.
>
> B) This reusing is out-dated if we introduce MOVABLE-dedicated node.
> The MOVABLE-dedicated node should not appear in
> node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
> because MOVABLE-dedicated node has no high or normal memory.
>
> In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
> is in node_stats[N_HIGH_MEMORY], it is also means it is in
> node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.
>
> The slub uses
> for_each_node_state(nid, N_NORMAL_MEMORY)
> and creates kmem_cache_node for MOVABLE-dedicated node and cause problem.
>
> In one word, we need a N_MEMORY. We just intrude it as an alias to
> N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late patches.
>

If this is really that problematic (and it appears it's not given that
there are many use cases of it and people tend to get it right), then why
not simply rename N_HIGH_MEMORY instead of introducing yet another
nodemask to the equation?

2012-10-29 20:58:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [V5 PATCH 08/26] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

On Mon 29-10-12 13:40:39, David Rientjes wrote:
> On Mon, 29 Oct 2012, Michal Hocko wrote:
>
> > > N_HIGH_MEMORY stands for the nodes that has normal or high memory.
> > > N_MEMORY stands for the nodes that has any memory.
> >
> > What is the difference of those two?
> >
>
> Patch 5 in the series

Strange, I do not see that one at the mailing list.

> introduces it to be equal to N_HIGH_MEMORY, so

So this is just a rename? If yes it would be much esier if it was
mentioned in the patch description.

> accepting this patch would be an implicit ack of the direction taken
> there.

--
Michal Hocko
SUSE Labs

2012-10-29 21:08:10

by David Rientjes

[permalink] [raw]
Subject: Re: [V5 PATCH 08/26] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

On Mon, 29 Oct 2012, Michal Hocko wrote:

> > > > N_HIGH_MEMORY stands for the nodes that has normal or high memory.
> > > > N_MEMORY stands for the nodes that has any memory.
> > >
> > > What is the difference of those two?
> > >
> >
> > Patch 5 in the series
>
> Strange, I do not see that one at the mailing list.
>

http://marc.info/?l=linux-kernel&m=135152595827692

> > introduces it to be equal to N_HIGH_MEMORY, so
>
> So this is just a rename? If yes it would be much esier if it was
> mentioned in the patch description.
>

It's not even a rename even though it should be, it's adding yet another
node_states that is equal to N_HIGH_MEMORY since that state already
includes all memory. It's just a matter of taste but I think we should be
renaming it instead of aliasing it (unless you actually want to make
N_HIGH_MEMORY only include nodes with highmem, but nothing depends on
that).

2012-10-29 21:34:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [V5 PATCH 08/26] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

On Mon 29-10-12 14:08:05, David Rientjes wrote:
> On Mon, 29 Oct 2012, Michal Hocko wrote:
>
> > > > > N_HIGH_MEMORY stands for the nodes that has normal or high memory.
> > > > > N_MEMORY stands for the nodes that has any memory.
> > > >
> > > > What is the difference of those two?
> > > >
> > >
> > > Patch 5 in the series
> >
> > Strange, I do not see that one at the mailing list.
> >
>
> http://marc.info/?l=linux-kernel&m=135152595827692

Thanks!

> > > introduces it to be equal to N_HIGH_MEMORY, so
> >
> > So this is just a rename? If yes it would be much esier if it was
> > mentioned in the patch description.
> >
>
> It's not even a rename even though it should be, it's adding yet another
> node_states that is equal to N_HIGH_MEMORY since that state already
> includes all memory.

Which is really strange because I do not see any reason for yet another
alias if the follow up patches rename all of them (I didn't try to apply
the whole series to check that so I might be wrong here).

> It's just a matter of taste but I think we should be renaming it
> instead of aliasing it (unless you actually want to make N_HIGH_MEMORY
> only include nodes with highmem, but nothing depends on that).

Agreed, I've always considered N_HIGH_MEMORY misleading and confusing so
renaming it would really make a lot of sense to me.
--
Michal Hocko
SUSE Labs

2012-10-30 09:51:28

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [V5 PATCH 00/26] mm, memory-hotplug: dynamic configure movable memory and introduce movable node

HI Lai,

The patch-set is huge. Therefore, we hesitate to read the patch-set.
I think the patch-set has multiple feature developments.
- Development of online_movable [PATCH 1 - 3]
- Cleanup node_state_attr [PATCH 4]
- Introduce N_MEMORY [PATCH 5 - 18]
- Development of kernelcore_max_addr [PATCH 19 - 25]
- Bug fix [PATCH 26]

Why don't you separate the patch-set into each feature development?
By separating the patch-set, many people can easily participate
in your development.

Thanks,
Yasuaki Ishimatsu

2012/10/30 0:07, Lai Jiangshan wrote:
> Movable memory is a very important concept of memory-management,
> we need to consolidate it and make use of it on systems.
>
> Movable memory is needed for
> o anti-fragmentation(hugepage, big-order allocation...)
> o logic hot-remove(virtualization, Memory capacity on Demand)
> o physic hot-remove(power-saving, hardware partitioning, hardware fault management)
>
> All these require dynamic configuring the memory and making better utilities of memories
> and safer. We also need physic hot-remove, so we need movable node too.
> (Although some systems support physic-memory-migration, we don't require all
> memory on physic-node is movable, but movable node is still needed here
> for logic-node if we want to make physic-migration is transparent)
>
> We add dynamic configuration commands "online_movalbe" and "online_kernel".
> We also add non-dynamic boot option kernelcore_max_addr.
> We may add some more dynamic/non-dynamic configuration in future.
>
>
> The patchset is based on 3.7-rc3 with these three patches already applied:
> https://lkml.org/lkml/2012/10/24/151
> https://lkml.org/lkml/2012/10/26/150
>
> You can also simply pull all the patches from:
> git pull https://github.com/laijs/linux.git hotplug-next
>
>
>
> Issues):
>
> mempolicy(M_BIND) don't act well when the nodemask has movable nodes only,
> the kernel allocation will fail and the task can't create new task or other
> kernel objects.
>
> So we change the strategy/policy
> when the bound nodemask has movable node(s) only, we only
> apply mempolicy for userspace allocation, don't apply it
> for kernel allocation.
>
> CPUSET also has the same problem, but the code spread in page_alloc.c,
> and we doesn't fix it yet, we can/will change allocation strategy to one of
> these 3 strategies:
> 1) the same strategy as mempolicy
> 2) change cpuset, make nodemask always has at least a normal node
> 3) split nodemask: nodemask_user and nodemask_kernel
>
> Thoughts?
>
>
>
> Patches):
>
> patch1-3: add online_movable and online_kernel, bot don't result movable node
> Patch4 cleanup for node_state_attr
> Patch5 introduce N_MEMORY
> Patch6-17 use N_MEMORY instead N_HIGH_MEMORY.
> The patches are separated by subsystem,
> Patch18 also changes the node_states initialization
> Patch18-20 Add MOVABLE-dedicated node
> Patch21-25 Add kernelcore_max_addr
> patch26: mempolicy handle movable node
>
>
>
>
> Changes):
>
> change V5-V4:
> consolidate online_movable/online_kernel
> nodemask management
>
> change V4-v3
> rebase.
> online_movable/online_kernel can create a zone from empty
> or empyt a zone
>
> change V3-v2:
> Proper nodemask management
>
> change V2-V1:
>
> The original V1 patchset of MOVABLE-dedicated node is here:
> http://comments.gmane.org/gmane.linux.kernel.mm/78122
>
> The new V2 adds N_MEMORY and a notion of "MOVABLE-dedicated node".
> And fix some related problems.
>
> The orignal V1 patchset of "add online_movable" is here:
> https://lkml.org/lkml/2012/7/4/145
>
> The new V2 discards the MIGRATE_HOTREMOVE approach, and use a more straight
> implementation(only 1 patch).
>
>
>
> Lai Jiangshan (22):
> mm, memory-hotplug: dynamic configure movable memory and portion
> memory
> memory_hotplug: handle empty zone when online_movable/online_kernel
> memory_hotplug: ensure every online node has NORMAL memory
> node: cleanup node_state_attr
> node_states: introduce N_MEMORY
> cpuset: use N_MEMORY instead N_HIGH_MEMORY
> procfs: use N_MEMORY instead N_HIGH_MEMORY
> memcontrol: use N_MEMORY instead N_HIGH_MEMORY
> oom: use N_MEMORY instead N_HIGH_MEMORY
> mm,migrate: use N_MEMORY instead N_HIGH_MEMORY
> mempolicy: use N_MEMORY instead N_HIGH_MEMORY
> hugetlb: use N_MEMORY instead N_HIGH_MEMORY
> vmstat: use N_MEMORY instead N_HIGH_MEMORY
> kthread: use N_MEMORY instead N_HIGH_MEMORY
> init: use N_MEMORY instead N_HIGH_MEMORY
> vmscan: use N_MEMORY instead N_HIGH_MEMORY
> page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states
> initialization
> hotplug: update nodemasks management
> numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
> memory_hotplug: allow online/offline memory to result movable node
> page_alloc: add kernelcore_max_addr
> mempolicy: fix is_valid_nodemask()
>
> Yasuaki Ishimatsu (4):
> x86: get pg_data_t's memory from other node
> x86: use memblock_set_current_limit() to set memblock.current_limit
> memblock: limit memory address from memblock
> memblock: compare current_limit with end variable at
> memblock_find_in_range_node()
>
> Documentation/cgroups/cpusets.txt | 2 +-
> Documentation/kernel-parameters.txt | 9 +
> Documentation/memory-hotplug.txt | 19 ++-
> arch/x86/kernel/setup.c | 4 +-
> arch/x86/mm/init_64.c | 4 +-
> arch/x86/mm/numa.c | 8 +-
> drivers/base/memory.c | 27 ++--
> drivers/base/node.c | 28 ++--
> fs/proc/kcore.c | 2 +-
> fs/proc/task_mmu.c | 4 +-
> include/linux/cpuset.h | 2 +-
> include/linux/memblock.h | 1 +
> include/linux/memory.h | 1 +
> include/linux/memory_hotplug.h | 13 ++-
> include/linux/nodemask.h | 5 +
> init/main.c | 2 +-
> kernel/cpuset.c | 32 ++--
> kernel/kthread.c | 2 +-
> mm/Kconfig | 8 +
> mm/hugetlb.c | 24 ++--
> mm/memblock.c | 10 +-
> mm/memcontrol.c | 18 +-
> mm/memory_hotplug.c | 283 +++++++++++++++++++++++++++++++++--
> mm/mempolicy.c | 48 ++++---
> mm/migrate.c | 2 +-
> mm/oom_kill.c | 2 +-
> mm/page_alloc.c | 76 +++++++---
> mm/page_cgroup.c | 2 +-
> mm/vmscan.c | 4 +-
> mm/vmstat.c | 4 +-
> 30 files changed, 508 insertions(+), 138 deletions(-)
>

2012-10-31 07:04:02

by Wen Congyang

[permalink] [raw]
Subject: Re: [V5 PATCH 05/26] node_states: introduce N_MEMORY

At 10/30/2012 04:46 AM, David Rientjes Wrote:
> On Mon, 29 Oct 2012, Lai Jiangshan wrote:
>
>> We have N_NORMAL_MEMORY for standing for the nodes that have normal memory with
>> zone_type <= ZONE_NORMAL.
>>
>> And we have N_HIGH_MEMORY for standing for the nodes that have normal or high
>> memory.
>>
>
> (In other words, all memory.)
>
>> But we don't have any word to stand for the nodes that have *any* memory.
>>
>
> It's N_HIGH_MEMORY, or at least it's supposed to be. Is there a problem
> where the bit isn't getting set for a node with memory?
>
>> A) But this reusing is bad for *readability*. Because the name
>> N_HIGH_MEMORY just stands for high or normal:
>>
>> A.example 1)
>> mem_cgroup_nr_lru_pages():
>> for_each_node_state(nid, N_HIGH_MEMORY)
>>
>> The user will be confused(why this function just counts for high or
>> normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
>> until someone else tell them N_HIGH_MEMORY is reused to stand for
>> nodes that have any memory.
>>
>> A.cont) If we introduce N_MEMORY, we can reduce this confusing
>> AND make the code more clearly:
>>
>> A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:
>>
>> One is in page_cgroup_init(void):
>> for_each_node_state(nid, N_HIGH_MEMORY) {
>>
>> It means if the node have memory, we will allocate page_cgroup map for
>> the node. We should use N_MEMORY instead here to gaim more clearly.
>>
>> The second using is in alloc_page_cgroup():
>> if (node_state(nid, N_HIGH_MEMORY))
>> addr = vzalloc_node(size, nid);
>>
>> It means if the node has high or normal memory that can be allocated
>> from kernel. We should keep N_HIGH_MEMORY here, and it will be better
>> if the "any memory" semantic of N_HIGH_MEMORY is removed.
>>
>> B) This reusing is out-dated if we introduce MOVABLE-dedicated node.
>> The MOVABLE-dedicated node should not appear in
>> node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
>> because MOVABLE-dedicated node has no high or normal memory.
>>
>> In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
>> is in node_stats[N_HIGH_MEMORY], it is also means it is in
>> node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.
>>
>> The slub uses
>> for_each_node_state(nid, N_NORMAL_MEMORY)
>> and creates kmem_cache_node for MOVABLE-dedicated node and cause problem.
>>
>> In one word, we need a N_MEMORY. We just intrude it as an alias to
>> N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late patches.
>>
>
> If this is really that problematic (and it appears it's not given that
> there are many use cases of it and people tend to get it right), then why
> not simply rename N_HIGH_MEMORY instead of introducing yet another
> nodemask to the equation?

The reason is that we need a node which only contains movable memory. This
feature is very important for node hotplug. So we will add a new nodemask
for movable memory. N_MEMORY contains movable memory but N_HIGH_MEMORY
doesn't contain it.

Thanks
Wen Congyang

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-31 09:24:36

by Wen Congyang

[permalink] [raw]
Subject: Re: [V5 PATCH 00/26] mm, memory-hotplug: dynamic configure movable memory and introduce movable node

At 10/30/2012 05:50 PM, Yasuaki Ishimatsu Wrote:
> HI Lai,
>
> The patch-set is huge. Therefore, we hesitate to read the patch-set.
> I think the patch-set has multiple feature developments.
> - Development of online_movable [PATCH 1 - 3]
> - Cleanup node_state_attr [PATCH 4]
> - Introduce N_MEMORY [PATCH 5 - 18]
> - Development of kernelcore_max_addr [PATCH 19 - 25]
> - Bug fix [PATCH 26]

I have splited it to 6 patchsets:

part1: patch 1-3
http://marc.info/?l=linux-kernel&m=135166176108186&w=2

part2: patch 4
http://marc.info/?l=linux-kernel&m=135166705909544&w=2

part3: patch 5-18
http://marc.info/?l=linux-kernel&m=135167050510527&w=2

part4: patch 19-20
http://marc.info/?l=linux-kernel&m=135167344211401&w=2

part5: patch 21-25
http://marc.info/?l=linux-kernel&m=135167497312063&w=2

part6: patch 26
http://marc.info/?l=linux-kernel&m=135167512612132&w=2

>
> Why don't you separate the patch-set into each feature development?
> By separating the patch-set, many people can easily participate
> in your development.
>
> Thanks,
> Yasuaki Ishimatsu
>
> 2012/10/30 0:07, Lai Jiangshan wrote:
>> Movable memory is a very important concept of memory-management,
>> we need to consolidate it and make use of it on systems.
>>
>> Movable memory is needed for
>> o anti-fragmentation(hugepage, big-order allocation...)
>> o logic hot-remove(virtualization, Memory capacity on Demand)
>> o physic hot-remove(power-saving, hardware partitioning, hardware fault management)
>>
>> All these require dynamic configuring the memory and making better utilities of memories
>> and safer. We also need physic hot-remove, so we need movable node too.
>> (Although some systems support physic-memory-migration, we don't require all
>> memory on physic-node is movable, but movable node is still needed here
>> for logic-node if we want to make physic-migration is transparent)
>>
>> We add dynamic configuration commands "online_movalbe" and "online_kernel".
>> We also add non-dynamic boot option kernelcore_max_addr.
>> We may add some more dynamic/non-dynamic configuration in future.
>>
>>
>> The patchset is based on 3.7-rc3 with these three patches already applied:
>> https://lkml.org/lkml/2012/10/24/151
>> https://lkml.org/lkml/2012/10/26/150
>>
>> You can also simply pull all the patches from:
>> git pull https://github.com/laijs/linux.git hotplug-next
>>
>>
>>
>> Issues):
>>
>> mempolicy(M_BIND) don't act well when the nodemask has movable nodes only,
>> the kernel allocation will fail and the task can't create new task or other
>> kernel objects.
>>
>> So we change the strategy/policy
>> when the bound nodemask has movable node(s) only, we only
>> apply mempolicy for userspace allocation, don't apply it
>> for kernel allocation.
>>
>> CPUSET also has the same problem, but the code spread in page_alloc.c,
>> and we doesn't fix it yet, we can/will change allocation strategy to one of
>> these 3 strategies:
>> 1) the same strategy as mempolicy
>> 2) change cpuset, make nodemask always has at least a normal node
>> 3) split nodemask: nodemask_user and nodemask_kernel
>>
>> Thoughts?
>>
>>
>>
>> Patches):
>>
>> patch1-3: add online_movable and online_kernel, bot don't result movable node
>> Patch4 cleanup for node_state_attr
>> Patch5 introduce N_MEMORY
>> Patch6-17 use N_MEMORY instead N_HIGH_MEMORY.
>> The patches are separated by subsystem,
>> Patch18 also changes the node_states initialization
>> Patch18-20 Add MOVABLE-dedicated node
>> Patch21-25 Add kernelcore_max_addr
>> patch26: mempolicy handle movable node
>>
>>
>>
>>
>> Changes):
>>
>> change V5-V4:
>> consolidate online_movable/online_kernel
>> nodemask management
>>
>> change V4-v3
>> rebase.
>> online_movable/online_kernel can create a zone from empty
>> or empyt a zone
>>
>> change V3-v2:
>> Proper nodemask management
>>
>> change V2-V1:
>>
>> The original V1 patchset of MOVABLE-dedicated node is here:
>> http://comments.gmane.org/gmane.linux.kernel.mm/78122
>>
>> The new V2 adds N_MEMORY and a notion of "MOVABLE-dedicated node".
>> And fix some related problems.
>>
>> The orignal V1 patchset of "add online_movable" is here:
>> https://lkml.org/lkml/2012/7/4/145
>>
>> The new V2 discards the MIGRATE_HOTREMOVE approach, and use a more straight
>> implementation(only 1 patch).
>>
>>
>>
>> Lai Jiangshan (22):
>> mm, memory-hotplug: dynamic configure movable memory and portion
>> memory
>> memory_hotplug: handle empty zone when online_movable/online_kernel
>> memory_hotplug: ensure every online node has NORMAL memory
>> node: cleanup node_state_attr
>> node_states: introduce N_MEMORY
>> cpuset: use N_MEMORY instead N_HIGH_MEMORY
>> procfs: use N_MEMORY instead N_HIGH_MEMORY
>> memcontrol: use N_MEMORY instead N_HIGH_MEMORY
>> oom: use N_MEMORY instead N_HIGH_MEMORY
>> mm,migrate: use N_MEMORY instead N_HIGH_MEMORY
>> mempolicy: use N_MEMORY instead N_HIGH_MEMORY
>> hugetlb: use N_MEMORY instead N_HIGH_MEMORY
>> vmstat: use N_MEMORY instead N_HIGH_MEMORY
>> kthread: use N_MEMORY instead N_HIGH_MEMORY
>> init: use N_MEMORY instead N_HIGH_MEMORY
>> vmscan: use N_MEMORY instead N_HIGH_MEMORY
>> page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states
>> initialization
>> hotplug: update nodemasks management
>> numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
>> memory_hotplug: allow online/offline memory to result movable node
>> page_alloc: add kernelcore_max_addr
>> mempolicy: fix is_valid_nodemask()
>>
>> Yasuaki Ishimatsu (4):
>> x86: get pg_data_t's memory from other node
>> x86: use memblock_set_current_limit() to set memblock.current_limit
>> memblock: limit memory address from memblock
>> memblock: compare current_limit with end variable at
>> memblock_find_in_range_node()
>>
>> Documentation/cgroups/cpusets.txt | 2 +-
>> Documentation/kernel-parameters.txt | 9 +
>> Documentation/memory-hotplug.txt | 19 ++-
>> arch/x86/kernel/setup.c | 4 +-
>> arch/x86/mm/init_64.c | 4 +-
>> arch/x86/mm/numa.c | 8 +-
>> drivers/base/memory.c | 27 ++--
>> drivers/base/node.c | 28 ++--
>> fs/proc/kcore.c | 2 +-
>> fs/proc/task_mmu.c | 4 +-
>> include/linux/cpuset.h | 2 +-
>> include/linux/memblock.h | 1 +
>> include/linux/memory.h | 1 +
>> include/linux/memory_hotplug.h | 13 ++-
>> include/linux/nodemask.h | 5 +
>> init/main.c | 2 +-
>> kernel/cpuset.c | 32 ++--
>> kernel/kthread.c | 2 +-
>> mm/Kconfig | 8 +
>> mm/hugetlb.c | 24 ++--
>> mm/memblock.c | 10 +-
>> mm/memcontrol.c | 18 +-
>> mm/memory_hotplug.c | 283 +++++++++++++++++++++++++++++++++--
>> mm/mempolicy.c | 48 ++++---
>> mm/migrate.c | 2 +-
>> mm/oom_kill.c | 2 +-
>> mm/page_alloc.c | 76 +++++++---
>> mm/page_cgroup.c | 2 +-
>> mm/vmscan.c | 4 +-
>> mm/vmstat.c | 4 +-
>> 30 files changed, 508 insertions(+), 138 deletions(-)
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-31 13:18:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [V5 PATCH 08/26] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

On Wed 31-10-12 15:03:36, Wen Congyang wrote:
> At 10/30/2012 04:46 AM, David Rientjes Wrote:
> > On Mon, 29 Oct 2012, Lai Jiangshan wrote:
[...]
> >> In one word, we need a N_MEMORY. We just intrude it as an alias to
> >> N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late patches.
> >>
> >
> > If this is really that problematic (and it appears it's not given that
> > there are many use cases of it and people tend to get it right), then why
> > not simply rename N_HIGH_MEMORY instead of introducing yet another
> > nodemask to the equation?
>
> The reason is that we need a node which only contains movable memory. This
> feature is very important for node hotplug. So we will add a new nodemask
> for movable memory. N_MEMORY contains movable memory but N_HIGH_MEMORY
> doesn't contain it.

OK, so the N_MOVABLE_MEMORY (or how you will call it) requires that all
the allocations will be migrateable?
How do you want to achieve that with the page_cgroup descriptors? (see
bellow)

On Mon 29-10-12 23:20:58, Lai Jiangshan wrote:
[...]
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 5ddad0c..c1054ad 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
> if (mem_cgroup_disabled())
> return;
>
> - for_each_node_state(nid, N_HIGH_MEMORY) {
> + for_each_node_state(nid, N_MEMORY) {
> unsigned long start_pfn, end_pfn;
>
> start_pfn = node_start_pfn(nid);

This will call init_section_page_cgroup(pfn, nid) later which allocates
page_cgroup descriptors which are not movable. Or is there any code in
your patchset that handles this?
--
Michal Hocko
SUSE Labs