2013-04-12 01:14:19

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 00/25] Dynamic NUMA: Runtime NUMA memory layout reconfiguration

These patches allow the NUMA memory layout (meaning which node each physical
page belongs to, the mapping from physical pages to NUMA nodes) to be changed
at runtime in place (without hotplugging).

Depends on "mm: avoid duplication of setup_nr_node_ids()",
http://comments.gmane.org/gmane.linux.kernel.mm/96880, which is merged into the
current MMOTM.

TODO:

- Update sysfs node information when reconfiguration occurs
- Currently, I use pageflag setters without "owning" pages which could cause
loss of pageflag updates when combined with non-atomic pageflag users in
mm/*. Some options for solving this: (a) make all pageflags access atomic,
(b) use pageblock flags, (c) use bits in a new bitmap, or (d) attempt to work
around races in a similar way to memory-failure.

= Why/when is this useful? =

In virtual machines (VMs) running on NUMA systems both [a] if/when the
hypervisor decides to move their backing memory around (compacting,
prioritizing another VMs desired layout, etc) and [b] in general for
migration of VMs.

The hardware is _already_ changing the NUMA layout underneath us. We have
powerpc64 systems with firmware that currently move the backing memory around,
and have the ability to notify Linux of new NUMA info.

= How are you managing to do this? =

Reconfiguration of page->node mappings is done at the page allocator
level by both pulling free pages out of the free lists (when a new memory
layout is committed) & redirecting pages on free to their new node.

Because we can't change page_node(A) while A is allocated [1], a rbtree
holding the mapping from pfn ranges to node ids ('struct memlayout')
is introduced to track the pfn->node mapping for
yet-to-be-transplanted pages. A lookup in this rbtree occurs on any
page allocator path that decides which zone to free a page to.

To avoid horrible performance due to rbtree lookups all the time, the
rbtree is only consulted when the page is marked with a new pageflag
(LookupNode).

[1]: quite a few users of page_node() depend on it not changing, some
accumulate per-node stats by using this. We'd also have to change it via atomic
operations to avoid disturbing the pageflags which share the same unsigned
long.

= Code & testing =

A debugfs interface allows the NUMA memory layout to be changed. Basically,
you don't need to have weird systems to test this, in fact, I've done all my
testing so far in plain old qemu-i386.

A script which stripes the memory between nodes or pushes all memory to a
(potentially new) node is avaliable here:

https://raw.github.com/jmesmon/trifles/master/bin/dnuma-test

The patches are also available via:

https://github.com/jmesmon/linux.git dnuma/v32

f16107a..db6f5c3

= Current Limitations =

For the reconfiguration to be effective (and not make the allocator make
poorer choices), updating the cpu->node mappings is also needed. This patchset
does _not_ handle this. Also missing is a way to update topology (node
distances), which is slightly less fatal.

These patches only work on SPARSEMEM and the node id _must_ fit in the pageflags
(can't be pushed out to the section). This generally means that 32-bit
platforms are out (unless you hack MAX_PHYS{ADDR,MEM}_BITS).

This code does the reconfiguration without hotplugging memory at all (1
errant page doesn't keep us from fixing the rest of them). But it still
depends on MEMORY_HOTPLUG for functions that online nodes & adjust
zone/pgdat size.

Things that need doing or would be nice to have but aren't bugs:

- While the interface is meant to be driven via a hypervisor/firmware, that
portion is not yet included.
- notifier for kernel users of memory that need/want their allocations on a
particular node (NODE_DATA(), for instance).
- notifier for userspace.
- a way to allocate things from the appropriate node prior to the page
allocator being fully updated (could just be "allocate it wrong now &
reallocate later").
- Make memlayout faster (potentially via per-node allocation, different data
structure, and/or more/smarter caching).
- (potentially) propagation of updated layout knowledge into kmem_caches
(SL*B).

--

Since v1: http://comments.gmane.org/gmane.linux.kernel.mm/95541

- Update watermarks.
- Update zone percpu pageset ->batch & ->high only when needed.
- Don't lazily adjust {pgdat,zone}->{present_pages,managed_pages}, set them all at once.
- Don't attempt to use more than nr_node_ids nodes.

--

Cody P Schafer (25):
rbtree: add postorder iteration functions.
rbtree: add rbtree_postorder_for_each_entry_safe() helper.
mm/memory_hotplug: factor out zone+pgdat growth.
memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h
mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones &
pgdats
mm: add nid_zone() helper
page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled.
page_alloc: in move_freepages(), skip pages instead of VM_BUG on node
differences.
page_alloc: when dynamic numa is enabled, don't check that all pages
in a block belong to the same zone
page-flags dnuma: reserve a pageflag for determining if a page needs a
node lookup.
memory_hotplug: factor out locks in mem_online_cpu()
mm: add memlayout & dnuma to track pfn->nid & transplant pages between
nodes
mm: memlayout+dnuma: add debugfs interface
page_alloc: use dnuma to transplant newly freed pages in
__free_pages_ok()
page_alloc: use dnuma to transplant newly freed pages in
free_hot_cold_page()
page_alloc: transplant pages that are being flushed from the per-cpu
lists
x86: memlayout: add a arch specific inital memlayout setter.
init/main: call memlayout_global_init() in start_kernel().
dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug
x86/mm/numa: when dnuma is enabled, use memlayout to handle memory
hotplug's physaddr_to_nid.
mm/memory_hotplug: VM_BUG if nid is too large.
mm/page_alloc: in page_outside_zone_boundaries(), avoid premature
decisions.
mm/page_alloc: make pr_err() in page_outside_zone_boundaries() more
useful
mm/page_alloc: use manage_pages instead of present pages when
calculating default_zonelist_order()
mm: add a early_param "extra_nr_node_ids" to increase nr_node_ids
above the minimum by a percentage.

Documentation/kernel-parameters.txt | 6 +
arch/x86/mm/numa.c | 32 ++-
include/linux/dnuma.h | 97 ++++++++
include/linux/memlayout.h | 133 +++++++++++
include/linux/memory_hotplug.h | 4 +
include/linux/mm.h | 7 +-
include/linux/page-flags.h | 19 ++
include/linux/rbtree.h | 12 +
init/main.c | 2 +
lib/rbtree.c | 40 ++++
mm/Kconfig | 54 +++++
mm/Makefile | 2 +
mm/dnuma.c | 441 ++++++++++++++++++++++++++++++++++++
mm/internal.h | 13 +-
mm/memlayout-debugfs.c | 339 +++++++++++++++++++++++++++
mm/memlayout-debugfs.h | 39 ++++
mm/memlayout.c | 265 ++++++++++++++++++++++
mm/memory_hotplug.c | 54 +++--
mm/page_alloc.c | 154 +++++++++++--
19 files changed, 1669 insertions(+), 44 deletions(-)
create mode 100644 include/linux/dnuma.h
create mode 100644 include/linux/memlayout.h
create mode 100644 mm/dnuma.c
create mode 100644 mm/memlayout-debugfs.c
create mode 100644 mm/memlayout-debugfs.h
create mode 100644 mm/memlayout.c

--
1.8.2.1


2013-04-12 01:14:30

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 08/25] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences.

With dynamic numa, pages are going to be gradully moved from one node to
another, causing the page ranges that move_freepages() examines to
contain pages that actually belong to another node.

When dynamic numa is enabled, we skip these pages instead of VM_BUGing
out on them.

This additionally moves the VM_BUG_ON() (which detects a change in node)
so that it follows the pfn_valid_within() check.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1fbf5f2..75192eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -957,6 +957,7 @@ int move_freepages(struct zone *zone,
struct page *page;
unsigned long order;
int pages_moved = 0;
+ int zone_nid = zone_to_nid(zone);

#ifndef CONFIG_HOLES_IN_ZONE
/*
@@ -970,14 +971,24 @@ int move_freepages(struct zone *zone,
#endif

for (page = start_page; page <= end_page;) {
- /* Make sure we are not inadvertently changing nodes */
- VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));
-
if (!pfn_valid_within(page_to_pfn(page))) {
page++;
continue;
}

+ if (page_to_nid(page) != zone_nid) {
+#ifndef CONFIG_DYNAMIC_NUMA
+ /*
+ * In the normal case (without Dynamic NUMA), all pages
+ * in a pageblock should belong to the same zone (and
+ * as a result all have the same nid).
+ */
+ VM_BUG_ON(page_to_nid(page) != zone_nid);
+#endif
+ page++;
+ continue;
+ }
+
if (!PageBuddy(page)) {
page++;
continue;
--
1.8.2.1

2013-04-12 01:14:34

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 11/25] memory_hotplug: factor out locks in mem_online_cpu()

In dynamic numa, when onlining nodes, lock_memory_hotplug() is already
held when mem_online_node()'s functionality is needed.

Factor out the locking and create a new function __mem_online_node() to
allow reuse.

Signed-off-by: Cody P Schafer <[email protected]>
---
include/linux/memory_hotplug.h | 1 +
mm/memory_hotplug.c | 29 ++++++++++++++++-------------
2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index cd393014..391824d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -248,6 +248,7 @@ static inline int is_mem_section_removable(unsigned long pfn,
static inline void try_offline_node(int nid) {}
#endif /* CONFIG_MEMORY_HOTREMOVE */

+extern int __mem_online_node(int nid);
extern int mem_online_node(int nid);
extern int add_memory(int nid, u64 start, u64 size);
extern int arch_add_memory(int nid, u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index deea8c2..f5ea9b7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1058,26 +1058,29 @@ static void rollback_node_hotadd(int nid, pg_data_t *pgdat)
return;
}

-
-/*
- * called by cpu_up() to online a node without onlined memory.
- */
-int mem_online_node(int nid)
+int __mem_online_node(int nid)
{
- pg_data_t *pgdat;
- int ret;
+ pg_data_t *pgdat;
+ int ret;

- lock_memory_hotplug();
pgdat = hotadd_new_pgdat(nid, 0);
- if (!pgdat) {
- ret = -ENOMEM;
- goto out;
- }
+ if (!pgdat)
+ return -ENOMEM;
+
node_set_online(nid);
ret = register_one_node(nid);
BUG_ON(ret);
+ return ret;
+}

-out:
+/*
+ * called by cpu_up() to online a node without onlined memory.
+ */
+int mem_online_node(int nid)
+{
+ int ret;
+ lock_memory_hotplug();
+ ret = __mem_online_node(nid);
unlock_memory_hotplug();
return ret;
}
--
1.8.2.1

2013-04-12 01:14:42

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 12/25] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes

On some systems, the hypervisor can (and will) relocate physical
addresses as seen in a VM between real NUMA nodes. For example, IBM
Power systems which are using particular revisions of PHYP (IBM's
proprietary hypervisor)

This change set introduces the infrastructure for tracking & dynamically
changing "memory layouts" (or "memlayouts"): the mapping between page
ranges & the actual backing NUMA node.

A memlayout is stored as an rbtree which maps pfns (really, ranges of
pfns) to a node. This mapping (combined with the LookupNode pageflag) is
used to "transplant" (move pages between nodes) pages when they are
freed back to the page allocator.

Additionally, when a new memlayout is commited the currently free pages
that are now in the 'wrong' zone's freelist are immidiately transplanted.

Hooks that tie it into the page alloctor to actually perform the
"transplant on free" are in later patches.

Signed-off-by: Cody P Schafer <[email protected]>
---
include/linux/dnuma.h | 97 ++++++++++
include/linux/memlayout.h | 126 +++++++++++++
mm/Kconfig | 24 +++
mm/Makefile | 1 +
mm/dnuma.c | 439 ++++++++++++++++++++++++++++++++++++++++++++++
mm/memlayout.c | 237 +++++++++++++++++++++++++
6 files changed, 924 insertions(+)
create mode 100644 include/linux/dnuma.h
create mode 100644 include/linux/memlayout.h
create mode 100644 mm/dnuma.c
create mode 100644 mm/memlayout.c

diff --git a/include/linux/dnuma.h b/include/linux/dnuma.h
new file mode 100644
index 0000000..029a984
--- /dev/null
+++ b/include/linux/dnuma.h
@@ -0,0 +1,97 @@
+#ifndef LINUX_DNUMA_H_
+#define LINUX_DNUMA_H_
+
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/memlayout.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+
+#ifdef CONFIG_DYNAMIC_NUMA
+/* Must be called _before_ setting a new_ml to the pfn_to_node_map */
+void dnuma_online_required_nodes_and_zones(struct memlayout *new_ml);
+
+/* Must be called _after_ setting a new_ml to the pfn_to_node_map */
+void dnuma_move_free_pages(struct memlayout *new_ml);
+void dnuma_mark_page_range(struct memlayout *new_ml);
+
+static inline bool dnuma_is_active(void)
+{
+ struct memlayout *ml;
+ bool ret;
+
+ rcu_read_lock();
+ ml = rcu_dereference(pfn_to_node_map);
+ ret = ml && (ml->type != ML_INITIAL);
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static inline bool dnuma_has_memlayout(void)
+{
+ return !!rcu_access_pointer(pfn_to_node_map);
+}
+
+static inline int dnuma_page_needs_move(struct page *page)
+{
+ int new_nid, old_nid;
+
+ if (!TestClearPageLookupNode(page))
+ return NUMA_NO_NODE;
+
+ /* FIXME: this does rcu_lock, deref, unlock */
+ if (WARN_ON(!dnuma_is_active()))
+ return NUMA_NO_NODE;
+
+ /* FIXME: and so does this (rcu lock, deref, and unlock) */
+ new_nid = memlayout_pfn_to_nid(page_to_pfn(page));
+ old_nid = page_to_nid(page);
+
+ if (new_nid == NUMA_NO_NODE) {
+ pr_alert("dnuma: pfn %05lx has moved from node %d to a non-memlayout range.\n",
+ page_to_pfn(page), old_nid);
+ return NUMA_NO_NODE;
+ }
+
+ if (new_nid == old_nid)
+ return NUMA_NO_NODE;
+
+ if (WARN_ON(!zone_is_initialized(
+ nid_zone(new_nid, page_zonenum(page)))))
+ return NUMA_NO_NODE;
+
+ return new_nid;
+}
+
+void dnuma_post_free_to_new_zone(struct page *page, int order);
+void dnuma_prior_free_to_new_zone(struct page *page, int order,
+ struct zone *dest_zone,
+ int dest_nid);
+
+#else /* !defined CONFIG_DYNAMIC_NUMA */
+
+static inline bool dnuma_is_active(void)
+{
+ return false;
+}
+
+static inline void dnuma_prior_free_to_new_zone(struct page *page, int order,
+ struct zone *dest_zone,
+ int dest_nid)
+{
+ BUG();
+}
+
+static inline void dnuma_post_free_to_new_zone(struct page *page, int order)
+{
+ BUG();
+}
+
+static inline int dnuma_page_needs_move(struct page *page)
+{
+ return NUMA_NO_NODE;
+}
+#endif /* !defined CONFIG_DYNAMIC_NUMA */
+
+#endif /* defined LINUX_DNUMA_H_ */
diff --git a/include/linux/memlayout.h b/include/linux/memlayout.h
new file mode 100644
index 0000000..6c26c52
--- /dev/null
+++ b/include/linux/memlayout.h
@@ -0,0 +1,126 @@
+#ifndef LINUX_MEMLAYOUT_H_
+#define LINUX_MEMLAYOUT_H_
+
+#include <linux/memblock.h> /* __init_memblock */
+#include <linux/mm.h> /* NODE_DATA, page_zonenum */
+#include <linux/mmzone.h> /* pfn_to_nid */
+#include <linux/rbtree.h>
+#include <linux/types.h> /* size_t */
+
+#ifdef CONFIG_DYNAMIC_NUMA
+# ifdef NODE_NOT_IN_PAGE_FLAGS
+# error "CONFIG_DYNAMIC_NUMA requires the NODE is in page flags. Try freeing up some flags by decreasing the maximum number of NUMA nodes, or switch to sparsmem-vmemmap"
+# endif
+
+enum memlayout_type {
+ ML_INITIAL,
+ ML_USER_DEBUG,
+ ML_NUM_TYPES
+};
+
+struct rangemap_entry {
+ struct rb_node node;
+ unsigned long pfn_start;
+ /* @pfn_end: inclusive, not stored as a count to make the lookup
+ * faster
+ */
+ unsigned long pfn_end;
+ int nid;
+};
+
+#define RME_FMT "{%05lx-%05lx}:%d"
+#define RME_EXP(rme) rme->pfn_start, rme->pfn_end, rme->nid
+
+struct memlayout {
+ /*
+ * - contains rangemap_entrys.
+ * - assumes no 'ranges' overlap.
+ */
+ struct rb_root root;
+ enum memlayout_type type;
+
+ /*
+ * When a memlayout is commited, 'cache' is accessed (the field is read
+ * from & written to) by multiple tasks without additional locking
+ * (other than the rcu locking for accessing the memlayout).
+ *
+ * Do not assume that it will not change. Use ACCESS_ONCE() to avoid
+ * potential races.
+ */
+ struct rangemap_entry *cache;
+
+#ifdef CONFIG_DNUMA_DEBUGFS
+ unsigned seq;
+ struct dentry *d;
+#endif
+};
+
+extern __rcu struct memlayout *pfn_to_node_map;
+
+/* FIXME: overflow potential in completion check */
+#define ml_for_each_pfn_in_range(rme, pfn) \
+ for (pfn = rme->pfn_start; \
+ pfn <= rme->pfn_end || pfn < rme->pfn_start; \
+ pfn++)
+
+static inline bool rme_bounds_pfn(struct rangemap_entry *rme, unsigned long pfn)
+{
+ return rme->pfn_start <= pfn && pfn <= rme->pfn_end;
+}
+
+static inline struct rangemap_entry *rme_next(struct rangemap_entry *rme)
+{
+ struct rb_node *node = rb_next(&rme->node);
+ if (!node)
+ return NULL;
+ return rb_entry(node, typeof(*rme), node);
+}
+
+static inline struct rangemap_entry *rme_first(struct memlayout *ml)
+{
+ struct rb_node *node = rb_first(&ml->root);
+ if (!node)
+ return NULL;
+ return rb_entry(node, struct rangemap_entry, node);
+}
+
+#define ml_for_each_range(ml, rme) \
+ for (rme = rme_first(ml); \
+ &rme->node; \
+ rme = rme_next(rme))
+
+struct memlayout *memlayout_create(enum memlayout_type);
+void memlayout_destroy(struct memlayout *ml);
+
+int memlayout_new_range(struct memlayout *ml,
+ unsigned long pfn_start, unsigned long pfn_end, int nid);
+int memlayout_pfn_to_nid(unsigned long pfn);
+
+/*
+ * Put ranges added by memlayout_new_range() into use by
+ * memlayout_pfn_get_nid() and retire old memlayout.
+ *
+ * No modifications to a memlayout should be made after it is commited.
+ */
+void memlayout_commit(struct memlayout *ml);
+
+/*
+ * Sets up an inital memlayout in early boot.
+ * A weak default which uses memblock is provided.
+ */
+void memlayout_global_init(void);
+
+#else /* !defined(CONFIG_DYNAMIC_NUMA) */
+
+/* memlayout_new_range() & memlayout_commit() are purposefully omitted */
+
+static inline void memlayout_global_init(void)
+{}
+
+static inline int memlayout_pfn_to_nid(unsigned long pfn)
+{
+ return NUMA_NO_NODE;
+}
+#endif /* !defined(CONFIG_DYNAMIC_NUMA) */
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 3bea74f..86f0984 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -169,6 +169,30 @@ config MOVABLE_NODE
config HAVE_BOOTMEM_INFO_NODE
def_bool n

+config DYNAMIC_NUMA
+ bool "Dynamic Numa: Allow NUMA layout to change after boot time"
+ depends on NUMA
+ depends on !DISCONTIGMEM
+ depends on MEMORY_HOTPLUG # locking + mem_online_node().
+ help
+ Dynamic Numa (DNUMA) allows the movement of pages between NUMA nodes at
+ run time.
+
+ Typically, this is used on systems running under a hypervisor which
+ may move the running VM based on the hypervisors needs. On such a
+ system, this config option enables Linux to update it's knowledge of
+ the memory layout.
+
+ If the feature is not used but is enabled, there is a very small
+ amount of overhead (an additional pageflag check) is added to all page frees.
+
+ This is only useful if you enable some of the additional options to
+ allow modifications of the numa memory layout (either through hypervisor events
+ or a userspace interface).
+
+ Choose Y if you have are running linux under a hypervisor that uses
+ this feature, otherwise choose N if unsure.
+
# eventually, we can have this option just 'select SPARSEMEM'
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..82fe7c9b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
obj-$(CONFIG_CLEANCACHE) += cleancache.o
obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
+obj-$(CONFIG_DYNAMIC_NUMA) += dnuma.o memlayout.o
diff --git a/mm/dnuma.c b/mm/dnuma.c
new file mode 100644
index 0000000..2ee0903
--- /dev/null
+++ b/mm/dnuma.c
@@ -0,0 +1,439 @@
+#define pr_fmt(fmt) "dnuma: " fmt
+
+#include <linux/atomic.h>
+#include <linux/bootmem.h>
+#include <linux/dnuma.h>
+#include <linux/memory.h>
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+
+#include "internal.h"
+
+/* - must be called under lock_memory_hotplug() */
+/* TODO: avoid iterating over all PFNs. */
+void dnuma_online_required_nodes_and_zones(struct memlayout *new_ml)
+{
+ struct rangemap_entry *rme;
+ ml_for_each_range(new_ml, rme) {
+ unsigned long pfn;
+ int nid = rme->nid;
+
+ if (!node_online(nid)) {
+ pr_info("onlining node %d [start]\n", nid);
+
+ /* Consult hotadd_new_pgdat() */
+ __mem_online_node(nid);
+
+ /* XXX: we aren't really onlining memory, but some code
+ * uses memory online notifications to tell if new
+ * nodes have been created.
+ *
+ * Also note that the notifiers expect to be able to do
+ * allocations, ie we must allow for might_sleep() */
+ {
+ int ret;
+
+ /* memory_notify() expects:
+ * - to add pages at the same time
+ * - to add zones at the same time
+ * We can do neither of these things.
+ *
+ * XXX: - slab uses .status_change_nid
+ * - slub uses .status_change_nid_normal
+ * FIXME: for slub, we may not be placing any
+ * "normal" memory in it, can we check
+ * for this?
+ */
+ struct memory_notify arg = {
+ .status_change_nid = nid,
+ .status_change_nid_normal = nid,
+ };
+
+ ret = memory_notify(MEM_GOING_ONLINE, &arg);
+ ret = notifier_to_errno(ret);
+ if (WARN_ON(ret)) {
+ /* XXX: other stuff will bug out if we
+ * keep going, need to actually cancel
+ * memlayout changes
+ */
+ memory_notify(MEM_CANCEL_ONLINE, &arg);
+ }
+ }
+
+ pr_info("onlining node %d [complete]\n", nid);
+ }
+
+ /* Determine the zones required */
+ for (pfn = rme->pfn_start; pfn <= rme->pfn_end; pfn++) {
+ struct zone *zone;
+ if (!pfn_valid(pfn))
+ continue;
+
+ zone = nid_zone(nid, page_zonenum(pfn_to_page(pfn)));
+ /* XXX: we (dnuma paths) can handle this (there will
+ * just be quite a few WARNS in the logs), but if we
+ * are indicating error above, should we bail out here
+ * as well? */
+ WARN_ON(ensure_zone_is_initialized(zone, 0, 0));
+ }
+ }
+}
+
+/*
+ * Cannot be folded into dnuma_move_unallocated_pages() because unmarked pages
+ * could be freed back into the zone as dnuma_move_unallocated_pages() was in
+ * the process of iterating over it.
+ */
+void dnuma_mark_page_range(struct memlayout *new_ml)
+{
+ struct rangemap_entry *rme;
+ ml_for_each_range(new_ml, rme) {
+ unsigned long pfn;
+ for (pfn = rme->pfn_start; pfn <= rme->pfn_end; pfn++) {
+ if (!pfn_valid(pfn))
+ continue;
+ /* FIXME: should we be skipping compound / buddied
+ * pages? */
+ /* FIXME: if PageReserved(), can we just poke the nid
+ * directly? Should we? */
+ SetPageLookupNode(pfn_to_page(pfn));
+ }
+ }
+}
+
+#if 0
+static void node_states_set_node(int node, struct memory_notify *arg)
+{
+ if (arg->status_change_nid_normal >= 0)
+ node_set_state(node, N_NORMAL_MEMORY);
+
+ if (arg->status_change_nid_high >= 0)
+ node_set_state(node, N_HIGH_MEMORY);
+
+ node_set_state(node, N_MEMORY);
+}
+#endif
+
+void dnuma_post_free_to_new_zone(struct page *page, int order)
+{
+}
+
+static void dnuma_prior_return_to_new_zone(struct page *page, int order,
+ struct zone *dest_zone,
+ int dest_nid)
+{
+ int i;
+ unsigned long pfn = page_to_pfn(page);
+
+ grow_pgdat_and_zone(dest_zone, pfn, pfn + (1UL << order));
+
+ for (i = 0; i < 1UL << order; i++)
+ set_page_node(&page[i], dest_nid);
+}
+
+static void clear_lookup_node(struct page *page, int order)
+{
+ int i;
+ for (i = 0; i < 1UL << order; i++)
+ ClearPageLookupNode(&page[i]);
+}
+
+/* Does not assume it is called with any locking (but can be called with zone
+ * locks held, if needed) */
+void dnuma_prior_free_to_new_zone(struct page *page, int order,
+ struct zone *dest_zone,
+ int dest_nid)
+{
+ dnuma_prior_return_to_new_zone(page, order, dest_zone, dest_nid);
+}
+
+/* must be called with zone->lock held and memlayout's update_lock held */
+static void remove_free_pages_from_zone(struct zone *zone, struct page *page,
+ int order)
+{
+ /* zone free stats */
+ zone->free_area[order].nr_free--;
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+ list_del(&page->lru);
+ __ClearPageBuddy(page);
+
+ /* Allowed because we hold the memlayout update_lock. */
+ clear_lookup_node(page, order);
+
+ /* XXX: can we shrink spanned_pages & start_pfn without too much work?
+ * - not crutial because having a
+ * larger-than-necessary span simply means that more
+ * PFNs are iterated over.
+ * - would be nice to be able to do this to cut down
+ * on overhead caused by PFN iterators.
+ */
+}
+
+/*
+ * __ref is to allow (__meminit) zone_pcp_update(), which we will have because
+ * DYNAMIC_NUMA depends on MEMORY_HOTPLUG (and all the MEMORY_HOTPLUG comments
+ * indicate __meminit is allowed when they are enabled).
+ */
+static void __ref add_free_page_to_node(int dest_nid, struct page *page,
+ int order)
+{
+ bool need_zonelists_rebuild = false;
+ struct zone *dest_zone = nid_zone(dest_nid, page_zonenum(page));
+ VM_BUG_ON(!zone_is_initialized(dest_zone));
+
+ if (zone_is_empty(dest_zone))
+ need_zonelists_rebuild = true;
+
+ /* Add page to new zone */
+ dnuma_prior_return_to_new_zone(page, order, dest_zone, dest_nid);
+ return_pages_to_zone(page, order, dest_zone);
+ dnuma_post_free_to_new_zone(order);
+
+ /* XXX: fixme, there are other states that need fixing up */
+ if (!node_state(dest_nid, N_MEMORY))
+ node_set_state(dest_nid, N_MEMORY);
+
+ if (need_zonelists_rebuild) {
+ /* XXX: also does stop_machine() */
+ zone_pcp_reset(dest_zone);
+ /* XXX: why is this locking actually needed? */
+ mutex_lock(&zonelists_mutex);
+#if 0
+ /* assumes that zone is unused */
+ setup_zone_pageset(dest_zone);
+ build_all_zonelists(NULL, NULL);
+#else
+ build_all_zonelists(NULL, dest_zone);
+#endif
+ mutex_unlock(&zonelists_mutex);
+ }
+}
+
+static struct rangemap_entry *add_split_pages_to_zones(
+ struct rangemap_entry *first_rme,
+ struct page *page, int order)
+{
+ int i;
+ struct rangemap_entry *rme = first_rme;
+ /*
+ * We avoid doing any hard work to try to split the pages optimally
+ * here because the page allocator splits them into 0-order pages
+ * anyway.
+ *
+ * XXX: All of the checks for NULL rmes and the nid conditional are to
+ * work around memlayouts potentially not covering all valid memory.
+ */
+ for (i = 0; i < (1 << order); i++) {
+ unsigned long pfn = page_to_pfn(page);
+ int nid;
+ while (rme && pfn > rme->pfn_end)
+ rme = rme_next(rme);
+
+ if (rme && pfn >= rme->pfn_start)
+ nid = rme->nid;
+ else
+ nid = page_to_nid(page + i);
+
+ add_free_page_to_node(nid, page + i, 0);
+ }
+
+ return rme;
+}
+
+#define _page_count_idx(managed, nid, zone_num) \
+ (managed + 2 * (zone_num + MAX_NR_ZONES * (nid)))
+#define page_count_idx(nid, zone_num) _page_count_idx(0, nid, zone_num)
+
+/* Because we hold lock_memory_hotplug(), we assume that no else will be
+ * changing present_pages and managed_pages.
+ */
+static void update_page_counts(struct memlayout *new_ml)
+{
+ /* Perform a combined iteration of pgdat+zones and memlayout.
+ * - memlayouts are ordered, their lookup from pfn is slow, and they
+ * might have holes over valid pfns.
+ * - pgdat+zones are unordered, have O(1) lookups, and don't have holes
+ * over valid pfns.
+ */
+ struct rangemap_entry *rme;
+ unsigned long pfn = 0;
+ unsigned long *counts = kzalloc(2 * nr_node_ids * MAX_NR_ZONES *
+ sizeof(*counts),
+ GFP_KERNEL);
+ if (WARN_ON(!counts))
+ return;
+ rme = rme_first(new_ml);
+ for (pfn = 0; pfn < max_pfn; pfn++) {
+ int nid;
+ struct page *page;
+ size_t idx;
+
+ if (!pfn_valid(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+recheck_rme:
+ if (!rme || pfn < rme->pfn_start) {
+ /* We are before the start of the current rme, or we
+ * are past the last rme, fallback on pgdat+zone+page
+ * data. */
+ nid = page_to_nid(page);
+ pr_debug("FALLBACK: pfn %05lx, put in node %d. current rme "RME_FMT"\n",
+ pfn, nid, RME_EXP(rme));
+ } else if (pfn > rme->pfn_end) {
+ rme = rme_next(rme);
+ goto recheck_rme;
+ } else {
+ nid = rme->nid;
+ }
+
+ idx = page_count_idx(nid, page_zonenum(page));
+ /* XXX: what happens if pages become
+ reserved/unreserved during this
+ process? */
+ if (!PageReserved(page))
+ counts[idx]++; /* managed_pages */
+ counts[idx + 1]++; /* present_pages */
+ }
+
+ {
+ int nid;
+ for (nid = 0; nid < nr_node_ids; nid++) {
+ unsigned long nid_present = 0;
+ int zone_num;
+ pg_data_t *node = NODE_DATA(nid);
+ if (!node)
+ continue;
+ for (zone_num = 0; zone_num < node->nr_zones;
+ zone_num++) {
+ struct zone *zone = &node->node_zones[zone_num];
+ size_t idx = page_count_idx(nid, zone_num);
+ pr_debug("nid %d zone %d mp=%lu pp=%lu -> mp=%lu pp=%lu\n",
+ nid, zone_num,
+ zone->managed_pages,
+ zone->present_pages,
+ counts[idx], counts[idx+1]);
+ zone->managed_pages = counts[idx];
+ zone->present_pages = counts[idx + 1];
+ nid_present += zone->present_pages;
+
+ /*
+ * recalculate pcp ->batch & ->high using
+ * zone->managed_pages
+ */
+ zone_pcp_update(zone);
+ }
+
+ pr_debug(" node %d zone * present_pages %lu to %lu\n",
+ node->node_id, node->node_present_pages,
+ nid_present);
+ node->node_present_pages = nid_present;
+ }
+ }
+
+ kfree(counts);
+}
+
+void __ref dnuma_move_free_pages(struct memlayout *new_ml)
+{
+ struct rangemap_entry *rme;
+
+ update_page_counts(new_ml);
+ init_per_zone_wmark_min();
+
+ /* FIXME: how does this removal of pages from a zone interact with
+ * migrate types? ISOLATION? */
+ ml_for_each_range(new_ml, rme) {
+ unsigned long pfn = rme->pfn_start;
+ int range_nid;
+ struct page *page;
+new_rme:
+ range_nid = rme->nid;
+
+ for (; pfn <= rme->pfn_end; pfn++) {
+ struct zone *zone;
+ int page_nid, order;
+ unsigned long flags, last_pfn, first_pfn;
+ if (!pfn_valid(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+#if 0
+ /* XXX: can we ensure this is safe? Pages marked
+ * reserved could be freed into the page allocator if
+ * they mark memory areas that were allocated via
+ * earlier allocators. */
+ if (PageReserved(page)) {
+ set_page_node(page, range_nid);
+ /* TODO: adjust spanned_pages & present_pages &
+ * start_pfn. */
+ }
+#endif
+
+ /* Currently allocated, will be fixed up when freed. */
+ if (!PageBuddy(page))
+ continue;
+
+ page_nid = page_to_nid(page);
+ if (page_nid == range_nid)
+ continue;
+
+ zone = page_zone(page);
+ spin_lock_irqsave(&zone->lock, flags);
+
+ /* Someone allocated it since we last checked. It will
+ * be fixed up when it is freed */
+ if (!PageBuddy(page))
+ goto skip_unlock;
+
+ /* It has already been transplanted "somewhere",
+ * somewhere should be the proper zone. */
+ if (page_zone(page) != zone) {
+ VM_BUG_ON(zone != nid_zone(range_nid,
+ page_zonenum(page)));
+ goto skip_unlock;
+ }
+
+ order = page_order(page);
+ first_pfn = pfn & ~((1 << order) - 1);
+ last_pfn = pfn | ((1 << order) - 1);
+ if (WARN(pfn != first_pfn,
+ "pfn %05lx is not first_pfn %05lx\n",
+ pfn, first_pfn)) {
+ pfn = last_pfn;
+ goto skip_unlock;
+ }
+
+ if (last_pfn > rme->pfn_end) {
+ /*
+ * this higher order page doesn't fit into the
+ * current range even though it starts there.
+ */
+ pr_warn("order-%02d page (pfn %05lx-%05lx) extends beyond end of rme "RME_FMT"\n",
+ order, first_pfn, last_pfn,
+ RME_EXP(rme));
+
+ remove_free_pages_from_zone(zone, page, order);
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ rme = add_split_pages_to_zones(rme, page,
+ order);
+ pfn = last_pfn + 1;
+ goto new_rme;
+ }
+
+ remove_free_pages_from_zone(zone, page, order);
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ add_free_page_to_node(range_nid, page, order);
+ pfn = last_pfn;
+ continue;
+skip_unlock:
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+ }
+}
diff --git a/mm/memlayout.c b/mm/memlayout.c
new file mode 100644
index 0000000..7d2905b
--- /dev/null
+++ b/mm/memlayout.c
@@ -0,0 +1,237 @@
+/*
+ * memlayout - provides a mapping of PFN ranges to nodes with the requirements
+ * that looking up a node from a PFN is fast, and changes to the mapping will
+ * occour relatively infrequently.
+ *
+ */
+#define pr_fmt(fmt) "memlayout: " fmt
+
+#include <linux/dnuma.h>
+#include <linux/export.h>
+#include <linux/memblock.h>
+#include <linux/printk.h>
+#include <linux/rbtree.h>
+#include <linux/rcupdate.h>
+#include <linux/slab.h>
+
+/* protected by memlayout_lock */
+__rcu struct memlayout *pfn_to_node_map;
+DEFINE_MUTEX(memlayout_lock);
+
+static void free_rme_tree(struct rb_root *root)
+{
+ struct rangemap_entry *pos, *n;
+ rbtree_postorder_for_each_entry_safe(pos, n, root, node) {
+ kfree(pos);
+ }
+}
+
+static void ml_destroy_mem(struct memlayout *ml)
+{
+ if (!ml)
+ return;
+ free_rme_tree(&ml->root);
+ kfree(ml);
+}
+
+static int find_insertion_point(struct memlayout *ml, unsigned long pfn_start,
+ unsigned long pfn_end, int nid, struct rb_node ***o_new,
+ struct rb_node **o_parent)
+{
+ struct rb_node **new = &ml->root.rb_node, *parent = NULL;
+ struct rangemap_entry *rme;
+ pr_debug("adding range: {%lX-%lX}:%d\n", pfn_start, pfn_end, nid);
+ while (*new) {
+ rme = rb_entry(*new, typeof(*rme), node);
+
+ parent = *new;
+ if (pfn_end < rme->pfn_start && pfn_start < rme->pfn_end)
+ new = &((*new)->rb_left);
+ else if (pfn_start > rme->pfn_end && pfn_end > rme->pfn_end)
+ new = &((*new)->rb_right);
+ else {
+ /* an embedded region, need to use an interval or
+ * sequence tree. */
+ pr_warn("tried to embed {%lX,%lX}:%d inside {%lX-%lX}:%d\n",
+ pfn_start, pfn_end, nid,
+ rme->pfn_start, rme->pfn_end, rme->nid);
+ return 1;
+ }
+ }
+
+ *o_new = new;
+ *o_parent = parent;
+ return 0;
+}
+
+int memlayout_new_range(struct memlayout *ml, unsigned long pfn_start,
+ unsigned long pfn_end, int nid)
+{
+ struct rb_node **new, *parent;
+ struct rangemap_entry *rme;
+
+ if (WARN_ON(nid < 0))
+ return -EINVAL;
+ if (WARN_ON(nid >= nr_node_ids))
+ return -EINVAL;
+
+ if (find_insertion_point(ml, pfn_start, pfn_end, nid, &new, &parent))
+ return 1;
+
+ rme = kmalloc(sizeof(*rme), GFP_KERNEL);
+ if (!rme)
+ return -ENOMEM;
+
+ rme->pfn_start = pfn_start;
+ rme->pfn_end = pfn_end;
+ rme->nid = nid;
+
+ rb_link_node(&rme->node, parent, new);
+ rb_insert_color(&rme->node, &ml->root);
+ return 0;
+}
+
+int memlayout_pfn_to_nid(unsigned long pfn)
+{
+ struct rb_node *node;
+ struct memlayout *ml;
+ struct rangemap_entry *rme;
+ rcu_read_lock();
+ ml = rcu_dereference(pfn_to_node_map);
+ if (!ml || (ml->type == ML_INITIAL))
+ goto out;
+
+ rme = ACCESS_ONCE(ml->cache);
+ if (rme && rme_bounds_pfn(rme, pfn)) {
+ rcu_read_unlock();
+ return rme->nid;
+ }
+
+ node = ml->root.rb_node;
+ while (node) {
+ struct rangemap_entry *rme = rb_entry(node, typeof(*rme), node);
+ bool greater_than_start = rme->pfn_start <= pfn;
+ bool less_than_end = pfn <= rme->pfn_end;
+
+ if (greater_than_start && !less_than_end)
+ node = node->rb_right;
+ else if (less_than_end && !greater_than_start)
+ node = node->rb_left;
+ else {
+ /* greater_than_start && less_than_end.
+ * the case (!greater_than_start && !less_than_end)
+ * is impossible */
+ int nid = rme->nid;
+ ACCESS_ONCE(ml->cache) = rme;
+ rcu_read_unlock();
+ return nid;
+ }
+ }
+
+out:
+ rcu_read_unlock();
+ return NUMA_NO_NODE;
+}
+
+void memlayout_destroy(struct memlayout *ml)
+{
+ ml_destroy_mem(ml);
+}
+
+struct memlayout *memlayout_create(enum memlayout_type type)
+{
+ struct memlayout *ml;
+
+ if (WARN_ON(type < 0 || type >= ML_NUM_TYPES))
+ return NULL;
+
+ ml = kmalloc(sizeof(*ml), GFP_KERNEL);
+ if (!ml)
+ return NULL;
+
+ ml->root = RB_ROOT;
+ ml->type = type;
+ ml->cache = NULL;
+
+ return ml;
+}
+
+void memlayout_commit(struct memlayout *ml)
+{
+ struct memlayout *old_ml;
+
+ if (ml->type == ML_INITIAL) {
+ if (WARN(dnuma_has_memlayout(),
+ "memlayout marked first is not first, ignoring.\n")) {
+ memlayout_destroy(ml);
+ ml_backlog_feed(ml);
+ return;
+ }
+
+ mutex_lock(&memlayout_lock);
+ rcu_assign_pointer(pfn_to_node_map, ml);
+ mutex_unlock(&memlayout_lock);
+ return;
+ }
+
+ lock_memory_hotplug();
+ dnuma_online_required_nodes_and_zones(ml);
+ unlock_memory_hotplug();
+
+ mutex_lock(&memlayout_lock);
+ old_ml = rcu_dereference_protected(pfn_to_node_map,
+ mutex_is_locked(&memlayout_lock));
+
+ rcu_assign_pointer(pfn_to_node_map, ml);
+
+ synchronize_rcu();
+ memlayout_destroy(old_ml);
+
+ /* Must be called only after the new value for pfn_to_node_map has
+ * propogated to all tasks, otherwise some pages may lookup the old
+ * pfn_to_node_map on free & not transplant themselves to their new-new
+ * node. */
+ dnuma_mark_page_range(ml);
+
+ /* Do this after the free path is set up so that pages are free'd into
+ * their "new" zones so that after this completes, no free pages in the
+ * wrong zone remain. */
+ dnuma_move_free_pages(ml);
+
+ /* All new _non pcp_ page allocations now match the memlayout*/
+ drain_all_pages();
+ /* All new page allocations now match the memlayout */
+
+ mutex_unlock(&memlayout_lock);
+}
+
+/*
+ * The default memlayout global initializer, using memblock to determine
+ * affinities
+ *
+ * reqires: slab_is_available() && memblock is not (yet) freed.
+ * sleeps: definitely: memlayout_commit() -> synchronize_rcu()
+ * potentially: kmalloc()
+ */
+__weak __meminit
+void memlayout_global_init(void)
+{
+ int i, nid, errs = 0;
+ unsigned long start, end;
+ struct memlayout *ml = memlayout_create(ML_INITIAL);
+ if (WARN_ON(!ml))
+ return;
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ int r = memlayout_new_range(ml, start, end - 1, nid);
+ if (r) {
+ pr_err("failed to add range [%05lx, %05lx] in node %d to mapping\n",
+ start, end, nid);
+ errs++;
+ } else
+ pr_devel("added range [%05lx, %05lx] in node %d\n",
+ start, end, nid);
+ }
+
+ memlayout_commit(ml);
+}
--
1.8.2.1

2013-04-12 01:14:48

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 18/25] init/main: call memlayout_global_init() in start_kernel().

memlayout_global_init() initializes the first memlayout, which is
assumed to match the initial page-flag nid settings.

This is done in start_kernel() as the initdata used to populate the
memlayout is purged from memory early in the boot process (XXX: When?).

Signed-off-by: Cody P Schafer <[email protected]>
---
init/main.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/init/main.c b/init/main.c
index 63534a1..a1c2094 100644
--- a/init/main.c
+++ b/init/main.c
@@ -72,6 +72,7 @@
#include <linux/ptrace.h>
#include <linux/blkdev.h>
#include <linux/elevator.h>
+#include <linux/memlayout.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -618,6 +619,7 @@ asmlinkage void __init start_kernel(void)
security_init();
dbg_late_init();
vfs_caches_init(totalram_pages);
+ memlayout_global_init();
signals_init();
/* rootfs populating might need page-writeback */
page_writeback_init();
--
1.8.2.1

2013-04-12 01:14:55

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 25/25] mm: add a early_param "extra_nr_node_ids" to increase nr_node_ids above the minimum by a percentage.

For dynamic numa, sometimes the hypervisor we're running under will want
to split a single NUMA node into multiple NUMA nodes. If the number of
numa nodes is limited to the number avaliable when the system booted (as
it is on x86), we may not be able to fully adopt the new memory layout
provided by the hypervisor.

This option allows reserving some extra node ids as a percentage of the
boot time node ids. While not perfect (idealy nr_node_ids would be fully
dynamic), this allows decent functionality without invasive changes to
the SL{U,A}B allocators.

Signed-off-by: Cody P Schafer <[email protected]>
---
Documentation/kernel-parameters.txt | 6 ++++++
mm/page_alloc.c | 24 ++++++++++++++++++++++++
2 files changed, 30 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4609e81..b0523d8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2033,6 +2033,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
use hotplug cpu feature to put more cpu back to online.
just like you compile the kernel NR_CPUS=n

+ extra_nr_node_ids= [NUMA] Increase the maximum number of NUMA nodes
+ above the number detected at boot by the specified
+ percentage (rounded up). For example:
+ extra_nr_node_ids=100 would double the number of
+ node_ids avaliable (up to a max of MAX_NUMNODES).
+
nr_uarts= [SERIAL] maximum number of UARTs to be registered.

numa_balancing= [KNL,X86] Enable or disable automatic NUMA balancing.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 686d8f8..d333d91 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4798,6 +4798,17 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

#if MAX_NUMNODES > 1
+
+static unsigned nr_node_ids_mod_percent;
+static int __init setup_extra_nr_node_ids(char *arg)
+{
+ int r = kstrtouint(arg, 10, &nr_node_ids_mod_percent);
+ if (r)
+ pr_err("invalid param value extra_nr_node_ids=\"%s\"\n", arg);
+ return 0;
+}
+early_param("extra_nr_node_ids", setup_extra_nr_node_ids);
+
/*
* Figure out the number of possible node ids.
*/
@@ -4809,6 +4820,19 @@ void __init setup_nr_node_ids(void)
for_each_node_mask(node, node_possible_map)
highest = node;
nr_node_ids = highest + 1;
+
+ /*
+ * expand nr_node_ids and node_possible_map so more can be onlined
+ * later
+ */
+ nr_node_ids +=
+ DIV_ROUND_UP(nr_node_ids * nr_node_ids_mod_percent, 100);
+
+ if (nr_node_ids > MAX_NUMNODES)
+ nr_node_ids = MAX_NUMNODES;
+
+ for (node = highest + 1; node < nr_node_ids; node++)
+ node_set(node, node_possible_map);
}
#endif

--
1.8.2.1

2013-04-12 01:14:51

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 22/25] mm/page_alloc: in page_outside_zone_boundaries(), avoid premature decisions.

With some code that expands the zone boundaries, VM_BUG_ON(bad_range()) was being triggered.

Previously, page_outside_zone_boundaries() decided that once it detected
a page outside the boundaries, it was certainly outside even if the
seqlock indicated the data was invalid & needed to be reread. This
methodology _almost_ works because zones are only ever grown. However,
becase the zone span is stored as a start and a length, some expantions
momentarily appear as shifts to the left (when the zone_start_pfn is
assigned prior to zone_spanned_pages).

If we want to remove the seqlock around zone_start_pfn & zone
spanned_pages, always writing the spanned_pages first, issuing a memory
barrier, and then writing the new zone_start_pfn _may_ work. The concern
there is that we could be seen as shrinking the span when zone_start_pfn
is written (the entire span would shift to the left). As there will be
no pages in the exsess span that actually belong to the zone being
manipulated, I don't expect there to be issues.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97bdf6b..a54baa9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -238,12 +238,13 @@ bool oom_killer_disabled __read_mostly;
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
- int ret = 0;
+ int ret;
unsigned seq;
unsigned long pfn = page_to_pfn(page);
unsigned long sp, start_pfn;

do {
+ ret = 0;
seq = zone_span_seqbegin(zone);
start_pfn = zone->zone_start_pfn;
sp = zone->spanned_pages;
--
1.8.2.1

2013-04-12 01:15:15

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 24/25] mm/page_alloc: use manage_pages instead of present pages when calculating default_zonelist_order()

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20304cb..686d8f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3488,8 +3488,8 @@ static int default_zonelist_order(void)
z = &NODE_DATA(nid)->node_zones[zone_type];
if (populated_zone(z)) {
if (zone_type < ZONE_NORMAL)
- low_kmem_size += z->present_pages;
- total_size += z->present_pages;
+ low_kmem_size += z->managed_pages;
+ total_size += z->managed_pages;
} else if (zone_type == ZONE_NORMAL) {
/*
* If any node has only lowmem, then node order
@@ -3519,8 +3519,8 @@ static int default_zonelist_order(void)
z = &NODE_DATA(nid)->node_zones[zone_type];
if (populated_zone(z)) {
if (zone_type < ZONE_NORMAL)
- low_kmem_size += z->present_pages;
- total_size += z->present_pages;
+ low_kmem_size += z->managed_pages;
+ total_size += z->managed_pages;
}
}
if (low_kmem_size &&
--
1.8.2.1

2013-04-12 01:15:42

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 23/25] mm/page_alloc: make pr_err() in page_outside_zone_boundaries() more useful

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a54baa9..20304cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -253,8 +253,11 @@ static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
} while (zone_span_seqretry(zone, seq));

if (ret)
- pr_err("page %lu outside zone [ %lu - %lu ]\n",
- pfn, start_pfn, start_pfn + sp);
+ pr_err("page with pfn %05lx outside zone %s with pfn range {%05lx-%05lx} in node %d with pfn range {%05lx-%05lx}\n",
+ pfn, zone->name, start_pfn, start_pfn + sp,
+ zone->zone_pgdat->node_id,
+ zone->zone_pgdat->node_start_pfn,
+ pgdat_end_pfn(zone->zone_pgdat));

return ret;
}
--
1.8.2.1

2013-04-12 01:16:03

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 20/25] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid.

When a memlayout is tracked (ie: CONFIG_DYNAMIC_NUMA is enabled), rather
than iterate over numa_meminfo, a lookup can be done using memlayout.

Signed-off-by: Cody P Schafer <[email protected]>
---
arch/x86/mm/numa.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 75819ef..f1609c0 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -28,7 +28,7 @@ struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
EXPORT_SYMBOL(node_data);

static struct numa_meminfo numa_meminfo
-#ifndef CONFIG_MEMORY_HOTPLUG
+#if !defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DYNAMIC_NUMA)
__initdata
#endif
;
@@ -832,7 +832,7 @@ EXPORT_SYMBOL(cpumask_of_node);

#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */

-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined(CONFIG_MEMORY_HOTPLUG) && !defined(CONFIG_DYNAMIC_NUMA)
int memory_add_physaddr_to_nid(u64 start)
{
struct numa_meminfo *mi = &numa_meminfo;
--
1.8.2.1

2013-04-12 01:16:02

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 21/25] mm/memory_hotplug: VM_BUG if nid is too large.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/memory_hotplug.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f5ea9b7..5fcd29e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1063,6 +1063,8 @@ int __mem_online_node(int nid)
pg_data_t *pgdat;
int ret;

+ VM_BUG_ON(nid >= nr_node_ids);
+
pgdat = hotadd_new_pgdat(nid, 0);
if (!pgdat)
return -ENOMEM;
--
1.8.2.1

2013-04-12 01:16:36

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 19/25] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/memlayout.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/mm/memlayout.c b/mm/memlayout.c
index 45e7df6..4dc6706 100644
--- a/mm/memlayout.c
+++ b/mm/memlayout.c
@@ -247,3 +247,19 @@ void memlayout_global_init(void)

memlayout_commit(ml);
}
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * Provides a default memory_add_physaddr_to_nid() for memory hotplug, unless
+ * overridden by the arch.
+ */
+__weak
+int memory_add_physaddr_to_nid(u64 start)
+{
+ int nid = memlayout_pfn_to_nid(PFN_DOWN(start));
+ if (nid == NUMA_NO_NODE)
+ return 0;
+ return nid;
+}
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+#endif
--
1.8.2.1

2013-04-12 01:14:40

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 14/25] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok()

__free_pages_ok() handles higher order (order != 0) pages. Transplant
hook is added here as this is where the struct zone to free to is
decided.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4628443..f8ae178 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
#include <linux/migrate.h>
#include <linux/page-debug-flags.h>
#include <linux/sched/rt.h>
+#include <linux/dnuma.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -732,6 +733,13 @@ static void __free_pages_ok(struct page *page, unsigned int order)
{
unsigned long flags;
int migratetype;
+ int dest_nid = dnuma_page_needs_move(page);
+ struct zone *zone;
+
+ if (dest_nid != NUMA_NO_NODE)
+ zone = nid_zone(dest_nid, page_zonenum(page));
+ else
+ zone = page_zone(page);

if (!free_pages_prepare(page, order))
return;
@@ -740,7 +748,11 @@ static void __free_pages_ok(struct page *page, unsigned int order)
__count_vm_events(PGFREE, 1 << order);
migratetype = get_pageblock_migratetype(page);
set_freepage_migratetype(page, migratetype);
- free_one_page(page_zone(page), page, order, migratetype);
+ if (dest_nid != NUMA_NO_NODE)
+ dnuma_prior_free_to_new_zone(page, order, zone, dest_nid);
+ free_one_page(zone, page, order, migratetype);
+ if (dest_nid != NUMA_NO_NODE)
+ dnuma_post_free_to_new_zone(order);
local_irq_restore(flags);
}

--
1.8.2.1

2013-04-12 01:16:57

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 17/25] x86: memlayout: add a arch specific inital memlayout setter.

On x86, we have numa_info specifically to track the numa layout, which
is precisely the data memlayout needs, so use it to create an initial
memlayout.

Signed-off-by: Cody P Schafer <[email protected]>
---
arch/x86/mm/numa.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a71c4e2..75819ef 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
#include <linux/nodemask.h>
#include <linux/sched.h>
#include <linux/topology.h>
+#include <linux/dnuma.h>

#include <asm/e820.h>
#include <asm/proto.h>
@@ -32,6 +33,33 @@ __initdata
#endif
;

+#ifdef CONFIG_DYNAMIC_NUMA
+void __init memlayout_global_init(void)
+{
+ struct numa_meminfo *mi = &numa_meminfo;
+ int i;
+ struct numa_memblk *blk;
+ struct memlayout *ml = memlayout_create(ML_INITIAL);
+ if (WARN_ON(!ml))
+ return;
+
+ pr_devel("x86/memlayout: adding ranges from numa_meminfo\n");
+ for (i = 0; i < mi->nr_blks; i++) {
+ blk = mi->blk + i;
+ pr_devel(" adding range {%LX[%LX]-%LX[%LX]}:%d\n",
+ PFN_DOWN(blk->start), blk->start,
+ PFN_DOWN(blk->end - PAGE_SIZE / 2 - 1),
+ blk->end - 1, blk->nid);
+ memlayout_new_range(ml, PFN_DOWN(blk->start),
+ PFN_DOWN(blk->end - PAGE_SIZE / 2 - 1),
+ blk->nid);
+ }
+ pr_devel(" done adding ranges from numa_meminfo\n");
+
+ memlayout_commit(ml);
+}
+#endif
+
static int numa_distance_cnt;
static u8 *numa_distance;

--
1.8.2.1

2013-04-12 01:16:56

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 16/25] page_alloc: transplant pages that are being flushed from the per-cpu lists

In free_pcppages_bulk(), check if a page needs to be moved to a new
node/zone & then perform the transplant (in a slightly defered manner).

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 36 +++++++++++++++++++++++++++++++++++-
1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 98ac7c6..97bdf6b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -643,13 +643,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
int migratetype = 0;
int batch_free = 0;
int to_free = count;
+ struct page *pos, *page;
+ LIST_HEAD(need_move);

spin_lock(&zone->lock);
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;

while (to_free) {
- struct page *page;
struct list_head *list;

/*
@@ -672,11 +673,23 @@ static void free_pcppages_bulk(struct zone *zone, int count,

do {
int mt; /* migratetype of the to-be-freed page */
+ int dest_nid;

page = list_entry(list->prev, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
mt = get_freepage_migratetype(page);
+
+ dest_nid = dnuma_page_needs_move(page);
+ if (dest_nid != NUMA_NO_NODE) {
+ dnuma_prior_free_to_new_zone(page, 0,
+ nid_zone(dest_nid,
+ page_zonenum(page)),
+ dest_nid);
+ list_add(&page->lru, &need_move);
+ continue;
+ }
+
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
@@ -688,6 +701,27 @@ static void free_pcppages_bulk(struct zone *zone, int count,
} while (--to_free && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);
+
+ list_for_each_entry_safe(page, pos, &need_move, lru) {
+ struct zone *dest_zone = page_zone(page);
+ int mt;
+
+ spin_lock(&dest_zone->lock);
+
+ VM_BUG_ON(dest_zone != page_zone(page));
+ pr_devel("freeing pcp page %pK with changed node\n", page);
+ list_del(&page->lru);
+ mt = get_freepage_migratetype(page);
+ __free_one_page(page, dest_zone, 0, mt);
+ trace_mm_page_pcpu_drain(page, 0, mt);
+
+ /* XXX: fold into "post_free_to_new_zone()" ? */
+ if (is_migrate_cma(mt))
+ __mod_zone_page_state(dest_zone, NR_FREE_CMA_PAGES, 1);
+ dnuma_post_free_to_new_zone(0);
+
+ spin_unlock(&dest_zone->lock);
+ }
}

static void free_one_page(struct zone *zone, struct page *page, int order,
--
1.8.2.1

2013-04-12 01:17:33

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 15/25] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page()

free_hot_cold_page() is used for order == 0 pages, and is where the
page's zone is decided.

In the normal case, these pages are freed to the per-cpu lists. When a
page needs transplanting (ie: the actual node it belongs to has changed,
and it needs to be moved to another zone), the pcp lists are skipped &
the page is freed via free_one_page().

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8ae178..98ac7c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1357,6 +1357,7 @@ void mark_free_pages(struct zone *zone)
*/
void free_hot_cold_page(struct page *page, int cold)
{
+ int dest_nid;
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
@@ -1370,6 +1371,15 @@ void free_hot_cold_page(struct page *page, int cold)
local_irq_save(flags);
__count_vm_event(PGFREE);

+ dest_nid = dnuma_page_needs_move(page);
+ if (dest_nid != NUMA_NO_NODE) {
+ struct zone *dest_zone = nid_zone(dest_nid, page_zonenum(page));
+ dnuma_prior_free_to_new_zone(page, 0, dest_zone, dest_nid);
+ free_one_page(dest_zone, page, 0, migratetype);
+ dnuma_post_free_to_new_zone(0);
+ goto out;
+ }
+
/*
* We only track unmovable, reclaimable and movable on pcp lists.
* Free ISOLATE pages back to the allocator because they are being
--
1.8.2.1

2013-04-12 01:14:39

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 13/25] mm: memlayout+dnuma: add debugfs interface

Add a debugfs interface to dnuma/memlayout. It keeps track of a
variable backlog of memory layouts, provides some statistics on dnuma
moved pages & cache performance, and allows the setting of a new global
memlayout.

TODO: split out statistics, backlog, & write interfaces from eachother.

Signed-off-by: Cody P Schafer <[email protected]>
---
include/linux/dnuma.h | 2 +-
include/linux/memlayout.h | 7 +
mm/Kconfig | 30 ++++
mm/Makefile | 1 +
mm/dnuma.c | 4 +-
mm/memlayout-debugfs.c | 339 ++++++++++++++++++++++++++++++++++++++++++++++
mm/memlayout-debugfs.h | 39 ++++++
mm/memlayout.c | 20 ++-
8 files changed, 436 insertions(+), 6 deletions(-)
create mode 100644 mm/memlayout-debugfs.c
create mode 100644 mm/memlayout-debugfs.h

diff --git a/include/linux/dnuma.h b/include/linux/dnuma.h
index 029a984..7a33131 100644
--- a/include/linux/dnuma.h
+++ b/include/linux/dnuma.h
@@ -64,7 +64,7 @@ static inline int dnuma_page_needs_move(struct page *page)
return new_nid;
}

-void dnuma_post_free_to_new_zone(struct page *page, int order);
+void dnuma_post_free_to_new_zone(int order);
void dnuma_prior_free_to_new_zone(struct page *page, int order,
struct zone *dest_zone,
int dest_nid);
diff --git a/include/linux/memlayout.h b/include/linux/memlayout.h
index 6c26c52..14dbf35 100644
--- a/include/linux/memlayout.h
+++ b/include/linux/memlayout.h
@@ -56,6 +56,7 @@ struct memlayout {
};

extern __rcu struct memlayout *pfn_to_node_map;
+extern struct mutex memlayout_lock; /* update-side lock */

/* FIXME: overflow potential in completion check */
#define ml_for_each_pfn_in_range(rme, pfn) \
@@ -90,7 +91,13 @@ static inline struct rangemap_entry *rme_first(struct memlayout *ml)
rme = rme_next(rme))

struct memlayout *memlayout_create(enum memlayout_type);
+
+/*
+ * In most cases, these should only be used by the memlayout debugfs code (or
+ * internally within memlayout)
+ */
void memlayout_destroy(struct memlayout *ml);
+void memlayout_destroy_mem(struct memlayout *ml);

int memlayout_new_range(struct memlayout *ml,
unsigned long pfn_start, unsigned long pfn_end, int nid);
diff --git a/mm/Kconfig b/mm/Kconfig
index 86f0984..3820b3c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,6 +193,36 @@ config DYNAMIC_NUMA
Choose Y if you have are running linux under a hypervisor that uses
this feature, otherwise choose N if unsure.

+config DNUMA_DEBUGFS
+ bool "Export DNUMA & memlayout internals via debugfs"
+ depends on DYNAMIC_NUMA
+ help
+ Export some dynamic numa info via debugfs in <debugfs>/memlayout.
+
+ Enables the tracking and export of statistics and the exporting of the
+ current memory layout.
+
+ If you are not debugging Dynamic NUMA or memlayout, choose N.
+
+config DNUMA_BACKLOG
+ int "Number of old memlayouts to keep (0 = None, -1 = unlimited)"
+ depends on DNUMA_DEBUGFS
+ help
+ Allows access to old memory layouts & statistics in debugfs.
+
+ Each memlayout will consume some memory, and when set to -1
+ (unlimited), this can result in unbounded kernel memory use.
+
+config DNUMA_DEBUGFS_WRITE
+ bool "Change NUMA layout via debugfs"
+ depends on DNUMA_DEBUGFS
+ help
+ Enable the use of <debugfs>/memlayout/{start,end,node,commit}
+
+ Write a PFN to 'start' & 'end', then a node id to 'node'.
+ Repeat this until you are satisfied with your memory layout, then
+ write '1' to 'commit'.
+
# eventually, we can have this option just 'select SPARSEMEM'
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
diff --git a/mm/Makefile b/mm/Makefile
index 82fe7c9b..b07926c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -59,3 +59,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
obj-$(CONFIG_CLEANCACHE) += cleancache.o
obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
obj-$(CONFIG_DYNAMIC_NUMA) += dnuma.o memlayout.o
+obj-$(CONFIG_DNUMA_DEBUGFS) += memlayout-debugfs.o
diff --git a/mm/dnuma.c b/mm/dnuma.c
index 2ee0903..eb00b7b 100644
--- a/mm/dnuma.c
+++ b/mm/dnuma.c
@@ -11,6 +11,7 @@
#include <linux/types.h>

#include "internal.h"
+#include "memlayout-debugfs.h"

/* - must be called under lock_memory_hotplug() */
/* TODO: avoid iterating over all PFNs. */
@@ -117,8 +118,9 @@ static void node_states_set_node(int node, struct memory_notify *arg)
}
#endif

-void dnuma_post_free_to_new_zone(struct page *page, int order)
+void dnuma_post_free_to_new_zone(int order)
{
+ ml_stat_count_moved_pages(order);
}

static void dnuma_prior_return_to_new_zone(struct page *page, int order,
diff --git a/mm/memlayout-debugfs.c b/mm/memlayout-debugfs.c
new file mode 100644
index 0000000..a4fc2cb
--- /dev/null
+++ b/mm/memlayout-debugfs.c
@@ -0,0 +1,339 @@
+#include <linux/debugfs.h>
+
+#include <linux/slab.h> /* kmalloc */
+#include <linux/module.h> /* THIS_MODULE, needed for DEFINE_SIMPLE_ATTR */
+
+#include "memlayout-debugfs.h"
+
+#if CONFIG_DNUMA_BACKLOG > 0
+/* Fixed size backlog */
+#include <linux/kfifo.h>
+#include <linux/log2.h> /* roundup_pow_of_two */
+DEFINE_KFIFO(ml_backlog, struct memlayout *,
+ roundup_pow_of_two(CONFIG_DNUMA_BACKLOG));
+void ml_backlog_feed(struct memlayout *ml)
+{
+ if (kfifo_is_full(&ml_backlog)) {
+ struct memlayout *old_ml;
+ BUG_ON(!kfifo_get(&ml_backlog, &old_ml));
+ memlayout_destroy(old_ml);
+ }
+
+ kfifo_put(&ml_backlog, (const struct memlayout **)&ml);
+}
+#elif CONFIG_DNUMA_BACKLOG < 0
+/* Unlimited backlog */
+void ml_backlog_feed(struct memlayout *ml)
+{
+ /* we never use the rme_tree, so we destroy the non-debugfs portions to
+ * save memory */
+ memlayout_destroy_mem(ml);
+}
+#else /* CONFIG_DNUMA_BACKLOG == 0 */
+/* No backlog */
+void ml_backlog_feed(struct memlayout *ml)
+{
+ memlayout_destroy(ml);
+}
+#endif
+
+static atomic64_t dnuma_moved_page_ct;
+void ml_stat_count_moved_pages(int order)
+{
+ atomic64_add(1 << order, &dnuma_moved_page_ct);
+}
+
+static atomic_t ml_seq = ATOMIC_INIT(0);
+static struct dentry *root_dentry, *current_dentry;
+#define ML_LAYOUT_NAME_SZ \
+ ((size_t)(DIV_ROUND_UP(sizeof(unsigned) * 8, 3) \
+ + 1 + strlen("layout.")))
+#define ML_REGION_NAME_SZ ((size_t)(2 * BITS_PER_LONG / 4 + 2))
+
+static void ml_layout_name(struct memlayout *ml, char *name)
+{
+ sprintf(name, "layout.%u", ml->seq);
+}
+
+static int dfs_range_get(void *data, u64 *val)
+{
+ *val = (uintptr_t)data;
+ return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(range_fops, dfs_range_get, NULL, "%lld\n");
+
+static void _ml_dbgfs_create_range(struct dentry *base,
+ struct rangemap_entry *rme, char *name)
+{
+ struct dentry *rd;
+ sprintf(name, "%05lx-%05lx", rme->pfn_start, rme->pfn_end);
+ rd = debugfs_create_file(name, 0400, base,
+ (void *)(uintptr_t)rme->nid, &range_fops);
+ if (!rd)
+ pr_devel("debugfs: failed to create "RME_FMT"\n",
+ RME_EXP(rme));
+ else
+ pr_devel("debugfs: created "RME_FMT"\n", RME_EXP(rme));
+}
+
+/* Must be called with memlayout_lock held */
+static void _ml_dbgfs_set_current(struct memlayout *ml, char *name)
+{
+ ml_layout_name(ml, name);
+ debugfs_remove(current_dentry);
+ current_dentry = debugfs_create_symlink("current", root_dentry, name);
+}
+
+static void ml_dbgfs_create_layout_assume_root(struct memlayout *ml)
+{
+ char name[ML_LAYOUT_NAME_SZ];
+ ml_layout_name(ml, name);
+ WARN_ON(!root_dentry);
+ ml->d = debugfs_create_dir(name, root_dentry);
+ WARN_ON(!ml->d);
+}
+
+# if defined(CONFIG_DNUMA_DEBUGFS_WRITE)
+
+#define DEFINE_DEBUGFS_GET(___type) \
+ static int debugfs_## ___type ## _get(void *data, u64 *val) \
+ { \
+ *val = *(___type *)data; \
+ return 0; \
+ }
+
+DEFINE_DEBUGFS_GET(u32);
+DEFINE_DEBUGFS_GET(u8);
+
+#define DEFINE_WATCHED_ATTR(___type, ___var) \
+ static int ___var ## _watch_set(void *data, u64 val) \
+ { \
+ ___type old_val = *(___type *)data; \
+ int ret = ___var ## _watch(old_val, val); \
+ if (!ret) \
+ *(___type *)data = val; \
+ return ret; \
+ } \
+ DEFINE_SIMPLE_ATTRIBUTE(___var ## _fops, \
+ debugfs_ ## ___type ## _get, \
+ ___var ## _watch_set, "%llu\n");
+
+#define DEFINE_ACTION_ATTR(___name)
+
+static u64 dnuma_user_start;
+static u64 dnuma_user_end;
+static u32 dnuma_user_node; /* XXX: I don't care about this var, remove? */
+static u8 dnuma_user_commit, dnuma_user_clear; /* same here */
+static struct memlayout *user_ml;
+static DEFINE_MUTEX(dnuma_user_lock);
+static int dnuma_user_node_watch(u32 old_val, u32 new_val)
+{
+ int ret = 0;
+ mutex_lock(&dnuma_user_lock);
+ if (!user_ml)
+ user_ml = memlayout_create(ML_USER_DEBUG);
+
+ if (WARN_ON(!user_ml)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ if (new_val >= nr_node_ids) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (dnuma_user_start > dnuma_user_end) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ret = memlayout_new_range(user_ml, dnuma_user_start, dnuma_user_end,
+ new_val);
+
+ if (!ret) {
+ dnuma_user_start = 0;
+ dnuma_user_end = 0;
+ }
+out:
+ mutex_unlock(&dnuma_user_lock);
+ return ret;
+}
+
+static int dnuma_user_commit_watch(u8 old_val, u8 new_val)
+{
+ mutex_lock(&dnuma_user_lock);
+ if (user_ml)
+ memlayout_commit(user_ml);
+ user_ml = NULL;
+ mutex_unlock(&dnuma_user_lock);
+ return 0;
+}
+
+static int dnuma_user_clear_watch(u8 old_val, u8 new_val)
+{
+ mutex_lock(&dnuma_user_lock);
+ if (user_ml)
+ memlayout_destroy(user_ml);
+ user_ml = NULL;
+ mutex_unlock(&dnuma_user_lock);
+ return 0;
+}
+
+DEFINE_WATCHED_ATTR(u32, dnuma_user_node);
+DEFINE_WATCHED_ATTR(u8, dnuma_user_commit);
+DEFINE_WATCHED_ATTR(u8, dnuma_user_clear);
+# endif /* defined(CONFIG_DNUMA_DEBUGFS_WRITE) */
+
+/* create the entire current memlayout.
+ * only used for the layout which exsists prior to fs initialization
+ */
+static void ml_dbgfs_create_initial_layout(void)
+{
+ struct rangemap_entry *rme;
+ char name[max(ML_REGION_NAME_SZ, ML_LAYOUT_NAME_SZ)];
+ struct memlayout *old_ml, *new_ml;
+
+ new_ml = kmalloc(sizeof(*new_ml), GFP_KERNEL);
+ if (WARN(!new_ml, "memlayout allocation failed\n"))
+ return;
+
+ mutex_lock(&memlayout_lock);
+
+ old_ml = rcu_dereference_protected(pfn_to_node_map,
+ mutex_is_locked(&memlayout_lock));
+ if (WARN_ON(!old_ml))
+ goto e_out;
+ *new_ml = *old_ml;
+
+ if (WARN_ON(new_ml->d))
+ goto e_out;
+
+ /* this assumption holds as ml_dbgfs_create_initial_layout() (this
+ * function) is only called by ml_dbgfs_create_root() */
+ ml_dbgfs_create_layout_assume_root(new_ml);
+ if (!new_ml->d)
+ goto e_out;
+
+ ml_for_each_range(new_ml, rme) {
+ _ml_dbgfs_create_range(new_ml->d, rme, name);
+ }
+
+ _ml_dbgfs_set_current(new_ml, name);
+ rcu_assign_pointer(pfn_to_node_map, new_ml);
+ mutex_unlock(&memlayout_lock);
+
+ synchronize_rcu();
+ kfree(old_ml);
+ return;
+e_out:
+ mutex_unlock(&memlayout_lock);
+ kfree(new_ml);
+}
+
+static atomic64_t ml_cache_hits;
+static atomic64_t ml_cache_misses;
+
+void ml_stat_cache_miss(void)
+{
+ atomic64_inc(&ml_cache_misses);
+}
+
+void ml_stat_cache_hit(void)
+{
+ atomic64_inc(&ml_cache_hits);
+}
+
+/* returns 0 if root_dentry has been created */
+static int ml_dbgfs_create_root(void)
+{
+ if (root_dentry)
+ return 0;
+
+ if (!debugfs_initialized()) {
+ pr_devel("debugfs not registered or disabled.\n");
+ return -EINVAL;
+ }
+
+ root_dentry = debugfs_create_dir("memlayout", NULL);
+ if (!root_dentry) {
+ pr_devel("root dir creation failed\n");
+ return -EINVAL;
+ }
+
+ /* TODO: place in a different dir? (to keep memlayout & dnuma seperate)
+ */
+ /* FIXME: use debugfs_create_atomic64() [does not yet exsist]. */
+ debugfs_create_u64("moved-pages", 0400, root_dentry,
+ (uint64_t *)&dnuma_moved_page_ct.counter);
+ debugfs_create_u64("pfn-lookup-cache-misses", 0400, root_dentry,
+ (uint64_t *)&ml_cache_misses.counter);
+ debugfs_create_u64("pfn-lookup-cache-hits", 0400, root_dentry,
+ (uint64_t *)&ml_cache_hits.counter);
+
+# if defined(CONFIG_DNUMA_DEBUGFS_WRITE)
+ /* Set node last: on write, it adds the range. */
+ debugfs_create_x64("start", 0600, root_dentry, &dnuma_user_start);
+ debugfs_create_x64("end", 0600, root_dentry, &dnuma_user_end);
+ debugfs_create_file("node", 0200, root_dentry,
+ &dnuma_user_node, &dnuma_user_node_fops);
+ debugfs_create_file("commit", 0200, root_dentry,
+ &dnuma_user_commit, &dnuma_user_commit_fops);
+ debugfs_create_file("clear", 0200, root_dentry,
+ &dnuma_user_clear, &dnuma_user_clear_fops);
+# endif
+
+ /* uses root_dentry */
+ ml_dbgfs_create_initial_layout();
+
+ return 0;
+}
+
+static void ml_dbgfs_create_layout(struct memlayout *ml)
+{
+ if (ml_dbgfs_create_root()) {
+ ml->d = NULL;
+ return;
+ }
+ ml_dbgfs_create_layout_assume_root(ml);
+}
+
+static int ml_dbgfs_init_root(void)
+{
+ ml_dbgfs_create_root();
+ return 0;
+}
+
+void ml_dbgfs_init(struct memlayout *ml)
+{
+ ml->seq = atomic_inc_return(&ml_seq) - 1;
+ ml_dbgfs_create_layout(ml);
+}
+
+void ml_dbgfs_create_range(struct memlayout *ml, struct rangemap_entry *rme)
+{
+ char name[ML_REGION_NAME_SZ];
+ if (ml->d)
+ _ml_dbgfs_create_range(ml->d, rme, name);
+}
+
+void ml_dbgfs_set_current(struct memlayout *ml)
+{
+ char name[ML_LAYOUT_NAME_SZ];
+ _ml_dbgfs_set_current(ml, name);
+}
+
+void ml_destroy_dbgfs(struct memlayout *ml)
+{
+ if (ml && ml->d)
+ debugfs_remove_recursive(ml->d);
+}
+
+static void __exit ml_dbgfs_exit(void)
+{
+ debugfs_remove_recursive(root_dentry);
+ root_dentry = NULL;
+}
+
+module_init(ml_dbgfs_init_root);
+module_exit(ml_dbgfs_exit);
diff --git a/mm/memlayout-debugfs.h b/mm/memlayout-debugfs.h
new file mode 100644
index 0000000..12dc1eb
--- /dev/null
+++ b/mm/memlayout-debugfs.h
@@ -0,0 +1,39 @@
+#ifndef LINUX_MM_MEMLAYOUT_DEBUGFS_H_
+#define LINUX_MM_MEMLAYOUT_DEBUGFS_H_
+
+#include <linux/memlayout.h>
+
+#ifdef CONFIG_DNUMA_DEBUGFS
+void ml_stat_count_moved_pages(int order);
+void ml_stat_cache_hit(void);
+void ml_stat_cache_miss(void);
+void ml_dbgfs_init(struct memlayout *ml);
+void ml_dbgfs_create_range(struct memlayout *ml, struct rangemap_entry *rme);
+void ml_destroy_dbgfs(struct memlayout *ml);
+void ml_dbgfs_set_current(struct memlayout *ml);
+void ml_backlog_feed(struct memlayout *ml);
+#else /* !defined(CONFIG_DNUMA_DEBUGFS) */
+static inline void ml_stat_count_moved_pages(int order)
+{}
+static inline void ml_stat_cache_hit(void)
+{}
+static inline void ml_stat_cache_miss(void)
+{}
+
+static inline void ml_dbgfs_init(struct memlayout *ml)
+{}
+static inline void ml_dbgfs_create_range(struct memlayout *ml,
+ struct rangemap_entry *rme)
+{}
+static inline void ml_destroy_dbgfs(struct memlayout *ml)
+{}
+static inline void ml_dbgfs_set_current(struct memlayout *ml)
+{}
+
+static inline void ml_backlog_feed(struct memlayout *ml)
+{
+ memlayout_destroy(ml);
+}
+#endif
+
+#endif
diff --git a/mm/memlayout.c b/mm/memlayout.c
index 7d2905b..45e7df6 100644
--- a/mm/memlayout.c
+++ b/mm/memlayout.c
@@ -14,6 +14,8 @@
#include <linux/rcupdate.h>
#include <linux/slab.h>

+#include "memlayout-debugfs.h"
+
/* protected by memlayout_lock */
__rcu struct memlayout *pfn_to_node_map;
DEFINE_MUTEX(memlayout_lock);
@@ -26,7 +28,7 @@ static void free_rme_tree(struct rb_root *root)
}
}

-static void ml_destroy_mem(struct memlayout *ml)
+void memlayout_destroy_mem(struct memlayout *ml)
{
if (!ml)
return;
@@ -88,6 +90,8 @@ int memlayout_new_range(struct memlayout *ml, unsigned long pfn_start,

rb_link_node(&rme->node, parent, new);
rb_insert_color(&rme->node, &ml->root);
+
+ ml_dbgfs_create_range(ml, rme);
return 0;
}

@@ -104,9 +108,12 @@ int memlayout_pfn_to_nid(unsigned long pfn)
rme = ACCESS_ONCE(ml->cache);
if (rme && rme_bounds_pfn(rme, pfn)) {
rcu_read_unlock();
+ ml_stat_cache_hit();
return rme->nid;
}

+ ml_stat_cache_miss();
+
node = ml->root.rb_node;
while (node) {
struct rangemap_entry *rme = rb_entry(node, typeof(*rme), node);
@@ -135,7 +142,8 @@ out:

void memlayout_destroy(struct memlayout *ml)
{
- ml_destroy_mem(ml);
+ ml_destroy_dbgfs(ml);
+ memlayout_destroy_mem(ml);
}

struct memlayout *memlayout_create(enum memlayout_type type)
@@ -153,6 +161,7 @@ struct memlayout *memlayout_create(enum memlayout_type type)
ml->type = type;
ml->cache = NULL;

+ ml_dbgfs_init(ml);
return ml;
}

@@ -163,12 +172,12 @@ void memlayout_commit(struct memlayout *ml)
if (ml->type == ML_INITIAL) {
if (WARN(dnuma_has_memlayout(),
"memlayout marked first is not first, ignoring.\n")) {
- memlayout_destroy(ml);
ml_backlog_feed(ml);
return;
}

mutex_lock(&memlayout_lock);
+ ml_dbgfs_set_current(ml);
rcu_assign_pointer(pfn_to_node_map, ml);
mutex_unlock(&memlayout_lock);
return;
@@ -179,13 +188,16 @@ void memlayout_commit(struct memlayout *ml)
unlock_memory_hotplug();

mutex_lock(&memlayout_lock);
+
+ ml_dbgfs_set_current(ml);
+
old_ml = rcu_dereference_protected(pfn_to_node_map,
mutex_is_locked(&memlayout_lock));

rcu_assign_pointer(pfn_to_node_map, ml);

synchronize_rcu();
- memlayout_destroy(old_ml);
+ ml_backlog_feed(old_ml);

/* Must be called only after the new value for pfn_to_node_map has
* propogated to all tasks, otherwise some pages may lookup the old
--
1.8.2.1

2013-04-12 01:18:14

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 09/25] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone

When dynamic numa is enabled, the last or first page in a pageblock may
have been transplanted to a new zone (or may not yet be transplanted to
a new zone).

Disable a BUG_ON() which checks that the start_page and end_page are in
the same zone, if they are not in the proper zone they will simply be
skipped.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/page_alloc.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 75192eb..95e4a23 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -959,13 +959,16 @@ int move_freepages(struct zone *zone,
int pages_moved = 0;
int zone_nid = zone_to_nid(zone);

-#ifndef CONFIG_HOLES_IN_ZONE
+#if !defined(CONFIG_HOLES_IN_ZONE) && !defined(CONFIG_DYNAMIC_NUMA)
/*
- * page_zone is not safe to call in this context when
- * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
- * anyway as we check zone boundaries in move_freepages_block().
- * Remove at a later date when no bug reports exist related to
- * grouping pages by mobility
+ * With CONFIG_HOLES_IN_ZONE set, this check is unsafe as start_page or
+ * end_page may not be "valid".
+ * With CONFIG_DYNAMIC_NUMA set, this condition is a valid occurence &
+ * not a bug.
+ *
+ * This bug check is probably redundant anyway as we check zone
+ * boundaries in move_freepages_block(). Remove at a later date when
+ * no bug reports exist related to grouping pages by mobility
*/
BUG_ON(page_zone(start_page) != page_zone(end_page));
#endif
--
1.8.2.1

2013-04-12 01:18:13

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 10/25] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup.

Add a pageflag called "lookup_node"/ PG_lookup_node / Page*LookupNode().

Used by dynamic numa to indicate when a page has a new node assignment
waiting for it.

FIXME: This also exempts PG_lookup_node from PAGE_FLAGS_CHECK_AT_PREP
due to the asynchronous usage of PG_lookup_node, which needs to be
avoided.

Signed-off-by: Cody P Schafer <[email protected]>
---
include/linux/page-flags.h | 19 +++++++++++++++++++
mm/page_alloc.c | 3 +++
2 files changed, 22 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..09dd94e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,9 @@ enum pageflags {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
PG_compound_lock,
#endif
+#ifdef CONFIG_DYNAMIC_NUMA
+ PG_lookup_node, /* extra lookup required to find real node */
+#endif
__NR_PAGEFLAGS,

/* Filesystems */
@@ -275,6 +278,17 @@ PAGEFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif

+/* Setting is unconditional, simply leads to an extra lookup.
+ * Clearing must be conditional so we don't miss any memlayout changes.
+ */
+#ifdef CONFIG_DYNAMIC_NUMA
+PAGEFLAG(LookupNode, lookup_node)
+TESTCLEARFLAG(LookupNode, lookup_node)
+#else
+PAGEFLAG_FALSE(LookupNode)
+TESTCLEARFLAG_FALSE(LookupNode)
+#endif
+
u64 stable_page_flags(struct page *page);

static inline int PageUptodate(struct page *page)
@@ -509,7 +523,12 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
* Pages being prepped should not have any flags set. It they are set,
* there has been a kernel bug or struct page corruption.
*/
+#ifndef CONFIG_DYNAMIC_NUMA
#define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
+#else
+#define PAGE_FLAGS_CHECK_AT_PREP (((1 << NR_PAGEFLAGS) - 1) \
+ & ~(1 << PG_lookup_node))
+#endif

#define PAGE_FLAGS_PRIVATE \
(1 << PG_private | 1 << PG_private_2)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 95e4a23..4628443 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6170,6 +6170,9 @@ static const struct trace_print_flags pageflag_names[] = {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
{1UL << PG_compound_lock, "compound_lock" },
#endif
+#ifdef CONFIG_DYNAMIC_NUMA
+ {1UL << PG_lookup_node, "lookup_node" },
+#endif
};

static void dump_page_flags(unsigned long flags)
--
1.8.2.1

2013-04-12 01:14:28

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 06/25] mm: add nid_zone() helper

Add nid_zone(), which returns the zone corresponding to a given nid & zonenum.

Signed-off-by: Cody P Schafer <[email protected]>
---
include/linux/mm.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9ddae00..1b6abae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -708,9 +708,14 @@ static inline void page_nid_reset_last(struct page *page)
}
#endif

+static inline struct zone *nid_zone(int nid, enum zone_type zonenum)
+{
+ return &NODE_DATA(nid)->node_zones[zonenum];
+}
+
static inline struct zone *page_zone(const struct page *page)
{
- return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
+ return nid_zone(page_to_nid(page), page_zonenum(page));
}

#ifdef SECTION_IN_PAGE_FLAGS
--
1.8.2.1

2013-04-12 01:18:59

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 07/25] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled.

Add return_pages_to_zone(), which uses return_page_to_zone().
It is a minimized version of __free_pages_ok() which handles adding
pages which have been removed from another zone into a new zone.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/internal.h | 5 ++++-
mm/page_alloc.c | 17 +++++++++++++++++
2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index b11e574..a70c77b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -104,6 +104,10 @@ extern void prep_compound_page(struct page *page, unsigned long order);
#ifdef CONFIG_MEMORY_FAILURE
extern bool is_free_buddy_page(struct page *page);
#endif
+#ifdef CONFIG_DYNAMIC_NUMA
+void return_pages_to_zone(struct page *page, unsigned int order,
+ struct zone *zone);
+#endif

#ifdef CONFIG_MEMORY_HOTPLUG
/*
@@ -114,7 +118,6 @@ extern int ensure_zone_is_initialized(struct zone *zone,
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-
/*
* in mm/compaction.c
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 96909bb..1fbf5f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -442,6 +442,12 @@ static inline void set_page_order(struct page *page, int order)
__SetPageBuddy(page);
}

+static inline void set_free_page_order(struct page *page, int order)
+{
+ set_page_private(page, order);
+ VM_BUG_ON(!PageBuddy(page));
+}
+
static inline void rmv_page_order(struct page *page)
{
__ClearPageBuddy(page);
@@ -738,6 +744,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

+#ifdef CONFIG_DYNAMIC_NUMA
+void return_pages_to_zone(struct page *page, unsigned int order,
+ struct zone *zone)
+{
+ unsigned long flags;
+ local_irq_save(flags);
+ free_one_page(zone, page, order, get_freepage_migratetype(page));
+ local_irq_restore(flags);
+}
+#endif
+
/*
* Read access to zone->managed_pages is safe because it's unsigned long,
* but we still need to serialize writers. Currently all callers of
--
1.8.2.1

2013-04-12 01:14:25

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 04/25] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h

Export ensure_zone_is_initialized() so that it can be used to initialize
new zones within the dynamic numa code.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/internal.h | 8 ++++++++
mm/memory_hotplug.c | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 8562de0..b11e574 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -105,6 +105,14 @@ extern void prep_compound_page(struct page *page, unsigned long order);
extern bool is_free_buddy_page(struct page *page);
#endif

+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * in mm/memory_hotplug.c
+ */
+extern int ensure_zone_is_initialized(struct zone *zone,
+ unsigned long start_pfn, unsigned long num_pages);
+#endif
+
#if defined CONFIG_COMPACTION || defined CONFIG_CMA

/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8f4d8d3..df04c36 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -284,7 +284,7 @@ static void fix_zone_id(struct zone *zone, unsigned long start_pfn,

/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
* alloc_bootmem_node_nopanic() */
-static int __ref ensure_zone_is_initialized(struct zone *zone,
+int __ref ensure_zone_is_initialized(struct zone *zone,
unsigned long start_pfn, unsigned long num_pages)
{
if (!zone_is_initialized(zone))
--
1.8.2.1

2013-04-12 01:19:28

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 05/25] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats

Use the *_is_empty() helpers to be more clear about what we're actually
checking for.

Signed-off-by: Cody P Schafer <[email protected]>
---
mm/memory_hotplug.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index df04c36..deea8c2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -242,7 +242,7 @@ static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
zone_span_writelock(zone);

old_zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
- if (!zone->spanned_pages || start_pfn < zone->zone_start_pfn)
+ if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
zone->zone_start_pfn = start_pfn;

zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
@@ -383,7 +383,7 @@ static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
unsigned long old_pgdat_end_pfn =
pgdat->node_start_pfn + pgdat->node_spanned_pages;

- if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
+ if (pgdat_is_empty(pgdat) || start_pfn < pgdat->node_start_pfn)
pgdat->node_start_pfn = start_pfn;

pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
--
1.8.2.1

2013-04-12 01:14:23

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 03/25] mm/memory_hotplug: factor out zone+pgdat growth.

Create a new function grow_pgdat_and_zone() which handles locking +
growth of a zone & the pgdat which it is associated with.

Signed-off-by: Cody P Schafer <[email protected]>
---
include/linux/memory_hotplug.h | 3 +++
mm/memory_hotplug.c | 17 +++++++++++------
2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index b6a3be7..cd393014 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -78,6 +78,9 @@ static inline void zone_seqlock_init(struct zone *zone)
{
seqlock_init(&zone->span_seqlock);
}
+extern void grow_pgdat_and_zone(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn);
+
extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 46de32a..8f4d8d3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -390,13 +390,22 @@ static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
pgdat->node_start_pfn;
}

+void grow_pgdat_and_zone(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned long flags;
+ pgdat_resize_lock(zone->zone_pgdat, &flags);
+ grow_zone_span(zone, start_pfn, end_pfn);
+ grow_pgdat_span(zone->zone_pgdat, start_pfn, end_pfn);
+ pgdat_resize_unlock(zone->zone_pgdat, &flags);
+}
+
static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
{
struct pglist_data *pgdat = zone->zone_pgdat;
int nr_pages = PAGES_PER_SECTION;
int nid = pgdat->node_id;
int zone_type;
- unsigned long flags;
int ret;

zone_type = zone - pgdat->node_zones;
@@ -404,11 +413,7 @@ static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
if (ret)
return ret;

- pgdat_resize_lock(zone->zone_pgdat, &flags);
- grow_zone_span(zone, phys_start_pfn, phys_start_pfn + nr_pages);
- grow_pgdat_span(zone->zone_pgdat, phys_start_pfn,
- phys_start_pfn + nr_pages);
- pgdat_resize_unlock(zone->zone_pgdat, &flags);
+ grow_pgdat_and_zone(zone, phys_start_pfn, phys_start_pfn + nr_pages);
memmap_init_zone(nr_pages, nid, zone_type,
phys_start_pfn, MEMMAP_HOTPLUG);
return 0;
--
1.8.2.1

2013-04-12 01:14:21

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 01/25] rbtree: add postorder iteration functions.

Add postorder iteration functions for rbtree. These are useful for
safely freeing an entire rbtree without modifying the tree at all.

Signed-off-by: Cody P Schafer <[email protected]>
---
include/linux/rbtree.h | 4 ++++
lib/rbtree.c | 40 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 0022c1b..2879e96 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -68,6 +68,10 @@ extern struct rb_node *rb_prev(const struct rb_node *);
extern struct rb_node *rb_first(const struct rb_root *);
extern struct rb_node *rb_last(const struct rb_root *);

+/* Postorder iteration - always visit the parent after it's children */
+extern struct rb_node *rb_first_postorder(const struct rb_root *);
+extern struct rb_node *rb_next_postorder(const struct rb_node *);
+
/* Fast replacement of a single node without remove/rebalance/add/rebalance */
extern void rb_replace_node(struct rb_node *victim, struct rb_node *new,
struct rb_root *root);
diff --git a/lib/rbtree.c b/lib/rbtree.c
index c0e31fe..65f4eff 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -518,3 +518,43 @@ void rb_replace_node(struct rb_node *victim, struct rb_node *new,
*new = *victim;
}
EXPORT_SYMBOL(rb_replace_node);
+
+static struct rb_node *rb_left_deepest_node(const struct rb_node *node)
+{
+ for (;;) {
+ if (node->rb_left)
+ node = node->rb_left;
+ else if (node->rb_right)
+ node = node->rb_right;
+ else
+ return (struct rb_node *)node;
+ }
+}
+
+struct rb_node *rb_next_postorder(const struct rb_node *node)
+{
+ const struct rb_node *parent;
+ if (!node)
+ return NULL;
+ parent = rb_parent(node);
+
+ /* If we're sitting on node, we've already seen our children */
+ if (parent && node == parent->rb_left && parent->rb_right) {
+ /* If we are the parent's left node, go to the parent's right
+ * node then all the way down to the left */
+ return rb_left_deepest_node(parent->rb_right);
+ } else
+ /* Otherwise we are the parent's right node, and the parent
+ * should be next */
+ return (struct rb_node *)parent;
+}
+EXPORT_SYMBOL(rb_next_postorder);
+
+struct rb_node *rb_first_postorder(const struct rb_root *root)
+{
+ if (!root->rb_node)
+ return NULL;
+
+ return rb_left_deepest_node(root->rb_node);
+}
+EXPORT_SYMBOL(rb_first_postorder);
--
1.8.2.1

2013-04-12 01:19:52

by Cody P Schafer

[permalink] [raw]
Subject: [RFC PATCH v2 02/25] rbtree: add rbtree_postorder_for_each_entry_safe() helper.

Signed-off-by: Cody P Schafer <[email protected]>
---
include/linux/rbtree.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 2879e96..1b239ca 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -85,4 +85,12 @@ static inline void rb_link_node(struct rb_node * node, struct rb_node * parent,
*rb_link = node;
}

+#define rbtree_postorder_for_each_entry_safe(pos, n, root, field) \
+ for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),\
+ n = rb_entry(rb_next_postorder(&pos->field), \
+ typeof(*pos), field); \
+ &pos->field; \
+ pos = n, \
+ n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field))
+
#endif /* _LINUX_RBTREE_H */
--
1.8.2.1