The kernel has recently added support for using persistent memory as
normal RAM:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
The persistent memory is hot added to nodes separate from other memory
types, which makes it convenient to make node based memory policies.
When persistent memory provides a larger and cheaper address space, but
with slower access characteristics than system RAM, we'd like the kernel
to make use of these memory-only nodes as a migration tier for pages
that would normally be discared during memory reclaim. This is faster
than doing IO for swap or page cache, and makes better utilization of
available physical address space.
The feature is not enabled by default. The user must opt-in to kernel
managed page migration by defining the demotion path. In the future,
we may want to have the kernel automatically create this based on
heterogeneous memory attributes and CPU locality.
Keith Busch (5):
node: Define and export memory migration path
mm: Split handling old page for migration
mm: Attempt to migrate page in lieu of discard
mm: Consider anonymous pages without swap
mm/migrate: Add page movement trace event
Documentation/ABI/stable/sysfs-devices-node | 11 +-
drivers/base/node.c | 73 +++++++++++++
include/linux/migrate.h | 6 ++
include/linux/node.h | 6 ++
include/linux/swap.h | 20 ++++
include/trace/events/migrate.h | 29 ++++-
mm/debug.c | 1 +
mm/migrate.c | 161 ++++++++++++++++++----------
mm/vmscan.c | 25 ++++-
9 files changed, 271 insertions(+), 61 deletions(-)
--
2.14.4
Trace the source and destination node of a page migration to help debug
memory usage.
Signed-off-by: Keith Busch <[email protected]>
---
include/trace/events/migrate.h | 26 ++++++++++++++++++++++++++
mm/migrate.c | 1 +
2 files changed, 27 insertions(+)
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index d25de0cc8714..3d4b7131e547 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -6,6 +6,7 @@
#define _TRACE_MIGRATE_H
#include <linux/tracepoint.h>
+#include <trace/events/mmflags.h>
#define MIGRATE_MODE \
EM( MIGRATE_ASYNC, "MIGRATE_ASYNC") \
@@ -71,6 +72,31 @@ TRACE_EVENT(mm_migrate_pages,
__print_symbolic(__entry->mode, MIGRATE_MODE),
__print_symbolic(__entry->reason, MIGRATE_REASON))
);
+
+TRACE_EVENT(mm_migrate_move_page,
+
+ TP_PROTO(struct page *from, struct page *to, int status),
+
+ TP_ARGS(from, to, status),
+
+ TP_STRUCT__entry(
+ __field(struct page *, from)
+ __field(struct page *, to)
+ __field(int, status)
+ ),
+
+ TP_fast_assign(
+ __entry->from = from;
+ __entry->to = to;
+ __entry->status = status;
+ ),
+
+ TP_printk("node from=%d to=%d status=%d flags=%s refs=%d",
+ page_to_nid(__entry->from), page_to_nid(__entry->to),
+ __entry->status,
+ show_page_flags(__entry->from->flags & ((1UL << NR_PAGEFLAGS) - 1)),
+ page_ref_count(__entry->from))
+);
#endif /* _TRACE_MIGRATE_H */
/* This part must be outside protection */
diff --git a/mm/migrate.c b/mm/migrate.c
index 83fad87361bf..d97433da12c0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -997,6 +997,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
page->mapping = NULL;
}
out:
+ trace_mm_migrate_move_page(page, newpage, rc);
return rc;
}
--
2.14.4
Refactor unmap_and_move() handling for the new page into a separate
function from locking and preparing the old page.
No functional change here: this is just making it easier to reuse this
part of the page migration from contexts that already locked the old page.
Signed-off-by: Keith Busch <[email protected]>
---
mm/migrate.c | 115 +++++++++++++++++++++++++++++++----------------------------
1 file changed, 61 insertions(+), 54 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index ac6f4939bb59..705b320d4b35 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1000,57 +1000,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
return rc;
}
-static int __unmap_and_move(struct page *page, struct page *newpage,
- int force, enum migrate_mode mode)
+static int __unmap_and_move_locked(struct page *page, struct page *newpage,
+ enum migrate_mode mode)
{
int rc = -EAGAIN;
int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;
bool is_lru = !__PageMovable(page);
- if (!trylock_page(page)) {
- if (!force || mode == MIGRATE_ASYNC)
- goto out;
-
- /*
- * It's not safe for direct compaction to call lock_page.
- * For example, during page readahead pages are added locked
- * to the LRU. Later, when the IO completes the pages are
- * marked uptodate and unlocked. However, the queueing
- * could be merging multiple pages for one bio (e.g.
- * mpage_readpages). If an allocation happens for the
- * second or third page, the process can end up locking
- * the same page twice and deadlocking. Rather than
- * trying to be clever about what pages can be locked,
- * avoid the use of lock_page for direct compaction
- * altogether.
- */
- if (current->flags & PF_MEMALLOC)
- goto out;
-
- lock_page(page);
- }
-
- if (PageWriteback(page)) {
- /*
- * Only in the case of a full synchronous migration is it
- * necessary to wait for PageWriteback. In the async case,
- * the retry loop is too short and in the sync-light case,
- * the overhead of stalling is too much
- */
- switch (mode) {
- case MIGRATE_SYNC:
- case MIGRATE_SYNC_NO_COPY:
- break;
- default:
- rc = -EBUSY;
- goto out_unlock;
- }
- if (!force)
- goto out_unlock;
- wait_on_page_writeback(page);
- }
-
/*
* By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
* we cannot notice that anon_vma is freed while we migrates a page.
@@ -1077,11 +1034,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
* This is much like races on refcount of oldpage: just don't BUG().
*/
if (unlikely(!trylock_page(newpage)))
- goto out_unlock;
+ goto out;
if (unlikely(!is_lru)) {
rc = move_to_new_page(newpage, page, mode);
- goto out_unlock_both;
+ goto out_unlock;
}
/*
@@ -1100,7 +1057,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
VM_BUG_ON_PAGE(PageAnon(page), page);
if (page_has_private(page)) {
try_to_free_buffers(page);
- goto out_unlock_both;
+ goto out_unlock;
}
} else if (page_mapped(page)) {
/* Establish migration ptes */
@@ -1110,22 +1067,19 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
page_was_mapped = 1;
}
-
if (!page_mapped(page))
rc = move_to_new_page(newpage, page, mode);
if (page_was_mapped)
remove_migration_ptes(page,
rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
-
-out_unlock_both:
- unlock_page(newpage);
out_unlock:
+ unlock_page(newpage);
/* Drop an anon_vma reference if we took one */
+out:
if (anon_vma)
put_anon_vma(anon_vma);
- unlock_page(page);
-out:
+
/*
* If migration is successful, decrease refcount of the newpage
* which will not free the page because new page owner increased
@@ -1141,7 +1095,60 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
else
putback_lru_page(newpage);
}
+ return rc;
+}
+
+static int __unmap_and_move(struct page *page, struct page *newpage,
+ int force, enum migrate_mode mode)
+{
+ int rc = -EAGAIN;
+
+ if (!trylock_page(page)) {
+ if (!force || mode == MIGRATE_ASYNC)
+ goto out;
+
+ /*
+ * It's not safe for direct compaction to call lock_page.
+ * For example, during page readahead pages are added locked
+ * to the LRU. Later, when the IO completes the pages are
+ * marked uptodate and unlocked. However, the queueing
+ * could be merging multiple pages for one bio (e.g.
+ * mpage_readpages). If an allocation happens for the
+ * second or third page, the process can end up locking
+ * the same page twice and deadlocking. Rather than
+ * trying to be clever about what pages can be locked,
+ * avoid the use of lock_page for direct compaction
+ * altogether.
+ */
+ if (current->flags & PF_MEMALLOC)
+ goto out;
+
+ lock_page(page);
+ }
+ if (PageWriteback(page)) {
+ /*
+ * Only in the case of a full synchronous migration is it
+ * necessary to wait for PageWriteback. In the async case,
+ * the retry loop is too short and in the sync-light case,
+ * the overhead of stalling is too much
+ */
+ switch (mode) {
+ case MIGRATE_SYNC:
+ case MIGRATE_SYNC_NO_COPY:
+ break;
+ default:
+ rc = -EBUSY;
+ goto out_unlock;
+ }
+ if (!force)
+ goto out_unlock;
+ wait_on_page_writeback(page);
+ }
+ rc = __unmap_and_move_locked(page, newpage, mode);
+out_unlock:
+ unlock_page(page);
+out:
return rc;
}
--
2.14.4
Age and reclaim anonymous pages from nodes that have an online migration node even
if swap is not enabled.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/swap.h | 20 ++++++++++++++++++++
mm/vmscan.c | 10 +++++-----
2 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bfb5c4ac108..91b405a3b44f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -680,5 +680,25 @@ static inline bool mem_cgroup_swap_full(struct page *page)
}
#endif
+static inline bool reclaim_anon_pages(struct mem_cgroup *memcg,
+ int node_id)
+{
+ /* Always age anon pages when we have swap */
+ if (memcg == NULL) {
+ if (get_nr_swap_pages() > 0)
+ return true;
+ } else {
+ if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
+ return true;
+ }
+
+ /* Also age anon pages if we can auto-migrate them */
+ if (next_migration_node(node_id) >= 0)
+ return true;
+
+ /* No way to reclaim anon pages */
+ return false;
+}
+
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0a95804e946a..226c4c838947 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -327,7 +327,7 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
- if (get_nr_swap_pages() > 0)
+ if (reclaim_anon_pages(NULL, zone_to_nid(zone)))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
@@ -2206,7 +2206,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
* If we don't have swap space, anonymous page deactivation
* is pointless.
*/
- if (!file && !total_swap_pages)
+ if (!file && !reclaim_anon_pages(NULL, pgdat->node_id))
return false;
inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
@@ -2287,7 +2287,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
enum lru_list lru;
/* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+ if (!sc->may_swap || !reclaim_anon_pages(memcg, pgdat->node_id)) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2650,7 +2650,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
*/
pages_for_compaction = compact_gap(sc->order);
inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
- if (get_nr_swap_pages() > 0)
+ if (!reclaim_anon_pages(NULL, pgdat->node_id))
inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
if (sc->nr_reclaimed < pages_for_compaction &&
inactive_lru_pages > pages_for_compaction)
@@ -3347,7 +3347,7 @@ static void age_active_anon(struct pglist_data *pgdat,
{
struct mem_cgroup *memcg;
- if (!total_swap_pages)
+ if (!reclaim_anon_pages(NULL, pgdat->node_id))
return;
memcg = mem_cgroup_iter(NULL, NULL, NULL);
--
2.14.4
If a memory node has a preferred migration path to demote cold pages,
attempt to move those inactive pages to that migration node before
reclaiming. This will better utilize available memory, provide a faster
tier than swapping or discarding, and allow such pages to be reused
immediately without IO to retrieve the data.
Some places we would like to see this used:
1. Persistent memory being as a slower, cheaper DRAM replacement
2. Remote memory-only "expansion" NUMA nodes
3. Resolving memory imbalances where one NUMA node is seeing more
allocation activity than another. This helps keep more recent
allocations closer to the CPUs on the node doing the allocating.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/migrate.h | 6 ++++++
include/trace/events/migrate.h | 3 ++-
mm/debug.c | 1 +
mm/migrate.c | 45 ++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 15 ++++++++++++++
5 files changed, 69 insertions(+), 1 deletion(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e13d9bf2f9a5..a004cb1b2dbb 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -25,6 +25,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+ MR_DEMOTION,
MR_TYPES
};
@@ -79,6 +80,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
extern int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page, enum migrate_mode mode,
int extra_count);
+extern bool migrate_demote_mapping(struct page *page);
#else
static inline void putback_movable_pages(struct list_head *l) {}
@@ -105,6 +107,10 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
return -ENOSYS;
}
+static inline bool migrate_demote_mapping(struct page *page)
+{
+ return false;
+}
#endif /* CONFIG_MIGRATION */
#ifdef CONFIG_COMPACTION
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 705b33d1e395..d25de0cc8714 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset") \
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
EM( MR_NUMA_MISPLACED, "numa_misplaced") \
- EMe(MR_CONTIG_RANGE, "contig_range")
+ EM(MR_CONTIG_RANGE, "contig_range") \
+ EMe(MR_DEMOTION, "demotion")
/*
* First define the enums in the above macros to be exported to userspace
diff --git a/mm/debug.c b/mm/debug.c
index c0b31b6c3877..53d499f65199 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPES] = {
"mempolicy_mbind",
"numa_misplaced",
"cma",
+ "demotion",
};
const struct trace_print_flags pageflag_names[] = {
diff --git a/mm/migrate.c b/mm/migrate.c
index 705b320d4b35..83fad87361bf 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1152,6 +1152,51 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
return rc;
}
+/**
+ * migrate_demote_mapping() - Migrate this page and its mappings to its
+ * demotion node.
+ * @page: An isolated, non-compound page that should move to
+ * its current node's migration path.
+ *
+ * @returns: True if migrate demotion was successful, false otherwise
+ */
+bool migrate_demote_mapping(struct page *page)
+{
+ int rc, next_nid = next_migration_node(page_to_nid(page));
+ struct page *newpage;
+
+ /*
+ * The flags are set to allocate only on the desired node in the
+ * migration path, and to fail fast if not immediately available. We
+ * are already in the memory reclaim path, we don't want heroic
+ * efforts to get a page.
+ */
+ gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
+ __GFP_NOMEMALLOC | __GFP_THISNODE;
+
+ VM_BUG_ON_PAGE(PageCompound(page), page);
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+
+ if (next_nid < 0)
+ return false;
+
+ newpage = alloc_pages_node(next_nid, mask, 0);
+ if (!newpage)
+ return false;
+
+ /*
+ * MIGRATE_ASYNC is the most light weight and never blocks.
+ */
+ rc = __unmap_and_move_locked(page, newpage, MIGRATE_ASYNC);
+ if (rc != MIGRATEPAGE_SUCCESS) {
+ __free_pages(newpage, 0);
+ return false;
+ }
+
+ set_page_owner_migrate_reason(newpage, MR_DEMOTION);
+ return true;
+}
+
/*
* gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work
* around it.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a5ad0b35ab8e..0a95804e946a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1261,6 +1261,21 @@ static unsigned long shrink_page_list(struct list_head *page_list,
; /* try to reclaim the page below */
}
+ if (!PageCompound(page)) {
+ if (migrate_demote_mapping(page)) {
+ unlock_page(page);
+ if (likely(put_page_testzero(page)))
+ goto free_it;
+
+ /*
+ * Speculative reference will free this page,
+ * so leave it off the LRU.
+ */
+ nr_reclaimed++;
+ continue;
+ }
+ }
+
/*
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
--
2.14.4
Prepare for the kernel to auto-migrate pages to other memory nodes with a
user defined node migration table. A user may create a single target for
each NUMA node to enable the kernel to do NUMA page migrations instead
of simply reclaiming colder pages. A node with no target is a "terminal
node", so reclaim acts normally there. The migration target does not
fundamentally _need_ to be a single node, but this implementation starts
there to limit complexity.
If you consider the migration path as a graph, cycles (loops) in the graph
are disallowed. This avoids wasting resources by constantly migrating
(A->B, B->A, A->B ...). The expectation is that cycles will never be
allowed.
Signed-off-by: Keith Busch <[email protected]>
---
Documentation/ABI/stable/sysfs-devices-node | 11 ++++-
drivers/base/node.c | 73 +++++++++++++++++++++++++++++
include/linux/node.h | 6 +++
3 files changed, 89 insertions(+), 1 deletion(-)
diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 3e90e1f3bf0a..7439e1845e5d 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -90,4 +90,13 @@ Date: December 2009
Contact: Lee Schermerhorn <[email protected]>
Description:
The node's huge page size control/query attributes.
- See Documentation/admin-guide/mm/hugetlbpage.rst
\ No newline at end of file
+ See Documentation/admin-guide/mm/hugetlbpage.rst
+
+What: /sys/devices/system/node/nodeX/migration_path
+Data March 2019
+Contact: Linux Memory Management list <[email protected]>
+Description:
+ Defines which node the kernel should attempt to migrate this
+ node's pages to when this node requires memory reclaim. A
+ negative value means this is a terminal node and memory can not
+ be reclaimed through kernel managed migration.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 86d6cd92ce3d..20a90905555f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -59,6 +59,10 @@ static inline ssize_t node_read_cpulist(struct device *dev,
static DEVICE_ATTR(cpumap, S_IRUGO, node_read_cpumask, NULL);
static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
+#define TERMINAL_NODE -1
+static int node_migration[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = TERMINAL_NODE};
+static DEFINE_SPINLOCK(node_migration_lock);
+
#define K(x) ((x) << (PAGE_SHIFT - 10))
static ssize_t node_read_meminfo(struct device *dev,
struct device_attribute *attr, char *buf)
@@ -233,6 +237,74 @@ static ssize_t node_read_distance(struct device *dev,
}
static DEVICE_ATTR(distance, S_IRUGO, node_read_distance, NULL);
+static ssize_t migration_path_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%d\n", node_migration[dev->id]);
+}
+
+static ssize_t migration_path_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int i, err, nid = dev->id;
+ nodemask_t visited = NODE_MASK_NONE;
+ long next;
+
+ err = kstrtol(buf, 0, &next);
+ if (err)
+ return -EINVAL;
+
+ if (next < 0) {
+ spin_lock(&node_migration_lock);
+ WRITE_ONCE(node_migration[nid], TERMINAL_NODE);
+ spin_unlock(&node_migration_lock);
+ return count;
+ }
+ if (next > MAX_NUMNODES || !node_online(next))
+ return -EINVAL;
+
+ /*
+ * Follow the entire migration path from 'nid' through the point where
+ * we hit a TERMINAL_NODE.
+ *
+ * Don't allow looped migration cycles in the path.
+ */
+ node_set(nid, visited);
+ spin_lock(&node_migration_lock);
+ for (i = next; node_migration[i] != TERMINAL_NODE;
+ i = node_migration[i]) {
+ /* Fail if we have visited this node already */
+ if (node_test_and_set(i, visited)) {
+ spin_unlock(&node_migration_lock);
+ return -EINVAL;
+ }
+ }
+ WRITE_ONCE(node_migration[nid], next);
+ spin_unlock(&node_migration_lock);
+
+ return count;
+}
+static DEVICE_ATTR_RW(migration_path);
+
+/**
+ * next_migration_node() - Get the next node in the migration path
+ * @current_node: The starting node to lookup the next node
+ *
+ * @returns: node id for next memory node in the migration path hierarchy from
+ * @current_node; -1 if @current_node is terminal or its migration
+ * node is not online.
+ */
+int next_migration_node(int current_node)
+{
+ int nid = READ_ONCE(node_migration[current_node]);
+
+ if (nid >= 0 && node_online(nid))
+ return nid;
+ return TERMINAL_NODE;
+}
+
static struct attribute *node_dev_attrs[] = {
&dev_attr_cpumap.attr,
&dev_attr_cpulist.attr,
@@ -240,6 +312,7 @@ static struct attribute *node_dev_attrs[] = {
&dev_attr_numastat.attr,
&dev_attr_distance.attr,
&dev_attr_vmstat.attr,
+ &dev_attr_migration_path.attr,
NULL
};
ATTRIBUTE_GROUPS(node_dev);
diff --git a/include/linux/node.h b/include/linux/node.h
index 257bb3d6d014..af46c7a8b94f 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -67,6 +67,7 @@ static inline int register_one_node(int nid)
return error;
}
+extern int next_migration_node(int current_node);
extern void unregister_one_node(int nid);
extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
@@ -115,6 +116,11 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
node_registration_func_t unreg)
{
}
+
+static inline int next_migration_node(int current_node)
+{
+ return -1;
+}
#endif
#define to_node(device) container_of(device, struct node, dev)
--
2.14.4
On 21 Mar 2019, at 13:01, Keith Busch wrote:
> The kernel has recently added support for using persistent memory as
> normal RAM:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>
> The persistent memory is hot added to nodes separate from other memory
> types, which makes it convenient to make node based memory policies.
>
> When persistent memory provides a larger and cheaper address space, but
> with slower access characteristics than system RAM, we'd like the kernel
> to make use of these memory-only nodes as a migration tier for pages
> that would normally be discared during memory reclaim. This is faster
> than doing IO for swap or page cache, and makes better utilization of
> available physical address space.
>
> The feature is not enabled by default. The user must opt-in to kernel
> managed page migration by defining the demotion path. In the future,
> we may want to have the kernel automatically create this based on
> heterogeneous memory attributes and CPU locality.
>
Cc more people here.
Thank you for the patchset. This is definitely useful when we have larger PMEM
backing existing DRAM. I have several questions:
1. The name of “page demotion” seems confusing to me, since I thought it was about large pages
demote to small pages as opposite to promoting small pages to THPs. Am I the only
one here?
2. For the demotion path, a common case would be from high-performance memory, like HBM
or Multi-Channel DRAM, to DRAM, then to PMEM, and finally to disks, right? More general
case for demotion path would be derived from the memory performance description from HMAT[1],
right? Do you have any algorithm to form such a path from HMAT?
3. Do you have a plan for promoting pages from lower-level memory to higher-level memory,
like from PMEM to DRAM? Will this one-way demotion make all pages sink to PMEM and disk?
4. In your patch 3, you created a new method migrate_demote_mapping() to migrate pages to
other memory node, is there any problem of reusing existing migrate_pages() interface?
5. In addition, you only migrate base pages, is there any performance concern on migrating THPs?
Is it too costly to migrate THPs?
Thanks.
[1] https://lwn.net/Articles/724562/
--
Best Regards,
Yan Zi
On Thu, Mar 21, 2019 at 02:20:51PM -0700, Zi Yan wrote:
> 1. The name of “page demotion” seems confusing to me, since I thought it was about large pages
> demote to small pages as opposite to promoting small pages to THPs. Am I the only
> one here?
If you have a THP, we'll skip the page migration and fall through to
split_huge_page_to_list(), then the smaller pages can be considered,
migrated and reclaimed individually. Not that we couldn't try to migrate
a THP directly. It was just simpler implementation for this first attempt.
> 2. For the demotion path, a common case would be from high-performance memory, like HBM
> or Multi-Channel DRAM, to DRAM, then to PMEM, and finally to disks, right? More general
> case for demotion path would be derived from the memory performance description from HMAT[1],
> right? Do you have any algorithm to form such a path from HMAT?
Yes, I have a PoC for the kernel setting up a demotion path based on
HMAT properties here:
https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/commit/?h=mm-migrate&id=4d007659e1dd1b0dad49514348be4441fbe7cadb
The above is just from an experimental branch.
> 3. Do you have a plan for promoting pages from lower-level memory to higher-level memory,
> like from PMEM to DRAM? Will this one-way demotion make all pages sink to PMEM and disk?
Promoting previously demoted pages would require the application do
something to make that happen if you turn demotion on with this series.
Kernel auto-promotion is still being investigated, and it's a little
trickier than reclaim.
If it sinks to disk, though, the next access behavior is the same as
before, without this series.
> 4. In your patch 3, you created a new method migrate_demote_mapping() to migrate pages to
> other memory node, is there any problem of reusing existing migrate_pages() interface?
Yes, we may not want to migrate everything in the shrink_page_list()
pages. We might want to keep a page, so we have to do those checks first. At
the point we know we want to attempt migration, the page is already
locked and not in a list, so it is just easier to directly invoke the
new __unmap_and_move_locked() that migrate_pages() eventually also calls.
> 5. In addition, you only migrate base pages, is there any performance concern on migrating THPs?
> Is it too costly to migrate THPs?
It was just easier to consider single pages first, so we let a THP split
if possible. I'm not sure of the cost in migrating THPs directly.
On Thu, Mar 21, 2019 at 3:36 PM Keith Busch <[email protected]> wrote:
>
> On Thu, Mar 21, 2019 at 02:20:51PM -0700, Zi Yan wrote:
> > 1. The name of “page demotion” seems confusing to me, since I thought it was about large pages
> > demote to small pages as opposite to promoting small pages to THPs. Am I the only
> > one here?
>
> If you have a THP, we'll skip the page migration and fall through to
> split_huge_page_to_list(), then the smaller pages can be considered,
> migrated and reclaimed individually. Not that we couldn't try to migrate
> a THP directly. It was just simpler implementation for this first attempt.
>
> > 2. For the demotion path, a common case would be from high-performance memory, like HBM
> > or Multi-Channel DRAM, to DRAM, then to PMEM, and finally to disks, right? More general
> > case for demotion path would be derived from the memory performance description from HMAT[1],
> > right? Do you have any algorithm to form such a path from HMAT?
>
> Yes, I have a PoC for the kernel setting up a demotion path based on
> HMAT properties here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/commit/?h=mm-migrate&id=4d007659e1dd1b0dad49514348be4441fbe7cadb
>
> The above is just from an experimental branch.
>
> > 3. Do you have a plan for promoting pages from lower-level memory to higher-level memory,
> > like from PMEM to DRAM? Will this one-way demotion make all pages sink to PMEM and disk?
>
> Promoting previously demoted pages would require the application do
> something to make that happen if you turn demotion on with this series.
> Kernel auto-promotion is still being investigated, and it's a little
> trickier than reclaim.
Just FYI. I'm currently working on a patchset which tries to promotes
page from second tier memory (i.e. PMEM) to DRAM via NUMA balancing.
But, NUMA balancing can't deal with unmapped page cache, they have to
be promoted from different path, i.e. mark_page_accessed().
And, I do agree with Keith, promotion is definitely trickier than
reclaim since kernel can't recognize "hot" pages accurately. NUMA
balancing is still corse-grained and inaccurate, but it is simple. If
we would like to implement more sophisticated algorithm, in-kernel
implementation might be not a good idea.
Thanks,
Yang
>
> If it sinks to disk, though, the next access behavior is the same as
> before, without this series.
>
> > 4. In your patch 3, you created a new method migrate_demote_mapping() to migrate pages to
> > other memory node, is there any problem of reusing existing migrate_pages() interface?
>
> Yes, we may not want to migrate everything in the shrink_page_list()
> pages. We might want to keep a page, so we have to do those checks first. At
> the point we know we want to attempt migration, the page is already
> locked and not in a list, so it is just easier to directly invoke the
> new __unmap_and_move_locked() that migrate_pages() eventually also calls.
>
> > 5. In addition, you only migrate base pages, is there any performance concern on migrating THPs?
> > Is it too costly to migrate THPs?
>
> It was just easier to consider single pages first, so we let a THP split
> if possible. I'm not sure of the cost in migrating THPs directly.
>
On Thu, Mar 21, 2019 at 1:03 PM Keith Busch <[email protected]> wrote:
>
> If a memory node has a preferred migration path to demote cold pages,
> attempt to move those inactive pages to that migration node before
> reclaiming. This will better utilize available memory, provide a faster
> tier than swapping or discarding, and allow such pages to be reused
> immediately without IO to retrieve the data.
>
> Some places we would like to see this used:
>
> 1. Persistent memory being as a slower, cheaper DRAM replacement
> 2. Remote memory-only "expansion" NUMA nodes
> 3. Resolving memory imbalances where one NUMA node is seeing more
> allocation activity than another. This helps keep more recent
> allocations closer to the CPUs on the node doing the allocating.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> include/linux/migrate.h | 6 ++++++
> include/trace/events/migrate.h | 3 ++-
> mm/debug.c | 1 +
> mm/migrate.c | 45 ++++++++++++++++++++++++++++++++++++++++++
> mm/vmscan.c | 15 ++++++++++++++
> 5 files changed, 69 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index e13d9bf2f9a5..a004cb1b2dbb 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -25,6 +25,7 @@ enum migrate_reason {
> MR_MEMPOLICY_MBIND,
> MR_NUMA_MISPLACED,
> MR_CONTIG_RANGE,
> + MR_DEMOTION,
> MR_TYPES
> };
>
> @@ -79,6 +80,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> extern int migrate_page_move_mapping(struct address_space *mapping,
> struct page *newpage, struct page *page, enum migrate_mode mode,
> int extra_count);
> +extern bool migrate_demote_mapping(struct page *page);
> #else
>
> static inline void putback_movable_pages(struct list_head *l) {}
> @@ -105,6 +107,10 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
> return -ENOSYS;
> }
>
> +static inline bool migrate_demote_mapping(struct page *page)
> +{
> + return false;
> +}
> #endif /* CONFIG_MIGRATION */
>
> #ifdef CONFIG_COMPACTION
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 705b33d1e395..d25de0cc8714 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -20,7 +20,8 @@
> EM( MR_SYSCALL, "syscall_or_cpuset") \
> EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
> EM( MR_NUMA_MISPLACED, "numa_misplaced") \
> - EMe(MR_CONTIG_RANGE, "contig_range")
> + EM(MR_CONTIG_RANGE, "contig_range") \
> + EMe(MR_DEMOTION, "demotion")
>
> /*
> * First define the enums in the above macros to be exported to userspace
> diff --git a/mm/debug.c b/mm/debug.c
> index c0b31b6c3877..53d499f65199 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPES] = {
> "mempolicy_mbind",
> "numa_misplaced",
> "cma",
> + "demotion",
> };
>
> const struct trace_print_flags pageflag_names[] = {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 705b320d4b35..83fad87361bf 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1152,6 +1152,51 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> return rc;
> }
>
> +/**
> + * migrate_demote_mapping() - Migrate this page and its mappings to its
> + * demotion node.
> + * @page: An isolated, non-compound page that should move to
> + * its current node's migration path.
> + *
> + * @returns: True if migrate demotion was successful, false otherwise
> + */
> +bool migrate_demote_mapping(struct page *page)
> +{
> + int rc, next_nid = next_migration_node(page_to_nid(page));
> + struct page *newpage;
> +
> + /*
> + * The flags are set to allocate only on the desired node in the
> + * migration path, and to fail fast if not immediately available. We
> + * are already in the memory reclaim path, we don't want heroic
> + * efforts to get a page.
> + */
> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
> + __GFP_NOMEMALLOC | __GFP_THISNODE;
> +
> + VM_BUG_ON_PAGE(PageCompound(page), page);
> + VM_BUG_ON_PAGE(PageLRU(page), page);
> +
> + if (next_nid < 0)
> + return false;
> +
> + newpage = alloc_pages_node(next_nid, mask, 0);
> + if (!newpage)
> + return false;
> +
> + /*
> + * MIGRATE_ASYNC is the most light weight and never blocks.
> + */
> + rc = __unmap_and_move_locked(page, newpage, MIGRATE_ASYNC);
> + if (rc != MIGRATEPAGE_SUCCESS) {
> + __free_pages(newpage, 0);
> + return false;
> + }
> +
> + set_page_owner_migrate_reason(newpage, MR_DEMOTION);
> + return true;
> +}
> +
> /*
> * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work
> * around it.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a5ad0b35ab8e..0a95804e946a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1261,6 +1261,21 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> ; /* try to reclaim the page below */
> }
>
> + if (!PageCompound(page)) {
> + if (migrate_demote_mapping(page)) {
> + unlock_page(page);
> + if (likely(put_page_testzero(page)))
> + goto free_it;
> +
> + /*
> + * Speculative reference will free this page,
> + * so leave it off the LRU.
> + */
> + nr_reclaimed++;
> + continue;
> + }
> + }
It looks the reclaim path would fall through if the migration is
failed. But, it looks, with patch #4, you may end up trying reclaim an
anon page on swapless system if migration is failed?
And, actually I have the same question with Yan Zi. Why not just put
the demote candidate into a separate list, then migrate all the
candidates in bulk with migrate_pages()?
Thanks,
Yang
> +
> /*
> * Anonymous process memory has backing store?
> * Try to allocate it some swap space here.
> --
> 2.14.4
>
<snip>
>> 2. For the demotion path, a common case would be from high-performance memory, like HBM
>> or Multi-Channel DRAM, to DRAM, then to PMEM, and finally to disks, right? More general
>> case for demotion path would be derived from the memory performance description from HMAT[1],
>> right? Do you have any algorithm to form such a path from HMAT?
>
> Yes, I have a PoC for the kernel setting up a demotion path based on
> HMAT properties here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/commit/?h=mm-migrate&id=4d007659e1dd1b0dad49514348be4441fbe7cadb
>
> The above is just from an experimental branch.
Got it. Thanks.
>
>> 3. Do you have a plan for promoting pages from lower-level memory to higher-level memory,
>> like from PMEM to DRAM? Will this one-way demotion make all pages sink to PMEM and disk?
>
> Promoting previously demoted pages would require the application do
> something to make that happen if you turn demotion on with this series.
> Kernel auto-promotion is still being investigated, and it's a little
> trickier than reclaim.
>
> If it sinks to disk, though, the next access behavior is the same as
> before, without this series.
This means, when demotion is on, the path for a page would be DRAM->PMEM->Disk->DRAM->PMEM->… .
This could be a start point.
I actually did something similar here for two-level heterogeneous memory structure: https://github.com/ysarch-lab/nimble_page_management_asplos_2019/blob/nimble_page_management_4_14_78/mm/memory_manage.c#L401.
What I did basically was calling shrink_page_list() periodically, so pages will be separated
in active and inactive lists. Then, pages in the _inactive_ list of fast memory (like DRAM)
are migrated to slow memory (like PMEM) and pages in the _active_ list of slow memory are migrated
to fast memory. It is kinda of abusing the existing page lists. :)
My conclusion from that experiments is that you need high-throughput page migration mechanisms,
like multi-threaded page migration, migrating a bunch of pages in a batch (https://github.com/ysarch-lab/nimble_page_management_asplos_2019/blob/nimble_page_management_4_14_78/mm/copy_page.c), and
a new mechanism called exchange pages (https://github.com/ysarch-lab/nimble_page_management_asplos_2019/blob/nimble_page_management_4_14_78/mm/exchange.c), so that using page migration to manage multi-level
memory systems becomes useful. Otherwise, the overheads (TLB shootdown and other kernel activities
in the page migration process) of page migration may kill the benefit. Because the performance
gap between DRAM and PMEM is supposed to be smaller than the one between DRAM and disk,
the benefit of putting data in DRAM might not compensate the cost of migrating cold pages from DRAM
to PMEM. Namely, directly putting data in PMEM after DRAM is full might be better.
>> 4. In your patch 3, you created a new method migrate_demote_mapping() to migrate pages to
>> other memory node, is there any problem of reusing existing migrate_pages() interface?
>
> Yes, we may not want to migrate everything in the shrink_page_list()
> pages. We might want to keep a page, so we have to do those checks first. At
> the point we know we want to attempt migration, the page is already
> locked and not in a list, so it is just easier to directly invoke the
> new __unmap_and_move_locked() that migrate_pages() eventually also calls.
Right, I understand that you want to only migrate small pages to begin with. My question is
why not using the existing migrate_pages() in your patch 3. Like:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a5ad0b35ab8e..0a0753af357f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1261,6 +1261,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
; /* try to reclaim the page below */
}
+ if (!PageCompound(page)) {
+ int next_nid = next_migration_node(page);
+ int err;
+
+ if (next_nid != TERMINAL_NODE) {
+ LIST_HEAD(migrate_list);
+ list_add(&migrate_list, &page->lru);
+ err = migrate_pages(&migrate_list, alloc_new_node_page, NULL,
+ next_nid, MIGRATE_ASYNC, MR_DEMOTION);
+ if (err)
+ putback_movable_pages(&migrate_list);
+ }
+ }
+
/*
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
Because your new migrate_demote_mapping() basically does the same thing as the code above.
If you are not OK with the gfp flags in alloc_new_node_page(), you can just write your own
alloc_new_node_page(). :)
>
>> 5. In addition, you only migrate base pages, is there any performance concern on migrating THPs?
>> Is it too costly to migrate THPs?
>
> It was just easier to consider single pages first, so we let a THP split
> if possible. I'm not sure of the cost in migrating THPs directly.
AFAICT, when migrating the same amount of 2MB data, migrating a THP is much quick than migrating
512 4KB pages. Because you save 511 TLB shootdowns in THP migration and copying 2MB contiguous data
achieves higher throughput than copying individual 4KB pages. But it highly depends on whether
any subpage in a THP is hotter than others, so migrating a THP as a whole might hurt performance
sometimes. Just some of my observation in my own experiments.
--
Best Regards,
Yan Zi
On 21 Mar 2019, at 16:02, Yang Shi wrote:
> On Thu, Mar 21, 2019 at 3:36 PM Keith Busch <[email protected]> wrote:
>>
>> On Thu, Mar 21, 2019 at 02:20:51PM -0700, Zi Yan wrote:
>>> 1. The name of “page demotion” seems confusing to me, since I thought it was about large pages
>>> demote to small pages as opposite to promoting small pages to THPs. Am I the only
>>> one here?
>>
>> If you have a THP, we'll skip the page migration and fall through to
>> split_huge_page_to_list(), then the smaller pages can be considered,
>> migrated and reclaimed individually. Not that we couldn't try to migrate
>> a THP directly. It was just simpler implementation for this first attempt.
>>
>>> 2. For the demotion path, a common case would be from high-performance memory, like HBM
>>> or Multi-Channel DRAM, to DRAM, then to PMEM, and finally to disks, right? More general
>>> case for demotion path would be derived from the memory performance description from HMAT[1],
>>> right? Do you have any algorithm to form such a path from HMAT?
>>
>> Yes, I have a PoC for the kernel setting up a demotion path based on
>> HMAT properties here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/commit/?h=mm-migrate&id=4d007659e1dd1b0dad49514348be4441fbe7cadb
>>
>> The above is just from an experimental branch.
>>
>>> 3. Do you have a plan for promoting pages from lower-level memory to higher-level memory,
>>> like from PMEM to DRAM? Will this one-way demotion make all pages sink to PMEM and disk?
>>
>> Promoting previously demoted pages would require the application do
>> something to make that happen if you turn demotion on with this series.
>> Kernel auto-promotion is still being investigated, and it's a little
>> trickier than reclaim.
>
> Just FYI. I'm currently working on a patchset which tries to promotes
> page from second tier memory (i.e. PMEM) to DRAM via NUMA balancing.
> But, NUMA balancing can't deal with unmapped page cache, they have to
> be promoted from different path, i.e. mark_page_accessed().
Got it. Another concern is that NUMA balancing marks pages inaccessible
to obtain access information. It might add more overheads on top of page migration
overheads. Considering the benefit of migrating pages from PMEM to DRAM
is not as large as “bring data from disk to DRAM”, the overheads might offset
the benefit, meaning you might see performance degradation.
>
> And, I do agree with Keith, promotion is definitely trickier than
> reclaim since kernel can't recognize "hot" pages accurately. NUMA
> balancing is still corse-grained and inaccurate, but it is simple. If
> we would like to implement more sophisticated algorithm, in-kernel
> implementation might be not a good idea.
I agree. Or hardware vendor, like Intel, could bring more information
on page hotness, like multi-bit access bits or page-modification log
provided by Intel for virtualization.
--
Best Regards,
Yan Zi
On Thu, Mar 21, 2019 at 05:12:33PM -0700, Zi Yan wrote:
> > Yes, we may not want to migrate everything in the shrink_page_list()
> > pages. We might want to keep a page, so we have to do those checks first. At
> > the point we know we want to attempt migration, the page is already
> > locked and not in a list, so it is just easier to directly invoke the
> > new __unmap_and_move_locked() that migrate_pages() eventually also calls.
>
> Right, I understand that you want to only migrate small pages to begin with. My question is
> why not using the existing migrate_pages() in your patch 3. Like:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a5ad0b35ab8e..0a0753af357f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1261,6 +1261,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> ; /* try to reclaim the page below */
> }
>
> + if (!PageCompound(page)) {
> + int next_nid = next_migration_node(page);
> + int err;
> +
> + if (next_nid != TERMINAL_NODE) {
> + LIST_HEAD(migrate_list);
> + list_add(&migrate_list, &page->lru);
> + err = migrate_pages(&migrate_list, alloc_new_node_page, NULL,
> + next_nid, MIGRATE_ASYNC, MR_DEMOTION);
> + if (err)
> + putback_movable_pages(&migrate_list);
> + }
> + }
> +
> /*
> * Anonymous process memory has backing store?
> * Try to allocate it some swap space here.
>
> Because your new migrate_demote_mapping() basically does the same thing as the code above.
> If you are not OK with the gfp flags in alloc_new_node_page(), you can just write your own
> alloc_new_node_page(). :)
The page is already locked, you can't call migrate_pages()
with locked pages. You'd have to surround migrate_pages with
unlock_page/try_lock_page, and I thought that looked odd. Further,
it changes the flow if the subsequent try lock fails, and I'm trying to
be careful about not introducing different behavior if migration fails.
Patch 2/5 is included here so we can reuse the necessary code from a
locked page context.
On Thu, Mar 21, 2019 at 04:58:16PM -0700, Yang Shi wrote:
> On Thu, Mar 21, 2019 at 1:03 PM Keith Busch <[email protected]> wrote:
> > + if (!PageCompound(page)) {
> > + if (migrate_demote_mapping(page)) {
> > + unlock_page(page);
> > + if (likely(put_page_testzero(page)))
> > + goto free_it;
> > +
> > + /*
> > + * Speculative reference will free this page,
> > + * so leave it off the LRU.
> > + */
> > + nr_reclaimed++;
> > + continue;
> > + }
> > + }
>
> It looks the reclaim path would fall through if the migration is
> failed. But, it looks, with patch #4, you may end up trying reclaim an
> anon page on swapless system if migration is failed?
Right, and add_to_swap() will fail and the page jumps to activate_locked
label, placing it back where it was before.
> And, actually I have the same question with Yan Zi. Why not just put
> the demote candidate into a separate list, then migrate all the
> candidates in bulk with migrate_pages()?
The page is already locked at the point we know we want to migrate it.