2020-06-29 23:51:17

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 0/8] Migrate Pages in lieu of discard

I've been sitting on these for too long. Tha main purpose of this
post is to have a public discussion with the other folks who are
interested in this functionalty and converge on a single
implementation.

This set directly incorporates a statictics patch from Yang Shi and
also includes one to ensure good behavior with cgroup reclaim which
was very closely derived from this series:

https://lore.kernel.org/linux-mm/[email protected]/

Since the last post, the major changes are:
- Added patch to skip migration when doing cgroup reclaim
- Added stats patch from Yang Shi

The full series is also available here:

https://github.com/hansendc/linux/tree/automigrate-20200629

--

We're starting to see systems with more and more kinds of memory such
as Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out. Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties. First, the newer allocations can end
up in the slower persistent memory. Second, reclaimed data in DRAM
are just discarded even if there are gobs of space in persistent
memory that could be used.

This set implements a solution to these problems. At the end of the
reclaim process in shrink_page_list() just before the last page
refcount is dropped, the page is migrated to persistent memory instead
of being dropped.

While I've talked about a DRAM/PMEM pairing, this approach would
function in any environment where memory tiers exist.

This is not perfect. It "strands" pages in slower memory and never
brings them back to fast DRAM. Other things need to be built to
promote hot pages back to DRAM.

This is part of a larger patch set. If you want to apply these or
play with them, I'd suggest using the tree from here. It includes
autonuma-based hot page promotion back to DRAM:

http://lkml.kernel.org/r/[email protected]

This is also all based on an upstream mechanism that allows
persistent memory to be onlined and used as if it were volatile:

http://lkml.kernel.org/r/[email protected]

Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>

--

Dave Hansen (5):
mm/numa: node demotion data structure and lookup
mm/vmscan: Attempt to migrate page in lieu of discard
mm/numa: automatically generate node migration order
mm/vmscan: never demote for memcg reclaim
mm/numa: new reclaim mode to enable reclaim-based migration

Keith Busch (2):
mm/migrate: Defer allocating new page until needed
mm/vmscan: Consider anonymous pages without swap

Yang Shi (1):
mm/vmscan: add page demotion counter

Documentation/admin-guide/sysctl/vm.rst | 9
include/linux/migrate.h | 6
include/linux/node.h | 9
include/linux/vm_event_item.h | 2
include/trace/events/migrate.h | 3
mm/debug.c | 1
mm/internal.h | 1
mm/migrate.c | 400 ++++++++++++++++++++++++++------
mm/page_alloc.c | 2
mm/vmscan.c | 88 ++++++-
mm/vmstat.c | 2
11 files changed, 439 insertions(+), 84 deletions(-)


2020-06-29 23:51:18

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup


From: Dave Hansen <[email protected]>

Prepare for the kernel to auto-migrate pages to other memory nodes
with a user defined node migration table. This allows creating single
migration target for each NUMA node to enable the kernel to do NUMA
page migrations instead of simply reclaiming colder pages. A node
with no target is a "terminal node", so reclaim acts normally there.
The migration target does not fundamentally _need_ to be a single node,
but this implementation starts there to limit complexity.

If you consider the migration path as a graph, cycles (loops) in the
graph are disallowed. This avoids wasting resources by constantly
migrating (A->B, B->A, A->B ...). The expectation is that cycles will
never be allowed.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
---

b/mm/migrate.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path mm/migrate.c
--- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path 2020-06-29 16:34:36.849312609 -0700
+++ b/mm/migrate.c 2020-06-29 16:34:36.853312609 -0700
@@ -1159,6 +1159,29 @@ out:
return rc;
}

+static int node_demotion[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE};
+
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * @returns: node id for next memory node in the demotion path hierarchy
+ * from @node; -1 if @node is terminal
+ */
+int next_demotion_node(int node)
+{
+ get_online_mems();
+ while (true) {
+ node = node_demotion[node];
+ if (node == NUMA_NO_NODE)
+ break;
+ if (node_online(node))
+ break;
+ }
+ put_online_mems();
+ return node;
+}
+
/*
* gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work
* around it.
_

2020-06-29 23:51:22

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration


From: Dave Hansen <[email protected]>

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold. If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This propses extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

The implementation here is pretty simple and entirely unoptimized.
On any memory hotplug events, assume that a node was added or
removed and recalculate all migration targets. This ensures that
the node_demotion[] array is always ready to be used in case the
new reclaim mode is enabled. This recalculation is far from
optimal, most glaringly that it does not even attempt to figure
out if nodes are actually coming or going.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
---

b/Documentation/admin-guide/sysctl/vm.rst | 9 ++++
b/mm/migrate.c | 61 +++++++++++++++++++++++++++++-
b/mm/vmscan.c | 7 +--
3 files changed, 73 insertions(+), 4 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion 2020-06-29 16:35:01.012312549 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst 2020-06-29 16:35:01.021312549 -0700
@@ -941,6 +941,7 @@ This is value OR'ed together of
1 (bit currently ignored)
2 Zone reclaim writes dirty pages out
4 Zone reclaim swaps pages
+8 Zone reclaim migrates pages
= ===================================

zone_reclaim_mode is disabled by default. For file servers or workloads
@@ -965,3 +966,11 @@ of other processes running on other node
Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations. These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances. Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure. This migration is
+performed before swap.
diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
--- a/mm/migrate.c~enable-numa-demotion 2020-06-29 16:35:01.015312549 -0700
+++ b/mm/migrate.c 2020-06-29 16:35:01.022312549 -0700
@@ -49,6 +49,7 @@
#include <linux/sched/mm.h>
#include <linux/ptrace.h>
#include <linux/oom.h>
+#include <linux/memory.h>

#include <asm/tlbflush.h>

@@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
* Avoid any oddities like cycles that could occur
* from changes in the topology. This will leave
* a momentary gap when migration is disabled.
+ *
+ * This is superfluous for memory offlining since
+ * MEM_GOING_OFFLINE does it independently, but it
+ * does not hurt to do it a second time.
*/
disable_all_migrate_targets();

@@ -3211,6 +3216,60 @@ again:
/* Is another pass necessary? */
if (!nodes_empty(next_pass))
goto again;
+}

- put_online_mems();
+/*
+ * React to hotplug events that might online or offline
+ * NUMA nodes.
+ *
+ * This leaves migrate-on-reclaim transiently disabled
+ * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
+ * This runs whether RECLAIM_MIGRATE is enabled or not.
+ * That ensures that the user can turn RECLAIM_MIGRATE
+ * without needing to recalcuate migration targets.
+ */
+#if defined(CONFIG_MEMORY_HOTPLUG)
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ switch (action) {
+ case MEM_GOING_OFFLINE:
+ /*
+ * Make sure there are not transient states where
+ * an offline node is a migration target. This
+ * will leave migration disabled until the offline
+ * completes and the MEM_OFFLINE case below runs.
+ */
+ disable_all_migrate_targets();
+ break;
+ case MEM_OFFLINE:
+ case MEM_ONLINE:
+ /*
+ * Recalculate the target nodes once the node
+ * reaches its final state (online or offline).
+ */
+ set_migration_target_nodes();
+ break;
+ case MEM_CANCEL_OFFLINE:
+ /*
+ * MEM_GOING_OFFLINE disabled all the migration
+ * targets. Reenable them.
+ */
+ set_migration_target_nodes();
+ break;
+ case MEM_GOING_ONLINE:
+ case MEM_CANCEL_ONLINE:
+ break;
+ }
+
+ return notifier_from_errno(0);
}
+
+static int __init migrate_on_reclaim_init(void)
+{
+ hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+ return 0;
+}
+late_initcall(migrate_on_reclaim_init);
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
--- a/mm/vmscan.c~enable-numa-demotion 2020-06-29 16:35:01.017312549 -0700
+++ b/mm/vmscan.c 2020-06-29 16:35:01.023312549 -0700
@@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
* These bit locations are exposed in the vm.zone_reclaim_mode sysctl
* ABI. New bits are OK, but existing bits can never change.
*/
-#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */
-#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
+#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */
+#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
+#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE (1<<3) /* Migrate pages during reclaim */

/*
* Priority for NODE_RECLAIM. This determines the fraction of pages
_

2020-06-29 23:52:25

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed


From: Keith Busch <[email protected]>

Migrating pages had been allocating the new page before it was actually
needed. Subsequent operations may still fail, which would have to handle
cleaning up the newly allocated page when it was never used.

Defer allocating the page until we are actually ready to make use of
it, after locking the original page. This simplifies error handling,
but should not have any functional change in behavior. This is just
refactoring page migration so the main part can more easily be reused
by other code.

#Signed-off-by: Keith Busch <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
---

b/mm/migrate.c | 148 ++++++++++++++++++++++++++++-----------------------------
1 file changed, 75 insertions(+), 73 deletions(-)

diff -puN mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed mm/migrate.c
--- a/mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed 2020-06-29 16:34:37.896312607 -0700
+++ b/mm/migrate.c 2020-06-29 16:34:37.900312607 -0700
@@ -1014,56 +1014,17 @@ out:
return rc;
}

-static int __unmap_and_move(struct page *page, struct page *newpage,
- int force, enum migrate_mode mode)
+static int __unmap_and_move(new_page_t get_new_page,
+ free_page_t put_new_page,
+ unsigned long private, struct page *page,
+ enum migrate_mode mode,
+ enum migrate_reason reason)
{
int rc = -EAGAIN;
int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;
bool is_lru = !__PageMovable(page);
-
- if (!trylock_page(page)) {
- if (!force || mode == MIGRATE_ASYNC)
- goto out;
-
- /*
- * It's not safe for direct compaction to call lock_page.
- * For example, during page readahead pages are added locked
- * to the LRU. Later, when the IO completes the pages are
- * marked uptodate and unlocked. However, the queueing
- * could be merging multiple pages for one bio (e.g.
- * mpage_readpages). If an allocation happens for the
- * second or third page, the process can end up locking
- * the same page twice and deadlocking. Rather than
- * trying to be clever about what pages can be locked,
- * avoid the use of lock_page for direct compaction
- * altogether.
- */
- if (current->flags & PF_MEMALLOC)
- goto out;
-
- lock_page(page);
- }
-
- if (PageWriteback(page)) {
- /*
- * Only in the case of a full synchronous migration is it
- * necessary to wait for PageWriteback. In the async case,
- * the retry loop is too short and in the sync-light case,
- * the overhead of stalling is too much
- */
- switch (mode) {
- case MIGRATE_SYNC:
- case MIGRATE_SYNC_NO_COPY:
- break;
- default:
- rc = -EBUSY;
- goto out_unlock;
- }
- if (!force)
- goto out_unlock;
- wait_on_page_writeback(page);
- }
+ struct page *newpage;

/*
* By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
@@ -1082,6 +1043,12 @@ static int __unmap_and_move(struct page
if (PageAnon(page) && !PageKsm(page))
anon_vma = page_get_anon_vma(page);

+ newpage = get_new_page(page, private);
+ if (!newpage) {
+ rc = -ENOMEM;
+ goto out;
+ }
+
/*
* Block others from accessing the new page when we get around to
* establishing additional references. We are usually the only one
@@ -1091,11 +1058,11 @@ static int __unmap_and_move(struct page
* This is much like races on refcount of oldpage: just don't BUG().
*/
if (unlikely(!trylock_page(newpage)))
- goto out_unlock;
+ goto out_put;

if (unlikely(!is_lru)) {
rc = move_to_new_page(newpage, page, mode);
- goto out_unlock_both;
+ goto out_unlock;
}

/*
@@ -1114,7 +1081,7 @@ static int __unmap_and_move(struct page
VM_BUG_ON_PAGE(PageAnon(page), page);
if (page_has_private(page)) {
try_to_free_buffers(page);
- goto out_unlock_both;
+ goto out_unlock;
}
} else if (page_mapped(page)) {
/* Establish migration ptes */
@@ -1131,15 +1098,9 @@ static int __unmap_and_move(struct page
if (page_was_mapped)
remove_migration_ptes(page,
rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
-
-out_unlock_both:
- unlock_page(newpage);
out_unlock:
- /* Drop an anon_vma reference if we took one */
- if (anon_vma)
- put_anon_vma(anon_vma);
- unlock_page(page);
-out:
+ unlock_page(newpage);
+out_put:
/*
* If migration is successful, decrease refcount of the newpage
* which will not free the page because new page owner increased
@@ -1150,12 +1111,20 @@ out:
* state.
*/
if (rc == MIGRATEPAGE_SUCCESS) {
+ set_page_owner_migrate_reason(newpage, reason);
if (unlikely(!is_lru))
put_page(newpage);
else
putback_lru_page(newpage);
+ } else if (put_new_page) {
+ put_new_page(newpage, private);
+ } else {
+ put_page(newpage);
}
-
+out:
+ /* Drop an anon_vma reference if we took one */
+ if (anon_vma)
+ put_anon_vma(anon_vma);
return rc;
}

@@ -1203,8 +1172,7 @@ static ICE_noinline int unmap_and_move(n
int force, enum migrate_mode mode,
enum migrate_reason reason)
{
- int rc = MIGRATEPAGE_SUCCESS;
- struct page *newpage = NULL;
+ int rc = -EAGAIN;

if (!thp_migration_supported() && PageTransHuge(page))
return -ENOMEM;
@@ -1219,17 +1187,57 @@ static ICE_noinline int unmap_and_move(n
__ClearPageIsolated(page);
unlock_page(page);
}
+ rc = MIGRATEPAGE_SUCCESS;
goto out;
}

- newpage = get_new_page(page, private);
- if (!newpage)
- return -ENOMEM;
+ if (!trylock_page(page)) {
+ if (!force || mode == MIGRATE_ASYNC)
+ return rc;

- rc = __unmap_and_move(page, newpage, force, mode);
- if (rc == MIGRATEPAGE_SUCCESS)
- set_page_owner_migrate_reason(newpage, reason);
+ /*
+ * It's not safe for direct compaction to call lock_page.
+ * For example, during page readahead pages are added locked
+ * to the LRU. Later, when the IO completes the pages are
+ * marked uptodate and unlocked. However, the queueing
+ * could be merging multiple pages for one bio (e.g.
+ * mpage_readpages). If an allocation happens for the
+ * second or third page, the process can end up locking
+ * the same page twice and deadlocking. Rather than
+ * trying to be clever about what pages can be locked,
+ * avoid the use of lock_page for direct compaction
+ * altogether.
+ */
+ if (current->flags & PF_MEMALLOC)
+ return rc;
+
+ lock_page(page);
+ }
+
+ if (PageWriteback(page)) {
+ /*
+ * Only in the case of a full synchronous migration is it
+ * necessary to wait for PageWriteback. In the async case,
+ * the retry loop is too short and in the sync-light case,
+ * the overhead of stalling is too much
+ */
+ switch (mode) {
+ case MIGRATE_SYNC:
+ case MIGRATE_SYNC_NO_COPY:
+ break;
+ default:
+ rc = -EBUSY;
+ goto out_unlock;
+ }
+ if (!force)
+ goto out_unlock;
+ wait_on_page_writeback(page);
+ }
+ rc = __unmap_and_move(get_new_page, put_new_page, private,
+ page, mode, reason);

+out_unlock:
+ unlock_page(page);
out:
if (rc != -EAGAIN) {
/*
@@ -1269,9 +1277,8 @@ out:
if (rc != -EAGAIN) {
if (likely(!__PageMovable(page))) {
putback_lru_page(page);
- goto put_new;
+ goto done;
}
-
lock_page(page);
if (PageMovable(page))
putback_movable_page(page);
@@ -1280,13 +1287,8 @@ out:
unlock_page(page);
put_page(page);
}
-put_new:
- if (put_new_page)
- put_new_page(newpage, private);
- else
- put_page(newpage);
}
-
+done:
return rc;
}

_

2020-06-29 23:52:26

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard


From: Dave Hansen <[email protected]>

If a memory node has a preferred migration path to demote cold pages,
attempt to move those inactive pages to that migration node before
reclaiming. This will better utilize available memory, provide a faster
tier than swapping or discarding, and allow such pages to be reused
immediately without IO to retrieve the data.

When handling anonymous pages, this will be considered before swap if
enabled. Should the demotion fail for any reason, the page reclaim
will proceed as if the demotion feature was not enabled.

Some places we would like to see this used:

1. Persistent memory being as a slower, cheaper DRAM replacement
2. Remote memory-only "expansion" NUMA nodes
3. Resolving memory imbalances where one NUMA node is seeing more
allocation activity than another. This helps keep more recent
allocations closer to the CPUs on the node doing the allocating.

Yang Shi's patches used an alternative approach where to-be-discarded
pages were collected on a separate discard list and then discarded
as a batch with migrate_pages(). This results in simpler code and
has all the performance advantages of batching, but has the
disadvantage that pages which fail to migrate never get swapped.

#Signed-off-by: Keith Busch <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
---

b/include/linux/migrate.h | 6 ++++
b/include/trace/events/migrate.h | 3 +-
b/mm/debug.c | 1
b/mm/migrate.c | 52 +++++++++++++++++++++++++++++++++++++++
b/mm/vmscan.c | 25 ++++++++++++++++++
5 files changed, 86 insertions(+), 1 deletion(-)

diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
--- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.950312604 -0700
+++ b/include/linux/migrate.h 2020-06-29 16:34:38.963312604 -0700
@@ -25,6 +25,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+ MR_DEMOTION,
MR_TYPES
};

@@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
struct page *newpage, struct page *page);
extern int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page, int extra_count);
+extern int migrate_demote_mapping(struct page *page);
#else

static inline void putback_movable_pages(struct list_head *l) {}
@@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
return -ENOSYS;
}

+static inline int migrate_demote_mapping(struct page *page)
+{
+ return -ENOSYS;
+}
#endif /* CONFIG_MIGRATION */

#ifdef CONFIG_COMPACTION
diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
--- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.952312604 -0700
+++ b/include/trace/events/migrate.h 2020-06-29 16:34:38.963312604 -0700
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset") \
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
EM( MR_NUMA_MISPLACED, "numa_misplaced") \
- EMe(MR_CONTIG_RANGE, "contig_range")
+ EM( MR_CONTIG_RANGE, "contig_range") \
+ EMe(MR_DEMOTION, "demotion")

/*
* First define the enums in the above macros to be exported to userspace
diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
--- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.954312604 -0700
+++ b/mm/debug.c 2020-06-29 16:34:38.963312604 -0700
@@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
"mempolicy_mbind",
"numa_misplaced",
"cma",
+ "demotion",
};

const struct trace_print_flags pageflag_names[] = {
diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
--- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.956312604 -0700
+++ b/mm/migrate.c 2020-06-29 16:34:38.964312604 -0700
@@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
return node;
}

+static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
+{
+ /*
+ * 'mask' targets allocation only to the desired node in the
+ * migration path, and fails fast if the allocation can not be
+ * immediately satisfied. Reclaim is already active and heroic
+ * allocation efforts are unwanted.
+ */
+ gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
+ __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
+ __GFP_MOVABLE;
+ struct page *newpage;
+
+ if (PageTransHuge(page)) {
+ mask |= __GFP_COMP;
+ newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
+ if (newpage)
+ prep_transhuge_page(newpage);
+ } else
+ newpage = alloc_pages_node(node, mask, 0);
+
+ return newpage;
+}
+
+/**
+ * migrate_demote_mapping() - Migrate this page and its mappings to its
+ * demotion node.
+ * @page: A locked, isolated, non-huge page that should migrate to its current
+ * node's demotion target, if available. Since this is intended to be
+ * called during memory reclaim, all flag options are set to fail fast.
+ *
+ * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
+ */
+int migrate_demote_mapping(struct page *page)
+{
+ int next_nid = next_demotion_node(page_to_nid(page));
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(PageHuge(page), page);
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+
+ if (next_nid == NUMA_NO_NODE)
+ return -ENOSYS;
+ if (PageTransHuge(page) && !thp_migration_supported())
+ return -ENOMEM;
+
+ /* MIGRATE_ASYNC is the most light weight and never blocks.*/
+ return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
+ page, MIGRATE_ASYNC, MR_DEMOTION);
+}
+
+
/*
* gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work
* around it.
diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
--- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.959312604 -0700
+++ b/mm/vmscan.c 2020-06-29 16:34:38.965312604 -0700
@@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st
LIST_HEAD(free_pages);
unsigned nr_reclaimed = 0;
unsigned pgactivate = 0;
+ int rc;

memset(stat, 0, sizeof(*stat));
cond_resched();
@@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
; /* try to reclaim the page below */
}

+ rc = migrate_demote_mapping(page);
+ /*
+ * -ENOMEM on a THP may indicate either migration is
+ * unsupported or there was not enough contiguous
+ * space. Split the THP into base pages and retry the
+ * head immediately. The tail pages will be considered
+ * individually within the current loop's page list.
+ */
+ if (rc == -ENOMEM && PageTransHuge(page) &&
+ !split_huge_page_to_list(page, page_list))
+ rc = migrate_demote_mapping(page);
+
+ if (rc == MIGRATEPAGE_SUCCESS) {
+ unlock_page(page);
+ if (likely(put_page_testzero(page)))
+ goto free_it;
+ /*
+ * Speculative reference will free this page,
+ * so leave it off the LRU.
+ */
+ nr_reclaimed++;
+ continue;
+ }
+
/*
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
_

2020-06-29 23:53:18

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 4/8] mm/vmscan: add page demotion counter


From: Yang Shi <[email protected]>

Account the number of demoted pages into reclaim_state->nr_demoted.

Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.

[ daveh:
- __count_vm_events() a bit, and made them look at the THP
size directly rather than getting data from migrate_pages()
]

Signed-off-by: Yang Shi <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
---

b/include/linux/vm_event_item.h | 2 ++
b/mm/migrate.c | 13 ++++++++++++-
b/mm/vmscan.c | 1 +
b/mm/vmstat.c | 2 ++
4 files changed, 17 insertions(+), 1 deletion(-)

diff -puN include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter include/linux/vm_event_item.h
--- a/include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter 2020-06-29 16:34:40.332312601 -0700
+++ b/include/linux/vm_event_item.h 2020-06-29 16:34:40.342312601 -0700
@@ -32,6 +32,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
PGREFILL,
PGSTEAL_KSWAPD,
PGSTEAL_DIRECT,
+ PGDEMOTE_KSWAPD,
+ PGDEMOTE_DIRECT,
PGSCAN_KSWAPD,
PGSCAN_DIRECT,
PGSCAN_DIRECT_THROTTLE,
diff -puN mm/migrate.c~mm-vmscan-add-page-demotion-counter mm/migrate.c
--- a/mm/migrate.c~mm-vmscan-add-page-demotion-counter 2020-06-29 16:34:40.334312601 -0700
+++ b/mm/migrate.c 2020-06-29 16:34:40.343312601 -0700
@@ -1187,6 +1187,7 @@ static struct page *alloc_demote_node_pa
int migrate_demote_mapping(struct page *page)
{
int next_nid = next_demotion_node(page_to_nid(page));
+ int ret;

VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageHuge(page), page);
@@ -1198,8 +1199,18 @@ int migrate_demote_mapping(struct page *
return -ENOMEM;

/* MIGRATE_ASYNC is the most light weight and never blocks.*/
- return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
+ ret = __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
page, MIGRATE_ASYNC, MR_DEMOTION);
+
+ if (ret == MIGRATEPAGE_SUCCESS) {
+ int nr_demoted = hpage_nr_pages(page);
+ if (current_is_kswapd())
+ __count_vm_events(PGDEMOTE_KSWAPD, nr_demoted);
+ else
+ __count_vm_events(PGDEMOTE_DIRECT, nr_demoted);
+ }
+
+ return ret;
}


diff -puN mm/vmscan.c~mm-vmscan-add-page-demotion-counter mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-add-page-demotion-counter 2020-06-29 16:34:40.336312601 -0700
+++ b/mm/vmscan.c 2020-06-29 16:34:40.344312601 -0700
@@ -140,6 +140,7 @@ struct scan_control {
unsigned int immediate;
unsigned int file_taken;
unsigned int taken;
+ unsigned int demoted;
} nr;

/* for recording the reclaimed slab by now */
diff -puN mm/vmstat.c~mm-vmscan-add-page-demotion-counter mm/vmstat.c
--- a/mm/vmstat.c~mm-vmscan-add-page-demotion-counter 2020-06-29 16:34:40.339312601 -0700
+++ b/mm/vmstat.c 2020-06-29 16:34:40.345312601 -0700
@@ -1198,6 +1198,8 @@ const char * const vmstat_text[] = {
"pgrefill",
"pgsteal_kswapd",
"pgsteal_direct",
+ "pgdemote_kswapd",
+ "pgdemote_direct",
"pgscan_kswapd",
"pgscan_direct",
"pgscan_direct_throttle",
_

2020-06-29 23:53:23

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap


From: Keith Busch <[email protected]>

Age and reclaim anonymous pages if a migration path is available. The
node has other recourses for inactive anonymous pages beyond swap,

#Signed-off-by: Keith Busch <[email protected]>
Cc: Keith Busch <[email protected]>
[vishal: fixup the migration->demotion rename]
Signed-off-by: Vishal Verma <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>

--

Changes from Dave 06/2020:
* rename reclaim_anon_pages()->can_reclaim_anon_pages()

---

b/include/linux/node.h | 9 +++++++++
b/mm/vmscan.c | 32 +++++++++++++++++++++++++++-----
2 files changed, 36 insertions(+), 5 deletions(-)

diff -puN include/linux/node.h~0009-mm-vmscan-Consider-anonymous-pages-without-swap include/linux/node.h
--- a/include/linux/node.h~0009-mm-vmscan-Consider-anonymous-pages-without-swap 2020-06-29 16:34:42.861312594 -0700
+++ b/include/linux/node.h 2020-06-29 16:34:42.867312594 -0700
@@ -180,4 +180,13 @@ static inline void register_hugetlbfs_wi

#define to_node(device) container_of(device, struct node, dev)

+#ifdef CONFIG_MIGRATION
+extern int next_demotion_node(int node);
+#else
+static inline int next_demotion_node(int node)
+{
+ return NUMA_NO_NODE;
+}
+#endif
+
#endif /* _LINUX_NODE_H_ */
diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap mm/vmscan.c
--- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap 2020-06-29 16:34:42.863312594 -0700
+++ b/mm/vmscan.c 2020-06-29 16:34:42.868312594 -0700
@@ -288,6 +288,26 @@ static bool writeback_throttling_sane(st
}
#endif

+static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
+ int node_id)
+{
+ /* Always age anon pages when we have swap */
+ if (memcg == NULL) {
+ if (get_nr_swap_pages() > 0)
+ return true;
+ } else {
+ if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
+ return true;
+ }
+
+ /* Also age anon pages if we can auto-migrate them */
+ if (next_demotion_node(node_id) >= 0)
+ return true;
+
+ /* No way to reclaim anon pages */
+ return false;
+}
+
/*
* This misses isolated pages which are not accounted for to save counters.
* As the data only determines if reclaim or compaction continues, it is
@@ -299,7 +319,7 @@ unsigned long zone_reclaimable_pages(str

nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
- if (get_nr_swap_pages() > 0)
+ if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);

@@ -2267,7 +2287,7 @@ static void get_scan_count(struct lruvec
enum lru_list lru;

/* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+ if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2572,7 +2592,9 @@ static void shrink_lruvec(struct lruvec
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
- if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
+ if (can_reclaim_anon_pages(lruvec_memcg(lruvec),
+ lruvec_pgdat(lruvec)->node_id) &&
+ inactive_is_low(lruvec, LRU_INACTIVE_ANON))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
}
@@ -2642,7 +2664,7 @@ static inline bool should_continue_recla
*/
pages_for_compaction = compact_gap(sc->order);
inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
- if (get_nr_swap_pages() > 0)
+ if (can_reclaim_anon_pages(NULL, pgdat->node_id))
inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);

return inactive_lru_pages > pages_for_compaction;
@@ -3395,7 +3417,7 @@ static void age_active_anon(struct pglis
struct mem_cgroup *memcg;
struct lruvec *lruvec;

- if (!total_swap_pages)
+ if (!can_reclaim_anon_pages(NULL, pgdat->node_id))
return;

lruvec = mem_cgroup_lruvec(NULL, pgdat);
_

2020-06-29 23:53:24

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim


From: Dave Hansen <[email protected]>

Global reclaim aims to reduce the amount of memory used on
a given node or set of nodes. Migrating pages to another
node serves this purpose.

memcg reclaim is different. Its goal is to reduce the
total memory consumption of the entire memcg, across all
nodes. Migration does not assist memcg reclaim because
it just moves page contents between nodes rather than
actually reducing memory consumption.

Signed-off-by: Dave Hansen <[email protected]>
Suggested-by: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
---

b/mm/vmscan.c | 61 +++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 42 insertions(+), 19 deletions(-)

diff -puN mm/vmscan.c~never-demote-for-memcg-reclaim mm/vmscan.c
--- a/mm/vmscan.c~never-demote-for-memcg-reclaim 2020-06-29 16:34:44.018312591 -0700
+++ b/mm/vmscan.c 2020-06-29 16:34:44.023312591 -0700
@@ -289,7 +289,8 @@ static bool writeback_throttling_sane(st
#endif

static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
- int node_id)
+ int node_id,
+ struct scan_control *sc)
{
/* Always age anon pages when we have swap */
if (memcg == NULL) {
@@ -300,8 +301,14 @@ static inline bool can_reclaim_anon_page
return true;
}

- /* Also age anon pages if we can auto-migrate them */
- if (next_demotion_node(node_id) >= 0)
+ /*
+ * Also age anon pages if we can auto-migrate them.
+ *
+ * Migrating a page does not reduce comsumption of a
+ * memcg so should not be performed when in memcg
+ * reclaim.
+ */
+ if ((sc && cgroup_reclaim(sc)) && (next_demotion_node(node_id) >= 0))
return true;

/* No way to reclaim anon pages */
@@ -319,7 +326,7 @@ unsigned long zone_reclaimable_pages(str

nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
- if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
+ if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);

@@ -1084,6 +1091,32 @@ static void page_check_dirty_writeback(s
mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
}

+
+static int shrink_do_demote_mapping(struct page *page,
+ struct list_head *page_list,
+ struct scan_control *sc)
+{
+ int rc;
+
+ /* It is pointless to do demotion in memcg reclaim */
+ if (cgroup_reclaim(sc))
+ return -ENOTSUPP;
+
+ rc = migrate_demote_mapping(page);
+ /*
+ * -ENOMEM on a THP may indicate either migration is
+ * unsupported or there was not enough contiguous
+ * space. Split the THP into base pages and retry the
+ * head immediately. The tail pages will be considered
+ * individually within the current loop's page list.
+ */
+ if (rc == -ENOMEM && PageTransHuge(page) &&
+ !split_huge_page_to_list(page, page_list))
+ rc = migrate_demote_mapping(page);
+
+ return rc;
+}
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -1251,17 +1284,7 @@ static unsigned long shrink_page_list(st
; /* try to reclaim the page below */
}

- rc = migrate_demote_mapping(page);
- /*
- * -ENOMEM on a THP may indicate either migration is
- * unsupported or there was not enough contiguous
- * space. Split the THP into base pages and retry the
- * head immediately. The tail pages will be considered
- * individually within the current loop's page list.
- */
- if (rc == -ENOMEM && PageTransHuge(page) &&
- !split_huge_page_to_list(page, page_list))
- rc = migrate_demote_mapping(page);
+ rc = shrink_do_demote_mapping(page, page_list, sc);

if (rc == MIGRATEPAGE_SUCCESS) {
unlock_page(page);
@@ -2287,7 +2310,7 @@ static void get_scan_count(struct lruvec
enum lru_list lru;

/* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
+ if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2593,7 +2616,7 @@ static void shrink_lruvec(struct lruvec
* rebalance the anon lru active/inactive ratio.
*/
if (can_reclaim_anon_pages(lruvec_memcg(lruvec),
- lruvec_pgdat(lruvec)->node_id) &&
+ lruvec_pgdat(lruvec)->node_id, sc) &&
inactive_is_low(lruvec, LRU_INACTIVE_ANON))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
@@ -2664,7 +2687,7 @@ static inline bool should_continue_recla
*/
pages_for_compaction = compact_gap(sc->order);
inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
- if (can_reclaim_anon_pages(NULL, pgdat->node_id))
+ if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);

return inactive_lru_pages > pages_for_compaction;
@@ -3417,7 +3440,7 @@ static void age_active_anon(struct pglis
struct mem_cgroup *memcg;
struct lruvec *lruvec;

- if (!can_reclaim_anon_pages(NULL, pgdat->node_id))
+ if (!can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
return;

lruvec = mem_cgroup_lruvec(NULL, pgdat);
_

2020-06-29 23:53:26

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order


From: Dave Hansen <[email protected]>

When memory fills up on a node, memory contents can be
automatically migrated to another node. The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list
would be to follow the page allocator fallback lists. Those
lists already tell us if memory is full where to look next. It
would also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists. This would potentially lead to
migration cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering. But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now. It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

The protocol for node_demotion[] access and writing is not
standard. It has no specific locking and is intended to be read
locklessly. Readers must take care to avoid observing changes
that appear incoherent. This was done so that node_demotion[]
locking has no chance of becoming a bottleneck on large systems
with lots of CPUs in direct reclaim.

This code is unused for now. It will be called later in the
series.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
---

b/mm/internal.h | 1
b/mm/migrate.c | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
b/mm/page_alloc.c | 2
3 files changed, 131 insertions(+), 2 deletions(-)

diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h
--- a/mm/internal.h~auto-setup-default-migration-path-from-firmware 2020-06-29 16:34:41.629312597 -0700
+++ b/mm/internal.h 2020-06-29 16:34:41.638312597 -0700
@@ -192,6 +192,7 @@ extern int user_min_free_kbytes;

extern void zone_pcp_update(struct zone *zone);
extern void zone_pcp_reset(struct zone *zone);
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);

#if defined CONFIG_COMPACTION || defined CONFIG_CMA

diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c
--- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware 2020-06-29 16:34:41.631312597 -0700
+++ b/mm/migrate.c 2020-06-29 16:34:41.639312597 -0700
@@ -1128,6 +1128,10 @@ out:
return rc;
}

+/*
+ * Writes to this array occur without locking. READ_ONCE()
+ * is recommended for readers.
+ */
static int node_demotion[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE};

/**
@@ -1141,7 +1145,13 @@ int next_demotion_node(int node)
{
get_online_mems();
while (true) {
- node = node_demotion[node];
+ /*
+ * node_demotion[] is updated without excluding
+ * this function from running. READ_ONCE() avoids
+ * 'node' checks reading different values from
+ * node_demotion[].
+ */
+ node = READ_ONCE(node_demotion[node]);
if (node == NUMA_NO_NODE)
break;
if (node_online(node))
@@ -3086,3 +3096,121 @@ void migrate_vma_finalize(struct migrate
}
EXPORT_SYMBOL(migrate_vma_finalize);
#endif /* CONFIG_DEVICE_PRIVATE */
+
+/* Disable reclaim-based migration. */
+static void disable_all_migrate_targets(void)
+{
+ int node;
+
+ for_each_online_node(node)
+ node_demotion[node] = NUMA_NO_NODE;
+}
+
+/*
+ * Find an automatic demotion target for 'node'.
+ * Failing here is OK. It might just indicate
+ * being at the end of a chain.
+ */
+static int establish_migrate_target(int node, nodemask_t *used)
+{
+ int migration_target;
+
+ /*
+ * Can not set a migration target on a
+ * node with it already set.
+ *
+ * No need for READ_ONCE() here since this
+ * in the write path for node_demotion[].
+ * This should be the only thread writing.
+ */
+ if (node_demotion[node] != NUMA_NO_NODE)
+ return NUMA_NO_NODE;
+
+ migration_target = find_next_best_node(node, used);
+ if (migration_target == NUMA_NO_NODE)
+ return NUMA_NO_NODE;
+
+ node_demotion[node] = migration_target;
+
+ return migration_target;
+}
+
+/*
+ * When memory fills up on a node, memory contents can be
+ * automatically migrated to another node instead of
+ * discarded at reclaim.
+ *
+ * Establish a "migration path" which will start at nodes
+ * with CPUs and will follow the priorities used to build the
+ * page allocator zonelists.
+ *
+ * The difference here is that cycles must be avoided. If
+ * node0 migrates to node1, then neither node1, nor anything
+ * node1 migrates to can migrate to node0.
+ *
+ * This function can run simultaneously with readers of
+ * node_demotion[]. However, it can not run simultaneously
+ * with itself. Exclusion is provided by memory hotplug events
+ * being single-threaded.
+ */
+void set_migration_target_nodes(void)
+{
+ nodemask_t next_pass = NODE_MASK_NONE;
+ nodemask_t this_pass = NODE_MASK_NONE;
+ nodemask_t used_targets = NODE_MASK_NONE;
+ int node;
+
+ get_online_mems();
+ /*
+ * Avoid any oddities like cycles that could occur
+ * from changes in the topology. This will leave
+ * a momentary gap when migration is disabled.
+ */
+ disable_all_migrate_targets();
+
+ /*
+ * Ensure that the "disable" is visible across the system.
+ * Readers will see either a combination of before+disable
+ * state or disable+after. They will never see before and
+ * after state together.
+ *
+ * The before+after state together might have cycles and
+ * could cause readers to do things like loop until this
+ * function finishes. This ensures they can only see a
+ * single "bad" read and would, for instance, only loop
+ * once.
+ */
+ smp_wmb();
+
+ /*
+ * Allocations go close to CPUs, first. Assume that
+ * the migration path starts at the nodes with CPUs.
+ */
+ next_pass = node_states[N_CPU];
+again:
+ this_pass = next_pass;
+ next_pass = NODE_MASK_NONE;
+ /*
+ * To avoid cycles in the migration "graph", ensure
+ * that migration sources are not future targets by
+ * setting them in 'used_targets'.
+ *
+ * But, do this only once per pass so that multiple
+ * source nodes can share a target node.
+ */
+ nodes_or(used_targets, used_targets, this_pass);
+ for_each_node_mask(node, this_pass) {
+ int target_node = establish_migrate_target(node, &used_targets);
+
+ if (target_node == NUMA_NO_NODE)
+ continue;
+
+ /* Visit targets from this pass in the next pass: */
+ node_set(target_node, next_pass);
+ }
+ /* Is another pass necessary? */
+ if (!nodes_empty(next_pass))
+ goto again;
+
+ put_online_mems();
+}
diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c
--- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware 2020-06-29 16:34:41.634312597 -0700
+++ b/mm/page_alloc.c 2020-06-29 16:34:41.641312597 -0700
@@ -5591,7 +5591,7 @@ static int node_load[MAX_NUMNODES];
*
* Return: node id of the found node or %NUMA_NO_NODE if no node is found.
*/
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask)
{
int n, val;
int min_val = INT_MAX;
_

2020-06-30 07:27:02

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

Hi, Dave,

Dave Hansen <[email protected]> writes:

> From: Dave Hansen <[email protected]>
>
> Some method is obviously needed to enable reclaim-based migration.
>
> Just like traditional autonuma, there will be some workloads that
> will benefit like workloads with more "static" configurations where
> hot pages stay hot and cold pages stay cold. If pages come and go
> from the hot and cold sets, the benefits of this approach will be
> more limited.
>
> The benefits are truly workload-based and *not* hardware-based.
> We do not believe that there is a viable threshold where certain
> hardware configurations should have this mechanism enabled while
> others do not.
>
> To be conservative, earlier work defaulted to disable reclaim-
> based migration and did not include a mechanism to enable it.
> This propses extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.
>
> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).
>
> The implementation here is pretty simple and entirely unoptimized.
> On any memory hotplug events, assume that a node was added or
> removed and recalculate all migration targets. This ensures that
> the node_demotion[] array is always ready to be used in case the
> new reclaim mode is enabled. This recalculation is far from
> optimal, most glaringly that it does not even attempt to figure
> out if nodes are actually coming or going.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Yang Shi <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: Dan Williams <[email protected]>
> ---
>
> b/Documentation/admin-guide/sysctl/vm.rst | 9 ++++
> b/mm/migrate.c | 61 +++++++++++++++++++++++++++++-
> b/mm/vmscan.c | 7 +--
> 3 files changed, 73 insertions(+), 4 deletions(-)
>
> diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
> --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion 2020-06-29 16:35:01.012312549 -0700
> +++ b/Documentation/admin-guide/sysctl/vm.rst 2020-06-29 16:35:01.021312549 -0700
> @@ -941,6 +941,7 @@ This is value OR'ed together of
> 1 (bit currently ignored)
> 2 Zone reclaim writes dirty pages out
> 4 Zone reclaim swaps pages
> +8 Zone reclaim migrates pages
> = ===================================
>
> zone_reclaim_mode is disabled by default. For file servers or workloads
> @@ -965,3 +966,11 @@ of other processes running on other node
> Allowing regular swap effectively restricts allocations to the local
> node unless explicitly overridden by memory policies or cpuset
> configurations.
> +
> +Page migration during reclaim is intended for systems with tiered memory
> +configurations. These systems have multiple types of memory with varied
> +performance characteristics instead of plain NUMA systems where the same
> +kind of memory is found at varied distances. Allowing page migration
> +during reclaim enables these systems to migrate pages from fast tiers to
> +slow tiers when the fast tier is under pressure. This migration is
> +performed before swap.
> diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
> --- a/mm/migrate.c~enable-numa-demotion 2020-06-29 16:35:01.015312549 -0700
> +++ b/mm/migrate.c 2020-06-29 16:35:01.022312549 -0700
> @@ -49,6 +49,7 @@
> #include <linux/sched/mm.h>
> #include <linux/ptrace.h>
> #include <linux/oom.h>
> +#include <linux/memory.h>
>
> #include <asm/tlbflush.h>
>
> @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
> * Avoid any oddities like cycles that could occur
> * from changes in the topology. This will leave
> * a momentary gap when migration is disabled.
> + *
> + * This is superfluous for memory offlining since
> + * MEM_GOING_OFFLINE does it independently, but it
> + * does not hurt to do it a second time.
> */
> disable_all_migrate_targets();
>
> @@ -3211,6 +3216,60 @@ again:
> /* Is another pass necessary? */
> if (!nodes_empty(next_pass))
> goto again;
> +}
>
> - put_online_mems();
> +/*
> + * React to hotplug events that might online or offline
> + * NUMA nodes.
> + *
> + * This leaves migrate-on-reclaim transiently disabled
> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
> + * This runs whether RECLAIM_MIGRATE is enabled or not.
> + * That ensures that the user can turn RECLAIM_MIGRATE
> + * without needing to recalcuate migration targets.
> + */
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> + unsigned long action, void *arg)
> +{
> + switch (action) {
> + case MEM_GOING_OFFLINE:
> + /*
> + * Make sure there are not transient states where
> + * an offline node is a migration target. This
> + * will leave migration disabled until the offline
> + * completes and the MEM_OFFLINE case below runs.
> + */
> + disable_all_migrate_targets();
> + break;
> + case MEM_OFFLINE:
> + case MEM_ONLINE:
> + /*
> + * Recalculate the target nodes once the node
> + * reaches its final state (online or offline).
> + */
> + set_migration_target_nodes();
> + break;
> + case MEM_CANCEL_OFFLINE:
> + /*
> + * MEM_GOING_OFFLINE disabled all the migration
> + * targets. Reenable them.
> + */
> + set_migration_target_nodes();
> + break;
> + case MEM_GOING_ONLINE:
> + case MEM_CANCEL_ONLINE:
> + break;
> + }
> +
> + return notifier_from_errno(0);
> }
> +
> +static int __init migrate_on_reclaim_init(void)
> +{
> + hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> + return 0;
> +}
> +late_initcall(migrate_on_reclaim_init);
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
> --- a/mm/vmscan.c~enable-numa-demotion 2020-06-29 16:35:01.017312549 -0700
> +++ b/mm/vmscan.c 2020-06-29 16:35:01.023312549 -0700
> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
> * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
> * ABI. New bits are OK, but existing bits can never change.
> */
> -#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */
> -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
> -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
> +#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */
> +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
> +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
> +#define RECLAIM_MIGRATE (1<<3) /* Migrate pages during reclaim */
>
> /*
> * Priority for NODE_RECLAIM. This determines the fraction of pages

I found that RECLAIM_MIGRATE is defined but never referenced in the
patch.

If my understanding of the code were correct, shrink_do_demote_mapping()
is called by shrink_page_list(), which is used by kswapd and direct
reclaim. So as long as the persistent memory node is onlined,
reclaim-based migration will be enabled regardless of node reclaim mode.

Best Regards,
Huang, Ying

2020-06-30 08:24:27

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order

Dave Hansen <[email protected]> writes:

> +/*
> + * Find an automatic demotion target for 'node'.
> + * Failing here is OK. It might just indicate
> + * being at the end of a chain.
> + */
> +static int establish_migrate_target(int node, nodemask_t *used)
> +{
> + int migration_target;
> +
> + /*
> + * Can not set a migration target on a
> + * node with it already set.
> + *
> + * No need for READ_ONCE() here since this
> + * in the write path for node_demotion[].
> + * This should be the only thread writing.
> + */
> + if (node_demotion[node] != NUMA_NO_NODE)
> + return NUMA_NO_NODE;
> +
> + migration_target = find_next_best_node(node, used);
> + if (migration_target == NUMA_NO_NODE)
> + return NUMA_NO_NODE;
> +
> + node_demotion[node] = migration_target;
> +
> + return migration_target;
> +}
> +
> +/*
> + * When memory fills up on a node, memory contents can be
> + * automatically migrated to another node instead of
> + * discarded at reclaim.
> + *
> + * Establish a "migration path" which will start at nodes
> + * with CPUs and will follow the priorities used to build the
> + * page allocator zonelists.
> + *
> + * The difference here is that cycles must be avoided. If
> + * node0 migrates to node1, then neither node1, nor anything
> + * node1 migrates to can migrate to node0.
> + *
> + * This function can run simultaneously with readers of
> + * node_demotion[]. However, it can not run simultaneously
> + * with itself. Exclusion is provided by memory hotplug events
> + * being single-threaded.
> + */
> +void set_migration_target_nodes(void)
> +{
> + nodemask_t next_pass = NODE_MASK_NONE;
> + nodemask_t this_pass = NODE_MASK_NONE;
> + nodemask_t used_targets = NODE_MASK_NONE;
> + int node;
> +
> + get_online_mems();
> + /*
> + * Avoid any oddities like cycles that could occur
> + * from changes in the topology. This will leave
> + * a momentary gap when migration is disabled.
> + */
> + disable_all_migrate_targets();
> +
> + /*
> + * Ensure that the "disable" is visible across the system.
> + * Readers will see either a combination of before+disable
> + * state or disable+after. They will never see before and
> + * after state together.
> + *
> + * The before+after state together might have cycles and
> + * could cause readers to do things like loop until this
> + * function finishes. This ensures they can only see a
> + * single "bad" read and would, for instance, only loop
> + * once.
> + */
> + smp_wmb();
> +
> + /*
> + * Allocations go close to CPUs, first. Assume that
> + * the migration path starts at the nodes with CPUs.
> + */
> + next_pass = node_states[N_CPU];
> +again:
> + this_pass = next_pass;
> + next_pass = NODE_MASK_NONE;
> + /*
> + * To avoid cycles in the migration "graph", ensure
> + * that migration sources are not future targets by
> + * setting them in 'used_targets'.
> + *
> + * But, do this only once per pass so that multiple
> + * source nodes can share a target node.

establish_migrate_target() calls find_next_best_node(), which will set
target_node in used_targets. So it seems that the nodes_or() below is
only necessary to initialize used_targets, and multiple source nodes
cannot share one target node in current implementation.

Best Regards,
Huang, Ying

> + */
> + nodes_or(used_targets, used_targets, this_pass);
> + for_each_node_mask(node, this_pass) {
> + int target_node = establish_migrate_target(node, &used_targets);
> +
> + if (target_node == NUMA_NO_NODE)
> + continue;
> +
> + /* Visit targets from this pass in the next pass: */
> + node_set(target_node, next_pass);
> + }
> + /* Is another pass necessary? */
> + if (!nodes_empty(next_pass))
> + goto again;
> +
> + put_online_mems();
> +}

2020-06-30 20:52:34

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Migrate Pages in lieu of discard

On Mon, Jun 29, 2020 at 4:48 PM Dave Hansen <[email protected]> wrote:
>
> I've been sitting on these for too long. Tha main purpose of this
> post is to have a public discussion with the other folks who are
> interested in this functionalty and converge on a single
> implementation.
>
> This set directly incorporates a statictics patch from Yang Shi and
> also includes one to ensure good behavior with cgroup reclaim which
> was very closely derived from this series:
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Since the last post, the major changes are:
> - Added patch to skip migration when doing cgroup reclaim
> - Added stats patch from Yang Shi
>
> The full series is also available here:
>
> https://github.com/hansendc/linux/tree/automigrate-20200629
>
> --
>
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
>
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out. Allocations will, at some point, start
> falling over to the slower persistent memory.
>
> That has two nasty properties. First, the newer allocations can end
> up in the slower persistent memory. Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
>
> This set implements a solution to these problems. At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
>
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
>
> This is not perfect. It "strands" pages in slower memory and never
> brings them back to fast DRAM. Other things need to be built to
> promote hot pages back to DRAM.
>
> This is part of a larger patch set. If you want to apply these or
> play with them, I'd suggest using the tree from here. It includes
> autonuma-based hot page promotion back to DRAM:
>
> http://lkml.kernel.org/r/[email protected]
>
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
>
> http://lkml.kernel.org/r/[email protected]
>

I have a high level question. Given a reclaim request for a set of
nodes, if there is no demotion path out of that set, should the kernel
still consider the migrations within the set of nodes? Basically
should the decision to allow migrations within a reclaim request be
taken at the node level or the reclaim request (or allocation level)?

2020-06-30 20:54:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Migrate Pages in lieu of discard

On 6/30/20 11:36 AM, Shakeel Butt wrote:
>> This is part of a larger patch set. If you want to apply these or
>> play with them, I'd suggest using the tree from here. It includes
>> autonuma-based hot page promotion back to DRAM:
>>
>> http://lkml.kernel.org/r/[email protected]
>>
>> This is also all based on an upstream mechanism that allows
>> persistent memory to be onlined and used as if it were volatile:
>>
>> http://lkml.kernel.org/r/[email protected]
>>
> I have a high level question. Given a reclaim request for a set of
> nodes, if there is no demotion path out of that set, should the kernel
> still consider the migrations within the set of nodes?

OK, to be specific, we're talking about a case where we've arrived at
try_to_free_pages() and, say, all of the nodes on the system are set in
sc->nodemask? Isn't the common case that all nodes are set in
sc->nodemask? Since there is never a demotion path out of the set of
all nodes, the common case would be that there is no demotion path out
of a reclaim node set.

If that's true, I'd say that the kernel still needs to consider
migrations even within the set.

2020-06-30 20:58:24

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Migrate Pages in lieu of discard

On Tue, Jun 30, 2020 at 11:51 AM Dave Hansen <[email protected]> wrote:
>
> On 6/30/20 11:36 AM, Shakeel Butt wrote:
> >> This is part of a larger patch set. If you want to apply these or
> >> play with them, I'd suggest using the tree from here. It includes
> >> autonuma-based hot page promotion back to DRAM:
> >>
> >> http://lkml.kernel.org/r/[email protected]
> >>
> >> This is also all based on an upstream mechanism that allows
> >> persistent memory to be onlined and used as if it were volatile:
> >>
> >> http://lkml.kernel.org/r/[email protected]
> >>
> > I have a high level question. Given a reclaim request for a set of
> > nodes, if there is no demotion path out of that set, should the kernel
> > still consider the migrations within the set of nodes?
>
> OK, to be specific, we're talking about a case where we've arrived at
> try_to_free_pages()

Yes.

> and, say, all of the nodes on the system are set in
> sc->nodemask? Isn't the common case that all nodes are set in
> sc->nodemask?

Depends on the workload but for normal users, yes.

> Since there is never a demotion path out of the set of
> all nodes, the common case would be that there is no demotion path out
> of a reclaim node set.
>
> If that's true, I'd say that the kernel still needs to consider
> migrations even within the set.

In my opinion it should be a user defined policy but I think that
discussion is orthogonal to this patch series. As I understand, this
patch series aims to add the migration-within-reclaim infrastructure,
IMO the policies, optimizations, heuristics can come later.

BTW is this proposal only for systems having multi-tiers of memory?
Can a multi-node DRAM-only system take advantage of this proposal? For
example I have a system with two DRAM nodes running two jobs
hardwalled to each node. For each job the other node is kind of
low-tier memory. If I can describe the per-job demotion paths then
these jobs can take advantage of this proposal during occasional
peaks.

2020-06-30 20:59:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Migrate Pages in lieu of discard

On 6/30/20 12:25 PM, Shakeel Butt wrote:
>> Since there is never a demotion path out of the set of
>> all nodes, the common case would be that there is no demotion path out
>> of a reclaim node set.
>>
>> If that's true, I'd say that the kernel still needs to consider
>> migrations even within the set.
> In my opinion it should be a user defined policy but I think that
> discussion is orthogonal to this patch series. As I understand, this
> patch series aims to add the migration-within-reclaim infrastructure,
> IMO the policies, optimizations, heuristics can come later.

Yes, this should be considered to add the infrastructure and one
_simple_ policy implementation which sets up migration away from nodes
with CPUs to more distant nodes without CPUs.

This simple policy will be useful for (but not limited to) volatile-use
persistent memory like Intel's Optane DIMMS.

> BTW is this proposal only for systems having multi-tiers of memory?
> Can a multi-node DRAM-only system take advantage of this proposal? For
> example I have a system with two DRAM nodes running two jobs
> hardwalled to each node. For each job the other node is kind of
> low-tier memory. If I can describe the per-job demotion paths then
> these jobs can take advantage of this proposal during occasional
> peaks.

I don't see any reason it could not work there. There would just need
to be a way to set up a different demotion path policy that what was
done here.

2020-07-01 00:48:25

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

On Mon, 29 Jun 2020, Dave Hansen wrote:

> From: Dave Hansen <[email protected]>
>
> If a memory node has a preferred migration path to demote cold pages,
> attempt to move those inactive pages to that migration node before
> reclaiming. This will better utilize available memory, provide a faster
> tier than swapping or discarding, and allow such pages to be reused
> immediately without IO to retrieve the data.
>
> When handling anonymous pages, this will be considered before swap if
> enabled. Should the demotion fail for any reason, the page reclaim
> will proceed as if the demotion feature was not enabled.
>

Thanks for sharing these patches and kick-starting the conversation, Dave.

Could this cause us to break a user's mbind() or allow a user to
circumvent their cpuset.mems?

Because we don't have a mapping of the page back to its allocation
context (or the process context in which it was allocated), it seems like
both are possible.

So let's assume that migration nodes cannot be other DRAM nodes.
Otherwise, memory pressure could be intentionally or unintentionally
induced to migrate these pages to another node. Do we have such a
restriction on migration nodes?

> Some places we would like to see this used:
>
> 1. Persistent memory being as a slower, cheaper DRAM replacement
> 2. Remote memory-only "expansion" NUMA nodes
> 3. Resolving memory imbalances where one NUMA node is seeing more
> allocation activity than another. This helps keep more recent
> allocations closer to the CPUs on the node doing the allocating.
>

(3) is the concerning one given the above if we are to use
migrate_demote_mapping() for DRAM node balancing.

> Yang Shi's patches used an alternative approach where to-be-discarded
> pages were collected on a separate discard list and then discarded
> as a batch with migrate_pages(). This results in simpler code and
> has all the performance advantages of batching, but has the
> disadvantage that pages which fail to migrate never get swapped.
>
> #Signed-off-by: Keith Busch <[email protected]>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Keith Busch <[email protected]>
> Cc: Yang Shi <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: Dan Williams <[email protected]>
> ---
>
> b/include/linux/migrate.h | 6 ++++
> b/include/trace/events/migrate.h | 3 +-
> b/mm/debug.c | 1
> b/mm/migrate.c | 52 +++++++++++++++++++++++++++++++++++++++
> b/mm/vmscan.c | 25 ++++++++++++++++++
> 5 files changed, 86 insertions(+), 1 deletion(-)
>
> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.950312604 -0700
> +++ b/include/linux/migrate.h 2020-06-29 16:34:38.963312604 -0700
> @@ -25,6 +25,7 @@ enum migrate_reason {
> MR_MEMPOLICY_MBIND,
> MR_NUMA_MISPLACED,
> MR_CONTIG_RANGE,
> + MR_DEMOTION,
> MR_TYPES
> };
>
> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
> struct page *newpage, struct page *page);
> extern int migrate_page_move_mapping(struct address_space *mapping,
> struct page *newpage, struct page *page, int extra_count);
> +extern int migrate_demote_mapping(struct page *page);
> #else
>
> static inline void putback_movable_pages(struct list_head *l) {}
> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
> return -ENOSYS;
> }
>
> +static inline int migrate_demote_mapping(struct page *page)
> +{
> + return -ENOSYS;
> +}
> #endif /* CONFIG_MIGRATION */
>
> #ifdef CONFIG_COMPACTION
> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.952312604 -0700
> +++ b/include/trace/events/migrate.h 2020-06-29 16:34:38.963312604 -0700
> @@ -20,7 +20,8 @@
> EM( MR_SYSCALL, "syscall_or_cpuset") \
> EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
> EM( MR_NUMA_MISPLACED, "numa_misplaced") \
> - EMe(MR_CONTIG_RANGE, "contig_range")
> + EM( MR_CONTIG_RANGE, "contig_range") \
> + EMe(MR_DEMOTION, "demotion")
>
> /*
> * First define the enums in the above macros to be exported to userspace
> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.954312604 -0700
> +++ b/mm/debug.c 2020-06-29 16:34:38.963312604 -0700
> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
> "mempolicy_mbind",
> "numa_misplaced",
> "cma",
> + "demotion",
> };
>
> const struct trace_print_flags pageflag_names[] = {
> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.956312604 -0700
> +++ b/mm/migrate.c 2020-06-29 16:34:38.964312604 -0700
> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
> return node;
> }
>
> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
> +{
> + /*
> + * 'mask' targets allocation only to the desired node in the
> + * migration path, and fails fast if the allocation can not be
> + * immediately satisfied. Reclaim is already active and heroic
> + * allocation efforts are unwanted.
> + */
> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
> + __GFP_MOVABLE;

GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we
actually want to kick kswapd on the pmem node?

If not, GFP_TRANSHUGE_LIGHT does a trick where it does
GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM. You could probably do the same
here although the __GFP_IO and __GFP_FS would be unnecessary (but not
harmful).

> + struct page *newpage;
> +
> + if (PageTransHuge(page)) {
> + mask |= __GFP_COMP;
> + newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
> + if (newpage)
> + prep_transhuge_page(newpage);
> + } else
> + newpage = alloc_pages_node(node, mask, 0);
> +
> + return newpage;
> +}
> +
> +/**
> + * migrate_demote_mapping() - Migrate this page and its mappings to its
> + * demotion node.
> + * @page: A locked, isolated, non-huge page that should migrate to its current
> + * node's demotion target, if available. Since this is intended to be
> + * called during memory reclaim, all flag options are set to fail fast.
> + *
> + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
> + */
> +int migrate_demote_mapping(struct page *page)
> +{
> + int next_nid = next_demotion_node(page_to_nid(page));
> +
> + VM_BUG_ON_PAGE(!PageLocked(page), page);
> + VM_BUG_ON_PAGE(PageHuge(page), page);
> + VM_BUG_ON_PAGE(PageLRU(page), page);
> +
> + if (next_nid == NUMA_NO_NODE)
> + return -ENOSYS;
> + if (PageTransHuge(page) && !thp_migration_supported())
> + return -ENOMEM;
> +
> + /* MIGRATE_ASYNC is the most light weight and never blocks.*/
> + return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
> + page, MIGRATE_ASYNC, MR_DEMOTION);
> +}
> +
> +
> /*
> * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work
> * around it.
> diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
> --- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.959312604 -0700
> +++ b/mm/vmscan.c 2020-06-29 16:34:38.965312604 -0700
> @@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st
> LIST_HEAD(free_pages);
> unsigned nr_reclaimed = 0;
> unsigned pgactivate = 0;
> + int rc;
>
> memset(stat, 0, sizeof(*stat));
> cond_resched();
> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
> ; /* try to reclaim the page below */
> }
>
> + rc = migrate_demote_mapping(page);
> + /*
> + * -ENOMEM on a THP may indicate either migration is
> + * unsupported or there was not enough contiguous
> + * space. Split the THP into base pages and retry the
> + * head immediately. The tail pages will be considered
> + * individually within the current loop's page list.
> + */
> + if (rc == -ENOMEM && PageTransHuge(page) &&
> + !split_huge_page_to_list(page, page_list))
> + rc = migrate_demote_mapping(page);
> +
> + if (rc == MIGRATEPAGE_SUCCESS) {
> + unlock_page(page);
> + if (likely(put_page_testzero(page)))
> + goto free_it;
> + /*
> + * Speculative reference will free this page,
> + * so leave it off the LRU.
> + */
> + nr_reclaimed++;

nr_reclaimed += nr_pages instead?

> + continue;
> + }
> +
> /*
> * Anonymous process memory has backing store?
> * Try to allocate it some swap space here.

2020-07-01 00:50:34

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

Hi, Yang,

Yang Shi <[email protected]> writes:

>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>> --- a/mm/vmscan.c~enable-numa-demotion 2020-06-29 16:35:01.017312549 -0700
>>> +++ b/mm/vmscan.c 2020-06-29 16:35:01.023312549 -0700
>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>> * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>> * ABI. New bits are OK, but existing bits can never change.
>>> */
>>> -#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */
>>> -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
>>> -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
>>> +#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */
>>> +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
>>> +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
>>> +#define RECLAIM_MIGRATE (1<<3) /* Migrate pages during reclaim */
>>> /*
>>> * Priority for NODE_RECLAIM. This determines the fraction of pages
>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>> patch.
>>
>> If my understanding of the code were correct, shrink_do_demote_mapping()
>> is called by shrink_page_list(), which is used by kswapd and direct
>> reclaim. So as long as the persistent memory node is onlined,
>> reclaim-based migration will be enabled regardless of node reclaim mode.
>
> It looks so according to the code. But the intention of a new node
> reclaim mode is to do migration on reclaim *only when* the
> RECLAIM_MODE is enabled by the users.
>
> It looks the patch just clear the migration target node masks if the
> memory is offlined.
>
> So, I'm supposed you need check if node_reclaim is enabled before
> doing migration in shrink_page_list() and also need make node reclaim
> to adopt the new mode.

But why shouldn't we migrate in kswapd and direct reclaim? I think that
we may need a way to control it, but shouldn't disable it
unconditionally.

> Please refer to
> https://lore.kernel.org/linux-mm/[email protected]/
>

Best Regards,
Huang, Ying

2020-07-01 01:13:36

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration



On 6/30/20 5:48 PM, Huang, Ying wrote:
> Hi, Yang,
>
> Yang Shi <[email protected]> writes:
>
>>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>>> --- a/mm/vmscan.c~enable-numa-demotion 2020-06-29 16:35:01.017312549 -0700
>>>> +++ b/mm/vmscan.c 2020-06-29 16:35:01.023312549 -0700
>>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>>> * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>>> * ABI. New bits are OK, but existing bits can never change.
>>>> */
>>>> -#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */
>>>> -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
>>>> -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
>>>> +#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */
>>>> +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
>>>> +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
>>>> +#define RECLAIM_MIGRATE (1<<3) /* Migrate pages during reclaim */
>>>> /*
>>>> * Priority for NODE_RECLAIM. This determines the fraction of pages
>>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>>> patch.
>>>
>>> If my understanding of the code were correct, shrink_do_demote_mapping()
>>> is called by shrink_page_list(), which is used by kswapd and direct
>>> reclaim. So as long as the persistent memory node is onlined,
>>> reclaim-based migration will be enabled regardless of node reclaim mode.
>> It looks so according to the code. But the intention of a new node
>> reclaim mode is to do migration on reclaim *only when* the
>> RECLAIM_MODE is enabled by the users.
>>
>> It looks the patch just clear the migration target node masks if the
>> memory is offlined.
>>
>> So, I'm supposed you need check if node_reclaim is enabled before
>> doing migration in shrink_page_list() and also need make node reclaim
>> to adopt the new mode.
> But why shouldn't we migrate in kswapd and direct reclaim? I think that
> we may need a way to control it, but shouldn't disable it
> unconditionally.

Let me share some background. In the past discussions on LKML and last
year's LSFMM the opt-in approach was preferred since the new feature
might be not stable and mature.? So the new node reclaim mode was
suggested by both Mel and Michal. I'm supposed this is still a valid
point now.

Once it is mature and stable enough we definitely could make it
universally preferred and default behavior.

>
>> Please refer to
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
> Best Regards,
> Huang, Ying

2020-07-01 01:29:13

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

Yang Shi <[email protected]> writes:

> On 6/30/20 5:48 PM, Huang, Ying wrote:
>> Hi, Yang,
>>
>> Yang Shi <[email protected]> writes:
>>
>>>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>>>> --- a/mm/vmscan.c~enable-numa-demotion 2020-06-29 16:35:01.017312549 -0700
>>>>> +++ b/mm/vmscan.c 2020-06-29 16:35:01.023312549 -0700
>>>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>>>> * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>>>> * ABI. New bits are OK, but existing bits can never change.
>>>>> */
>>>>> -#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */
>>>>> -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
>>>>> -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
>>>>> +#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */
>>>>> +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
>>>>> +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
>>>>> +#define RECLAIM_MIGRATE (1<<3) /* Migrate pages during reclaim */
>>>>> /*
>>>>> * Priority for NODE_RECLAIM. This determines the fraction of pages
>>>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>>>> patch.
>>>>
>>>> If my understanding of the code were correct, shrink_do_demote_mapping()
>>>> is called by shrink_page_list(), which is used by kswapd and direct
>>>> reclaim. So as long as the persistent memory node is onlined,
>>>> reclaim-based migration will be enabled regardless of node reclaim mode.
>>> It looks so according to the code. But the intention of a new node
>>> reclaim mode is to do migration on reclaim *only when* the
>>> RECLAIM_MODE is enabled by the users.
>>>
>>> It looks the patch just clear the migration target node masks if the
>>> memory is offlined.
>>>
>>> So, I'm supposed you need check if node_reclaim is enabled before
>>> doing migration in shrink_page_list() and also need make node reclaim
>>> to adopt the new mode.
>> But why shouldn't we migrate in kswapd and direct reclaim? I think that
>> we may need a way to control it, but shouldn't disable it
>> unconditionally.
>
> Let me share some background. In the past discussions on LKML and last
> year's LSFMM the opt-in approach was preferred since the new feature
> might be not stable and mature.  So the new node reclaim mode was
> suggested by both Mel and Michal. I'm supposed this is still a valid
> point now.

Is there any technical reason? I think the code isn't very complex. If
we really worry about stable and mature, isn't it enough to provide some
way to enable/disable the feature? Even for kswapd and direct reclaim?

Best Regards,
Huang, Ying

> Once it is mature and stable enough we definitely could make it
> universally preferred and default behavior.
>
>>
>>> Please refer to
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>
>> Best Regards,
>> Huang, Ying

2020-07-01 01:30:48

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard



On 6/30/20 5:47 PM, David Rientjes wrote:
> On Mon, 29 Jun 2020, Dave Hansen wrote:
>
>> From: Dave Hansen <[email protected]>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
> Thanks for sharing these patches and kick-starting the conversation, Dave.
>
> Could this cause us to break a user's mbind() or allow a user to
> circumvent their cpuset.mems?
>
> Because we don't have a mapping of the page back to its allocation
> context (or the process context in which it was allocated), it seems like
> both are possible.

Yes, this could break the memory placement policy enforced by mbind and
cpuset. I discussed this with Michal on mailing list and tried to find a
way to solve it, but unfortunately it seems not easy as what you
mentioned above. The memory policy and cpuset is stored in task_struct
rather than mm_struct. It is not easy to trace back to task_struct from
page (owner field of mm_struct might be helpful, but it depends on
CONFIG_MEMCG and is not preferred way).

>
> So let's assume that migration nodes cannot be other DRAM nodes.
> Otherwise, memory pressure could be intentionally or unintentionally
> induced to migrate these pages to another node. Do we have such a
> restriction on migration nodes?
>
>> Some places we would like to see this used:
>>
>> 1. Persistent memory being as a slower, cheaper DRAM replacement
>> 2. Remote memory-only "expansion" NUMA nodes
>> 3. Resolving memory imbalances where one NUMA node is seeing more
>> allocation activity than another. This helps keep more recent
>> allocations closer to the CPUs on the node doing the allocating.
>>
> (3) is the concerning one given the above if we are to use
> migrate_demote_mapping() for DRAM node balancing.
>
>> Yang Shi's patches used an alternative approach where to-be-discarded
>> pages were collected on a separate discard list and then discarded
>> as a batch with migrate_pages(). This results in simpler code and
>> has all the performance advantages of batching, but has the
>> disadvantage that pages which fail to migrate never get swapped.
>>
>> #Signed-off-by: Keith Busch <[email protected]>
>> Signed-off-by: Dave Hansen <[email protected]>
>> Cc: Keith Busch <[email protected]>
>> Cc: Yang Shi <[email protected]>
>> Cc: David Rientjes <[email protected]>
>> Cc: Huang Ying <[email protected]>
>> Cc: Dan Williams <[email protected]>
>> ---
>>
>> b/include/linux/migrate.h | 6 ++++
>> b/include/trace/events/migrate.h | 3 +-
>> b/mm/debug.c | 1
>> b/mm/migrate.c | 52 +++++++++++++++++++++++++++++++++++++++
>> b/mm/vmscan.c | 25 ++++++++++++++++++
>> 5 files changed, 86 insertions(+), 1 deletion(-)
>>
>> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
>> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.950312604 -0700
>> +++ b/include/linux/migrate.h 2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ enum migrate_reason {
>> MR_MEMPOLICY_MBIND,
>> MR_NUMA_MISPLACED,
>> MR_CONTIG_RANGE,
>> + MR_DEMOTION,
>> MR_TYPES
>> };
>>
>> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
>> struct page *newpage, struct page *page);
>> extern int migrate_page_move_mapping(struct address_space *mapping,
>> struct page *newpage, struct page *page, int extra_count);
>> +extern int migrate_demote_mapping(struct page *page);
>> #else
>>
>> static inline void putback_movable_pages(struct list_head *l) {}
>> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
>> return -ENOSYS;
>> }
>>
>> +static inline int migrate_demote_mapping(struct page *page)
>> +{
>> + return -ENOSYS;
>> +}
>> #endif /* CONFIG_MIGRATION */
>>
>> #ifdef CONFIG_COMPACTION
>> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
>> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.952312604 -0700
>> +++ b/include/trace/events/migrate.h 2020-06-29 16:34:38.963312604 -0700
>> @@ -20,7 +20,8 @@
>> EM( MR_SYSCALL, "syscall_or_cpuset") \
>> EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
>> EM( MR_NUMA_MISPLACED, "numa_misplaced") \
>> - EMe(MR_CONTIG_RANGE, "contig_range")
>> + EM( MR_CONTIG_RANGE, "contig_range") \
>> + EMe(MR_DEMOTION, "demotion")
>>
>> /*
>> * First define the enums in the above macros to be exported to userspace
>> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
>> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.954312604 -0700
>> +++ b/mm/debug.c 2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
>> "mempolicy_mbind",
>> "numa_misplaced",
>> "cma",
>> + "demotion",
>> };
>>
>> const struct trace_print_flags pageflag_names[] = {
>> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
>> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.956312604 -0700
>> +++ b/mm/migrate.c 2020-06-29 16:34:38.964312604 -0700
>> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
>> return node;
>> }
>>
>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> + /*
>> + * 'mask' targets allocation only to the desired node in the
>> + * migration path, and fails fast if the allocation can not be
>> + * immediately satisfied. Reclaim is already active and heroic
>> + * allocation efforts are unwanted.
>> + */
>> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> + __GFP_MOVABLE;
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we
> actually want to kick kswapd on the pmem node?
>
> If not, GFP_TRANSHUGE_LIGHT does a trick where it does
> GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM. You could probably do the same
> here although the __GFP_IO and __GFP_FS would be unnecessary (but not
> harmful).

I'm not sure how Dave thought about this, however, IMHO kicking kswapd
on pmem node would help to free memory then improve migration success
rate. In my implementation, as Dave mentioned in the commit log, the
migration candidates are put on a separate list then migrated in batch
by calling migrate_pages(). Kicking kswapd on pmem would help to improve
success rate since migrate_pages() will retry a couple of times.

Dave's implementation (as you see in this patch) does migration for per
page basis, if migration is failed it will try swap. Kicking kswapd on
pmem would also help the later migration. However, IMHO it seems
migration retry should be still faster than swap.

>
>> + struct page *newpage;
>> +
>> + if (PageTransHuge(page)) {
>> + mask |= __GFP_COMP;
>> + newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
>> + if (newpage)
>> + prep_transhuge_page(newpage);
>> + } else
>> + newpage = alloc_pages_node(node, mask, 0);
>> +
>> + return newpage;
>> +}
>> +
>> +/**
>> + * migrate_demote_mapping() - Migrate this page and its mappings to its
>> + * demotion node.
>> + * @page: A locked, isolated, non-huge page that should migrate to its current
>> + * node's demotion target, if available. Since this is intended to be
>> + * called during memory reclaim, all flag options are set to fail fast.
>> + *
>> + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
>> + */
>> +int migrate_demote_mapping(struct page *page)
>> +{
>> + int next_nid = next_demotion_node(page_to_nid(page));
>> +
>> + VM_BUG_ON_PAGE(!PageLocked(page), page);
>> + VM_BUG_ON_PAGE(PageHuge(page), page);
>> + VM_BUG_ON_PAGE(PageLRU(page), page);
>> +
>> + if (next_nid == NUMA_NO_NODE)
>> + return -ENOSYS;
>> + if (PageTransHuge(page) && !thp_migration_supported())
>> + return -ENOMEM;
>> +
>> + /* MIGRATE_ASYNC is the most light weight and never blocks.*/
>> + return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
>> + page, MIGRATE_ASYNC, MR_DEMOTION);
>> +}
>> +
>> +
>> /*
>> * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work
>> * around it.
>> diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
>> --- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.959312604 -0700
>> +++ b/mm/vmscan.c 2020-06-29 16:34:38.965312604 -0700
>> @@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st
>> LIST_HEAD(free_pages);
>> unsigned nr_reclaimed = 0;
>> unsigned pgactivate = 0;
>> + int rc;
>>
>> memset(stat, 0, sizeof(*stat));
>> cond_resched();
>> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>> ; /* try to reclaim the page below */
>> }
>>
>> + rc = migrate_demote_mapping(page);
>> + /*
>> + * -ENOMEM on a THP may indicate either migration is
>> + * unsupported or there was not enough contiguous
>> + * space. Split the THP into base pages and retry the
>> + * head immediately. The tail pages will be considered
>> + * individually within the current loop's page list.
>> + */
>> + if (rc == -ENOMEM && PageTransHuge(page) &&
>> + !split_huge_page_to_list(page, page_list))
>> + rc = migrate_demote_mapping(page);
>> +
>> + if (rc == MIGRATEPAGE_SUCCESS) {
>> + unlock_page(page);
>> + if (likely(put_page_testzero(page)))
>> + goto free_it;
>> + /*
>> + * Speculative reference will free this page,
>> + * so leave it off the LRU.
>> + */
>> + nr_reclaimed++;
> nr_reclaimed += nr_pages instead?
>
>> + continue;
>> + }
>> +
>> /*
>> * Anonymous process memory has backing store?
>> * Try to allocate it some swap space here.

2020-07-01 01:43:08

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

David Rientjes <[email protected]> writes:

> On Mon, 29 Jun 2020, Dave Hansen wrote:
>
>> From: Dave Hansen <[email protected]>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
>
> Thanks for sharing these patches and kick-starting the conversation, Dave.
>
> Could this cause us to break a user's mbind() or allow a user to
> circumvent their cpuset.mems?
>
> Because we don't have a mapping of the page back to its allocation
> context (or the process context in which it was allocated), it seems like
> both are possible.

For mbind, I think we don't have enough information during reclaim to
enforce the node binding policy. But for cpuset, if cgroup v2 (with the
unified hierarchy) is used, it's possible to get the node binding policy
via something like,

cgroup_get_e_css(page->mem_cgroup, &cpuset_cgrp_subsys)

> So let's assume that migration nodes cannot be other DRAM nodes.
> Otherwise, memory pressure could be intentionally or unintentionally
> induced to migrate these pages to another node. Do we have such a
> restriction on migration nodes?
>
>> Some places we would like to see this used:
>>
>> 1. Persistent memory being as a slower, cheaper DRAM replacement
>> 2. Remote memory-only "expansion" NUMA nodes
>> 3. Resolving memory imbalances where one NUMA node is seeing more
>> allocation activity than another. This helps keep more recent
>> allocations closer to the CPUs on the node doing the allocating.
>>
>
> (3) is the concerning one given the above if we are to use
> migrate_demote_mapping() for DRAM node balancing.
>
>> Yang Shi's patches used an alternative approach where to-be-discarded
>> pages were collected on a separate discard list and then discarded
>> as a batch with migrate_pages(). This results in simpler code and
>> has all the performance advantages of batching, but has the
>> disadvantage that pages which fail to migrate never get swapped.
>>
>> #Signed-off-by: Keith Busch <[email protected]>
>> Signed-off-by: Dave Hansen <[email protected]>
>> Cc: Keith Busch <[email protected]>
>> Cc: Yang Shi <[email protected]>
>> Cc: David Rientjes <[email protected]>
>> Cc: Huang Ying <[email protected]>
>> Cc: Dan Williams <[email protected]>
>> ---
>>
>> b/include/linux/migrate.h | 6 ++++
>> b/include/trace/events/migrate.h | 3 +-
>> b/mm/debug.c | 1
>> b/mm/migrate.c | 52 +++++++++++++++++++++++++++++++++++++++
>> b/mm/vmscan.c | 25 ++++++++++++++++++
>> 5 files changed, 86 insertions(+), 1 deletion(-)
>>
>> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
>> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.950312604 -0700
>> +++ b/include/linux/migrate.h 2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ enum migrate_reason {
>> MR_MEMPOLICY_MBIND,
>> MR_NUMA_MISPLACED,
>> MR_CONTIG_RANGE,
>> + MR_DEMOTION,
>> MR_TYPES
>> };
>>
>> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
>> struct page *newpage, struct page *page);
>> extern int migrate_page_move_mapping(struct address_space *mapping,
>> struct page *newpage, struct page *page, int extra_count);
>> +extern int migrate_demote_mapping(struct page *page);
>> #else
>>
>> static inline void putback_movable_pages(struct list_head *l) {}
>> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
>> return -ENOSYS;
>> }
>>
>> +static inline int migrate_demote_mapping(struct page *page)
>> +{
>> + return -ENOSYS;
>> +}
>> #endif /* CONFIG_MIGRATION */
>>
>> #ifdef CONFIG_COMPACTION
>> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
>> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.952312604 -0700
>> +++ b/include/trace/events/migrate.h 2020-06-29 16:34:38.963312604 -0700
>> @@ -20,7 +20,8 @@
>> EM( MR_SYSCALL, "syscall_or_cpuset") \
>> EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
>> EM( MR_NUMA_MISPLACED, "numa_misplaced") \
>> - EMe(MR_CONTIG_RANGE, "contig_range")
>> + EM( MR_CONTIG_RANGE, "contig_range") \
>> + EMe(MR_DEMOTION, "demotion")
>>
>> /*
>> * First define the enums in the above macros to be exported to userspace
>> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
>> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.954312604 -0700
>> +++ b/mm/debug.c 2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
>> "mempolicy_mbind",
>> "numa_misplaced",
>> "cma",
>> + "demotion",
>> };
>>
>> const struct trace_print_flags pageflag_names[] = {
>> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
>> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.956312604 -0700
>> +++ b/mm/migrate.c 2020-06-29 16:34:38.964312604 -0700
>> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
>> return node;
>> }
>>
>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> + /*
>> + * 'mask' targets allocation only to the desired node in the
>> + * migration path, and fails fast if the allocation can not be
>> + * immediately satisfied. Reclaim is already active and heroic
>> + * allocation efforts are unwanted.
>> + */
>> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> + __GFP_MOVABLE;
>
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we
> actually want to kick kswapd on the pmem node?

I think it should be a good idea to kick kswapd on the PMEM node.
Because otherwise, we will discard more pages in DRAM node. And in
general, the DRAM pages are hotter than the PMEM pages, because the cold
DRAM pages are migrated to the PMEM node.

> If not, GFP_TRANSHUGE_LIGHT does a trick where it does
> GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM. You could probably do the same
> here although the __GFP_IO and __GFP_FS would be unnecessary (but not
> harmful).
>
>> + struct page *newpage;
>> +
>> + if (PageTransHuge(page)) {
>> + mask |= __GFP_COMP;
>> + newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
>> + if (newpage)
>> + prep_transhuge_page(newpage);
>> + } else
>> + newpage = alloc_pages_node(node, mask, 0);
>> +
>> + return newpage;
>> +}
>> +

Best Regards,
Huang, Ying

2020-07-01 05:42:58

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

On Tue, 30 Jun 2020, Yang Shi wrote:

> > > From: Dave Hansen <[email protected]>
> > >
> > > If a memory node has a preferred migration path to demote cold pages,
> > > attempt to move those inactive pages to that migration node before
> > > reclaiming. This will better utilize available memory, provide a faster
> > > tier than swapping or discarding, and allow such pages to be reused
> > > immediately without IO to retrieve the data.
> > >
> > > When handling anonymous pages, this will be considered before swap if
> > > enabled. Should the demotion fail for any reason, the page reclaim
> > > will proceed as if the demotion feature was not enabled.
> > >
> > Thanks for sharing these patches and kick-starting the conversation, Dave.
> >
> > Could this cause us to break a user's mbind() or allow a user to
> > circumvent their cpuset.mems?
> >
> > Because we don't have a mapping of the page back to its allocation
> > context (or the process context in which it was allocated), it seems like
> > both are possible.
>
> Yes, this could break the memory placement policy enforced by mbind and
> cpuset. I discussed this with Michal on mailing list and tried to find a way
> to solve it, but unfortunately it seems not easy as what you mentioned above.
> The memory policy and cpuset is stored in task_struct rather than mm_struct.
> It is not easy to trace back to task_struct from page (owner field of
> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
> preferred way).
>

Yeah, and Ying made a similar response to this message.

We can do this if we consider pmem not to be a separate memory tier from
the system perspective, however, but rather the socket perspective. In
other words, a node can only demote to a series of exclusive pmem ranges
and promote to the same series of ranges in reverse order. So DRAM node 0
can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
node 3 -- a pmem range cannot be demoted to, or promoted from, more than
one DRAM node.

This naturally takes care of mbind() and cpuset.mems if we consider pmem
just to be slower volatile memory and we don't need to deal with the
latency concerns of cross socket migration. A user page will never be
demoted to a pmem range across the socket and will never be promoted to a
different DRAM node that it doesn't have access to.

That can work with the NUMA abstraction for pmem, but it could also
theoretically be a new memory zone instead. If all memory living on pmem
is migratable (the natural way that memory hotplug is done, so we can
offline), this zone would live above ZONE_MOVABLE. Zonelist ordering
would determine whether we can allocate directly from this memory based on
system config or a new gfp flag that could be set for users of a mempolicy
that allows allocations directly from pmem. If abstracted as a NUMA node
instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't
make much sense.

Kswapd would need to be enlightened for proper pgdat and pmem balancing
but in theory it should be simpler because it only has its own node to
manage. Existing per-zone watermarks might be easy to use to fine tune
the policy from userspace: the scale factor determines how much memory we
try to keep free on DRAM for migration from pmem, for example. We also
wouldn't have to deal with node hotplug or updating of demotion/promotion
node chains.

Maybe the strongest advantage of the node abstraction is the ability to
use autonuma and migrate_pages()/move_pages() API for moving pages
explicitly? Mempolicies could be used for migration to "top-tier" memory,
i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.

2020-07-01 08:48:22

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed

Dave Hansen <[email protected]> wrote:

> From: Keith Busch <[email protected]>
>
> Migrating pages had been allocating the new page before it was actually
> needed. Subsequent operations may still fail, which would have to handle
> cleaning up the newly allocated page when it was never used.
>
> Defer allocating the page until we are actually ready to make use of
> it, after locking the original page. This simplifies error handling,
> but should not have any functional change in behavior. This is just
> refactoring page migration so the main part can more easily be reused
> by other code.

Is there any concern that the src page is now held PG_locked over the
dst page allocation, which might wander into
reclaim/cond_resched/oom_kill? I don't have a deadlock in mind. I'm
just wondering about the additional latency imposed on unrelated threads
who want access src page.

> #Signed-off-by: Keith Busch <[email protected]>

Is commented Signed-off-by intentional? Same applies to later patches.

> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Keith Busch <[email protected]>
> Cc: Yang Shi <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: Dan Williams <[email protected]>
> ---
>
> b/mm/migrate.c | 148 ++++++++++++++++++++++++++++-----------------------------
> 1 file changed, 75 insertions(+), 73 deletions(-)
>
> diff -puN mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed mm/migrate.c
> --- a/mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed 2020-06-29 16:34:37.896312607 -0700
> +++ b/mm/migrate.c 2020-06-29 16:34:37.900312607 -0700
> @@ -1014,56 +1014,17 @@ out:
> return rc;
> }
>
> -static int __unmap_and_move(struct page *page, struct page *newpage,
> - int force, enum migrate_mode mode)
> +static int __unmap_and_move(new_page_t get_new_page,
> + free_page_t put_new_page,
> + unsigned long private, struct page *page,
> + enum migrate_mode mode,
> + enum migrate_reason reason)
> {
> int rc = -EAGAIN;
> int page_was_mapped = 0;
> struct anon_vma *anon_vma = NULL;
> bool is_lru = !__PageMovable(page);
> -
> - if (!trylock_page(page)) {
> - if (!force || mode == MIGRATE_ASYNC)
> - goto out;
> -
> - /*
> - * It's not safe for direct compaction to call lock_page.
> - * For example, during page readahead pages are added locked
> - * to the LRU. Later, when the IO completes the pages are
> - * marked uptodate and unlocked. However, the queueing
> - * could be merging multiple pages for one bio (e.g.
> - * mpage_readpages). If an allocation happens for the
> - * second or third page, the process can end up locking
> - * the same page twice and deadlocking. Rather than
> - * trying to be clever about what pages can be locked,
> - * avoid the use of lock_page for direct compaction
> - * altogether.
> - */
> - if (current->flags & PF_MEMALLOC)
> - goto out;
> -
> - lock_page(page);
> - }
> -
> - if (PageWriteback(page)) {
> - /*
> - * Only in the case of a full synchronous migration is it
> - * necessary to wait for PageWriteback. In the async case,
> - * the retry loop is too short and in the sync-light case,
> - * the overhead of stalling is too much
> - */
> - switch (mode) {
> - case MIGRATE_SYNC:
> - case MIGRATE_SYNC_NO_COPY:
> - break;
> - default:
> - rc = -EBUSY;
> - goto out_unlock;
> - }
> - if (!force)
> - goto out_unlock;
> - wait_on_page_writeback(page);
> - }
> + struct page *newpage;
>
> /*
> * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
> @@ -1082,6 +1043,12 @@ static int __unmap_and_move(struct page
> if (PageAnon(page) && !PageKsm(page))
> anon_vma = page_get_anon_vma(page);
>
> + newpage = get_new_page(page, private);
> + if (!newpage) {
> + rc = -ENOMEM;
> + goto out;
> + }
> +
> /*
> * Block others from accessing the new page when we get around to
> * establishing additional references. We are usually the only one
> @@ -1091,11 +1058,11 @@ static int __unmap_and_move(struct page
> * This is much like races on refcount of oldpage: just don't BUG().
> */
> if (unlikely(!trylock_page(newpage)))
> - goto out_unlock;
> + goto out_put;
>
> if (unlikely(!is_lru)) {
> rc = move_to_new_page(newpage, page, mode);
> - goto out_unlock_both;
> + goto out_unlock;
> }
>
> /*
> @@ -1114,7 +1081,7 @@ static int __unmap_and_move(struct page
> VM_BUG_ON_PAGE(PageAnon(page), page);
> if (page_has_private(page)) {
> try_to_free_buffers(page);
> - goto out_unlock_both;
> + goto out_unlock;
> }
> } else if (page_mapped(page)) {
> /* Establish migration ptes */
> @@ -1131,15 +1098,9 @@ static int __unmap_and_move(struct page
> if (page_was_mapped)
> remove_migration_ptes(page,
> rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
> -
> -out_unlock_both:
> - unlock_page(newpage);
> out_unlock:
> - /* Drop an anon_vma reference if we took one */
> - if (anon_vma)
> - put_anon_vma(anon_vma);
> - unlock_page(page);
> -out:
> + unlock_page(newpage);
> +out_put:
> /*
> * If migration is successful, decrease refcount of the newpage
> * which will not free the page because new page owner increased
> @@ -1150,12 +1111,20 @@ out:
> * state.
> */
> if (rc == MIGRATEPAGE_SUCCESS) {
> + set_page_owner_migrate_reason(newpage, reason);
> if (unlikely(!is_lru))
> put_page(newpage);
> else
> putback_lru_page(newpage);
> + } else if (put_new_page) {
> + put_new_page(newpage, private);
> + } else {
> + put_page(newpage);
> }
> -
> +out:
> + /* Drop an anon_vma reference if we took one */
> + if (anon_vma)
> + put_anon_vma(anon_vma);
> return rc;
> }
>
> @@ -1203,8 +1172,7 @@ static ICE_noinline int unmap_and_move(n
> int force, enum migrate_mode mode,
> enum migrate_reason reason)
> {
> - int rc = MIGRATEPAGE_SUCCESS;
> - struct page *newpage = NULL;
> + int rc = -EAGAIN;
>
> if (!thp_migration_supported() && PageTransHuge(page))
> return -ENOMEM;
> @@ -1219,17 +1187,57 @@ static ICE_noinline int unmap_and_move(n
> __ClearPageIsolated(page);
> unlock_page(page);
> }
> + rc = MIGRATEPAGE_SUCCESS;
> goto out;
> }
>
> - newpage = get_new_page(page, private);
> - if (!newpage)
> - return -ENOMEM;
> + if (!trylock_page(page)) {
> + if (!force || mode == MIGRATE_ASYNC)
> + return rc;
>
> - rc = __unmap_and_move(page, newpage, force, mode);
> - if (rc == MIGRATEPAGE_SUCCESS)
> - set_page_owner_migrate_reason(newpage, reason);
> + /*
> + * It's not safe for direct compaction to call lock_page.
> + * For example, during page readahead pages are added locked
> + * to the LRU. Later, when the IO completes the pages are
> + * marked uptodate and unlocked. However, the queueing
> + * could be merging multiple pages for one bio (e.g.
> + * mpage_readpages). If an allocation happens for the
> + * second or third page, the process can end up locking
> + * the same page twice and deadlocking. Rather than
> + * trying to be clever about what pages can be locked,
> + * avoid the use of lock_page for direct compaction
> + * altogether.
> + */
> + if (current->flags & PF_MEMALLOC)
> + return rc;
> +
> + lock_page(page);
> + }
> +
> + if (PageWriteback(page)) {
> + /*
> + * Only in the case of a full synchronous migration is it
> + * necessary to wait for PageWriteback. In the async case,
> + * the retry loop is too short and in the sync-light case,
> + * the overhead of stalling is too much
> + */
> + switch (mode) {
> + case MIGRATE_SYNC:
> + case MIGRATE_SYNC_NO_COPY:
> + break;
> + default:
> + rc = -EBUSY;
> + goto out_unlock;
> + }
> + if (!force)
> + goto out_unlock;
> + wait_on_page_writeback(page);
> + }
> + rc = __unmap_and_move(get_new_page, put_new_page, private,
> + page, mode, reason);
>
> +out_unlock:
> + unlock_page(page);
> out:
> if (rc != -EAGAIN) {
> /*
> @@ -1269,9 +1277,8 @@ out:
> if (rc != -EAGAIN) {
> if (likely(!__PageMovable(page))) {
> putback_lru_page(page);
> - goto put_new;
> + goto done;
> }
> -
> lock_page(page);
> if (PageMovable(page))
> putback_movable_page(page);
> @@ -1280,13 +1287,8 @@ out:
> unlock_page(page);
> put_page(page);
> }
> -put_new:
> - if (put_new_page)
> - put_new_page(newpage, private);
> - else
> - put_page(newpage);
> }
> -
> +done:
> return rc;
> }
>
> _

2020-07-01 08:54:46

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

David Rientjes <[email protected]> writes:

> On Tue, 30 Jun 2020, Yang Shi wrote:
>
>> > > From: Dave Hansen <[email protected]>
>> > >
>> > > If a memory node has a preferred migration path to demote cold pages,
>> > > attempt to move those inactive pages to that migration node before
>> > > reclaiming. This will better utilize available memory, provide a faster
>> > > tier than swapping or discarding, and allow such pages to be reused
>> > > immediately without IO to retrieve the data.
>> > >
>> > > When handling anonymous pages, this will be considered before swap if
>> > > enabled. Should the demotion fail for any reason, the page reclaim
>> > > will proceed as if the demotion feature was not enabled.
>> > >
>> > Thanks for sharing these patches and kick-starting the conversation, Dave.
>> >
>> > Could this cause us to break a user's mbind() or allow a user to
>> > circumvent their cpuset.mems?
>> >
>> > Because we don't have a mapping of the page back to its allocation
>> > context (or the process context in which it was allocated), it seems like
>> > both are possible.
>>
>> Yes, this could break the memory placement policy enforced by mbind and
>> cpuset. I discussed this with Michal on mailing list and tried to find a way
>> to solve it, but unfortunately it seems not easy as what you mentioned above.
>> The memory policy and cpuset is stored in task_struct rather than mm_struct.
>> It is not easy to trace back to task_struct from page (owner field of
>> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
>> preferred way).
>>
>
> Yeah, and Ying made a similar response to this message.
>
> We can do this if we consider pmem not to be a separate memory tier from
> the system perspective, however, but rather the socket perspective. In
> other words, a node can only demote to a series of exclusive pmem ranges
> and promote to the same series of ranges in reverse order. So DRAM node 0
> can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> one DRAM node.
>
> This naturally takes care of mbind() and cpuset.mems if we consider pmem
> just to be slower volatile memory and we don't need to deal with the
> latency concerns of cross socket migration. A user page will never be
> demoted to a pmem range across the socket and will never be promoted to a
> different DRAM node that it doesn't have access to.
>
> That can work with the NUMA abstraction for pmem, but it could also
> theoretically be a new memory zone instead. If all memory living on pmem
> is migratable (the natural way that memory hotplug is done, so we can
> offline), this zone would live above ZONE_MOVABLE. Zonelist ordering
> would determine whether we can allocate directly from this memory based on
> system config or a new gfp flag that could be set for users of a mempolicy
> that allows allocations directly from pmem. If abstracted as a NUMA node
> instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't
> make much sense.

Why can not we just bind the memory of the application to node 0, 2, 3
via mbind() or cpuset.mems? Then the application can allocate memory
directly from PMEM. And if we bind the memory of the application via
mbind() to node 0, we can only allocate memory directly from DRAM.

Best Regards,
Huang, Ying

2020-07-01 14:27:16

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/8] Migrate Pages in lieu of discard

On 30 Jun 2020, at 15:31, Dave Hansen wrote:

>
>
>> BTW is this proposal only for systems having multi-tiers of memory?
>> Can a multi-node DRAM-only system take advantage of this proposal? For
>> example I have a system with two DRAM nodes running two jobs
>> hardwalled to each node. For each job the other node is kind of
>> low-tier memory. If I can describe the per-job demotion paths then
>> these jobs can take advantage of this proposal during occasional
>> peaks.
>
> I don't see any reason it could not work there. There would just need
> to be a way to set up a different demotion path policy that what was
> done here.

We might need a different threshold (or GFP flag) for allocating new pages
in remote node for demotion. Otherwise, we could
see scenarios like: two nodes in a system are almost full and Node A is
trying to demote some pages to Node B, which triggers page demotion from
Node B to Node A. Then, we might be able to avoid a demotion cycle by not
allowing Node A to demote pages again but swapping pages to disk when Node B
is demoting its pages to Node A, but this still leads to a long reclaim path
compared to making Node A swapping to disk directly. In such cases, Node A
should just swap pages to disk without bothering Node B at all.

Maybe something like GFP_DEMOTION flag for allocating pages for demotion and
the flag requires more free pages available in the destination node to
avoid the situation above?



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-07-01 14:33:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/8] Migrate Pages in lieu of discard

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 7/1/20 7:24 AM, Zi Yan wrote:
> On 30 Jun 2020, at 15:31, Dave Hansen wrote:
>>> BTW is this proposal only for systems having multi-tiers of
>>> memory? Can a multi-node DRAM-only system take advantage of
>>> this proposal? For example I have a system with two DRAM nodes
>>> running two jobs hardwalled to each node. For each job the
>>> other node is kind of low-tier memory. If I can describe the
>>> per-job demotion paths then these jobs can take advantage of
>>> this proposal during occasional peaks.
>> I don't see any reason it could not work there. There would just
>> need to be a way to set up a different demotion path policy that
>> what was done here.
> We might need a different threshold (or GFP flag) for allocating
> new pages in remote node for demotion. Otherwise, we could see
> scenarios like: two nodes in a system are almost full and Node A
> is trying to demote some pages to Node B, which triggers page
> demotion from Node B to Node A.

I've always assumed that migration cycles would be illegal since it's
so hard to guarantee forward reclaim progress with them in place.
-----BEGIN PGP SIGNATURE-----

iQIcBAEBCAAGBQJe/J5kAAoJEGg1lTBwyZKw0TMP/1kufbxVGSY331xhOL/QHEoE
Tsuo62l2CJ/CbhIBKzac24k1Rf9AiyxUukkVZfa32c2Kf03XWjUNiWVuRPSTMlfT
E0h2llYYbUBs+eVeT4Ksz4xkThKHlXPNuS1OMhuSVbjhieiPqp3J0blohXaWdkSa
DBgpiqNlVPD7V0NIA5qfsumZRrOJDdJNdLKbjI7GBVprEHu5N/X0NQpakPErtcka
kSz7Hjv5x+fbd3rxc2QhrnegBE1oMQGUl14nf/kIKnKuZV2WIdabaxrYWrQBvALa
Z2sfcBRU41/SKvz/syCwJpSr1XkfsjNKvDMlkflXndMTzzP4/rhAyDX5Wzw99Aws
zz6UmRhZrFOudq4R5jpOqJiDfn1RGYA8mH04bEOPjEgGRiXaxi5Sp6fh/BQG5p7n
QESH0LVHEhg8h+10FWZ5VYU1UwMIdzolBI8Y8VlJDjeSpzSFyyDFP7Re3OyQRfmb
ij5ThSozo35t+zEYS4yofgPMZKJ/aZ+EySEF5LZsipKC2RsRuFFpaDSOOGXZKLXq
G/R9g2LeLZK6iNNlCrIGjeAAKN8UZzOMJwapYV8czt0HTQ2vRjuDE1Y2TRD6gjXI
x6vUCfFyOEJw4l3mca+Sb1pmFcaiXBRxBrat6q23Ls+eCDMIaTgx5wA7NEeq0Td7
yShQbtIvJKRubiscJlZ/
=MjgB
-----END PGP SIGNATURE-----

2020-07-01 14:49:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed

On 7/1/20 1:47 AM, Greg Thelen wrote:
> Dave Hansen <[email protected]> wrote:
>> From: Keith Busch <[email protected]>
>> Defer allocating the page until we are actually ready to make use of
>> it, after locking the original page. This simplifies error handling,
>> but should not have any functional change in behavior. This is just
>> refactoring page migration so the main part can more easily be reused
>> by other code.
>
> Is there any concern that the src page is now held PG_locked over the
> dst page allocation, which might wander into
> reclaim/cond_resched/oom_kill? I don't have a deadlock in mind. I'm
> just wondering about the additional latency imposed on unrelated threads
> who want access src page.

It's not great. *But*, the alternative is to toss the page contents out
and let users encounter a fault and an allocation. They would be
subject to all the latency associated with an allocation, just at a
slightly later time.

If it's a problem it seems like it would be pretty easy to fix, at least
for non-cgroup reclaim. We know which node we're reclaiming from and we
know if it has a demotion path, so we could proactively allocate a
single migration target page before doing the source lock_page(). That
creates some other problems, but I think it would be straightforward.

>> #Signed-off-by: Keith Busch <[email protected]>
>
> Is commented Signed-off-by intentional? Same applies to later patches.

Yes, Keith is no longer at Intel, so that @intel.com mail would bounce.
I left the @intel.com SoB so it would be clear that the code originated
from Keith while at Intel, but commented it out to avoid it being picked
up by anyone's tooling.

2020-07-01 15:16:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

On 6/30/20 10:41 PM, David Rientjes wrote:
> Maybe the strongest advantage of the node abstraction is the ability to
> use autonuma and migrate_pages()/move_pages() API for moving pages
> explicitly? Mempolicies could be used for migration to "top-tier" memory,
> i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.

I totally agree that we _could_ introduce this new memory class as a zone.

Doing it as nodes is pretty natural since the firmware today describes
both slow (versus DRAM) and fast memory as separate nodes. It also
means that apps can get visibility into placement with existing NUMA
tooling and ABIs. To me, those are the two strongest reasons for PMEM.

Looking to the future, I don't think the zone approach scales. I know
folks want to build stuff within a single socket which is a mix of:

1. High-Bandwidth, on-package memory (a la MCDRAM)
2. DRAM
3. DRAM-cached PMEM (aka. "memory mode" PMEM)
4. Non-cached PMEM

Right now, #1 doesn't exist on modern platform and #3/#4 can't be mixed
(you only get 3 _or_ 4 at once). I'd love to provide something here
that Intel can use to build future crazy platform configurations that
don't require kernel enabling.

2020-07-01 16:04:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

On 6/30/20 10:50 AM, Yang Shi wrote:
> So, I'm supposed you need check if node_reclaim is enabled before doing
> migration in shrink_page_list() and also need make node reclaim to adopt
> the new mode.
>
> Please refer to
> https://lore.kernel.org/linux-mm/[email protected]/
>
> I copied the related chunks here:

Thanks for those! I'll incorporate them for the next version.

2020-07-01 16:50:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

On 6/30/20 5:47 PM, David Rientjes wrote:
> On Mon, 29 Jun 2020, Dave Hansen wrote:
>> From: Dave Hansen <[email protected]>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
>
> Thanks for sharing these patches and kick-starting the conversation, Dave.
>
> Could this cause us to break a user's mbind() or allow a user to
> circumvent their cpuset.mems?

In its current form, yes.

My current rationale for this is that while it's not as deferential as
it can be to the user/kernel ABI contract, it's good *overall* behavior.
The auto-migration only kicks in when the data is about to go away. So
while the user's data might be slower than they like, it is *WAY* faster
than they deserve because it should be off on the disk.

> Because we don't have a mapping of the page back to its allocation
> context (or the process context in which it was allocated), it seems like
> both are possible.
>
> So let's assume that migration nodes cannot be other DRAM nodes.
> Otherwise, memory pressure could be intentionally or unintentionally
> induced to migrate these pages to another node. Do we have such a
> restriction on migration nodes?

There's nothing explicit. On a normal, balanced system where there's a
1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
implicit since the migration path is one deep and goes from DRAM->PMEM.

If there were some oddball system where there was a memory only DRAM
node, it might very well end up being a migration target.

>> Some places we would like to see this used:
>>
>> 1. Persistent memory being as a slower, cheaper DRAM replacement
>> 2. Remote memory-only "expansion" NUMA nodes
>> 3. Resolving memory imbalances where one NUMA node is seeing more
>> allocation activity than another. This helps keep more recent
>> allocations closer to the CPUs on the node doing the allocating.
>
> (3) is the concerning one given the above if we are to use
> migrate_demote_mapping() for DRAM node balancing.

Yeah, agreed. That's the sketchiest of the three. :)

>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> + /*
>> + * 'mask' targets allocation only to the desired node in the
>> + * migration path, and fails fast if the allocation can not be
>> + * immediately satisfied. Reclaim is already active and heroic
>> + * allocation efforts are unwanted.
>> + */
>> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> + __GFP_MOVABLE;
>
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we
> actually want to kick kswapd on the pmem node?

In my mental model, cold data flows from:

DRAM -> PMEM -> swap

Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
for kinda cold data, kswapd can be working on doing the PMEM->swap part
on really cold data.

...
>> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>> ; /* try to reclaim the page below */
>> }
>>
>> + rc = migrate_demote_mapping(page);
>> + /*
>> + * -ENOMEM on a THP may indicate either migration is
>> + * unsupported or there was not enough contiguous
>> + * space. Split the THP into base pages and retry the
>> + * head immediately. The tail pages will be considered
>> + * individually within the current loop's page list.
>> + */
>> + if (rc == -ENOMEM && PageTransHuge(page) &&
>> + !split_huge_page_to_list(page, page_list))
>> + rc = migrate_demote_mapping(page);
>> +
>> + if (rc == MIGRATEPAGE_SUCCESS) {
>> + unlock_page(page);
>> + if (likely(put_page_testzero(page)))
>> + goto free_it;
>> + /*
>> + * Speculative reference will free this page,
>> + * so leave it off the LRU.
>> + */
>> + nr_reclaimed++;
>
> nr_reclaimed += nr_pages instead?

Oh, good catch. I also need to go double-check that 'nr_pages' isn't
wrong elsewhere because of the split.

2020-07-01 17:24:43

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard



On 6/30/20 10:41 PM, David Rientjes wrote:
> On Tue, 30 Jun 2020, Yang Shi wrote:
>
>>>> From: Dave Hansen <[email protected]>
>>>>
>>>> If a memory node has a preferred migration path to demote cold pages,
>>>> attempt to move those inactive pages to that migration node before
>>>> reclaiming. This will better utilize available memory, provide a faster
>>>> tier than swapping or discarding, and allow such pages to be reused
>>>> immediately without IO to retrieve the data.
>>>>
>>>> When handling anonymous pages, this will be considered before swap if
>>>> enabled. Should the demotion fail for any reason, the page reclaim
>>>> will proceed as if the demotion feature was not enabled.
>>>>
>>> Thanks for sharing these patches and kick-starting the conversation, Dave.
>>>
>>> Could this cause us to break a user's mbind() or allow a user to
>>> circumvent their cpuset.mems?
>>>
>>> Because we don't have a mapping of the page back to its allocation
>>> context (or the process context in which it was allocated), it seems like
>>> both are possible.
>> Yes, this could break the memory placement policy enforced by mbind and
>> cpuset. I discussed this with Michal on mailing list and tried to find a way
>> to solve it, but unfortunately it seems not easy as what you mentioned above.
>> The memory policy and cpuset is stored in task_struct rather than mm_struct.
>> It is not easy to trace back to task_struct from page (owner field of
>> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
>> preferred way).
>>
> Yeah, and Ying made a similar response to this message.
>
> We can do this if we consider pmem not to be a separate memory tier from
> the system perspective, however, but rather the socket perspective. In
> other words, a node can only demote to a series of exclusive pmem ranges
> and promote to the same series of ranges in reverse order. So DRAM node 0
> can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> one DRAM node.
>
> This naturally takes care of mbind() and cpuset.mems if we consider pmem
> just to be slower volatile memory and we don't need to deal with the
> latency concerns of cross socket migration. A user page will never be
> demoted to a pmem range across the socket and will never be promoted to a
> different DRAM node that it doesn't have access to.

But I don't see too much benefit to limit the migration target to the
so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on
a different socket) pmem node since even the cross socket access should
be much faster then refault or swap from disk.

>
> That can work with the NUMA abstraction for pmem, but it could also
> theoretically be a new memory zone instead. If all memory living on pmem
> is migratable (the natural way that memory hotplug is done, so we can
> offline), this zone would live above ZONE_MOVABLE. Zonelist ordering
> would determine whether we can allocate directly from this memory based on
> system config or a new gfp flag that could be set for users of a mempolicy
> that allows allocations directly from pmem. If abstracted as a NUMA node
> instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't
> make much sense.
>
> Kswapd would need to be enlightened for proper pgdat and pmem balancing
> but in theory it should be simpler because it only has its own node to
> manage. Existing per-zone watermarks might be easy to use to fine tune
> the policy from userspace: the scale factor determines how much memory we
> try to keep free on DRAM for migration from pmem, for example. We also
> wouldn't have to deal with node hotplug or updating of demotion/promotion
> node chains.
>
> Maybe the strongest advantage of the node abstraction is the ability to
> use autonuma and migrate_pages()/move_pages() API for moving pages
> explicitly? Mempolicies could be used for migration to "top-tier" memory,
> i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.

I think using pmem as a node is more natural than zone and less
intrusive since we can just reuse all the numa APIs. If we treat pmem as
a new zone I think the implementation may be more intrusive and
complicated (i.e. need a new gfp flag) and user can't control the memory
placement.

Actually there had been such proposal before, please see
https://www.spinics.net/lists/linux-mm/msg151788.html


2020-07-01 18:21:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

On 7/1/20 1:54 AM, Huang, Ying wrote:
> Why can not we just bind the memory of the application to node 0, 2, 3
> via mbind() or cpuset.mems? Then the application can allocate memory
> directly from PMEM. And if we bind the memory of the application via
> mbind() to node 0, we can only allocate memory directly from DRAM.

Applications use cpuset.mems precisely because they don't want to
allocate directly from PMEM. They want the good, deterministic,
performance they get from DRAM.

Even if they don't allocate directly from PMEM, is it OK for such an app
to get its cold data migrated to PMEM? That's a much more subtle
question and I suspect the kernel isn't going to have a single answer
for it. I suspect we'll need a cpuset-level knob to turn auto-demotion
on or off.

2020-07-01 18:26:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order

On 6/30/20 1:22 AM, Huang, Ying wrote:
>> + /*
>> + * To avoid cycles in the migration "graph", ensure
>> + * that migration sources are not future targets by
>> + * setting them in 'used_targets'.
>> + *
>> + * But, do this only once per pass so that multiple
>> + * source nodes can share a target node.
> establish_migrate_target() calls find_next_best_node(), which will set
> target_node in used_targets. So it seems that the nodes_or() below is
> only necessary to initialize used_targets, and multiple source nodes
> cannot share one target node in current implementation.

Yes, that is true. My focus on this implementation was simplicity and
sanity for common configurations. I can certainly imagine scenarios
where this is suboptimal.

I'm totally open to other ways of doing this.

2020-07-01 18:33:37

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed



On 7/1/20 7:46 AM, Dave Hansen wrote:
> On 7/1/20 1:47 AM, Greg Thelen wrote:
>> Dave Hansen <[email protected]> wrote:
>>> From: Keith Busch <[email protected]>
>>> Defer allocating the page until we are actually ready to make use of
>>> it, after locking the original page. This simplifies error handling,
>>> but should not have any functional change in behavior. This is just
>>> refactoring page migration so the main part can more easily be reused
>>> by other code.
>> Is there any concern that the src page is now held PG_locked over the
>> dst page allocation, which might wander into
>> reclaim/cond_resched/oom_kill? I don't have a deadlock in mind. I'm
>> just wondering about the additional latency imposed on unrelated threads
>> who want access src page.
> It's not great. *But*, the alternative is to toss the page contents out
> and let users encounter a fault and an allocation. They would be
> subject to all the latency associated with an allocation, just at a
> slightly later time.
>
> If it's a problem it seems like it would be pretty easy to fix, at least
> for non-cgroup reclaim. We know which node we're reclaiming from and we
> know if it has a demotion path, so we could proactively allocate a
> single migration target page before doing the source lock_page(). That
> creates some other problems, but I think it would be straightforward.

If so this patch looks pointless if I read it correctly. The patch
defers page allocation in __unmap_and_move() under page lock so that
__unmap_and _move() can be called in reclaim path since the src page is
locked in reclaim path before calling __unmap_and_move() otherwise it
would deadlock itself.

Actually you always allocate target page with src page locked with this
implementation unless you move the target page allocation before
shrink_page_list(), but the problem is you don't know how many pages you
need allocate.

The alternative may be to unlock the src page then allocate target page
then lock src page again. But if so why not just call migrate_pages()
directly as I did in my series? It put the src page on a separate list
then unlock it, then migrate themn in batch later.

>>> #Signed-off-by: Keith Busch <[email protected]>
>> Is commented Signed-off-by intentional? Same applies to later patches.
> Yes, Keith is no longer at Intel, so that @intel.com mail would bounce.
> I left the @intel.com SoB so it would be clear that the code originated
> from Keith while at Intel, but commented it out to avoid it being picked
> up by anyone's tooling.

2020-07-01 19:27:37

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

On Wed, 1 Jul 2020, Dave Hansen wrote:

> > Could this cause us to break a user's mbind() or allow a user to
> > circumvent their cpuset.mems?
>
> In its current form, yes.
>
> My current rationale for this is that while it's not as deferential as
> it can be to the user/kernel ABI contract, it's good *overall* behavior.
> The auto-migration only kicks in when the data is about to go away. So
> while the user's data might be slower than they like, it is *WAY* faster
> than they deserve because it should be off on the disk.
>

It's outside the scope of this patchset, but eventually there will be a
promotion path that I think requires a strict 1:1 relationship between
DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and
cpuset.mems become ineffective for nodes facing memory pressure.

For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes
perfect sense. Theoretically, I think you could have DRAM N0 and N1 and
then a single PMEM N2 and this N2 can be the terminal node for both N0 and
N1. On promotion, I think we need to rely on something stronger than
autonuma to decide which DRAM node to promote to: specifically any user
policy put into effect (memory tiering or autonuma shouldn't be allowed to
subvert these user policies).

As others have mentioned, we lose the allocation or process context at the
time of demotion or promotion and any workaround for that requires some
hacks, such as mapping the page to cpuset (what is the right solution for
shared pages?) or adding NUMA locality handling to memcg.

I think a 1:1 relationship between DRAM and PMEM nodes is required if we
consider the eventual promotion of this memory so that user memory can't
eventually reappear on a DRAM node that is not allowed by mbind(),
set_mempolicy(), or cpuset.mems. I think it also makes this patchset much
simpler.

> > Because we don't have a mapping of the page back to its allocation
> > context (or the process context in which it was allocated), it seems like
> > both are possible.
> >
> > So let's assume that migration nodes cannot be other DRAM nodes.
> > Otherwise, memory pressure could be intentionally or unintentionally
> > induced to migrate these pages to another node. Do we have such a
> > restriction on migration nodes?
>
> There's nothing explicit. On a normal, balanced system where there's a
> 1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
> implicit since the migration path is one deep and goes from DRAM->PMEM.
>
> If there were some oddball system where there was a memory only DRAM
> node, it might very well end up being a migration target.
>

Shouldn't DRAM->DRAM demotion be banned? It's all DRAM and within the
control of mempolicies and cpusets today, so I had assumed this is outside
the scope of memory tiering support. I had assumed that memory tiering
support was all about separate tiers :)

> >> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
> >> +{
> >> + /*
> >> + * 'mask' targets allocation only to the desired node in the
> >> + * migration path, and fails fast if the allocation can not be
> >> + * immediately satisfied. Reclaim is already active and heroic
> >> + * allocation efforts are unwanted.
> >> + */
> >> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
> >> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
> >> + __GFP_MOVABLE;
> >
> > GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we
> > actually want to kick kswapd on the pmem node?
>
> In my mental model, cold data flows from:
>
> DRAM -> PMEM -> swap
>
> Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
> for kinda cold data, kswapd can be working on doing the PMEM->swap part
> on really cold data.
>

Makes sense.

2020-07-01 19:46:47

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

On Wed, 1 Jul 2020, Yang Shi wrote:

> > We can do this if we consider pmem not to be a separate memory tier from
> > the system perspective, however, but rather the socket perspective. In
> > other words, a node can only demote to a series of exclusive pmem ranges
> > and promote to the same series of ranges in reverse order. So DRAM node 0
> > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> > node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> > one DRAM node.
> >
> > This naturally takes care of mbind() and cpuset.mems if we consider pmem
> > just to be slower volatile memory and we don't need to deal with the
> > latency concerns of cross socket migration. A user page will never be
> > demoted to a pmem range across the socket and will never be promoted to a
> > different DRAM node that it doesn't have access to.
>
> But I don't see too much benefit to limit the migration target to the
> so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a
> different socket) pmem node since even the cross socket access should be much
> faster then refault or swap from disk.
>

Hi Yang,

Right, but any eventual promotion path would allow this to subvert the
user mempolicy or cpuset.mems if the demoted memory is eventually promoted
to a DRAM node on its socket. We've discussed not having the ability to
map from the demoted page to either of these contexts and it becomes more
difficult for shared memory. We have page_to_nid() and page_zone() so we
can always find the appropriate demotion or promotion node for a given
page if there is a 1:1 relationship.

Do we lose anything with the strict 1:1 relationship between DRAM and PMEM
nodes? It seems much simpler in terms of implementation and is more
intuitive.

> I think using pmem as a node is more natural than zone and less intrusive
> since we can just reuse all the numa APIs. If we treat pmem as a new zone I
> think the implementation may be more intrusive and complicated (i.e. need a
> new gfp flag) and user can't control the memory placement.
>

This is an important decision to make, I'm not sure that we actually
*want* all of these NUMA APIs :) If my memory is demoted, I can simply do
migrate_pages() back to DRAM and cause other memory to be demoted in its
place. Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense.
Kswapd for a DRAM node putting pressure on a PMEM node for demotion that
then puts the kswapd for the PMEM node under pressure to reclaim it serves
*only* to spend unnecessary cpu cycles.

Users could control the memory placement through a new mempolicy flag,
which I think are needed anyway for explicit allocation policies for PMEM
nodes. Consider if PMEM is a zone so that it has the natural 1:1
relationship with DRAM, now your system only has nodes {0,1} as today, no
new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that
specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I
can then mlock() if I want to disable demotion on memory pressure).

2020-07-01 19:51:05

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

On Wed, 1 Jul 2020, Dave Hansen wrote:

> Even if they don't allocate directly from PMEM, is it OK for such an app
> to get its cold data migrated to PMEM? That's a much more subtle
> question and I suspect the kernel isn't going to have a single answer
> for it. I suspect we'll need a cpuset-level knob to turn auto-demotion
> on or off.
>

I think the answer is whether the app's cold data can be reclaimed,
otherwise migration to PMEM is likely better in terms of performance. So
any such app today should just be mlocking its cold data if it can't
handle overhead from reclaim?

2020-07-02 01:23:13

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order

Dave Hansen <[email protected]> writes:

> On 6/30/20 1:22 AM, Huang, Ying wrote:
>>> + /*
>>> + * To avoid cycles in the migration "graph", ensure
>>> + * that migration sources are not future targets by
>>> + * setting them in 'used_targets'.
>>> + *
>>> + * But, do this only once per pass so that multiple
>>> + * source nodes can share a target node.
>> establish_migrate_target() calls find_next_best_node(), which will set
>> target_node in used_targets. So it seems that the nodes_or() below is
>> only necessary to initialize used_targets, and multiple source nodes
>> cannot share one target node in current implementation.
>
> Yes, that is true. My focus on this implementation was simplicity and
> sanity for common configurations. I can certainly imagine scenarios
> where this is suboptimal.
>
> I'm totally open to other ways of doing this.

OK. So when we really need to share one target node for multiple source
nodes, we can add a parameter to find_next_best_node() to specify
whether set target_node in used_targets.

Best Regards,
Huang, Ying

2020-07-02 01:51:48

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

David Rientjes <[email protected]> writes:

> On Wed, 1 Jul 2020, Dave Hansen wrote:
>
>> Even if they don't allocate directly from PMEM, is it OK for such an app
>> to get its cold data migrated to PMEM? That's a much more subtle
>> question and I suspect the kernel isn't going to have a single answer
>> for it. I suspect we'll need a cpuset-level knob to turn auto-demotion
>> on or off.
>>
>
> I think the answer is whether the app's cold data can be reclaimed,
> otherwise migration to PMEM is likely better in terms of performance. So
> any such app today should just be mlocking its cold data if it can't
> handle overhead from reclaim?

Yes. That's a way to solve the problem. A cpuset-level knob may be
more flexible, because you don't need to change the application source
code.

Best Regards,
Huang, Ying

2020-07-02 05:05:16

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

David Rientjes <[email protected]> writes:

> On Wed, 1 Jul 2020, Dave Hansen wrote:
>
>> > Could this cause us to break a user's mbind() or allow a user to
>> > circumvent their cpuset.mems?
>>
>> In its current form, yes.
>>
>> My current rationale for this is that while it's not as deferential as
>> it can be to the user/kernel ABI contract, it's good *overall* behavior.
>> The auto-migration only kicks in when the data is about to go away. So
>> while the user's data might be slower than they like, it is *WAY* faster
>> than they deserve because it should be off on the disk.
>>
>
> It's outside the scope of this patchset, but eventually there will be a
> promotion path that I think requires a strict 1:1 relationship between
> DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and
> cpuset.mems become ineffective for nodes facing memory pressure.

I have posted an patchset for AutoNUMA based promotion support,

https://lore.kernel.org/lkml/[email protected]/

Where, the page is promoted upon NUMA hint page fault. So all memory
policy (mbind(), set_mempolicy(), and cpuset.mems) are available. We
can refuse promoting the page to the DRAM nodes that are not allowed by
any memory policy. So, 1:1 relationship isn't necessary for promotion.

> For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes
> perfect sense. Theoretically, I think you could have DRAM N0 and N1 and
> then a single PMEM N2 and this N2 can be the terminal node for both N0 and
> N1. On promotion, I think we need to rely on something stronger than
> autonuma to decide which DRAM node to promote to: specifically any user
> policy put into effect (memory tiering or autonuma shouldn't be allowed to
> subvert these user policies).
>
> As others have mentioned, we lose the allocation or process context at the
> time of demotion or promotion

As above, we have process context at time of promotion.

> and any workaround for that requires some
> hacks, such as mapping the page to cpuset (what is the right solution for
> shared pages?) or adding NUMA locality handling to memcg.

It sounds natural to me to add NUMA nodes restriction to memcg.

Best Regards,
Huang, Ying

2020-07-02 10:04:27

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

On Wed, 1 Jul 2020 12:45:17 -0700
David Rientjes <[email protected]> wrote:

> On Wed, 1 Jul 2020, Yang Shi wrote:
>
> > > We can do this if we consider pmem not to be a separate memory tier from
> > > the system perspective, however, but rather the socket perspective. In
> > > other words, a node can only demote to a series of exclusive pmem ranges
> > > and promote to the same series of ranges in reverse order. So DRAM node 0
> > > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> > > node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> > > one DRAM node.
> > >
> > > This naturally takes care of mbind() and cpuset.mems if we consider pmem
> > > just to be slower volatile memory and we don't need to deal with the
> > > latency concerns of cross socket migration. A user page will never be
> > > demoted to a pmem range across the socket and will never be promoted to a
> > > different DRAM node that it doesn't have access to.
> >
> > But I don't see too much benefit to limit the migration target to the
> > so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a
> > different socket) pmem node since even the cross socket access should be much
> > faster then refault or swap from disk.
> >
>
> Hi Yang,
>
> Right, but any eventual promotion path would allow this to subvert the
> user mempolicy or cpuset.mems if the demoted memory is eventually promoted
> to a DRAM node on its socket. We've discussed not having the ability to
> map from the demoted page to either of these contexts and it becomes more
> difficult for shared memory. We have page_to_nid() and page_zone() so we
> can always find the appropriate demotion or promotion node for a given
> page if there is a 1:1 relationship.
>
> Do we lose anything with the strict 1:1 relationship between DRAM and PMEM
> nodes? It seems much simpler in terms of implementation and is more
> intuitive.
Hi David, Yang,

The 1:1 mapping implies a particular system topology. In the medium
term we are likely to see systems with a central pool of persistent memory
with equal access characteristics from multiple CPU containing nodes, each
with local DRAM.

Clearly we could fake a split of such a pmem pool to keep the 1:1 mapping
but it's certainly not elegant and may be very wasteful for resources.

Can a zone based approach work well without such a hard wall?

Jonathan

>
> > I think using pmem as a node is more natural than zone and less intrusive
> > since we can just reuse all the numa APIs. If we treat pmem as a new zone I
> > think the implementation may be more intrusive and complicated (i.e. need a
> > new gfp flag) and user can't control the memory placement.
> >
>
> This is an important decision to make, I'm not sure that we actually
> *want* all of these NUMA APIs :) If my memory is demoted, I can simply do
> migrate_pages() back to DRAM and cause other memory to be demoted in its
> place. Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense.
> Kswapd for a DRAM node putting pressure on a PMEM node for demotion that
> then puts the kswapd for the PMEM node under pressure to reclaim it serves
> *only* to spend unnecessary cpu cycles.
>
> Users could control the memory placement through a new mempolicy flag,
> which I think are needed anyway for explicit allocation policies for PMEM
> nodes. Consider if PMEM is a zone so that it has the natural 1:1
> relationship with DRAM, now your system only has nodes {0,1} as today, no
> new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that
> specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I
> can then mlock() if I want to disable demotion on memory pressure).
>


2020-07-03 09:32:56

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

Dave Hansen <[email protected]> writes:
> +/*
> + * React to hotplug events that might online or offline
> + * NUMA nodes.
> + *
> + * This leaves migrate-on-reclaim transiently disabled
> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
> + * This runs whether RECLAIM_MIGRATE is enabled or not.
> + * That ensures that the user can turn RECLAIM_MIGRATE
> + * without needing to recalcuate migration targets.
> + */
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> + unsigned long action, void *arg)
> +{
> + switch (action) {
> + case MEM_GOING_OFFLINE:
> + /*
> + * Make sure there are not transient states where
> + * an offline node is a migration target. This
> + * will leave migration disabled until the offline
> + * completes and the MEM_OFFLINE case below runs.
> + */
> + disable_all_migrate_targets();
> + break;
> + case MEM_OFFLINE:
> + case MEM_ONLINE:
> + /*
> + * Recalculate the target nodes once the node
> + * reaches its final state (online or offline).
> + */
> + set_migration_target_nodes();
> + break;
> + case MEM_CANCEL_OFFLINE:
> + /*
> + * MEM_GOING_OFFLINE disabled all the migration
> + * targets. Reenable them.
> + */
> + set_migration_target_nodes();
> + break;
> + case MEM_GOING_ONLINE:
> + case MEM_CANCEL_ONLINE:
> + break;

I think we need to call
disable_all_migrate_targets()/set_migration_target_nodes() for CPU
online/offline event too. Because that will influence node_state(nid,
N_CPU). Which will influence node demotion relationship.

> + }
> +
> + return notifier_from_errno(0);
> }
> +

Best Regards,
Huang, Ying