LinuxLists.cc - [RFC][PATCH 0/9] [v4][RESEND] Migrate Pages in lieu of discard

2020-10-07 18:06:06

Subject: [RFC][PATCH 0/9] [v4][RESEND] Migrate Pages in lieu of discard

Changes since (automigrate-20200818):
* Fall back to normal reclaim when demotion fails

The full series is also available here:

https://github.com/hansendc/linux/tree/automigrate-20201007

I really just want folks to look at:

[RFC][PATCH 5/9] mm/migrate: demote pages during reclaim

I've reworked that so that it can both use the high-level migration
API, and fall back to normal reclaim if migration fails. I think
that gives us the best of both worlds.

I'm posting the series in case folks want to run the whole thing.

--

We're starting to see systems with more and more kinds of memory such
as Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out. Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties. First, the newer allocations can end
up in the slower persistent memory. Second, reclaimed data in DRAM
are just discarded even if there are gobs of space in persistent
memory that could be used.

This set implements a solution to these problems. At the end of the
reclaim process in shrink_page_list() just before the last page
refcount is dropped, the page is migrated to persistent memory instead
of being dropped.

While I've talked about a DRAM/PMEM pairing, this approach would
function in any environment where memory tiers exist.

This is not perfect. It "strands" pages in slower memory and never
brings them back to fast DRAM. Other things need to be built to
promote hot pages back to DRAM.

This is also all based on an upstream mechanism that allows
persistent memory to be onlined and used as if it were volatile:

http://lkml.kernel.org/r/[email protected]

== Open Issues ==

* For cpusets and memory policies that restrict allocations
to PMEM, is it OK to demote to PMEM? Do we need a cgroup-
level API to opt-in or opt-out of these migrations?

--

Changes since (https://lwn.net/Articles/824830/):
* Use higher-level migrate_pages() API approach from Yang Shi's
earlier patches.
* made sure to actually check node_reclaim_mode's new bit
* disabled migration entirely before introducing RECLAIM_MIGRATE
* Replace GFP_NOWAIT with explicit __GFP_KSWAPD_RECLAIM and
comment why we want that.
* Comment on effects of that keep multiple source nodes from
sharing target nodes

Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>

2020-10-07 18:06:11

by Dave Hansen

[permalink] [raw]

Subject: [RFC][PATCH 2/9] mm/numa: automatically generate node migration order

From: Dave Hansen <[email protected]>

When memory fills up on a node, memory contents can be
automatically migrated to another node. The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list
would be to follow the page allocator fallback lists. Those
lists already tell us if memory is full where to look next. It
would also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists. This would potentially lead to
migration cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering. But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now. It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

The protocol for node_demotion[] access and writing is not
standard. It has no specific locking and is intended to be read
locklessly. Readers must take care to avoid observing changes
that appear incoherent. This was done so that node_demotion[]
locking has no chance of becoming a bottleneck on large systems
with lots of CPUs in direct reclaim.

This code is unused for now. It will be called later in the
series.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
---

b/mm/internal.h | 1
b/mm/migrate.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
b/mm/page_alloc.c | 2
3 files changed, 138 insertions(+), 2 deletions(-)

diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h
--- a/mm/internal.h~auto-setup-default-migration-path-from-firmware 2020-10-07 09:15:27.027642452 -0700
+++ b/mm/internal.h 2020-10-07 09:15:27.039642452 -0700
@@ -203,6 +203,7 @@ extern int user_min_free_kbytes;

extern void zone_pcp_update(struct zone *zone);
extern void zone_pcp_reset(struct zone *zone);
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);

#if defined CONFIG_COMPACTION || defined CONFIG_CMA

diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c
--- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware 2020-10-07 09:15:27.031642452 -0700
+++ b/mm/migrate.c 2020-10-07 09:15:27.041642452 -0700
@@ -1161,6 +1161,10 @@ out:
return rc;
}

+/*
+ * Writes to this array occur without locking. READ_ONCE()
+ * is recommended for readers to ensure consistent reads.
+ */
static int node_demotion[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE};

/**
@@ -1174,7 +1178,13 @@ static int node_demotion[MAX_NUMNODES] =
*/
int next_demotion_node(int node)
{
- return node_demotion[node];
+ /*
+ * node_demotion[] is updated without excluding
+ * this function from running. READ_ONCE() avoids
+ * reading multiple, inconsistent 'node' values
+ * during an update.
+ */
+ return READ_ONCE(node_demotion[node]);
}

/*
@@ -3112,3 +3122,128 @@ void migrate_vma_finalize(struct migrate
}
EXPORT_SYMBOL(migrate_vma_finalize);
#endif /* CONFIG_DEVICE_PRIVATE */
+
+/* Disable reclaim-based migration. */
+static void disable_all_migrate_targets(void)
+{
+ int node;
+
+ for_each_online_node(node)
+ node_demotion[node] = NUMA_NO_NODE;
+}
+
+/*
+ * Find an automatic demotion target for 'node'.
+ * Failing here is OK. It might just indicate
+ * being at the end of a chain.
+ */
+static int establish_migrate_target(int node, nodemask_t *used)
+{
+ int migration_target;
+
+ /*
+ * Can not set a migration target on a
+ * node with it already set.
+ *
+ * No need for READ_ONCE() here since this
+ * in the write path for node_demotion[].
+ * This should be the only thread writing.
+ */
+ if (node_demotion[node] != NUMA_NO_NODE)
+ return NUMA_NO_NODE;
+
+ migration_target = find_next_best_node(node, used);
+ if (migration_target == NUMA_NO_NODE)
+ return NUMA_NO_NODE;
+
+ node_demotion[node] = migration_target;
+
+ return migration_target;
+}
+
+/*
+ * When memory fills up on a node, memory contents can be
+ * automatically migrated to another node instead of
+ * discarded at reclaim.
+ *
+ * Establish a "migration path" which will start at nodes
+ * with CPUs and will follow the priorities used to build the
+ * page allocator zonelists.
+ *
+ * The difference here is that cycles must be avoided. If
+ * node0 migrates to node1, then neither node1, nor anything
+ * node1 migrates to can migrate to node0.
+ *
+ * This function can run simultaneously with readers of
+ * node_demotion[]. However, it can not run simultaneously
+ * with itself. Exclusion is provided by memory hotplug events
+ * being single-threaded.
+ */
+void __set_migration_target_nodes(void)
+{
+ nodemask_t next_pass = NODE_MASK_NONE;
+ nodemask_t this_pass = NODE_MASK_NONE;
+ nodemask_t used_targets = NODE_MASK_NONE;
+ int node;
+
+ /*
+ * Avoid any oddities like cycles that could occur
+ * from changes in the topology. This will leave
+ * a momentary gap when migration is disabled.
+ */
+ disable_all_migrate_targets();
+
+ /*
+ * Ensure that the "disable" is visible across the system.
+ * Readers will see either a combination of before+disable
+ * state or disable+after. They will never see before and
+ * after state together.
+ *
+ * The before+after state together might have cycles and
+ * could cause readers to do things like loop until this
+ * function finishes. This ensures they can only see a
+ * single "bad" read and would, for instance, only loop
+ * once.
+ */
+ smp_wmb();
+
+ /*
+ * Allocations go close to CPUs, first. Assume that
+ * the migration path starts at the nodes with CPUs.
+ */
+ next_pass = node_states[N_CPU];
+again:
+ this_pass = next_pass;
+ next_pass = NODE_MASK_NONE;
+ /*
+ * To avoid cycles in the migration "graph", ensure
+ * that migration sources are not future targets by
+ * setting them in 'used_targets'. Do this only
+ * once per pass so that multiple source nodes can
+ * share a target node.
+ *
+ * 'used_targets' will become unavailable in future
+ * passes. This limits some opportunities for
+ * multiple source nodes to share a desintation.
+ */
+ nodes_or(used_targets, used_targets, this_pass);
+ for_each_node_mask(node, this_pass) {
+ int target_node = establish_migrate_target(node, &used_targets);
+
+ if (target_node == NUMA_NO_NODE)
+ continue;
+
+ /* Visit targets from this pass in the next pass: */
+ node_set(target_node, next_pass);
+ }
+ /* Is another pass necessary? */
+ if (!nodes_empty(next_pass))
+ goto again;
+}
+
+void set_migration_target_nodes(void)
+{
+ get_online_mems();
+ __set_migration_target_nodes();
+ put_online_mems();
+}
diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c
--- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware 2020-10-07 09:15:27.035642452 -0700
+++ b/mm/page_alloc.c 2020-10-07 09:15:27.043642452 -0700
@@ -5632,7 +5632,7 @@ static int node_load[MAX_NUMNODES];
*
* Return: node id of the found node or %NUMA_NO_NODE if no node is found.
*/
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask)
{
int n, val;
int min_val = INT_MAX;
_

2020-10-07 18:06:22

by Dave Hansen

[permalink] [raw]

Subject: [RFC][PATCH 3/9] mm/migrate: update migration order during on hotplug events

From: Dave Hansen <[email protected]>

Reclaim-based migration is attempting to optimize data placement in
memory based on the system topology. If the system changes, so must
the migration ordering.

The implementation here is pretty simple and entirely unoptimized. On
any memory or CPU hotplug events, assume that a node was added or
removed and recalculate all migration targets. This ensures that the
node_demotion[] array is always ready to be used in case the new
reclaim mode is enabled.

This recalculation is far from optimal, most glaringly that it does
not even attempt to figure out if nodes are actually coming or going.
But, given the expected paucity of hotplug events, this should be
fine.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
---

b/mm/migrate.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 93 insertions(+)

diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
--- a/mm/migrate.c~enable-numa-demotion 2020-10-07 09:15:28.260642449 -0700
+++ b/mm/migrate.c 2020-10-07 09:15:28.266642449 -0700
@@ -49,6 +49,7 @@
#include <linux/sched/mm.h>
#include <linux/ptrace.h>
#include <linux/oom.h>
+#include <linux/memory.h>

#include <asm/tlbflush.h>

@@ -3241,9 +3242,101 @@ again:
goto again;
}

+/*
+ * For callers that do not hold get_online_mems() already.
+ */
void set_migration_target_nodes(void)
{
get_online_mems();
__set_migration_target_nodes();
put_online_mems();
}
+
+/*
+ * React to hotplug events that might affect the migration targes
+ * like events that online or offline NUMA nodes.
+ *
+ * The ordering is also currently dependent on which nodes have
+ * CPUs. That means we need CPU on/offline notification too.
+ */
+static int migration_online_cpu(unsigned int cpu)
+{
+ set_migration_target_nodes();
+ return 0;
+}
+
+static int migration_offline_cpu(unsigned int cpu)
+{
+ set_migration_target_nodes();
+ return 0;
+}
+
+/*
+ * This leaves migrate-on-reclaim transiently disabled
+ * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
+ * This runs reclaim-based micgration is enabled or not.
+ * This ensures that the user can turn reclaim-based
+ * migration at any time without needing to recalcuate
+ * migration targets.
+ *
+ * These callbacks already hold get_online_mems(). That
+ * is why __set_migration_target_nodes() can be used as
+ * opposed to set_migration_target_nodes().
+ */
+#if defined(CONFIG_MEMORY_HOTPLUG)
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ switch (action) {
+ case MEM_GOING_OFFLINE:
+ /*
+ * Make sure there are not transient states where
+ * an offline node is a migration target. This
+ * will leave migration disabled until the offline
+ * completes and the MEM_OFFLINE case below runs.
+ */
+ disable_all_migrate_targets();
+ break;
+ case MEM_OFFLINE:
+ case MEM_ONLINE:
+ /*
+ * Recalculate the target nodes once the node
+ * reaches its final state (online or offline).
+ */
+ __set_migration_target_nodes();
+ break;
+ case MEM_CANCEL_OFFLINE:
+ /*
+ * MEM_GOING_OFFLINE disabled all the migration
+ * targets. Reenable them.
+ */
+ __set_migration_target_nodes();
+ break;
+ case MEM_GOING_ONLINE:
+ case MEM_CANCEL_ONLINE:
+ break;
+ }
+
+ return notifier_from_errno(0);
+}
+
+static int __init migrate_on_reclaim_init(void)
+{
+ int ret;
+
+ ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "migrate on reclaim",
+ migration_online_cpu,
+ migration_offline_cpu);
+ /*
+ * In the unlikely case that this fails, the automatic
+ * migration targets may become suboptimal for nodes
+ * where N_CPU changes. With such a small impact in a
+ * rare case, do not bother trying to do anything special.
+ */
+ WARN_ON(ret < 0);
+
+ hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+ return 0;
+}
+late_initcall(migrate_on_reclaim_init);
+#endif /* CONFIG_MEMORY_HOTPLUG */
_

2020-10-07 18:06:26

by Dave Hansen

[permalink] [raw]

Subject: [RFC][PATCH 8/9] mm/vmscan: never demote for memcg reclaim

From: Dave Hansen <[email protected]>

Global reclaim aims to reduce the amount of memory used on
a given node or set of nodes. Migrating pages to another
node serves this purpose.

memcg reclaim is different. Its goal is to reduce the
total memory consumption of the entire memcg, across all
nodes. Migration does not assist memcg reclaim because
it just moves page contents between nodes rather than
actually reducing memory consumption.

Signed-off-by: Dave Hansen <[email protected]>
Suggested-by: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
---

b/mm/vmscan.c | 33 +++++++++++++++++++++++++--------
1 file changed, 25 insertions(+), 8 deletions(-)

diff -puN mm/vmscan.c~never-demote-for-memcg-reclaim mm/vmscan.c
--- a/mm/vmscan.c~never-demote-for-memcg-reclaim 2020-10-07 09:15:34.546642433 -0700
+++ b/mm/vmscan.c 2020-10-07 09:15:34.554642433 -0700
@@ -291,8 +291,11 @@ static bool writeback_throttling_sane(st
#endif

static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
- int node_id)
+ int node_id,
+ struct scan_control *sc)
{
+ bool in_cgroup_reclaim = false;
+
/* Always age anon pages when we have swap */
if (memcg == NULL) {
if (get_nr_swap_pages() > 0)
@@ -302,8 +305,18 @@ static inline bool can_reclaim_anon_page
return true;
}

- /* Also age anon pages if we can auto-migrate them */
- if (next_demotion_node(node_id) >= 0)
+ /* Can only be in memcg reclaim in paths with valid 'sc': */
+ if (sc && cgroup_reclaim(sc))
+ in_cgroup_reclaim = true;
+
+ /*
+ * Also age anon pages if we can auto-migrate them.
+ *
+ * Migrating a page does not reduce comsumption of a
+ * memcg so should not be performed when in memcg
+ * reclaim.
+ */
+ if (!in_cgroup_reclaim && (next_demotion_node(node_id) >= 0))
return true;

/* No way to reclaim anon pages */
@@ -321,7 +334,7 @@ unsigned long zone_reclaimable_pages(str

nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
- if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
+ if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);

@@ -1064,6 +1077,10 @@ bool migrate_demote_page_ok(struct page
VM_BUG_ON_PAGE(PageHuge(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);

+ /* It is pointless to do demotion in memcg reclaim */
+ if (cgroup_reclaim(sc))
+ return false;
+
if (next_nid == NUMA_NO_NODE)
return false;
if (PageTransHuge(page) && !thp_migration_supported())
@@ -2368,7 +2385,7 @@ static void get_scan_count(struct lruvec
enum lru_list lru;

/* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
+ if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2653,7 +2670,7 @@ static void shrink_lruvec(struct lruvec
* rebalance the anon lru active/inactive ratio.
*/
if (can_reclaim_anon_pages(lruvec_memcg(lruvec),
- lruvec_pgdat(lruvec)->node_id) &&
+ lruvec_pgdat(lruvec)->node_id, sc) &&
inactive_is_low(lruvec, LRU_INACTIVE_ANON))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
@@ -2724,7 +2741,7 @@ static inline bool should_continue_recla
*/
pages_for_compaction = compact_gap(sc->order);
inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
- if (can_reclaim_anon_pages(NULL, pgdat->node_id))
+ if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);

return inactive_lru_pages > pages_for_compaction;
@@ -3483,7 +3500,7 @@ static void age_active_anon(struct pglis
struct mem_cgroup *memcg;
struct lruvec *lruvec;

- if (!can_reclaim_anon_pages(NULL, pgdat->node_id))
+ if (!can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
return;

lruvec = mem_cgroup_lruvec(NULL, pgdat);
_

2020-10-07 18:06:29

by Dave Hansen

[permalink] [raw]

Subject: [RFC][PATCH 6/9] mm/vmscan: add page demotion counter

From: Yang Shi <[email protected]>

Account the number of demoted pages into reclaim_state->nr_demoted.

Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.

[ daveh:
- __count_vm_events() a bit, and made them look at the THP
size directly rather than getting data from migrate_pages()
]

Signed-off-by: Yang Shi <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
---

b/include/linux/vm_event_item.h | 2 ++
b/mm/vmscan.c | 6 ++++++
b/mm/vmstat.c | 2 ++
3 files changed, 10 insertions(+)

diff -puN include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter include/linux/vm_event_item.h
--- a/include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.171642439 -0700
+++ b/include/linux/vm_event_item.h 2020-10-07 09:15:32.179642439 -0700
@@ -33,6 +33,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
PGREUSE,
PGSTEAL_KSWAPD,
PGSTEAL_DIRECT,
+ PGDEMOTE_KSWAPD,
+ PGDEMOTE_DIRECT,
PGSCAN_KSWAPD,
PGSCAN_DIRECT,
PGSCAN_DIRECT_THROTTLE,
diff -puN mm/vmscan.c~mm-vmscan-add-page-demotion-counter mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.173642439 -0700
+++ b/mm/vmscan.c 2020-10-07 09:15:32.180642439 -0700
@@ -147,6 +147,7 @@ struct scan_control {
unsigned int immediate;
unsigned int file_taken;
unsigned int taken;
+ unsigned int demoted;
} nr;

/* for recording the reclaimed slab by now */
@@ -1134,6 +1135,11 @@ static unsigned int demote_page_list(str
target_nid, MIGRATE_ASYNC, MR_DEMOTION,
&nr_succeeded);

+ if (current_is_kswapd())
+ __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
+ else
+ __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
+
return nr_succeeded;
}

diff -puN mm/vmstat.c~mm-vmscan-add-page-demotion-counter mm/vmstat.c
--- a/mm/vmstat.c~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.175642439 -0700
+++ b/mm/vmstat.c 2020-10-07 09:15:32.181642439 -0700
@@ -1244,6 +1244,8 @@ const char * const vmstat_text[] = {
"pgreuse",
"pgsteal_kswapd",
"pgsteal_direct",
+ "pgdemote_kswapd",
+ "pgdemote_direct",
"pgscan_kswapd",
"pgscan_direct",
"pgscan_direct_throttle",
_

2020-10-07 18:24:06

by Oscar Salvador

[permalink] [raw]

Subject: Re: [RFC][PATCH 3/9] mm/migrate: update migration order during on hotplug events

On 2020-10-07 18:17, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> Reclaim-based migration is attempting to optimize data placement in
> memory based on the system topology. If the system changes, so must
> the migration ordering.
>
> The implementation here is pretty simple and entirely unoptimized. On
> any memory or CPU hotplug events, assume that a node was added or
> removed and recalculate all migration targets. This ensures that the
> node_demotion[] array is always ready to be used in case the new
> reclaim mode is enabled.
>
> This recalculation is far from optimal, most glaringly that it does
> not even attempt to figure out if nodes are actually coming or going.
> But, given the expected paucity of hotplug events, this should be
> fine.

Hi Dave,

I am still going through all the details, but just wanted to comment
early on this one.
Could not you hook into __try_online_node/try_offline_node?

In there we check whether a node should be brought up or removed due to
lack of cpus and memory.
That is being checked during hot-remove operations.

We also have node_states_check_changes_{offline,online} and their pair
node_states_{set,clear}_node, that checks during online/offline stages
which states should be removed from the node, but that is only wrt.
memory (I guess we would only be interested in N_MEMORY).

Thanks

2020-10-13 10:48:47

by Yang Shi

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/9] [v4][RESEND] Migrate Pages in lieu of discard

On Wed, Oct 7, 2020 at 9:17 AM Dave Hansen <[email protected]> wrote:
>
>
> Changes since (automigrate-20200818):
> * Fall back to normal reclaim when demotion fails
>
> The full series is also available here:
>
> https://github.com/hansendc/linux/tree/automigrate-20201007
>
> I really just want folks to look at:
>
> [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim
>
> I've reworked that so that it can both use the high-level migration
> API, and fall back to normal reclaim if migration fails. I think
> that gives us the best of both worlds.

Thanks for doing this. Although I was inclined to think the kswapds on
PMEM nodes could make enough space for retrying migration later
instead of doing swap right away, this approach might be
over-engineering and over-killing. The simple immediate "retry regular
reclaim" approach also looks fine to me for the time being. We always
could optimize it later with more test results backed by real life
workloads.

>
> I'm posting the series in case folks want to run the whole thing.
>
> --
>
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
>
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out. Allocations will, at some point, start
> falling over to the slower persistent memory.
>
> That has two nasty properties. First, the newer allocations can end
> up in the slower persistent memory. Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
>
> This set implements a solution to these problems. At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
>
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
>
> This is not perfect. It "strands" pages in slower memory and never
> brings them back to fast DRAM. Other things need to be built to
> promote hot pages back to DRAM.
>
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
>
> http://lkml.kernel.org/r/[email protected]
>
> == Open Issues ==
>
> * For cpusets and memory policies that restrict allocations
> to PMEM, is it OK to demote to PMEM? Do we need a cgroup-
> level API to opt-in or opt-out of these migrations?
>
> --
>
> Changes since (https://lwn.net/Articles/824830/):
> * Use higher-level migrate_pages() API approach from Yang Shi's
> earlier patches.
> * made sure to actually check node_reclaim_mode's new bit
> * disabled migration entirely before introducing RECLAIM_MIGRATE
> * Replace GFP_NOWAIT with explicit __GFP_KSWAPD_RECLAIM and
> comment why we want that.
> * Comment on effects of that keep multiple source nodes from
> sharing target nodes
>
> Cc: Yang Shi <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: David Hildenbrand <[email protected]>
>

2020-10-19 21:17:00

by Huang, Ying

[permalink] [raw]

Subject: Re: [RFC][PATCH 6/9] mm/vmscan: add page demotion counter

Dave Hansen <[email protected]> writes:

> From: Yang Shi <[email protected]>
>
> Account the number of demoted pages into reclaim_state->nr_demoted.

It appears that you don't add "nr_demoted" into struct reclaim_state.

> Add pgdemote_kswapd and pgdemote_direct VM counters showed in
> /proc/vmstat.
>
> [ daveh:
> - __count_vm_events() a bit, and made them look at the THP
> size directly rather than getting data from migrate_pages()

It appears that we get the data from migrate_pages() now.

> ]
>
> Signed-off-by: Yang Shi <[email protected]>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> ---
>
> b/include/linux/vm_event_item.h | 2 ++
> b/mm/vmscan.c | 6 ++++++
> b/mm/vmstat.c | 2 ++
> 3 files changed, 10 insertions(+)
>
> diff -puN include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter include/linux/vm_event_item.h
> --- a/include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.171642439 -0700
> +++ b/include/linux/vm_event_item.h 2020-10-07 09:15:32.179642439 -0700
> @@ -33,6 +33,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
> PGREUSE,
> PGSTEAL_KSWAPD,
> PGSTEAL_DIRECT,
> + PGDEMOTE_KSWAPD,
> + PGDEMOTE_DIRECT,
> PGSCAN_KSWAPD,
> PGSCAN_DIRECT,
> PGSCAN_DIRECT_THROTTLE,
> diff -puN mm/vmscan.c~mm-vmscan-add-page-demotion-counter mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.173642439 -0700
> +++ b/mm/vmscan.c 2020-10-07 09:15:32.180642439 -0700
> @@ -147,6 +147,7 @@ struct scan_control {
> unsigned int immediate;
> unsigned int file_taken;
> unsigned int taken;
> + unsigned int demoted;

It appears that this newly added field isn't used in the patch.

> } nr;
>
> /* for recording the reclaimed slab by now */
> @@ -1134,6 +1135,11 @@ static unsigned int demote_page_list(str
> target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> &nr_succeeded);
>
> + if (current_is_kswapd())
> + __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
> + else
> + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
> +
> return nr_succeeded;
> }
>
> diff -puN mm/vmstat.c~mm-vmscan-add-page-demotion-counter mm/vmstat.c
> --- a/mm/vmstat.c~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.175642439 -0700
> +++ b/mm/vmstat.c 2020-10-07 09:15:32.181642439 -0700
> @@ -1244,6 +1244,8 @@ const char * const vmstat_text[] = {
> "pgreuse",
> "pgsteal_kswapd",
> "pgsteal_direct",
> + "pgdemote_kswapd",
> + "pgdemote_direct",
> "pgscan_kswapd",
> "pgscan_direct",
> "pgscan_direct_throttle",
> _

Best Regards,
Huang, Ying

2020-10-28 15:38:34

by Yang Shi

[permalink] [raw]

Subject: Re: [RFC][PATCH 6/9] mm/vmscan: add page demotion counter

On Mon, Oct 19, 2020 at 12:38 AM Huang, Ying <[email protected]> wrote:
>
> Dave Hansen <[email protected]> writes:
>
> > From: Yang Shi <[email protected]>
> >
> > Account the number of demoted pages into reclaim_state->nr_demoted.
>
> It appears that you don't add "nr_demoted" into struct reclaim_state.
>
> > Add pgdemote_kswapd and pgdemote_direct VM counters showed in
> > /proc/vmstat.
> >
> > [ daveh:
> > - __count_vm_events() a bit, and made them look at the THP
> > size directly rather than getting data from migrate_pages()
>
> It appears that we get the data from migrate_pages() now.
>
> > ]
> >
> > Signed-off-by: Yang Shi <[email protected]>
> > Signed-off-by: Dave Hansen <[email protected]>
> > Cc: David Rientjes <[email protected]>
> > Cc: Huang Ying <[email protected]>
> > Cc: Dan Williams <[email protected]>
> > Cc: David Hildenbrand <[email protected]>
> > ---
> >
> > b/include/linux/vm_event_item.h | 2 ++
> > b/mm/vmscan.c | 6 ++++++
> > b/mm/vmstat.c | 2 ++
> > 3 files changed, 10 insertions(+)
> >
> > diff -puN include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter include/linux/vm_event_item.h
> > --- a/include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.171642439 -0700
> > +++ b/include/linux/vm_event_item.h 2020-10-07 09:15:32.179642439 -0700
> > @@ -33,6 +33,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
> > PGREUSE,
> > PGSTEAL_KSWAPD,
> > PGSTEAL_DIRECT,
> > + PGDEMOTE_KSWAPD,
> > + PGDEMOTE_DIRECT,
> > PGSCAN_KSWAPD,
> > PGSCAN_DIRECT,
> > PGSCAN_DIRECT_THROTTLE,
> > diff -puN mm/vmscan.c~mm-vmscan-add-page-demotion-counter mm/vmscan.c
> > --- a/mm/vmscan.c~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.173642439 -0700
> > +++ b/mm/vmscan.c 2020-10-07 09:15:32.180642439 -0700
> > @@ -147,6 +147,7 @@ struct scan_control {
> > unsigned int immediate;
> > unsigned int file_taken;
> > unsigned int taken;
> > + unsigned int demoted;
>
> It appears that this newly added field isn't used in the patch.

My original patch tracked nr_demoted in reclaim_stat as well, but it
seems Dave dropped that part. If Dave thinks it is not necessary to
keep tracking nr_demoted in reclaim_stat, then that field should be
dropped.

>
> > } nr;
> >
> > /* for recording the reclaimed slab by now */
> > @@ -1134,6 +1135,11 @@ static unsigned int demote_page_list(str
> > target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> > &nr_succeeded);
> >
> > + if (current_is_kswapd())
> > + __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
> > + else
> > + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
> > +
> > return nr_succeeded;
> > }
> >
> > diff -puN mm/vmstat.c~mm-vmscan-add-page-demotion-counter mm/vmstat.c
> > --- a/mm/vmstat.c~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.175642439 -0700
> > +++ b/mm/vmstat.c 2020-10-07 09:15:32.181642439 -0700
> > @@ -1244,6 +1244,8 @@ const char * const vmstat_text[] = {
> > "pgreuse",
> > "pgsteal_kswapd",
> > "pgsteal_direct",
> > + "pgdemote_kswapd",
> > + "pgdemote_direct",
> > "pgscan_kswapd",
> > "pgscan_direct",
> > "pgscan_direct_throttle",
> > _
>
> Best Regards,
> Huang, Ying
>