LinuxLists.cc - [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-11 03:59:02

[permalink] [raw]

Subject: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

With Dave Hansen's patches merged into Linus's tree

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4

PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
effectively and efficiently is still a question.

There have been a couple of proposals posted on the mailing list [1] [2] [3].

Changelog
=========
v1 --> v2:
* Dropped the default allocation node mask. The memory placement restriction
could be achieved by mempolicy or cpuset.
* Dropped the new mempolicy since its semantic is not that clear yet.
* Dropped PG_Promote flag.
* Defined N_CPU_MEM nodemask for the nodes which have both CPU and memory.
* Extended page_check_references() to implement "twice access" check for
anonymous page in NUMA balancing path.
* Reworked the memory demotion code.

v1: https://lore.kernel.org/linux-mm/[email protected]/

Design
======
Basically, the approach is aimed to spread data from DRAM (closest to local
CPU) down further to PMEM and disk (typically assume the lower tier storage
is slower, larger and cheaper than the upper tier) by their hotness. The
patchset tries to achieve this goal by doing memory promotion/demotion via
NUMA balancing and memory reclaim as what the below diagram shows:

DRAM <--> PMEM <--> Disk
^ ^
|-------------------|
swap

When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
Then NUMA balancing will promote pages to DRAM as long as the page is referenced
again. The memory pressure on PMEM node would push the inactive pages of PMEM
to disk via swap.

The promotion/demotion happens only between "primary" nodes (the nodes have
both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes
and promotion from DRAM to PMEM and demotion from PMEM to DRAM.

The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
that has differentiated performance from the conventional memory pool, or
differentiated performance for a specific initiator, per Dan Williams. So,
assuming PMEM nodes are cpuless nodes sounds reasonable.

However, cpuless nodes might be not PMEM nodes. But, actually, memory
promotion/demotion doesn't care what kind of memory will be the target nodes,
it could be DRAM, PMEM or something else, as long as they are the second tier
memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
pointless to do such demotion.

Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
Typically, memory allocation would happen on such nodes by default unless
cpuless nodes are specified explicitly, cpuless nodes would be just fallback
nodes, so they are also as known as "primary" nodes in this patchset. With
two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
demonstrate the promotion/demotion approach for now, and this looks more
architecture-independent. But it may be better to construct such node mask
by reading hardware information (i.e. HMAT), particularly for more complex
memory hierarchy.

To reduce memory thrashing and PMEM bandwidth pressure, promote twice faulted
page in NUMA balancing. Implement "twice access" check by extending
page_check_references() for anonymous pages.

When doing demotion, demote to the less-contended local PMEM node. If the
local PMEM node is contended (i.e. migrate_pages() returns -ENOMEM), just do
swap instead of demotion. To make things simple, demotion to the remote PMEM
node is not allowed for now if the local PMEM node is online. If the local
PMEM node is not online, just demote to the remote one. If no PMEM node online,
just do normal swap.

Anonymous page only for the time being since NUMA balancing can't promote
unmapped page cache.

Added vmstat counters for pgdemote_kswapd, pgdemote_direct and
numa_pages_promoted.

There are definitely still some details need to be sorted out, for example,
shall respect to mempolicy in demotion path, etc.

Any comment is welcome.

Test
====
The stress test was done with mmtests + applications workload (i.e. sysbench,
grep, etc).

Generate memory pressure by running mmtest's usemem-stress-numa-compact,
then run other applications as workload to stress the promotion and demotion
path. The machine was still alive after the stress test had been running for
~30 hours. The /proc/vmstat also shows:

...
pgdemote_kswapd 3316563
pgdemote_direct 1930721
...
numa_pages_promoted 81838

TODO
====
1. Promote page cache. There are a couple of ways to handle this in kernel,
i.e. promote via active LRU in reclaim path on PMEM node, or promote in
mark_page_accessed().

2. Promote/demote HugeTLB. Now HugeTLB is not on LRU and NUMA balancing just
skips it.

3. May place kernel pages (i.e. page table, slabs, etc) on DRAM only.

[1]: https://lore.kernel.org/linux-mm/[email protected]/
[2]: https://lore.kernel.org/linux-mm/[email protected]/
[3]: https://lore.kernel.org/linux-mm/[email protected]/T/#me1c1ed102741ba945c57071de9749e16a76e9f3d

Yang Shi (9):
mm: define N_CPU_MEM node states
mm: page_alloc: make find_next_best_node find return cpuless node
mm: numa: promote pages to DRAM when it gets accessed twice
mm: migrate: make migrate_pages() return nr_succeeded
mm: vmscan: demote anon DRAM pages to PMEM node
mm: vmscan: don't demote for memcg reclaim
mm: vmscan: check if the demote target node is contended or not
mm: vmscan: add page demotion counter
mm: numa: add page promotion counter

drivers/base/node.c | 2 +
include/linux/gfp.h | 12 +++
include/linux/migrate.h | 6 +-
include/linux/mmzone.h | 3 +
include/linux/nodemask.h | 3 +-
include/linux/vm_event_item.h | 3 +
include/linux/vmstat.h | 1 +
include/trace/events/migrate.h | 3 +-
mm/compaction.c | 3 +-
mm/debug.c | 1 +
mm/gup.c | 4 +-
mm/huge_memory.c | 15 ++++
mm/internal.h | 105 +++++++++++++++++++++++++
mm/memory-failure.c | 7 +-
mm/memory.c | 25 ++++++
mm/memory_hotplug.c | 10 ++-
mm/mempolicy.c | 7 +-
mm/migrate.c | 33 +++++---
mm/page_alloc.c | 19 +++--
mm/vmscan.c | 262 +++++++++++++++++++++++++++++++++++++++++----------------------
mm/vmstat.c | 14 +++-
21 files changed, 418 insertions(+), 120 deletions(-)

2019-04-11 03:58:18

[permalink] [raw]

Subject: [v2 PATCH 1/9] mm: define N_CPU_MEM node states

Kernel has some pre-defined node masks called node states, i.e.
N_MEMORY, N_CPU, etc. But, there might be cpuless nodes, i.e. PMEM
nodes, and some architectures, i.e. Power, may have memoryless nodes.
It is not very straight forward to get the nodes with both CPUs and
memory. So, define N_CPU_MEMORY node states. The nodes with both CPUs
and memory are called "primary" nodes. /sys/devices/system/node/primary
would show the current online "primary" nodes.

Signed-off-by: Yang Shi <[email protected]>
---
drivers/base/node.c | 2 ++
include/linux/nodemask.h | 3 ++-
mm/memory_hotplug.c | 6 ++++++
mm/page_alloc.c | 1 +
mm/vmstat.c | 11 +++++++++--
5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 86d6cd9..1b963b2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -634,6 +634,7 @@ static ssize_t show_node_state(struct device *dev,
#endif
[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
+ [N_CPU_MEM] = _NODE_ATTR(primary, N_CPU_MEM),
};

static struct attribute *node_state_attrs[] = {
@@ -645,6 +646,7 @@ static ssize_t show_node_state(struct device *dev,
#endif
&node_state_attr[N_MEMORY].attr.attr,
&node_state_attr[N_CPU].attr.attr,
+ &node_state_attr[N_CPU_MEM].attr.attr,
NULL
};

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 27e7fa3..66a8964 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -398,7 +398,8 @@ enum node_states {
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
N_MEMORY, /* The node has memory(regular, high, movable) */
- N_CPU, /* The node has one or more cpus */
+ N_CPU, /* The node has one or more cpus */
+ N_CPU_MEM, /* The node has both cpus and memory */
NR_NODE_STATES
};

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f767582..1140f3b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -729,6 +729,9 @@ static void node_states_set_node(int node, struct memory_notify *arg)

if (arg->status_change_nid >= 0)
node_set_state(node, N_MEMORY);
+
+ if (node_state(node, N_CPU))
+ node_set_state(node, N_CPU_MEM);
}

static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,
@@ -1569,6 +1572,9 @@ static void node_states_clear_node(int node, struct memory_notify *arg)

if (arg->status_change_nid >= 0)
node_clear_state(node, N_MEMORY);
+
+ if (node_state(node, N_CPU))
+ node_clear_state(node, N_CPU_MEM);
}

static int __ref __offline_pages(unsigned long start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03fcf73..7cd88a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -122,6 +122,7 @@ struct pcpu_drain {
#endif
[N_MEMORY] = { { [0] = 1UL } },
[N_CPU] = { { [0] = 1UL } },
+ [N_CPU_MEM] = { { [0] = 1UL } },
#endif /* NUMA */
};
EXPORT_SYMBOL(node_states);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 36b56f8..1a431dc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1910,15 +1910,22 @@ static void __init init_cpu_node_state(void)
int node;

for_each_online_node(node) {
- if (cpumask_weight(cpumask_of_node(node)) > 0)
+ if (cpumask_weight(cpumask_of_node(node)) > 0) {
node_set_state(node, N_CPU);
+ if (node_state(node, N_MEMORY))
+ node_set_state(node, N_CPU_MEM);
+ }
}
}

static int vmstat_cpu_online(unsigned int cpu)
{
+ int node = cpu_to_node(cpu);
+
refresh_zone_stat_thresholds();
- node_set_state(cpu_to_node(cpu), N_CPU);
+ node_set_state(node, N_CPU);
+ if (node_state(node, N_MEMORY))
+ node_set_state(node, N_CPU_MEM);
return 0;
}

--
1.8.3.1

2019-04-11 03:58:23

[permalink] [raw]

Subject: [v2 PATCH 2/9] mm: page_alloc: make find_next_best_node find return cpuless node

Need find the cloest cpuless node to demote DRAM pages. Add
"cpuless" parameter to find_next_best_node() to skip DRAM node on
demand.

Signed-off-by: Yang Shi <[email protected]>
---
mm/internal.h | 11 +++++++++++
mm/page_alloc.c | 14 ++++++++++----
2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9eeaf2b..a514808 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -292,6 +292,17 @@ static inline bool is_data_mapping(vm_flags_t flags)
return (flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE;
}

+#ifdef CONFIG_NUMA
+extern int find_next_best_node(int node, nodemask_t *used_node_mask,
+ bool cpuless);
+#else
+static inline int find_next_best_node(int node, nodemask_t *used_node_mask,
+ bool cpuless)
+{
+ return 0;
+}
+#endif
+
/* mm/util.c */
void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct rb_node *rb_parent);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7cd88a4..bda17c2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5362,6 +5362,7 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
* find_next_best_node - find the next node that should appear in a given node's fallback list
* @node: node whose fallback list we're appending
* @used_node_mask: nodemask_t of already used nodes
+ * @cpuless: find next best cpuless node
*
* We use a number of factors to determine which is the next node that should
* appear on a given node's fallback list. The node should not have appeared
@@ -5373,7 +5374,8 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
*
* Return: node id of the found node or %NUMA_NO_NODE if no node is found.
*/
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask,
+ bool cpuless)
{
int n, val;
int min_val = INT_MAX;
@@ -5381,13 +5383,18 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
const struct cpumask *tmp = cpumask_of_node(0);

/* Use the local node if we haven't already */
- if (!node_isset(node, *used_node_mask)) {
+ if (!node_isset(node, *used_node_mask) &&
+ !cpuless) {
node_set(node, *used_node_mask);
return node;
}

for_each_node_state(n, N_MEMORY) {

+ /* Find next best cpuless node */
+ if (cpuless && (node_state(n, N_CPU)))
+ continue;
+
/* Don't want a node to appear more than once */
if (node_isset(n, *used_node_mask))
continue;
@@ -5419,7 +5426,6 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
return best_node;
}

-
/*
* Build zonelists ordered by node and zones within node.
* This results in maximum locality--normal zone overflows into local
@@ -5481,7 +5487,7 @@ static void build_zonelists(pg_data_t *pgdat)
nodes_clear(used_mask);

memset(node_order, 0, sizeof(node_order));
- while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+ while ((node = find_next_best_node(local_node, &used_mask, false)) >= 0) {
/*
* We don't want to pressure a particular node.
* So adding penalty to the first node in same
--
1.8.3.1

2019-04-11 03:58:31

[permalink] [raw]

Subject: [v2 PATCH 3/9] mm: numa: promote pages to DRAM when it gets accessed twice

NUMA balancing would promote the pages to DRAM once it is accessed, but
it might be just one off access. To reduce migration thrashing and
memory bandwidth pressure, just promote the page which gets accessed
twice by extending page_check_references() to support second reference
algorithm for anonymous page.

The page_check_reference() would walk all mapped pte or pmd to check if
the page is referenced or not, but such walk sounds unnecessary to NUMA
balancing since NUMA balancing would have pte or pmd referenced bit set
all the time, so anonymous page would have at least one referenced pte
or pmd. And, distinguish with page reclaim path via scan_control,
scan_control would be NULL in NUMA balancing path.

This approach is not definitely the optimal one to distinguish the
hot or cold pages accurately. It may need much more sophisticated
algorithm to distinguish hot or cold pages accurately.

Signed-off-by: Yang Shi <[email protected]>
---
mm/huge_memory.c | 11 ++++++
mm/internal.h | 80 ++++++++++++++++++++++++++++++++++++++
mm/memory.c | 21 ++++++++++
mm/vmscan.c | 116 ++++++++++++++++---------------------------------------
4 files changed, 146 insertions(+), 82 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 404acdc..0b18ac45 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1590,6 +1590,17 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
}

/*
+ * Promote the page when it gets NUMA fault twice.
+ * It is safe to set page flag since the page is locked now.
+ */
+ if (!node_state(page_nid, N_CPU_MEM) &&
+ page_check_references(page, NULL) != PAGEREF_PROMOTE) {
+ put_page(page);
+ page_nid = NUMA_NO_NODE;
+ goto clear_pmdnuma;
+ }
+
+ /*
* Migrate the THP to the requested node, returns with page unlocked
* and access rights restored.
*/
diff --git a/mm/internal.h b/mm/internal.h
index a514808..bee4d6c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -89,8 +89,88 @@ static inline void set_page_refcounted(struct page *page)
/*
* in mm/vmscan.c:
*/
+struct scan_control {
+ /* How many pages shrink_list() should reclaim */
+ unsigned long nr_to_reclaim;
+
+ /*
+ * Nodemask of nodes allowed by the caller. If NULL, all nodes
+ * are scanned.
+ */
+ nodemask_t *nodemask;
+
+ /*
+ * The memory cgroup that hit its limit and as a result is the
+ * primary target of this reclaim invocation.
+ */
+ struct mem_cgroup *target_mem_cgroup;
+
+ /* Writepage batching in laptop mode; RECLAIM_WRITE */
+ unsigned int may_writepage:1;
+
+ /* Can mapped pages be reclaimed? */
+ unsigned int may_unmap:1;
+
+ /* Can pages be swapped as part of reclaim? */
+ unsigned int may_swap:1;
+
+ /* e.g. boosted watermark reclaim leaves slabs alone */
+ unsigned int may_shrinkslab:1;
+
+ /*
+ * Cgroups are not reclaimed below their configured memory.low,
+ * unless we threaten to OOM. If any cgroups are skipped due to
+ * memory.low and nothing was reclaimed, go back for memory.low.
+ */
+ unsigned int memcg_low_reclaim:1;
+ unsigned int memcg_low_skipped:1;
+
+ unsigned int hibernation_mode:1;
+
+ /* One of the zones is ready for compaction */
+ unsigned int compaction_ready:1;
+
+ /* Allocation order */
+ s8 order;
+
+ /* Scan (total_size >> priority) pages at once */
+ s8 priority;
+
+ /* The highest zone to isolate pages for reclaim from */
+ s8 reclaim_idx;
+
+ /* This context's GFP mask */
+ gfp_t gfp_mask;
+
+ /* Incremented by the number of inactive pages that were scanned */
+ unsigned long nr_scanned;
+
+ /* Number of pages freed so far during a call to shrink_zones() */
+ unsigned long nr_reclaimed;
+
+ struct {
+ unsigned int dirty;
+ unsigned int unqueued_dirty;
+ unsigned int congested;
+ unsigned int writeback;
+ unsigned int immediate;
+ unsigned int file_taken;
+ unsigned int taken;
+ } nr;
+};
+
+enum page_references {
+ PAGEREF_RECLAIM,
+ PAGEREF_RECLAIM_CLEAN,
+ PAGEREF_KEEP,
+ PAGEREF_ACTIVATE,
+ PAGEREF_PROMOTE = PAGEREF_ACTIVATE,
+};
+
extern int isolate_lru_page(struct page *page);
extern void putback_lru_page(struct page *page);
+enum page_references page_check_references(struct page *page,
+ struct scan_control *sc);

/*
* in mm/rmap.c:
diff --git a/mm/memory.c b/mm/memory.c
index 47fe250..01c1ead 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3680,6 +3680,27 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
goto out;
}

+ /*
+ * Promote the page when it gets NUMA fault twice.
+ * Need lock the page before check its references.
+ */
+ if (!node_state(page_nid, N_CPU_MEM)) {
+ if (!trylock_page(page)) {
+ put_page(page);
+ target_nid = NUMA_NO_NODE;
+ goto out;
+ }
+
+ if (page_check_references(page, NULL) != PAGEREF_PROMOTE) {
+ unlock_page(page);
+ put_page(page);
+ target_nid = NUMA_NO_NODE;
+ goto out;
+ }
+
+ unlock_page(page);
+ }
+
/* Migrate to the requested node */
migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a5ad0b3..0504845 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -63,76 +63,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/vmscan.h>

-struct scan_control {
- /* How many pages shrink_list() should reclaim */
- unsigned long nr_to_reclaim;
-
- /*
- * Nodemask of nodes allowed by the caller. If NULL, all nodes
- * are scanned.
- */
- nodemask_t *nodemask;
-
- /*
- * The memory cgroup that hit its limit and as a result is the
- * primary target of this reclaim invocation.
- */
- struct mem_cgroup *target_mem_cgroup;
-
- /* Writepage batching in laptop mode; RECLAIM_WRITE */
- unsigned int may_writepage:1;
-
- /* Can mapped pages be reclaimed? */
- unsigned int may_unmap:1;
-
- /* Can pages be swapped as part of reclaim? */
- unsigned int may_swap:1;
-
- /* e.g. boosted watermark reclaim leaves slabs alone */
- unsigned int may_shrinkslab:1;
-
- /*
- * Cgroups are not reclaimed below their configured memory.low,
- * unless we threaten to OOM. If any cgroups are skipped due to
- * memory.low and nothing was reclaimed, go back for memory.low.
- */
- unsigned int memcg_low_reclaim:1;
- unsigned int memcg_low_skipped:1;
-
- unsigned int hibernation_mode:1;
-
- /* One of the zones is ready for compaction */
- unsigned int compaction_ready:1;
-
- /* Allocation order */
- s8 order;
-
- /* Scan (total_size >> priority) pages at once */
- s8 priority;
-
- /* The highest zone to isolate pages for reclaim from */
- s8 reclaim_idx;
-
- /* This context's GFP mask */
- gfp_t gfp_mask;
-
- /* Incremented by the number of inactive pages that were scanned */
- unsigned long nr_scanned;
-
- /* Number of pages freed so far during a call to shrink_zones() */
- unsigned long nr_reclaimed;
-
- struct {
- unsigned int dirty;
- unsigned int unqueued_dirty;
- unsigned int congested;
- unsigned int writeback;
- unsigned int immediate;
- unsigned int file_taken;
- unsigned int taken;
- } nr;
-};
-
#ifdef ARCH_HAS_PREFETCH
#define prefetch_prev_lru_page(_page, _base, _field) \
do { \
@@ -1002,21 +932,32 @@ void putback_lru_page(struct page *page)
put_page(page); /* drop ref from isolate */
}

-enum page_references {
- PAGEREF_RECLAIM,
- PAGEREF_RECLAIM_CLEAN,
- PAGEREF_KEEP,
- PAGEREF_ACTIVATE,
-};
-
-static enum page_references page_check_references(struct page *page,
- struct scan_control *sc)
+/*
+ * Called by NUMA balancing to implement access twice check for
+ * promoting pages from cpuless nodes.
+ *
+ * The sc would be NULL in NUMA balancing path.
+ */
+enum page_references page_check_references(struct page *page,
+ struct scan_control *sc)
{
int referenced_ptes, referenced_page;
unsigned long vm_flags;
+ struct mem_cgroup *memcg = sc ? sc->target_mem_cgroup : NULL;
+
+ if (sc)
+ referenced_ptes = page_referenced(page, 1, memcg, &vm_flags);
+ else
+ /*
+ * The page should always has at least one referenced pte
+ * in NUMA balancing path since NUMA balancing set referenced
+ * bit by default in PAGE_NONE.
+ * So, it sounds unnecessary to walk rmap to get the number of
+ * referenced ptes. This also help avoid potential ptl
+ * deadlock for huge pmd.
+ */
+ referenced_ptes = 1;

- referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
- &vm_flags);
referenced_page = TestClearPageReferenced(page);

/*
@@ -1027,8 +968,19 @@ static enum page_references page_check_references(struct page *page,
return PAGEREF_RECLAIM;

if (referenced_ptes) {
- if (PageSwapBacked(page))
+ if (PageSwapBacked(page)) {
+ if (!sc) {
+ if (referenced_page)
+ return PAGEREF_ACTIVATE;
+
+ SetPageReferenced(page);
+
+ return PAGEREF_KEEP;
+ }
+
return PAGEREF_ACTIVATE;
+ }
+
/*
* All mapped pages start out with page table
* references from the instantiating fault, so we need
--
1.8.3.1

2019-04-11 03:59:31

[permalink] [raw]

Subject: [v2 PATCH 6/9] mm: vmscan: don't demote for memcg reclaim

The memcg reclaim happens when the limit is breached, but demotion just
migrate pages to the other node instead of reclaiming them. This sounds
pointless to memcg reclaim since the usage is not reduced at all.

Signed-off-by: Yang Shi <[email protected]>
---
mm/vmscan.c | 38 +++++++++++++++++++++-----------------
1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2a96609..80cd624 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1046,8 +1046,12 @@ static void page_check_dirty_writeback(struct page *page,
mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
}

-static inline bool is_demote_ok(int nid)
+static inline bool is_demote_ok(int nid, struct scan_control *sc)
{
+ /* It is pointless to do demotion in memcg reclaim */
+ if (!global_reclaim(sc))
+ return false;
+
/* Current node is cpuless node */
if (!node_state(nid, N_CPU_MEM))
return false;
@@ -1267,7 +1271,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* Demotion only happen from primary nodes
* to cpuless nodes.
*/
- if (is_demote_ok(page_to_nid(page))) {
+ if (is_demote_ok(page_to_nid(page), sc)) {
list_add(&page->lru, &demote_pages);
unlock_page(page);
continue;
@@ -2219,7 +2223,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
* deactivation is pointless.
*/
if (!file && !total_swap_pages &&
- !is_demote_ok(pgdat->node_id))
+ !is_demote_ok(pgdat->node_id, sc))
return false;

inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
@@ -2306,7 +2310,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
*
* If current node is already PMEM node, demotion is not applicable.
*/
- if (!is_demote_ok(pgdat->node_id)) {
+ if (!is_demote_ok(pgdat->node_id, sc)) {
/*
* If we have no swap space, do not bother scanning
* anon pages.
@@ -2315,18 +2319,18 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
scan_balance = SCAN_FILE;
goto out;
}
+ }

- /*
- * Global reclaim will swap to prevent OOM even with no
- * swappiness, but memcg users want to use this knob to
- * disable swapping for individual groups completely when
- * using the memory controller's swap limit feature would be
- * too expensive.
- */
- if (!global_reclaim(sc) && !swappiness) {
- scan_balance = SCAN_FILE;
- goto out;
- }
+ /*
+ * Global reclaim will swap to prevent OOM even with no
+ * swappiness, but memcg users want to use this knob to
+ * disable swapping for individual groups completely when
+ * using the memory controller's swap limit feature would be
+ * too expensive.
+ */
+ if (!global_reclaim(sc) && !swappiness) {
+ scan_balance = SCAN_FILE;
+ goto out;
}

/*
@@ -2675,7 +2679,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
*/
pages_for_compaction = compact_gap(sc->order);
inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
- if (get_nr_swap_pages() > 0 || is_demote_ok(pgdat->node_id))
+ if (get_nr_swap_pages() > 0 || is_demote_ok(pgdat->node_id, sc))
inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
if (sc->nr_reclaimed < pages_for_compaction &&
inactive_lru_pages > pages_for_compaction)
@@ -3373,7 +3377,7 @@ static void age_active_anon(struct pglist_data *pgdat,
struct mem_cgroup *memcg;

/* Aging anon page as long as demotion is fine */
- if (!total_swap_pages && !is_demote_ok(pgdat->node_id))
+ if (!total_swap_pages && !is_demote_ok(pgdat->node_id, sc))
return;

memcg = mem_cgroup_iter(NULL, NULL, NULL);
--
1.8.3.1

2019-04-11 03:59:34

[permalink] [raw]

Subject: [v2 PATCH 4/9] mm: migrate: make migrate_pages() return nr_succeeded

The migrate_pages() returns the number of pages that were not migrated,
or an error code. When returning an error code, there is no way to know
how many pages were migrated or not migrated.

In the following patch, migrate_pages() is used to demote pages to PMEM
node, we need account how many pages are reclaimed (demoted) since page
reclaim behavior depends on this. Add *nr_succeeded parameter to make
migrate_pages() return how many pages are demoted successfully for all
cases.

Signed-off-by: Yang Shi <[email protected]>
---
include/linux/migrate.h | 5 +++--
mm/compaction.c | 3 ++-
mm/gup.c | 4 +++-
mm/memory-failure.c | 7 +++++--
mm/memory_hotplug.c | 4 +++-
mm/mempolicy.c | 7 +++++--
mm/migrate.c | 18 ++++++++++--------
mm/page_alloc.c | 4 +++-
8 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e13d9bf..837fdd1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -66,7 +66,8 @@ extern int migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
enum migrate_mode mode);
extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
- unsigned long private, enum migrate_mode mode, int reason);
+ unsigned long private, enum migrate_mode mode, int reason,
+ unsigned int *nr_succeeded);
extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
extern void putback_movable_page(struct page *page);

@@ -84,7 +85,7 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
static inline void putback_movable_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t new,
free_page_t free, unsigned long private, enum migrate_mode mode,
- int reason)
+ int reason, unsigned int *nr_succeeded)
{ return -ENOSYS; }
static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
{ return -EBUSY; }
diff --git a/mm/compaction.c b/mm/compaction.c
index f171a83..c6a0ec4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2065,6 +2065,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
unsigned long last_migrated_pfn;
const bool sync = cc->mode != MIGRATE_ASYNC;
bool update_cached;
+ unsigned int nr_succeeded = 0;

cc->migratetype = gfpflags_to_migratetype(cc->gfp_mask);
ret = compaction_suitable(cc->zone, cc->order, cc->alloc_flags,
@@ -2173,7 +2174,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,

err = migrate_pages(&cc->migratepages, compaction_alloc,
compaction_free, (unsigned long)cc, cc->mode,
- MR_COMPACTION);
+ MR_COMPACTION, &nr_succeeded);

trace_mm_compaction_migratepages(cc->nr_migratepages, err,
&cc->migratepages);
diff --git a/mm/gup.c b/mm/gup.c
index f84e226..b482b8c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1217,6 +1217,7 @@ static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
long i;
bool drain_allow = true;
bool migrate_allow = true;
+ unsigned int nr_succeeded = 0;
LIST_HEAD(cma_page_list);

check_again:
@@ -1257,7 +1258,8 @@ static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
put_page(pages[i]);

if (migrate_pages(&cma_page_list, new_non_cma_page,
- NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
+ NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE,
+ &nr_succeeded)) {
/*
* some of the pages failed migration. Do get_user_pages
* without migration.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc8b517..b5d8a8f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1686,6 +1686,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
int ret;
unsigned long pfn = page_to_pfn(page);
struct page *hpage = compound_head(page);
+ unsigned int nr_succeeded = 0;
LIST_HEAD(pagelist);

/*
@@ -1713,7 +1714,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
}

ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
- MIGRATE_SYNC, MR_MEMORY_FAILURE);
+ MIGRATE_SYNC, MR_MEMORY_FAILURE, &nr_succeeded);
if (ret) {
pr_info("soft offline: %#lx: hugepage migration failed %d, type %lx (%pGp)\n",
pfn, ret, page->flags, &page->flags);
@@ -1742,6 +1743,7 @@ static int __soft_offline_page(struct page *page, int flags)
{
int ret;
unsigned long pfn = page_to_pfn(page);
+ unsigned int nr_succeeded = 0;

/*
* Check PageHWPoison again inside page lock because PageHWPoison
@@ -1801,7 +1803,8 @@ static int __soft_offline_page(struct page *page, int flags)
page_is_file_cache(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
- MIGRATE_SYNC, MR_MEMORY_FAILURE);
+ MIGRATE_SYNC, MR_MEMORY_FAILURE,
+ &nr_succeeded);
if (ret) {
if (!list_empty(&pagelist))
putback_movable_pages(&pagelist);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1140f3b..29414a4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1375,6 +1375,7 @@ static struct page *new_node_page(struct page *page, unsigned long private)
unsigned long pfn;
struct page *page;
int ret = 0;
+ unsigned int nr_succeeded = 0;
LIST_HEAD(source);

for (pfn = start_pfn; pfn < end_pfn; pfn++) {
@@ -1435,7 +1436,8 @@ static struct page *new_node_page(struct page *page, unsigned long private)
if (!list_empty(&source)) {
/* Allocate a new page from the nearest neighbor node */
ret = migrate_pages(&source, new_node_page, NULL, 0,
- MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
+ MIGRATE_SYNC, MR_MEMORY_HOTPLUG,
+ &nr_succeeded);
if (ret) {
list_for_each_entry(page, &source, lru) {
pr_warn("migrating pfn %lx failed ret:%d ",
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index af171cc..96d6e2e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -962,6 +962,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
nodemask_t nmask;
LIST_HEAD(pagelist);
int err = 0;
+ unsigned int nr_succeeded = 0;

nodes_clear(nmask);
node_set(source, nmask);
@@ -977,7 +978,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,

if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
- MIGRATE_SYNC, MR_SYSCALL);
+ MIGRATE_SYNC, MR_SYSCALL, &nr_succeeded);
if (err)
putback_movable_pages(&pagelist);
}
@@ -1156,6 +1157,7 @@ static long do_mbind(unsigned long start, unsigned long len,
struct mempolicy *new;
unsigned long end;
int err;
+ unsigned int nr_succeeded = 0;
LIST_HEAD(pagelist);

if (flags & ~(unsigned long)MPOL_MF_VALID)
@@ -1228,7 +1230,8 @@ static long do_mbind(unsigned long start, unsigned long len,
if (!list_empty(&pagelist)) {
WARN_ON_ONCE(flags & MPOL_MF_LAZY);
nr_failed = migrate_pages(&pagelist, new_page, NULL,
- start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
+ start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND,
+ &nr_succeeded);
if (nr_failed)
putback_movable_pages(&pagelist);
}
diff --git a/mm/migrate.c b/mm/migrate.c
index ac6f493..84bba47 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1387,6 +1387,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
* @mode: The migration mode that specifies the constraints for
* page migration, if any.
* @reason: The reason for page migration.
+ * @nr_succeeded: The number of pages migrated successfully.
*
* The function returns after 10 attempts or if no pages are movable any more
* because the list has become empty or no retryable pages exist any more.
@@ -1397,11 +1398,10 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
*/
int migrate_pages(struct list_head *from, new_page_t get_new_page,
free_page_t put_new_page, unsigned long private,
- enum migrate_mode mode, int reason)
+ enum migrate_mode mode, int reason, unsigned int *nr_succeeded)
{
int retry = 1;
int nr_failed = 0;
- int nr_succeeded = 0;
int pass = 0;
struct page *page;
struct page *page2;
@@ -1455,7 +1455,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
retry++;
break;
case MIGRATEPAGE_SUCCESS:
- nr_succeeded++;
+ (*nr_succeeded)++;
break;
default:
/*
@@ -1472,11 +1472,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
nr_failed += retry;
rc = nr_failed;
out:
- if (nr_succeeded)
- count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
+ if (*nr_succeeded)
+ count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded);
if (nr_failed)
count_vm_events(PGMIGRATE_FAIL, nr_failed);
- trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
+ trace_mm_migrate_pages(*nr_succeeded, nr_failed, mode, reason);

if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;
@@ -1501,12 +1501,13 @@ static int do_move_pages_to_node(struct mm_struct *mm,
struct list_head *pagelist, int node)
{
int err;
+ unsigned int nr_succeeded = 0;

if (list_empty(pagelist))
return 0;

err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
- MIGRATE_SYNC, MR_SYSCALL);
+ MIGRATE_SYNC, MR_SYSCALL, &nr_succeeded);
if (err)
putback_movable_pages(pagelist);
return err;
@@ -1939,6 +1940,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
pg_data_t *pgdat = NODE_DATA(node);
int isolated;
int nr_remaining;
+ unsigned int nr_succeeded = 0;
LIST_HEAD(migratepages);

/*
@@ -1963,7 +1965,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
list_add(&page->lru, &migratepages);
nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
NULL, node, MIGRATE_ASYNC,
- MR_NUMA_MISPLACED);
+ MR_NUMA_MISPLACED, &nr_succeeded);
if (nr_remaining) {
if (!list_empty(&migratepages)) {
list_del(&page->lru);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bda17c2..e53cc96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8139,6 +8139,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
unsigned long pfn = start;
unsigned int tries = 0;
int ret = 0;
+ unsigned int nr_succeeded = 0;

migrate_prep();

@@ -8166,7 +8167,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
cc->nr_migratepages -= nr_reclaimed;

ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
- NULL, 0, cc->mode, MR_CONTIG_RANGE);
+ NULL, 0, cc->mode, MR_CONTIG_RANGE,
+ &nr_succeeded);
}
if (ret < 0) {
putback_movable_pages(&cc->migratepages);
--
1.8.3.1

2019-04-11 03:59:36

[permalink] [raw]

Subject: [v2 PATCH 9/9] mm: numa: add page promotion counter

Add counter for page promotion for NUMA balancing.

Signed-off-by: Yang Shi <[email protected]>
---
include/linux/vm_event_item.h | 1 +
mm/huge_memory.c | 4 ++++
mm/memory.c | 4 ++++
mm/vmstat.c | 1 +
4 files changed, 10 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 499a3aa..9f52a62 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -51,6 +51,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NUMA_HINT_FAULTS,
NUMA_HINT_FAULTS_LOCAL,
NUMA_PAGE_MIGRATE,
+ NUMA_PAGE_PROMOTE,
#endif
#ifdef CONFIG_MIGRATION
PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0b18ac45..ca9d688 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1609,6 +1609,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
vmf->pmd, pmd, vmf->address, page, target_nid);
if (migrated) {
+ if (!node_state(page_nid, N_CPU_MEM) &&
+ node_state(target_nid, N_CPU_MEM))
+ count_vm_numa_events(NUMA_PAGE_PROMOTE, HPAGE_PMD_NR);
+
flags |= TNF_MIGRATED;
page_nid = target_nid;
} else
diff --git a/mm/memory.c b/mm/memory.c
index 01c1ead..7b1218b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3704,6 +3704,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
/* Migrate to the requested node */
migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated) {
+ if (!node_state(page_nid, N_CPU_MEM) &&
+ node_state(target_nid, N_CPU_MEM))
+ count_vm_numa_event(NUMA_PAGE_PROMOTE);
+
page_nid = target_nid;
flags |= TNF_MIGRATED;
} else
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d1e4993..fd194e3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1220,6 +1220,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
"numa_hint_faults",
"numa_hint_faults_local",
"numa_pages_migrated",
+ "numa_pages_promoted",
#endif
#ifdef CONFIG_MIGRATION
"pgmigrate_success",
--
1.8.3.1

2019-04-11 03:59:38

[permalink] [raw]

Subject: [v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node

Since PMEM provides larger capacity than DRAM and has much lower
access latency than disk, so it is a good choice to use as a middle
tier between DRAM and disk in page reclaim path.

With PMEM nodes, the demotion path of anonymous pages could be:

DRAM -> PMEM -> swap device

This patch demotes anonymous pages only for the time being and demote
THP to PMEM in a whole. To avoid expensive page reclaim and/or
compaction on PMEM node if there is memory pressure on it, the most
conservative gfp flag is used, which would fail quickly if there is
memory pressure and just wakeup kswapd on failure. The migrate_pages()
would split THP to migrate one by one as base page upon THP allocation
failure.

Demote pages to the cloest non-DRAM node even though the system is
swapless. The current logic of page reclaim just scan anon LRU when
swap is on and swappiness is set properly. Demoting to PMEM doesn't
need care whether swap is available or not. But, reclaiming from PMEM
still skip anon LRU if swap is not available.

The demotion just happens from DRAM node to its cloest PMEM node.
Demoting to a remote PMEM node or migrating from PMEM to DRAM on reclaim
is not allowed for now.

And, define a new migration reason for demotion, called MR_DEMOTE.
Demote page via async migration to avoid blocking.

Signed-off-by: Yang Shi <[email protected]>
---
include/linux/gfp.h | 12 ++++
include/linux/migrate.h | 1 +
include/trace/events/migrate.h | 3 +-
mm/debug.c | 1 +
mm/internal.h | 13 +++++
mm/migrate.c | 15 ++++-
mm/vmscan.c | 127 +++++++++++++++++++++++++++++++++++------
7 files changed, 149 insertions(+), 23 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fdab7de..57ced51 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -285,6 +285,14 @@
* available and will not wake kswapd/kcompactd on failure. The _LIGHT
* version does not attempt reclaim/compaction at all and is by default used
* in page fault path, while the non-light is used by khugepaged.
+ *
+ * %GFP_DEMOTE is for migration on memory reclaim (a.k.a demotion) allocations.
+ * The allocation might happen in kswapd or direct reclaim, so assuming
+ * __GFP_IO and __GFP_FS are not allowed looks safer. Demotion happens for
+ * user pages (on LRU) only and on specific node. Generally it will fail
+ * quickly if memory is not available, but may wake up kswapd on failure.
+ *
+ * %GFP_TRANSHUGE_DEMOTE is used for THP demotion allocation.
*/
#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
@@ -300,6 +308,10 @@
#define GFP_TRANSHUGE_LIGHT ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
__GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
#define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
+#define GFP_DEMOTE (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY | \
+ __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE | \
+ GFP_NOWAIT)
+#define GFP_TRANSHUGE_DEMOTE (GFP_DEMOTE | __GFP_COMP)

/* Convert GFP flags to their corresponding migrate type */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 837fdd1..cfb1f57 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -25,6 +25,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+ MR_DEMOTE,
MR_TYPES
};

diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 705b33d..c1d5b36 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset") \
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
EM( MR_NUMA_MISPLACED, "numa_misplaced") \
- EMe(MR_CONTIG_RANGE, "contig_range")
+ EM( MR_CONTIG_RANGE, "contig_range") \
+ EMe(MR_DEMOTE, "demote")

/*
* First define the enums in the above macros to be exported to userspace
diff --git a/mm/debug.c b/mm/debug.c
index c0b31b6..cc0d7df 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -25,6 +25,7 @@
"mempolicy_mbind",
"numa_misplaced",
"cma",
+ "demote",
};

const struct trace_print_flags pageflag_names[] = {
diff --git a/mm/internal.h b/mm/internal.h
index bee4d6c..8c424b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -383,6 +383,19 @@ static inline int find_next_best_node(int node, nodemask_t *used_node_mask,
}
#endif

+static inline bool has_cpuless_node_online(void)
+{
+ nodemask_t nmask;
+
+ nodes_andnot(nmask, node_states[N_MEMORY],
+ node_states[N_CPU_MEM]);
+
+ if (nodes_empty(nmask))
+ return false;
+
+ return true;
+}
+
/* mm/util.c */
void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct rb_node *rb_parent);
diff --git a/mm/migrate.c b/mm/migrate.c
index 84bba47..c97a739 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1001,7 +1001,8 @@ static int move_to_new_page(struct page *newpage, struct page *page,
}

static int __unmap_and_move(struct page *page, struct page *newpage,
- int force, enum migrate_mode mode)
+ int force, enum migrate_mode mode,
+ enum migrate_reason reason)
{
int rc = -EAGAIN;
int page_was_mapped = 0;
@@ -1138,8 +1139,16 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
if (rc == MIGRATEPAGE_SUCCESS) {
if (unlikely(!is_lru))
put_page(newpage);
- else
+ else {
+ /*
+ * Put demoted pages on the target node's
+ * active LRU.
+ */
+ if (!PageUnevictable(newpage) &&
+ reason == MR_DEMOTE)
+ SetPageActive(newpage);
putback_lru_page(newpage);
+ }
}

return rc;
@@ -1193,7 +1202,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
goto out;
}

- rc = __unmap_and_move(page, newpage, force, mode);
+ rc = __unmap_and_move(page, newpage, force, mode, reason);
if (rc == MIGRATEPAGE_SUCCESS)
set_page_owner_migrate_reason(newpage, reason);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0504845..2a96609 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1046,6 +1046,45 @@ static void page_check_dirty_writeback(struct page *page,
mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
}

+static inline bool is_demote_ok(int nid)
+{
+ /* Current node is cpuless node */
+ if (!node_state(nid, N_CPU_MEM))
+ return false;
+
+ /* No online PMEM node */
+ if (!has_cpuless_node_online())
+ return false;
+
+ return true;
+}
+
+#ifdef CONFIG_NUMA
+static struct page *alloc_demote_page(struct page *page, unsigned long node)
+{
+ if (unlikely(PageHuge(page)))
+ /* HugeTLB demotion is not supported for now */
+ BUG();
+ else if (PageTransHuge(page)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(node, GFP_TRANSHUGE_DEMOTE,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
+ return __alloc_pages_node(node, GFP_DEMOTE, 0);
+}
+#else
+static inline struct page *alloc_demote_page(struct page *page,
+ unsigned long node)
+{
+ return NULL;
+}
+#endif
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -1058,6 +1097,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
+ LIST_HEAD(demote_pages);
unsigned nr_reclaimed = 0;

memset(stat, 0, sizeof(*stat));
@@ -1220,6 +1260,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
if (PageAnon(page) && PageSwapBacked(page)) {
if (!PageSwapCache(page)) {
+ /*
+ * Demote anonymous pages only for now and
+ * skip MADV_FREE pages.
+ *
+ * Demotion only happen from primary nodes
+ * to cpuless nodes.
+ */
+ if (is_demote_ok(page_to_nid(page))) {
+ list_add(&page->lru, &demote_pages);
+ unlock_page(page);
+ continue;
+ }
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
if (PageTransHuge(page)) {
@@ -1429,6 +1481,29 @@ static unsigned long shrink_page_list(struct list_head *page_list,
VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
}

+ /* Demote pages to PMEM */
+ if (!list_empty(&demote_pages)) {
+ int err, target_nid;
+ unsigned int nr_succeeded = 0;
+ nodemask_t used_mask;
+
+ nodes_clear(used_mask);
+ target_nid = find_next_best_node(pgdat->node_id, &used_mask,
+ true);
+
+ err = migrate_pages(&demote_pages, alloc_demote_page, NULL,
+ target_nid, MIGRATE_ASYNC, MR_DEMOTE,
+ &nr_succeeded);
+
+ nr_reclaimed += nr_succeeded;
+
+ if (err) {
+ putback_movable_pages(&demote_pages);
+
+ list_splice(&ret_pages, &demote_pages);
+ }
+ }
+
mem_cgroup_uncharge_list(&free_pages);
try_to_unmap_flush();
free_unref_page_list(&free_pages);
@@ -2140,10 +2215,11 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
unsigned long gb;

/*
- * If we don't have swap space, anonymous page deactivation
- * is pointless.
+ * If we don't have swap space or PMEM online, anonymous page
+ * deactivation is pointless.
*/
- if (!file && !total_swap_pages)
+ if (!file && !total_swap_pages &&
+ !is_demote_ok(pgdat->node_id))
return false;

inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
@@ -2223,22 +2299,34 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
unsigned long ap, fp;
enum lru_list lru;

- /* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
- scan_balance = SCAN_FILE;
- goto out;
- }
-
/*
- * Global reclaim will swap to prevent OOM even with no
- * swappiness, but memcg users want to use this knob to
- * disable swapping for individual groups completely when
- * using the memory controller's swap limit feature would be
- * too expensive.
+ * Anon pages can be demoted to PMEM. If there is PMEM node online,
+ * still scan anonymous LRU even though the systme is swapless or
+ * swapping is disabled by memcg.
+ *
+ * If current node is already PMEM node, demotion is not applicable.
*/
- if (!global_reclaim(sc) && !swappiness) {
- scan_balance = SCAN_FILE;
- goto out;
+ if (!is_demote_ok(pgdat->node_id)) {
+ /*
+ * If we have no swap space, do not bother scanning
+ * anon pages.
+ */
+ if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+ scan_balance = SCAN_FILE;
+ goto out;
+ }
+
+ /*
+ * Global reclaim will swap to prevent OOM even with no
+ * swappiness, but memcg users want to use this knob to
+ * disable swapping for individual groups completely when
+ * using the memory controller's swap limit feature would be
+ * too expensive.
+ */
+ if (!global_reclaim(sc) && !swappiness) {
+ scan_balance = SCAN_FILE;
+ goto out;
+ }
}

/*
@@ -2587,7 +2675,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
*/
pages_for_compaction = compact_gap(sc->order);
inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
- if (get_nr_swap_pages() > 0)
+ if (get_nr_swap_pages() > 0 || is_demote_ok(pgdat->node_id))
inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
if (sc->nr_reclaimed < pages_for_compaction &&
inactive_lru_pages > pages_for_compaction)
@@ -3284,7 +3372,8 @@ static void age_active_anon(struct pglist_data *pgdat,
{
struct mem_cgroup *memcg;

- if (!total_swap_pages)
+ /* Aging anon page as long as demotion is fine */
+ if (!total_swap_pages && !is_demote_ok(pgdat->node_id))
return;

memcg = mem_cgroup_iter(NULL, NULL, NULL);
--
1.8.3.1

2019-04-11 03:59:46

[permalink] [raw]

Subject: [v2 PATCH 7/9] mm: vmscan: check if the demote target node is contended or not

When demoting to PMEM node, the target node may have memory pressure,
then the memory pressure may cause migrate_pages() fail.

If the failure is caused by memory pressure (i.e. returning -ENOMEM),
tag the node with PGDAT_CONTENDED. The tag would be cleared once the
target node is balanced again.

Check if the target node is PGDAT_CONTENDED or not, if it is just skip
demotion.

Signed-off-by: Yang Shi <[email protected]>
---
include/linux/mmzone.h | 3 +++
mm/vmscan.c | 28 ++++++++++++++++++++++++++++
2 files changed, 31 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fba7741..de534db 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -520,6 +520,9 @@ enum pgdat_flags {
* many pages under writeback
*/
PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */
+ PGDAT_CONTENDED, /* the node has not enough free memory
+ * available
+ */
};

enum zone_flags {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 80cd624..50cde53 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1048,6 +1048,9 @@ static void page_check_dirty_writeback(struct page *page,

static inline bool is_demote_ok(int nid, struct scan_control *sc)
{
+ int node;
+ nodemask_t used_mask;
+
/* It is pointless to do demotion in memcg reclaim */
if (!global_reclaim(sc))
return false;
@@ -1060,6 +1063,13 @@ static inline bool is_demote_ok(int nid, struct scan_control *sc)
if (!has_cpuless_node_online())
return false;

+ /* Check if the demote target node is contended or not */
+ nodes_clear(used_mask);
+ node = find_next_best_node(nid, &used_mask, true);
+
+ if (test_bit(PGDAT_CONTENDED, &NODE_DATA(node)->flags))
+ return false;
+
return true;
}

@@ -1502,6 +1512,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
nr_reclaimed += nr_succeeded;

if (err) {
+ if (err == -ENOMEM)
+ set_bit(PGDAT_CONTENDED,
+ &NODE_DATA(target_nid)->flags);
+
putback_movable_pages(&demote_pages);

list_splice(&ret_pages, &demote_pages);
@@ -2596,6 +2610,19 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
* scan target and the percentage scanning already complete
*/
lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
+
+ /*
+ * The shrink_page_list() may find the demote target node is
+ * contended, if so it doesn't make sense to scan anonymous
+ * LRU again.
+ *
+ * Need check if swap is available or not too since demotion
+ * may happen on swapless system.
+ */
+ if (!is_demote_ok(pgdat->node_id, sc) &&
+ (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0))
+ lru = LRU_FILE;
+
nr_scanned = targets[lru] - nr[lru];
nr[lru] = targets[lru] * (100 - percentage) / 100;
nr[lru] -= min(nr[lru], nr_scanned);
@@ -3458,6 +3485,7 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
clear_bit(PGDAT_CONGESTED, &pgdat->flags);
clear_bit(PGDAT_DIRTY, &pgdat->flags);
clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
+ clear_bit(PGDAT_CONTENDED, &pgdat->flags);
}

/*
--
1.8.3.1

2019-04-11 04:00:21

[permalink] [raw]

Subject: [v2 PATCH 8/9] mm: vmscan: add page demotion counter

Account the number of demoted pages into reclaim_state->nr_demoted.

Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.

Signed-off-by: Yang Shi <[email protected]>
---
include/linux/vm_event_item.h | 2 ++
include/linux/vmstat.h | 1 +
mm/internal.h | 1 +
mm/vmscan.c | 7 +++++++
mm/vmstat.c | 2 ++
5 files changed, 13 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441..499a3aa 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -32,6 +32,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGREFILL,
PGSTEAL_KSWAPD,
PGSTEAL_DIRECT,
+ PGDEMOTE_KSWAPD,
+ PGDEMOTE_DIRECT,
PGSCAN_KSWAPD,
PGSCAN_DIRECT,
PGSCAN_DIRECT_THROTTLE,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2db8d60..eb5d21c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -29,6 +29,7 @@ struct reclaim_stat {
unsigned nr_activate;
unsigned nr_ref_keep;
unsigned nr_unmap_fail;
+ unsigned nr_demoted;
};

#ifdef CONFIG_VM_EVENT_COUNTERS
diff --git a/mm/internal.h b/mm/internal.h
index 8c424b5..8ba4853 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -156,6 +156,7 @@ struct scan_control {
unsigned int immediate;
unsigned int file_taken;
unsigned int taken;
+ unsigned int demoted;
} nr;
};

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 50cde53..a52c8248 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1511,6 +1511,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,

nr_reclaimed += nr_succeeded;

+ stat->nr_demoted = nr_succeeded;
+ if (current_is_kswapd())
+ __count_vm_events(PGDEMOTE_KSWAPD, stat->nr_demoted);
+ else
+ __count_vm_events(PGDEMOTE_DIRECT, stat->nr_demoted);
+
if (err) {
if (err == -ENOMEM)
set_bit(PGDAT_CONTENDED,
@@ -2019,6 +2025,7 @@ static int current_may_throttle(void)
sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
sc->nr.writeback += stat.nr_writeback;
sc->nr.immediate += stat.nr_immediate;
+ sc->nr.demoted += stat.nr_demoted;
sc->nr.taken += nr_taken;
if (file)
sc->nr.file_taken += nr_taken;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1a431dc..d1e4993 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1192,6 +1192,8 @@ int fragmentation_index(struct zone *zone, unsigned int order)
"pgrefill",
"pgsteal_kswapd",
"pgsteal_direct",
+ "pgdemote_kswapd",
+ "pgdemote_direct",
"pgscan_kswapd",
"pgscan_direct",
"pgscan_direct_throttle",
--
1.8.3.1

2019-04-11 14:29:03

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

This isn't so much another aproach, as it it some tweaks on top of
what's there. Right?

This set seems to present a bunch of ideas, like "promote if accessed
twice". Seems like a good idea, but I'm a lot more interested in seeing
data about it being a good idea. What workloads is it good for? Bad for?

These look like fun to play with, but I'd be really curious what you
think needs to be done before we start merging these ideas.

2019-04-11 14:32:04

[permalink] [raw]

Subject: Re: [v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node

On 4/10/19 8:56 PM, Yang Shi wrote:
> include/linux/gfp.h | 12 ++++
> include/linux/migrate.h | 1 +
> include/trace/events/migrate.h | 3 +-
> mm/debug.c | 1 +
> mm/internal.h | 13 +++++
> mm/migrate.c | 15 ++++-
> mm/vmscan.c | 127 +++++++++++++++++++++++++++++++++++------
> 7 files changed, 149 insertions(+), 23 deletions(-)

Yikes, that's a lot of code.

And it only handles anonymous pages?

Also, I don't see anything in the code tying this to strictly demote
from DRAM to PMEM. Is that the end effect, or is it really implemented
that way and I missed it?

2019-04-11 16:08:28

[permalink] [raw]

Subject: Re: [v2 PATCH 7/9] mm: vmscan: check if the demote target node is contended or not

On 4/10/19 8:56 PM, Yang Shi wrote:
> When demoting to PMEM node, the target node may have memory pressure,
> then the memory pressure may cause migrate_pages() fail.
>
> If the failure is caused by memory pressure (i.e. returning -ENOMEM),
> tag the node with PGDAT_CONTENDED. The tag would be cleared once the
> target node is balanced again.
>
> Check if the target node is PGDAT_CONTENDED or not, if it is just skip
> demotion.

This seems like an actively bad idea to me.

Why do we need an *active* note to say the node is contended? Why isn't
just getting a failure back from migrate_pages() enough? Have you
observed this in practice?

2019-04-12 08:48:36

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Thu 11-04-19 11:56:50, Yang Shi wrote:
[...]
> Design
> ======
> Basically, the approach is aimed to spread data from DRAM (closest to local
> CPU) down further to PMEM and disk (typically assume the lower tier storage
> is slower, larger and cheaper than the upper tier) by their hotness. The
> patchset tries to achieve this goal by doing memory promotion/demotion via
> NUMA balancing and memory reclaim as what the below diagram shows:
>
> DRAM <--> PMEM <--> Disk
> ^ ^
> |-------------------|
> swap
>
> When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
> Then NUMA balancing will promote pages to DRAM as long as the page is referenced
> again. The memory pressure on PMEM node would push the inactive pages of PMEM
> to disk via swap.
>
> The promotion/demotion happens only between "primary" nodes (the nodes have
> both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes
> and promotion from DRAM to PMEM and demotion from PMEM to DRAM.
>
> The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
> that has differentiated performance from the conventional memory pool, or
> differentiated performance for a specific initiator, per Dan Williams. So,
> assuming PMEM nodes are cpuless nodes sounds reasonable.
>
> However, cpuless nodes might be not PMEM nodes. But, actually, memory
> promotion/demotion doesn't care what kind of memory will be the target nodes,
> it could be DRAM, PMEM or something else, as long as they are the second tier
> memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
> pointless to do such demotion.
>
> Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
> order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
> memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
> Typically, memory allocation would happen on such nodes by default unless
> cpuless nodes are specified explicitly, cpuless nodes would be just fallback
> nodes, so they are also as known as "primary" nodes in this patchset. With
> two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
> demonstrate the promotion/demotion approach for now, and this looks more
> architecture-independent. But it may be better to construct such node mask
> by reading hardware information (i.e. HMAT), particularly for more complex
> memory hierarchy.

I still believe you are overcomplicating this without a strong reason.
Why cannot we start simple and build from there? In other words I do not
think we really need anything like N_CPU_MEM at all.

I would expect that the very first attempt wouldn't do much more than
migrate to-be-reclaimed pages (without an explicit binding) with a
very optimistic allocation strategy (effectivelly GFP_NOWAIT) and if
that fails then simply give up. All that hooked essentially to the
node_reclaim path with a new node_reclaim mode so that the behavior
would be opt-in. This should be the most simplistic way to start AFAICS
and something people can play with without risking regressions.

Once we see how that behaves in the real world and what kind of corner
case user are able to trigger then we can build on top. E.g. do we want
to migrate from cpuless nodes as well? I am not really sure TBH. On one
hand why not if other nodes are free to hold that memory? Swap out is
more expensive. Anyway this is kind of decision which would rather be
shaped on an existing experience rather than ad-hoc decistion right now.

I would also not touch the numa balancing logic at this stage and rather
see how the current implementation behaves.
--
Michal Hocko
SUSE Labs

2019-04-15 22:09:15

[permalink] [raw]

Subject: Re: [v2 PATCH 7/9] mm: vmscan: check if the demote target node is contended or not

On 4/11/19 9:06 AM, Dave Hansen wrote:
> On 4/10/19 8:56 PM, Yang Shi wrote:
>> When demoting to PMEM node, the target node may have memory pressure,
>> then the memory pressure may cause migrate_pages() fail.
>>
>> If the failure is caused by memory pressure (i.e. returning -ENOMEM),
>> tag the node with PGDAT_CONTENDED. The tag would be cleared once the
>> target node is balanced again.
>>
>> Check if the target node is PGDAT_CONTENDED or not, if it is just skip
>> demotion.
> This seems like an actively bad idea to me.
>
> Why do we need an *active* note to say the node is contended? Why isn't
> just getting a failure back from migrate_pages() enough? Have you
> observed this in practice?

The flag will be used to check if the target node is contended or not
before moving the page into the demotion list. If the target node is
contended (i.e. GFP_NOWAIT would likely fail), the page reclaim code
even won't scan anonymous page list on swapless system. It will just try
to reclaim page cache. This would save some scanning time.

Thanks,
Yang

2019-04-15 22:11:28

[permalink] [raw]

Subject: Re: [v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node

On 4/11/19 7:31 AM, Dave Hansen wrote:
> On 4/10/19 8:56 PM, Yang Shi wrote:
>> include/linux/gfp.h | 12 ++++
>> include/linux/migrate.h | 1 +
>> include/trace/events/migrate.h | 3 +-
>> mm/debug.c | 1 +
>> mm/internal.h | 13 +++++
>> mm/migrate.c | 15 ++++-
>> mm/vmscan.c | 127 +++++++++++++++++++++++++++++++++++------
>> 7 files changed, 149 insertions(+), 23 deletions(-)
> Yikes, that's a lot of code.
>
> And it only handles anonymous pages?

Yes, for the time being. But, it is easy to extend to all kind of pages.

>
> Also, I don't see anything in the code tying this to strictly demote
> from DRAM to PMEM. Is that the end effect, or is it really implemented
> that way and I missed it?

No, not restrict to PMEM. It just tries to demote from "preferred node"
(or called compute node) to a memory-only node. In the hardware with
PMEM, PMEM would be the memory-only node.

Thanks,
Yang

2019-04-15 22:13:56

[permalink] [raw]

Subject: Re: [v2 PATCH 7/9] mm: vmscan: check if the demote target node is contended or not

On 4/15/19 3:06 PM, Yang Shi wrote:
>>>
>> This seems like an actively bad idea to me.
>>
>> Why do we need an *active* note to say the node is contended? Why isn't
>> just getting a failure back from migrate_pages() enough? Have you
>> observed this in practice?
>
> The flag will be used to check if the target node is contended or not
> before moving the page into the demotion list. If the target node is
> contended (i.e. GFP_NOWAIT would likely fail), the page reclaim code
> even won't scan anonymous page list on swapless system.

That seems like the actual problem that needs to get fixed.

On systems where we have demotions available, perhaps we need to start
scanning anonymous pages again, at least for zones where we *can* demote
from them.

2019-04-15 22:16:56

[permalink] [raw]

Subject: Re: [v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node

On 4/15/19 3:10 PM, Yang Shi wrote:
>> Also, I don't see anything in the code tying this to strictly demote
>> from DRAM to PMEM. Is that the end effect, or is it really implemented
>> that way and I missed it?
>
> No, not restrict to PMEM. It just tries to demote from "preferred node"
> (or called compute node) to a memory-only node. In the hardware with
> PMEM, PMEM would be the memory-only node.

If that's the case, your patch subject is pretty criminal. :)

2019-04-15 22:24:38

[permalink] [raw]

Subject: Re: [v2 PATCH 7/9] mm: vmscan: check if the demote target node is contended or not

On 4/15/19 3:13 PM, Dave Hansen wrote:
> On 4/15/19 3:06 PM, Yang Shi wrote:
>>> This seems like an actively bad idea to me.
>>>
>>> Why do we need an *active* note to say the node is contended? Why isn't
>>> just getting a failure back from migrate_pages() enough? Have you
>>> observed this in practice?
>> The flag will be used to check if the target node is contended or not
>> before moving the page into the demotion list. If the target node is
>> contended (i.e. GFP_NOWAIT would likely fail), the page reclaim code
>> even won't scan anonymous page list on swapless system.
> That seems like the actual problem that needs to get fixed.
>
> On systems where we have demotions available, perhaps we need to start
> scanning anonymous pages again, at least for zones where we *can* demote
> from them.

But the problem is if we know the demotion would likely fail, why bother
scanning anonymous pages again? The flag will be cleared by the target
node's kswapd once it gets balanced again. Then the anonymous pages
would get scanned next time.

2019-04-15 22:27:42

[permalink] [raw]

Subject: Re: [v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node

On 4/15/19 3:14 PM, Dave Hansen wrote:
> On 4/15/19 3:10 PM, Yang Shi wrote:
>>> Also, I don't see anything in the code tying this to strictly demote
>>> from DRAM to PMEM. Is that the end effect, or is it really implemented
>>> that way and I missed it?
>> No, not restrict to PMEM. It just tries to demote from "preferred node"
>> (or called compute node) to a memory-only node. In the hardware with
>> PMEM, PMEM would be the memory-only node.
> If that's the case, your patch subject is pretty criminal. :)

Aha, s/PMEM/cpuless would sound guiltless.

2019-04-16 00:10:06

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/12/19 1:47 AM, Michal Hocko wrote:
> On Thu 11-04-19 11:56:50, Yang Shi wrote:
> [...]
>> Design
>> ======
>> Basically, the approach is aimed to spread data from DRAM (closest to local
>> CPU) down further to PMEM and disk (typically assume the lower tier storage
>> is slower, larger and cheaper than the upper tier) by their hotness. The
>> patchset tries to achieve this goal by doing memory promotion/demotion via
>> NUMA balancing and memory reclaim as what the below diagram shows:
>>
>> DRAM <--> PMEM <--> Disk
>> ^ ^
>> |-------------------|
>> swap
>>
>> When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
>> Then NUMA balancing will promote pages to DRAM as long as the page is referenced
>> again. The memory pressure on PMEM node would push the inactive pages of PMEM
>> to disk via swap.
>>
>> The promotion/demotion happens only between "primary" nodes (the nodes have
>> both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes
>> and promotion from DRAM to PMEM and demotion from PMEM to DRAM.
>>
>> The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
>> that has differentiated performance from the conventional memory pool, or
>> differentiated performance for a specific initiator, per Dan Williams. So,
>> assuming PMEM nodes are cpuless nodes sounds reasonable.
>>
>> However, cpuless nodes might be not PMEM nodes. But, actually, memory
>> promotion/demotion doesn't care what kind of memory will be the target nodes,
>> it could be DRAM, PMEM or something else, as long as they are the second tier
>> memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
>> pointless to do such demotion.
>>
>> Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
>> order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
>> memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
>> Typically, memory allocation would happen on such nodes by default unless
>> cpuless nodes are specified explicitly, cpuless nodes would be just fallback
>> nodes, so they are also as known as "primary" nodes in this patchset. With
>> two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
>> demonstrate the promotion/demotion approach for now, and this looks more
>> architecture-independent. But it may be better to construct such node mask
>> by reading hardware information (i.e. HMAT), particularly for more complex
>> memory hierarchy.
> I still believe you are overcomplicating this without a strong reason.
> Why cannot we start simple and build from there? In other words I do not
> think we really need anything like N_CPU_MEM at all.

In this patchset N_CPU_MEM is used to tell us what nodes are cpuless
nodes. They would be the preferred demotion target. Of course, we could
rely on firmware to just demote to the next best node, but it may be a
"preferred" node, if so I don't see too much benefit achieved by
demotion. Am I missing anything?

>
> I would expect that the very first attempt wouldn't do much more than
> migrate to-be-reclaimed pages (without an explicit binding) with a

Do you mean respect mempolicy or cpuset when doing demotion? I was
wondering this, but I didn't do so in the current implementation since
it may need walk the rmap to retrieve the mempolicy in the reclaim path.
Is there any easier way to do so?

> very optimistic allocation strategy (effectivelly GFP_NOWAIT) and if

Yes, this has been done in this patchset.

> that fails then simply give up. All that hooked essentially to the
> node_reclaim path with a new node_reclaim mode so that the behavior
> would be opt-in. This should be the most simplistic way to start AFAICS
> and something people can play with without risking regressions.

I agree it is safer to start with node reclaim. Once it is stable enough
and we are confident enough, it can be extended to global reclaim.

>
> Once we see how that behaves in the real world and what kind of corner
> case user are able to trigger then we can build on top. E.g. do we want
> to migrate from cpuless nodes as well? I am not really sure TBH. On one
> hand why not if other nodes are free to hold that memory? Swap out is
> more expensive. Anyway this is kind of decision which would rather be
> shaped on an existing experience rather than ad-hoc decistion right now.

I do agree.

>
> I would also not touch the numa balancing logic at this stage and rather
> see how the current implementation behaves.

I agree we would prefer start from something simpler and see how it works.

The "twice access" optimization is aimed to reduce the PMEM bandwidth
burden since the bandwidth of PMEM is scarce resource. I did compare
"twice access" to "no twice access", it does save a lot bandwidth for
some once-off access pattern. For example, when running stress test with
mmtest's usemem-stress-numa-compact. The kernel would promote ~600,000
pages with "twice access" in 4 hours, but it would promote ~80,000,000
pages without "twice access".

Thanks,
Yang

2019-04-16 07:49:40

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Mon 15-04-19 17:09:07, Yang Shi wrote:
>
>
> On 4/12/19 1:47 AM, Michal Hocko wrote:
> > On Thu 11-04-19 11:56:50, Yang Shi wrote:
> > [...]
> > > Design
> > > ======
> > > Basically, the approach is aimed to spread data from DRAM (closest to local
> > > CPU) down further to PMEM and disk (typically assume the lower tier storage
> > > is slower, larger and cheaper than the upper tier) by their hotness. The
> > > patchset tries to achieve this goal by doing memory promotion/demotion via
> > > NUMA balancing and memory reclaim as what the below diagram shows:
> > >
> > > DRAM <--> PMEM <--> Disk
> > > ^ ^
> > > |-------------------|
> > > swap
> > >
> > > When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
> > > Then NUMA balancing will promote pages to DRAM as long as the page is referenced
> > > again. The memory pressure on PMEM node would push the inactive pages of PMEM
> > > to disk via swap.
> > >
> > > The promotion/demotion happens only between "primary" nodes (the nodes have
> > > both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes
> > > and promotion from DRAM to PMEM and demotion from PMEM to DRAM.
> > >
> > > The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
> > > that has differentiated performance from the conventional memory pool, or
> > > differentiated performance for a specific initiator, per Dan Williams. So,
> > > assuming PMEM nodes are cpuless nodes sounds reasonable.
> > >
> > > However, cpuless nodes might be not PMEM nodes. But, actually, memory
> > > promotion/demotion doesn't care what kind of memory will be the target nodes,
> > > it could be DRAM, PMEM or something else, as long as they are the second tier
> > > memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
> > > pointless to do such demotion.
> > >
> > > Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
> > > order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
> > > memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
> > > Typically, memory allocation would happen on such nodes by default unless
> > > cpuless nodes are specified explicitly, cpuless nodes would be just fallback
> > > nodes, so they are also as known as "primary" nodes in this patchset. With
> > > two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
> > > demonstrate the promotion/demotion approach for now, and this looks more
> > > architecture-independent. But it may be better to construct such node mask
> > > by reading hardware information (i.e. HMAT), particularly for more complex
> > > memory hierarchy.
> > I still believe you are overcomplicating this without a strong reason.
> > Why cannot we start simple and build from there? In other words I do not
> > think we really need anything like N_CPU_MEM at all.
>
> In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes.
> They would be the preferred demotion target.? Of course, we could rely on
> firmware to just demote to the next best node, but it may be a "preferred"
> node, if so I don't see too much benefit achieved by demotion. Am I missing
> anything?

Why cannot we simply demote in the proximity order? Why do you make
cpuless nodes so special? If other close nodes are vacant then just use
them.

> > I would expect that the very first attempt wouldn't do much more than
> > migrate to-be-reclaimed pages (without an explicit binding) with a
>
> Do you mean respect mempolicy or cpuset when doing demotion? I was wondering
> this, but I didn't do so in the current implementation since it may need
> walk the rmap to retrieve the mempolicy in the reclaim path. Is there any
> easier way to do so?

You definitely have to follow policy. You cannot demote to a node which
is outside of the cpuset/mempolicy because you are breaking contract
expected by the userspace. That implies doing a rmap walk.

> > I would also not touch the numa balancing logic at this stage and rather
> > see how the current implementation behaves.
>
> I agree we would prefer start from something simpler and see how it works.
>
> The "twice access" optimization is aimed to reduce the PMEM bandwidth burden
> since the bandwidth of PMEM is scarce resource. I did compare "twice access"
> to "no twice access", it does save a lot bandwidth for some once-off access
> pattern. For example, when running stress test with mmtest's
> usemem-stress-numa-compact. The kernel would promote ~600,000 pages with
> "twice access" in 4 hours, but it would promote ~80,000,000 pages without
> "twice access".

I pressume this is a result of a synthetic workload, right? Or do you
have any numbers for a real life usecase?
--
Michal Hocko
SUSE Labs

2019-04-16 14:31:24

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/16/19 12:47 AM, Michal Hocko wrote:
> You definitely have to follow policy. You cannot demote to a node which
> is outside of the cpuset/mempolicy because you are breaking contract
> expected by the userspace. That implies doing a rmap walk.

What *is* the contract with userspace, anyway? :)

Obviously, the preferred policy doesn't have any strict contract.

The strict binding has a bit more of a contract, but it doesn't prevent
swapping. Strict binding also doesn't keep another app from moving the
memory.

We have a reasonable argument that demotion is better than swapping.
So, we could say that even if a VMA has a strict NUMA policy, demoting
pages mapped there pages still beats swapping them or tossing the page
cache. It's doing them a favor to demote them.

Or, maybe we just need a swap hybrid where demotion moves the page but
keeps it unmapped and in the swap cache. That way an access gets a
fault and we can promote the page back to where it should be. That
would be faster than I/O-based swap for sure.

Anyway, I agree that the kernel probably shouldn't be moving pages
around willy-nilly with no consideration for memory policies, but users
might give us some wiggle room too.

2019-04-16 14:41:13

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Tue 16-04-19 07:30:20, Dave Hansen wrote:
> On 4/16/19 12:47 AM, Michal Hocko wrote:
> > You definitely have to follow policy. You cannot demote to a node which
> > is outside of the cpuset/mempolicy because you are breaking contract
> > expected by the userspace. That implies doing a rmap walk.
>
> What *is* the contract with userspace, anyway? :)
>
> Obviously, the preferred policy doesn't have any strict contract.
>
> The strict binding has a bit more of a contract, but it doesn't prevent
> swapping.

Yes, but swapping is not a problem for using binding for memory
partitioning.

> Strict binding also doesn't keep another app from moving the
> memory.

I would consider that a bug.
--
Michal Hocko
SUSE Labs

2019-04-16 15:35:29

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 16 Apr 2019, at 10:30, Dave Hansen wrote:

> On 4/16/19 12:47 AM, Michal Hocko wrote:
>> You definitely have to follow policy. You cannot demote to a node which
>> is outside of the cpuset/mempolicy because you are breaking contract
>> expected by the userspace. That implies doing a rmap walk.
>
> What *is* the contract with userspace, anyway? :)
>
> Obviously, the preferred policy doesn't have any strict contract.
>
> The strict binding has a bit more of a contract, but it doesn't prevent
> swapping. Strict binding also doesn't keep another app from moving the
> memory.
>
> We have a reasonable argument that demotion is better than swapping.
> So, we could say that even if a VMA has a strict NUMA policy, demoting
> pages mapped there pages still beats swapping them or tossing the page
> cache. It's doing them a favor to demote them.

I just wonder whether page migration is always better than swapping,
since SSD write throughput keeps improving but page migration throughput
is still low. For example, my machine has a SSD with 2GB/s writing throughput
but the throughput of 4KB page migration is less than 1GB/s, why do we
want to use page migration for demotion instead of swapping?

--
Best Regards,
Yan Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature

2019-04-16 15:47:51

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/16/19 7:39 AM, Michal Hocko wrote:
>> Strict binding also doesn't keep another app from moving the
>> memory.
> I would consider that a bug.

A bug where, though? Certainly not in the kernel.

I'm just saying that if an app has an assumption that strict binding
means that its memory can *NEVER* move, then that assumption is simply
wrong. It's not the guarantee that we provide. In fact, we provide
APIs (migrate_pages() at leaset) that explicitly and intentionally break
that guarantee.

All that our NUMA APIs provide (even the strict ones) is a promise about
where newly-allocated pages will be allocated.

2019-04-16 15:56:39

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/16/19 8:33 AM, Zi Yan wrote:
>> We have a reasonable argument that demotion is better than
>> swapping. So, we could say that even if a VMA has a strict NUMA
>> policy, demoting pages mapped there pages still beats swapping
>> them or tossing the page cache. It's doing them a favor to
>> demote them.
> I just wonder whether page migration is always better than
> swapping, since SSD write throughput keeps improving but page
> migration throughput is still low. For example, my machine has a
> SSD with 2GB/s writing throughput but the throughput of 4KB page
> migration is less than 1GB/s, why do we want to use page migration
> for demotion instead of swapping?

Just because we observe that page migration apparently has lower
throughput today doesn't mean that we should consider it a dead end.

2019-04-16 16:13:45

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 16 Apr 2019, at 11:55, Dave Hansen wrote:

> On 4/16/19 8:33 AM, Zi Yan wrote:
>>> We have a reasonable argument that demotion is better than
>>> swapping. So, we could say that even if a VMA has a strict NUMA
>>> policy, demoting pages mapped there pages still beats swapping
>>> them or tossing the page cache. It's doing them a favor to
>>> demote them.
>> I just wonder whether page migration is always better than
>> swapping, since SSD write throughput keeps improving but page
>> migration throughput is still low. For example, my machine has a
>> SSD with 2GB/s writing throughput but the throughput of 4KB page
>> migration is less than 1GB/s, why do we want to use page migration
>> for demotion instead of swapping?
>
> Just because we observe that page migration apparently has lower
> throughput today doesn't mean that we should consider it a dead end.

I definitely agree. I also want to make the point that we might
want to improve page migration as well to show that demotion via
page migration will work. Since most of proposed demotion approaches
use the same page replacement policy as swapping, if we do not have
high-throughput page migration, we might draw false conclusions that
demotion is no better than swapping but demotion can actually do
much better. :)

--
Best Regards,
Yan Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature

2019-04-16 18:36:33

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Tue 16-04-19 08:46:56, Dave Hansen wrote:
> On 4/16/19 7:39 AM, Michal Hocko wrote:
> >> Strict binding also doesn't keep another app from moving the
> >> memory.
> > I would consider that a bug.
>
> A bug where, though? Certainly not in the kernel.

Kernel should refrain from moving explicitly bound memory nilly willy. I
certainly agree that there are corner cases. E.g. memory hotplug. We do
break CPU affinity for CPU offline as well. So this is something user
should expect. But the kernel shouldn't move explicitly bound pages to a
different node implicitly. I am not sure whether we even do that during
compaction if we do then I would consider _this_ to be a bug. And NUMA
rebalancing under memory pressure falls into the same category IMO.
--
Michal Hocko
SUSE Labs

2019-04-16 19:20:30

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/16/19 12:47 AM, Michal Hocko wrote:
> On Mon 15-04-19 17:09:07, Yang Shi wrote:
>>
>> On 4/12/19 1:47 AM, Michal Hocko wrote:
>>> On Thu 11-04-19 11:56:50, Yang Shi wrote:
>>> [...]
>>>> Design
>>>> ======
>>>> Basically, the approach is aimed to spread data from DRAM (closest to local
>>>> CPU) down further to PMEM and disk (typically assume the lower tier storage
>>>> is slower, larger and cheaper than the upper tier) by their hotness. The
>>>> patchset tries to achieve this goal by doing memory promotion/demotion via
>>>> NUMA balancing and memory reclaim as what the below diagram shows:
>>>>
>>>> DRAM <--> PMEM <--> Disk
>>>> ^ ^
>>>> |-------------------|
>>>> swap
>>>>
>>>> When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
>>>> Then NUMA balancing will promote pages to DRAM as long as the page is referenced
>>>> again. The memory pressure on PMEM node would push the inactive pages of PMEM
>>>> to disk via swap.
>>>>
>>>> The promotion/demotion happens only between "primary" nodes (the nodes have
>>>> both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes
>>>> and promotion from DRAM to PMEM and demotion from PMEM to DRAM.
>>>>
>>>> The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
>>>> that has differentiated performance from the conventional memory pool, or
>>>> differentiated performance for a specific initiator, per Dan Williams. So,
>>>> assuming PMEM nodes are cpuless nodes sounds reasonable.
>>>>
>>>> However, cpuless nodes might be not PMEM nodes. But, actually, memory
>>>> promotion/demotion doesn't care what kind of memory will be the target nodes,
>>>> it could be DRAM, PMEM or something else, as long as they are the second tier
>>>> memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
>>>> pointless to do such demotion.
>>>>
>>>> Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
>>>> order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
>>>> memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
>>>> Typically, memory allocation would happen on such nodes by default unless
>>>> cpuless nodes are specified explicitly, cpuless nodes would be just fallback
>>>> nodes, so they are also as known as "primary" nodes in this patchset. With
>>>> two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
>>>> demonstrate the promotion/demotion approach for now, and this looks more
>>>> architecture-independent. But it may be better to construct such node mask
>>>> by reading hardware information (i.e. HMAT), particularly for more complex
>>>> memory hierarchy.
>>> I still believe you are overcomplicating this without a strong reason.
>>> Why cannot we start simple and build from there? In other words I do not
>>> think we really need anything like N_CPU_MEM at all.
>> In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes.
>> They would be the preferred demotion target. Of course, we could rely on
>> firmware to just demote to the next best node, but it may be a "preferred"
>> node, if so I don't see too much benefit achieved by demotion. Am I missing
>> anything?
> Why cannot we simply demote in the proximity order? Why do you make
> cpuless nodes so special? If other close nodes are vacant then just use
> them.

We could. But, this raises another question, would we prefer to just
demote to the next fallback node (just try once), if it is contended,
then just swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to try
all the nodes in the fallback order to find the first less contended one
(i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?

|------| |------| |------| |------|
|PMEM0|---|DRAM0| --- CPU0 --- CPU1 --- |DRAM1| --- |PMEM1|
|------| |------| |------| |------|

The first one sounds simpler, and the current implementation does so and
this needs find out the closest PMEM node by recognizing cpuless node.

If we prefer go with the second option, it is definitely unnecessary to
specialize any node.

>
>>> I would expect that the very first attempt wouldn't do much more than
>>> migrate to-be-reclaimed pages (without an explicit binding) with a
>> Do you mean respect mempolicy or cpuset when doing demotion? I was wondering
>> this, but I didn't do so in the current implementation since it may need
>> walk the rmap to retrieve the mempolicy in the reclaim path. Is there any
>> easier way to do so?
> You definitely have to follow policy. You cannot demote to a node which
> is outside of the cpuset/mempolicy because you are breaking contract
> expected by the userspace. That implies doing a rmap walk.

OK, however, this may prevent from demoting unmapped page cache since
there is no way to find those pages' policy.

And, we have to think about what we should do when the demotion target
has conflict with the mempolicy. The easiest way is to just skip those
conflict pages in demotion. Or we may have to do the demotion one page
by one page instead of migrating a list of pages.

>
>>> I would also not touch the numa balancing logic at this stage and rather
>>> see how the current implementation behaves.
>> I agree we would prefer start from something simpler and see how it works.
>>
>> The "twice access" optimization is aimed to reduce the PMEM bandwidth burden
>> since the bandwidth of PMEM is scarce resource. I did compare "twice access"
>> to "no twice access", it does save a lot bandwidth for some once-off access
>> pattern. For example, when running stress test with mmtest's
>> usemem-stress-numa-compact. The kernel would promote ~600,000 pages with
>> "twice access" in 4 hours, but it would promote ~80,000,000 pages without
>> "twice access".
> I pressume this is a result of a synthetic workload, right? Or do you
> have any numbers for a real life usecase?

The test just uses usemem.

2019-04-16 21:24:52

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/16/19 12:19 PM, Yang Shi wrote:
> would we prefer to try all the nodes in the fallback order to find the
> first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?

Once a page went to DRAM1, how would we tell that it originated in DRAM0
and is following the DRAM0 path rather than the DRAM1 path?

Memory on DRAM0's path would be:

DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap

Memory on DRAM1's path would be:

DRAM1 -> PMEM1 -> DRAM0 -> PMEM0 -> Swap

Keith Busch had a set of patches to let you specify the demotion order
via sysfs for fun. The rules we came up with were:
1. Pages keep no history of where they have been
2. Each node can only demote to one other node
3. The demotion path can not have cycles

That ensures that we *can't* follow the paths you described above, if we
follow those rules...

2019-04-16 22:00:19

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/16/19 2:22 PM, Dave Hansen wrote:
> On 4/16/19 12:19 PM, Yang Shi wrote:
>> would we prefer to try all the nodes in the fallback order to find the
>> first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?
> Once a page went to DRAM1, how would we tell that it originated in DRAM0
> and is following the DRAM0 path rather than the DRAM1 path?
>
> Memory on DRAM0's path would be:
>
> DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap
>
> Memory on DRAM1's path would be:
>
> DRAM1 -> PMEM1 -> DRAM0 -> PMEM0 -> Swap
>
> Keith Busch had a set of patches to let you specify the demotion order
> via sysfs for fun. The rules we came up with were:
> 1. Pages keep no history of where they have been
> 2. Each node can only demote to one other node

Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM
might be ok?

> 3. The demotion path can not have cycles

I agree with these rules, actually my implementation does imply the
similar rule. I tried to understand what Michal means. My current
implementation expects to have demotion happen from the initiator to the
target in the same local pair. But, Michal may expect to be able to
demote to remote initiator or target if the local target is contended.

IMHO, demotion in the local pair makes things much simpler.

>
> That ensures that we *can't* follow the paths you described above, if we
> follow those rules...

Yes, it might create a circle.

2019-04-16 23:05:19

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/16/19 2:59 PM, Yang Shi wrote:
> On 4/16/19 2:22 PM, Dave Hansen wrote:
>> Keith Busch had a set of patches to let you specify the demotion order
>> via sysfs for fun. The rules we came up with were:
>> 1. Pages keep no history of where they have been
>> 2. Each node can only demote to one other node
>
> Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM
> might be ok?

In Keith's code, I don't think we differentiated. We let any node
demote to any other node you want, as long as it follows the cycle rule.

2019-04-16 23:20:19

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/16/19 4:04 PM, Dave Hansen wrote:
> On 4/16/19 2:59 PM, Yang Shi wrote:
>> On 4/16/19 2:22 PM, Dave Hansen wrote:
>>> Keith Busch had a set of patches to let you specify the demotion order
>>> via sysfs for fun. The rules we came up with were:
>>> 1. Pages keep no history of where they have been
>>> 2. Each node can only demote to one other node
>> Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM
>> might be ok?
> In Keith's code, I don't think we differentiated. We let any node
> demote to any other node you want, as long as it follows the cycle rule.

I recall Keith's code let the userspace define the target node. Anyway,
we may need add one rule: not migrate-on-reclaim from PMEM node.
Demoting from PMEM to DRAM sounds pointless.

2019-04-16 23:20:52

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

>>>> Why cannot we start simple and build from there? In other words I
>>>> do not
>>>> think we really need anything like N_CPU_MEM at all.
>>> In this patchset N_CPU_MEM is used to tell us what nodes are cpuless
>>> nodes.
>>> They would be the preferred demotion target. Of course, we could
>>> rely on
>>> firmware to just demote to the next best node, but it may be a
>>> "preferred"
>>> node, if so I don't see too much benefit achieved by demotion. Am I
>>> missing
>>> anything?
>> Why cannot we simply demote in the proximity order? Why do you make
>> cpuless nodes so special? If other close nodes are vacant then just use
>> them.

And, I'm supposed we agree to *not* migrate from PMEM node (cpuless
node) to any other node on reclaim path, right? If so we need know if
the current node is DRAM node or PMEM node. If DRAM node, do demotion;
if PMEM node, do swap. So, using N_CPU_MEM to tell us if the current
node is DRAM node or not.

> We could. But, this raises another question, would we prefer to just
> demote to the next fallback node (just try once), if it is contended,
> then just swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to
> try all the nodes in the fallback order to find the first less
> contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?
>
>
> |------| |------| |------| |------|
> |PMEM0|---|DRAM0| --- CPU0 --- CPU1 --- |DRAM1| --- |PMEM1|
> |------| |------| |------| |------|
>
> The first one sounds simpler, and the current implementation does so
> and this needs find out the closest PMEM node by recognizing cpuless
> node.
>
> If we prefer go with the second option, it is definitely unnecessary
> to specialize any node.
>

2019-04-17 09:19:15

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Tue 16-04-19 12:19:21, Yang Shi wrote:
>
>
> On 4/16/19 12:47 AM, Michal Hocko wrote:
[...]
> > Why cannot we simply demote in the proximity order? Why do you make
> > cpuless nodes so special? If other close nodes are vacant then just use
> > them.
>
> We could. But, this raises another question, would we prefer to just demote
> to the next fallback node (just try once), if it is contended, then just
> swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to try all the nodes
> in the fallback order to find the first less contended one (i.e. DRAM0 ->
> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?

I would go with the later. Why, because it is more natural. Because that
is the natural allocation path so I do not see why this shouldn't be the
natural demotion path.

>
> |------|???? |------| |------|??????? |------|
> |PMEM0|---|DRAM0| --- CPU0 --- CPU1 --- |DRAM1| --- |PMEM1|
> |------|???? |------| |------|?????? |------|
>
> The first one sounds simpler, and the current implementation does so and
> this needs find out the closest PMEM node by recognizing cpuless node.

Unless you are specifying an explicit nodemask then the allocator will
do the allocation fallback for the migration target for you.

> If we prefer go with the second option, it is definitely unnecessary to
> specialize any node.
>
> > > > I would expect that the very first attempt wouldn't do much more than
> > > > migrate to-be-reclaimed pages (without an explicit binding) with a
> > > Do you mean respect mempolicy or cpuset when doing demotion? I was wondering
> > > this, but I didn't do so in the current implementation since it may need
> > > walk the rmap to retrieve the mempolicy in the reclaim path. Is there any
> > > easier way to do so?
> > You definitely have to follow policy. You cannot demote to a node which
> > is outside of the cpuset/mempolicy because you are breaking contract
> > expected by the userspace. That implies doing a rmap walk.
>
> OK, however, this may prevent from demoting unmapped page cache since there
> is no way to find those pages' policy.

I do not really expect that hard numa binding for the page cache is a
usecase we really have to lose sleep over for now.

> And, we have to think about what we should do when the demotion target has
> conflict with the mempolicy.

Simply skip it.

> The easiest way is to just skip those conflict
> pages in demotion. Or we may have to do the demotion one page by one page
> instead of migrating a list of pages.

Yes one page at the time sounds reasonable to me. THis is how we do
reclaim anyway.
--
Michal Hocko
SUSE Labs

2019-04-17 09:24:30

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> On 4/16/19 12:19 PM, Yang Shi wrote:
> > would we prefer to try all the nodes in the fallback order to find the
> > first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?
>
> Once a page went to DRAM1, how would we tell that it originated in DRAM0
> and is following the DRAM0 path rather than the DRAM1 path?
>
> Memory on DRAM0's path would be:
>
> DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap
>
> Memory on DRAM1's path would be:
>
> DRAM1 -> PMEM1 -> DRAM0 -> PMEM0 -> Swap
>
> Keith Busch had a set of patches to let you specify the demotion order
> via sysfs for fun. The rules we came up with were:

I am not a fan of any sysfs "fun"

> 1. Pages keep no history of where they have been

makes sense

> 2. Each node can only demote to one other node

Not really, see my other email. I do not really see any strong reason
why not use the full zonelist to demote to

> 3. The demotion path can not have cycles

yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
the migration target. That should prevent from loops or artificial nodes
exhausting quite naturaly AFAICS. Maybe we will need some tricks to
raise the watermark but I am not convinced something like that is really
necessary.

--
Michal Hocko
SUSE Labs

2019-04-17 15:21:33

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Tue, Apr 16, 2019 at 04:17:44PM -0700, Yang Shi wrote:
> On 4/16/19 4:04 PM, Dave Hansen wrote:
> > On 4/16/19 2:59 PM, Yang Shi wrote:
> > > On 4/16/19 2:22 PM, Dave Hansen wrote:
> > > > Keith Busch had a set of patches to let you specify the demotion order
> > > > via sysfs for fun.? The rules we came up with were:
> > > > 1. Pages keep no history of where they have been
> > > > 2. Each node can only demote to one other node
> > > Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM
> > > might be ok?
> > In Keith's code, I don't think we differentiated. We let any node
> > demote to any other node you want, as long as it follows the cycle rule.
>
> I recall Keith's code let the userspace define the target node.

Right, you have to opt-in in my original proposal since it may be a
bit presumptuous of the kernel to decide how a node's memory is going
to be used. User applications have other intentions for it.

It wouldn't be too difficult to make HMAT to create a reasonable initial
migration graph too, and that can also make that an opt-in user choice.

> Anyway, we may need add one rule: not migrate-on-reclaim from PMEM
> node. Demoting from PMEM to DRAM sounds pointless.

I really don't think we should be making such hard rules on PMEM. It
makes more sense to consider performance and locality for migration
rules than on a persistence attribute.

2019-04-17 15:40:53

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Wed 17-04-19 09:23:46, Keith Busch wrote:
> On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> > On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > > Keith Busch had a set of patches to let you specify the demotion order
> > > via sysfs for fun. The rules we came up with were:
> >
> > I am not a fan of any sysfs "fun"
>
> I'm hung up on the user facing interface, but there should be some way a
> user decides if a memory node is or is not a migrate target, right?

Why? Or to put it differently, why do we have to start with a user
interface at this stage when we actually barely have any real usecases
out there?

--
Michal Hocko
SUSE Labs

2019-04-17 15:49:19

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
> On Wed 17-04-19 09:23:46, Keith Busch wrote:
> > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> > > On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > > > Keith Busch had a set of patches to let you specify the demotion order
> > > > via sysfs for fun. The rules we came up with were:
> > >
> > > I am not a fan of any sysfs "fun"
> >
> > I'm hung up on the user facing interface, but there should be some way a
> > user decides if a memory node is or is not a migrate target, right?
>
> Why? Or to put it differently, why do we have to start with a user
> interface at this stage when we actually barely have any real usecases
> out there?

The use case is an alternative to swap, right? The user has to decide
which storage is the swap target, so operating in the same spirit.

2019-04-17 16:35:22

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > Keith Busch had a set of patches to let you specify the demotion order
> > via sysfs for fun. The rules we came up with were:
>
> I am not a fan of any sysfs "fun"

I'm hung up on the user facing interface, but there should be some way a
user decides if a memory node is or is not a migrate target, right?

2019-04-17 16:40:54

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Wed 17-04-19 09:37:39, Keith Busch wrote:
> On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
> > On Wed 17-04-19 09:23:46, Keith Busch wrote:
> > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > > > > Keith Busch had a set of patches to let you specify the demotion order
> > > > > via sysfs for fun. The rules we came up with were:
> > > >
> > > > I am not a fan of any sysfs "fun"
> > >
> > > I'm hung up on the user facing interface, but there should be some way a
> > > user decides if a memory node is or is not a migrate target, right?
> >
> > Why? Or to put it differently, why do we have to start with a user
> > interface at this stage when we actually barely have any real usecases
> > out there?
>
> The use case is an alternative to swap, right? The user has to decide
> which storage is the swap target, so operating in the same spirit.

I do not follow. If you use rebalancing you can still deplete the memory
and end up in a swap storage. If you want to reclaim/swap rather than
rebalance then you do not enable rebalancing (by node_reclaim or similar
mechanism).

--
Michal Hocko
SUSE Labs

2019-04-17 17:16:08

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/17/19 2:23 AM, Michal Hocko wrote:
>> 3. The demotion path can not have cycles
> yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
> the migration target. That should prevent from loops or artificial nodes
> exhausting quite naturaly AFAICS. Maybe we will need some tricks to
> raise the watermark but I am not convinced something like that is really
> necessary.

I don't think GFP_NOWAIT alone is good enough.

Let's say we have a system full of clean page cache and only two nodes:
0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes.
Each kswapd will be migrating pages to the *other* node since each is in
the other's fallback path.

I think what you're saying is that, eventually, the kswapds will see
allocation failures and stop migrating, providing hysteresis. This is
probably true.

But, I'm more concerned about that window where the kswapds are throwing
pages at each other because they're effectively just wasting resources
in this window. I guess we should figure our how large this window is
and how fast (or if) the dampening occurs in practice.

2019-04-17 17:28:19

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/17/19 9:39 AM, Michal Hocko wrote:
> On Wed 17-04-19 09:37:39, Keith Busch wrote:
>> On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
>>> On Wed 17-04-19 09:23:46, Keith Busch wrote:
>>>> On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
>>>>> On Tue 16-04-19 14:22:33, Dave Hansen wrote:
>>>>>> Keith Busch had a set of patches to let you specify the demotion order
>>>>>> via sysfs for fun. The rules we came up with were:
>>>>> I am not a fan of any sysfs "fun"
>>>> I'm hung up on the user facing interface, but there should be some way a
>>>> user decides if a memory node is or is not a migrate target, right?
>>> Why? Or to put it differently, why do we have to start with a user
>>> interface at this stage when we actually barely have any real usecases
>>> out there?
>> The use case is an alternative to swap, right? The user has to decide
>> which storage is the swap target, so operating in the same spirit.
> I do not follow. If you use rebalancing you can still deplete the memory
> and end up in a swap storage. If you want to reclaim/swap rather than
> rebalance then you do not enable rebalancing (by node_reclaim or similar
> mechanism).

I'm a little bit confused. Do you mean just do *not* do reclaim/swap in
rebalancing mode? If rebalancing is on, then node_reclaim just move the
pages around nodes, then kswapd or direct reclaim would take care of swap?

If so the node reclaim on PMEM node may rebalance the pages to DRAM
node? Should this be allowed?

I think both I and Keith was supposed to treat PMEM as a tier in the
reclaim hierarchy. The reclaim should push inactive pages down to PMEM,
then swap. So, PMEM is kind of a "terminal" node. So, he introduced
sysfs defined target node, I introduced N_CPU_MEM.

>

2019-04-17 17:37:15

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Wed, Apr 17, 2019 at 10:26:05AM -0700, Yang Shi wrote:
> On 4/17/19 9:39 AM, Michal Hocko wrote:
> > On Wed 17-04-19 09:37:39, Keith Busch wrote:
> > > On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
> > > > On Wed 17-04-19 09:23:46, Keith Busch wrote:
> > > > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> > > > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > > > > > > Keith Busch had a set of patches to let you specify the demotion order
> > > > > > > via sysfs for fun. The rules we came up with were:
> > > > > > I am not a fan of any sysfs "fun"
> > > > > I'm hung up on the user facing interface, but there should be some way a
> > > > > user decides if a memory node is or is not a migrate target, right?
> > > > Why? Or to put it differently, why do we have to start with a user
> > > > interface at this stage when we actually barely have any real usecases
> > > > out there?
> > > The use case is an alternative to swap, right? The user has to decide
> > > which storage is the swap target, so operating in the same spirit.
> > I do not follow. If you use rebalancing you can still deplete the memory
> > and end up in a swap storage. If you want to reclaim/swap rather than
> > rebalance then you do not enable rebalancing (by node_reclaim or similar
> > mechanism).
>
> I'm a little bit confused. Do you mean just do *not* do reclaim/swap in
> rebalancing mode? If rebalancing is on, then node_reclaim just move the
> pages around nodes, then kswapd or direct reclaim would take care of swap?
>
> If so the node reclaim on PMEM node may rebalance the pages to DRAM node?
> Should this be allowed?
>
> I think both I and Keith was supposed to treat PMEM as a tier in the reclaim
> hierarchy. The reclaim should push inactive pages down to PMEM, then swap.
> So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined
> target node, I introduced N_CPU_MEM.

Yeah, I think Yang and I view "demotion" as a separate feature from
numa rebalancing.

2019-04-17 17:52:58

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Wed 17-04-19 10:26:05, Yang Shi wrote:
>
>
> On 4/17/19 9:39 AM, Michal Hocko wrote:
> > On Wed 17-04-19 09:37:39, Keith Busch wrote:
> > > On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
> > > > On Wed 17-04-19 09:23:46, Keith Busch wrote:
> > > > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> > > > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > > > > > > Keith Busch had a set of patches to let you specify the demotion order
> > > > > > > via sysfs for fun. The rules we came up with were:
> > > > > > I am not a fan of any sysfs "fun"
> > > > > I'm hung up on the user facing interface, but there should be some way a
> > > > > user decides if a memory node is or is not a migrate target, right?
> > > > Why? Or to put it differently, why do we have to start with a user
> > > > interface at this stage when we actually barely have any real usecases
> > > > out there?
> > > The use case is an alternative to swap, right? The user has to decide
> > > which storage is the swap target, so operating in the same spirit.
> > I do not follow. If you use rebalancing you can still deplete the memory
> > and end up in a swap storage. If you want to reclaim/swap rather than
> > rebalance then you do not enable rebalancing (by node_reclaim or similar
> > mechanism).
>
> I'm a little bit confused. Do you mean just do *not* do reclaim/swap in
> rebalancing mode? If rebalancing is on, then node_reclaim just move the
> pages around nodes, then kswapd or direct reclaim would take care of swap?

Yes, that was the idea I wanted to get through. Sorry if that was not
really clear.

> If so the node reclaim on PMEM node may rebalance the pages to DRAM node?
> Should this be allowed?

Why it shouldn't? If there are other vacant Nodes to absorb that memory
then why not use it?

> I think both I and Keith was supposed to treat PMEM as a tier in the reclaim
> hierarchy. The reclaim should push inactive pages down to PMEM, then swap.
> So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined
> target node, I introduced N_CPU_MEM.

I understand that. And I am trying to figure out whether we really have
to tream PMEM specially here. Why is it any better than a generic NUMA
rebalancing code that could be used for many other usecases which are
not PMEM specific. If you present PMEM as a regular memory then also use
it as a normal memory.
--
Michal Hocko
SUSE Labs

2019-04-17 17:58:59

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Wed 17-04-19 10:13:44, Dave Hansen wrote:
> On 4/17/19 2:23 AM, Michal Hocko wrote:
> >> 3. The demotion path can not have cycles
> > yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
> > the migration target. That should prevent from loops or artificial nodes
> > exhausting quite naturaly AFAICS. Maybe we will need some tricks to
> > raise the watermark but I am not convinced something like that is really
> > necessary.
>
> I don't think GFP_NOWAIT alone is good enough.
>
> Let's say we have a system full of clean page cache and only two nodes:
> 0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes.
> Each kswapd will be migrating pages to the *other* node since each is in
> the other's fallback path.

I was thinking along node reclaim like based migration. You are right
that a parallel kswapd might reclaim enough to cause the ping pong and
we might need to play some watermaks tricks but as you say below this is
to be seen and a playground to explore. All I am saying is to try the
most simplistic approach first without all the bells and whistles to see
how this plays out with real workloads and build on top of that.

We already do have model - node_reclaim - which turned out to suck a lot
because the reclaim was just too aggressive wrt. refault. Maybe
migration will turn out much more feasible. And maybe I am completely
wrong and we need a much more complex solution.

> I think what you're saying is that, eventually, the kswapds will see
> allocation failures and stop migrating, providing hysteresis. This is
> probably true.
>
> But, I'm more concerned about that window where the kswapds are throwing
> pages at each other because they're effectively just wasting resources
> in this window. I guess we should figure our how large this window is
> and how fast (or if) the dampening occurs in practice.

--
Michal Hocko
SUSE Labs

2019-04-17 20:45:04

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

>>
>>>> I would also not touch the numa balancing logic at this stage and
>>>> rather
>>>> see how the current implementation behaves.
>>> I agree we would prefer start from something simpler and see how it
>>> works.
>>>
>>> The "twice access" optimization is aimed to reduce the PMEM
>>> bandwidth burden
>>> since the bandwidth of PMEM is scarce resource. I did compare "twice
>>> access"
>>> to "no twice access", it does save a lot bandwidth for some once-off
>>> access
>>> pattern. For example, when running stress test with mmtest's
>>> usemem-stress-numa-compact. The kernel would promote ~600,000 pages
>>> with
>>> "twice access" in 4 hours, but it would promote ~80,000,000 pages
>>> without
>>> "twice access".
>> I pressume this is a result of a synthetic workload, right? Or do you
>> have any numbers for a real life usecase?
>
> The test just uses usemem.

I tried to run some more real life like usecases, the below shows the
result by running mmtest's db-sysbench-mariadb-oltp-rw-medium test,
which is a typical database workload, with and w/o "twice access"
optimization.

w/ w/o
promotion 32771 312250

We can see the kernel did 10x promotion w/o "twice access" optimization.

I also tried kernel-devel and redis tests in mmtest, but they can't
generate enough memory pressure, so I had to run usemem test to generate
memory pressure. However, this brought in huge noise, particularly for
the w/o "twice access" case. But, the mysql test should be able to
demonstrate the improvement achieved by this optimization.

And, I'm wondering whether this optimization is also suitable to general
NUMA balancing or not.

2019-04-18 09:03:58

by Michal Hocko

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Wed 17-04-19 13:43:44, Yang Shi wrote:
[...]
> And, I'm wondering whether this optimization is also suitable to general
> NUMA balancing or not.

If there are convincing numbers then this should be a preferable way to
deal with it. Please note that the number of promotions is not the only
metric to watch. The overal performance/access latency would be another one.

--
Michal Hocko
SUSE Labs

2019-04-18 16:26:20

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/17/19 10:51 AM, Michal Hocko wrote:
> On Wed 17-04-19 10:26:05, Yang Shi wrote:
>> On 4/17/19 9:39 AM, Michal Hocko wrote:
>>> On Wed 17-04-19 09:37:39, Keith Busch wrote:
>>>> On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
>>>>> On Wed 17-04-19 09:23:46, Keith Busch wrote:
>>>>>> On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
>>>>>>> On Tue 16-04-19 14:22:33, Dave Hansen wrote:
>>>>>>>> Keith Busch had a set of patches to let you specify the demotion order
>>>>>>>> via sysfs for fun. The rules we came up with were:
>>>>>>> I am not a fan of any sysfs "fun"
>>>>>> I'm hung up on the user facing interface, but there should be some way a
>>>>>> user decides if a memory node is or is not a migrate target, right?
>>>>> Why? Or to put it differently, why do we have to start with a user
>>>>> interface at this stage when we actually barely have any real usecases
>>>>> out there?
>>>> The use case is an alternative to swap, right? The user has to decide
>>>> which storage is the swap target, so operating in the same spirit.
>>> I do not follow. If you use rebalancing you can still deplete the memory
>>> and end up in a swap storage. If you want to reclaim/swap rather than
>>> rebalance then you do not enable rebalancing (by node_reclaim or similar
>>> mechanism).
>> I'm a little bit confused. Do you mean just do *not* do reclaim/swap in
>> rebalancing mode? If rebalancing is on, then node_reclaim just move the
>> pages around nodes, then kswapd or direct reclaim would take care of swap?
> Yes, that was the idea I wanted to get through. Sorry if that was not
> really clear.
>
>> If so the node reclaim on PMEM node may rebalance the pages to DRAM node?
>> Should this be allowed?
> Why it shouldn't? If there are other vacant Nodes to absorb that memory
> then why not use it?
>
>> I think both I and Keith was supposed to treat PMEM as a tier in the reclaim
>> hierarchy. The reclaim should push inactive pages down to PMEM, then swap.
>> So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined
>> target node, I introduced N_CPU_MEM.
> I understand that. And I am trying to figure out whether we really have
> to tream PMEM specially here. Why is it any better than a generic NUMA
> rebalancing code that could be used for many other usecases which are
> not PMEM specific. If you present PMEM as a regular memory then also use
> it as a normal memory.

This also makes some sense. We just look at PMEM from different point of
view. Taking into account the performance disparity may outweigh
treating it as a normal memory in this patchset.

A ridiculous idea, may we have two modes? One for "rebalancing", the
other for "demotion"?

2019-04-18 18:24:58

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote:
> On 4/17/19 2:23 AM, Michal Hocko wrote:
> > yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
> > the migration target. That should prevent from loops or artificial nodes
> > exhausting quite naturaly AFAICS. Maybe we will need some tricks to
> > raise the watermark but I am not convinced something like that is really
> > necessary.
>
> I don't think GFP_NOWAIT alone is good enough.
>
> Let's say we have a system full of clean page cache and only two nodes:
> 0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes.
> Each kswapd will be migrating pages to the *other* node since each is in
> the other's fallback path.
>
> I think what you're saying is that, eventually, the kswapds will see
> allocation failures and stop migrating, providing hysteresis. This is
> probably true.
>
> But, I'm more concerned about that window where the kswapds are throwing
> pages at each other because they're effectively just wasting resources
> in this window. I guess we should figure our how large this window is
> and how fast (or if) the dampening occurs in practice.

I'm still refining tests to help answer this and have some preliminary
data. My test rig has CPU + memory Node 0, memory-only Node 1, and a
fast swap device. The test has an application strict mbind more than
the total memory to node 0, and forever writes random cachelines from
per-cpu threads.

I'm testing two memory pressure policies:

Node 0 can migrate to Node 1, no cycles
Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles)

After the initial ramp up time, the second policy is ~7-10% slower than
no cycles. There doesn't appear to be a temporary window dealing with
bouncing pages: it's just a slower overall steady state. Looks like when
migration fails and falls back to swap, the newly freed pages occasionaly
get sniped by the other node, keeping the pressure up.

2019-04-18 19:26:39

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 4/18/19 11:16 AM, Keith Busch wrote:
> On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote:
>> On 4/17/19 2:23 AM, Michal Hocko wrote:
>>> yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
>>> the migration target. That should prevent from loops or artificial nodes
>>> exhausting quite naturaly AFAICS. Maybe we will need some tricks to
>>> raise the watermark but I am not convinced something like that is really
>>> necessary.
>> I don't think GFP_NOWAIT alone is good enough.
>>
>> Let's say we have a system full of clean page cache and only two nodes:
>> 0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes.
>> Each kswapd will be migrating pages to the *other* node since each is in
>> the other's fallback path.
>>
>> I think what you're saying is that, eventually, the kswapds will see
>> allocation failures and stop migrating, providing hysteresis. This is
>> probably true.
>>
>> But, I'm more concerned about that window where the kswapds are throwing
>> pages at each other because they're effectively just wasting resources
>> in this window. I guess we should figure our how large this window is
>> and how fast (or if) the dampening occurs in practice.
> I'm still refining tests to help answer this and have some preliminary
> data. My test rig has CPU + memory Node 0, memory-only Node 1, and a
> fast swap device. The test has an application strict mbind more than
> the total memory to node 0, and forever writes random cachelines from
> per-cpu threads.

Thanks for the test. A follow-up question, how about the size for each
node? Is node 1 bigger than node 0? Since PMEM typically has larger
capacity, so I'm wondering whether the capacity may make things
different or not.

> I'm testing two memory pressure policies:
>
> Node 0 can migrate to Node 1, no cycles
> Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles)
>
> After the initial ramp up time, the second policy is ~7-10% slower than
> no cycles. There doesn't appear to be a temporary window dealing with
> bouncing pages: it's just a slower overall steady state. Looks like when
> migration fails and falls back to swap, the newly freed pages occasionaly
> get sniped by the other node, keeping the pressure up.

2019-04-18 21:08:35

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On 18 Apr 2019, at 15:23, Yang Shi wrote:

> On 4/18/19 11:16 AM, Keith Busch wrote:
>> On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote:
>>> On 4/17/19 2:23 AM, Michal Hocko wrote:
>>>> yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
>>>> the migration target. That should prevent from loops or artificial nodes
>>>> exhausting quite naturaly AFAICS. Maybe we will need some tricks to
>>>> raise the watermark but I am not convinced something like that is really
>>>> necessary.
>>> I don't think GFP_NOWAIT alone is good enough.
>>>
>>> Let's say we have a system full of clean page cache and only two nodes:
>>> 0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes.
>>> Each kswapd will be migrating pages to the *other* node since each is in
>>> the other's fallback path.
>>>
>>> I think what you're saying is that, eventually, the kswapds will see
>>> allocation failures and stop migrating, providing hysteresis. This is
>>> probably true.
>>>
>>> But, I'm more concerned about that window where the kswapds are throwing
>>> pages at each other because they're effectively just wasting resources
>>> in this window. I guess we should figure our how large this window is
>>> and how fast (or if) the dampening occurs in practice.
>> I'm still refining tests to help answer this and have some preliminary
>> data. My test rig has CPU + memory Node 0, memory-only Node 1, and a
>> fast swap device. The test has an application strict mbind more than
>> the total memory to node 0, and forever writes random cachelines from
>> per-cpu threads.
>
> Thanks for the test. A follow-up question, how about the size for each node? Is node 1 bigger than node 0? Since PMEM typically has larger capacity, so I'm wondering whether the capacity may make things different or not.
>
>> I'm testing two memory pressure policies:
>>
>> Node 0 can migrate to Node 1, no cycles
>> Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles)
>>
>> After the initial ramp up time, the second policy is ~7-10% slower than
>> no cycles. There doesn't appear to be a temporary window dealing with
>> bouncing pages: it's just a slower overall steady state. Looks like when
>> migration fails and falls back to swap, the newly freed pages occasionaly
>> get sniped by the other node, keeping the pressure up.

In addition to these two policies, I am curious about how MPOL_PREFERRED to Node 0
performs. I just wonder how bad static page allocation does.

--
Best Regards,
Yan Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature

2019-05-01 05:21:57

by Fengguang Wu

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Thu, Apr 18, 2019 at 11:02:27AM +0200, Michal Hocko wrote:
>On Wed 17-04-19 13:43:44, Yang Shi wrote:
>[...]
>> And, I'm wondering whether this optimization is also suitable to general
>> NUMA balancing or not.
>
>If there are convincing numbers then this should be a preferable way to
>deal with it. Please note that the number of promotions is not the only
>metric to watch. The overal performance/access latency would be another one.

Good question. Shi and me aligned today. Also talked with Mel (but
sorry I must missed some points due to poor English listening). It
becomes clear that

1) PMEM/DRAM page promotion/demotion is a hard problem to attack.
There will and should be multiple approaches for open discussion
before settling down. The criteria might be balanced complexity,
overheads, performance, etc.

2) We need a lot more data to lay solid foundation for effective
discussions. Testing will be a rather time consuming part for
contributor. We'll need to work together to create a number of
benchmarks that can well exercise the kernel promotion/demotion paths
and gather the necessary numbers. By collaborating on a common set of
tests, we can not only amortize efforts, but also compare different
approaches or compare v1/v2/... of the same approach conveniently.

Ying has already created several LKP test cases for that purpose.
Shi and me plan to join the efforts, too.

Thanks,
Fengguang

2019-05-01 06:47:34

by Fengguang Wu

[permalink] [raw]

Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

On Wed, Apr 17, 2019 at 11:17:48AM +0200, Michal Hocko wrote:
>On Tue 16-04-19 12:19:21, Yang Shi wrote:
>>
>>
>> On 4/16/19 12:47 AM, Michal Hocko wrote:
>[...]
>> > Why cannot we simply demote in the proximity order? Why do you make
>> > cpuless nodes so special? If other close nodes are vacant then just use
>> > them.
>>
>> We could. But, this raises another question, would we prefer to just demote
>> to the next fallback node (just try once), if it is contended, then just
>> swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to try all the nodes
>> in the fallback order to find the first less contended one (i.e. DRAM0 ->
>> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?
>
>I would go with the later. Why, because it is more natural. Because that
>is the natural allocation path so I do not see why this shouldn't be the
>natural demotion path.

"Demotion" should be more performance wise by "demoting to the
next-level (cheaper/slower) memory". Otherwise something like this
may happen.

DRAM0 pressured => demote cold pages to DRAM1
DRAM1 pressured => demote cold pages to DRAM0

Kind of DRAM0/DRAM1 exchanged a fraction of the demoted cold pages,
which looks not helpful for overall system performance.

Over time, it's even possible some cold pages get "demoted" in path
DRAM0=>DRAM1=>DRAM0=>DRAM1=>...

Thanks,
Fengguang