2022-08-18 13:29:43

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH v15 00/10] mm/demotion: Memory tiers and demotion

The current kernel has the basic memory tiering support: Inactive pages on a
higher tier NUMA node can be migrated (demoted) to a lower tier NUMA node to
make room for new allocations on the higher tier NUMA node. Frequently accessed
pages on a lower tier NUMA node can be migrated (promoted) to a higher tier NUMA
node to improve the performance.

In the current kernel, memory tiers are defined implicitly via a demotion path
relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed. The
current implementation puts all nodes with CPU into the highest tier, and builds the
tier hierarchy tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel implementation needs to be improved for several
important use cases:

* The current tier initialization code always initializes each memory-only NUMA
node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a virtual
machine) and that should be put into a higher tier.

* The current tier hierarchy always puts CPU nodes into the top tier. But on a
system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes
should be in the top tier, and DRAM nodes with CPUs are better to be placed
into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes into the top
tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from
CPU-less into a CPU node (or vice versa), the memory tier hierarchy gets
changed, even though no memory node is added or removed. This can make the
tier hierarchy unstable and make it difficult to support tier-based memory
accounting.

* A higher tier node can only be demoted to nodes with shortest distance on the
next lower tier as defined by the demotion path, not any other node from any
lower tier. This strict, demotion order does not work in all use
cases (e.g. some use cases may want to allow cross-socket demotion to another
node in the same demotion tier as a fallback when the preferred demotion node
is out of space), and has resulted in the feature request for an interface to
override the system-wide, per-node demotion order from the userspace. This
demotion order is also inconsistent with the page allocation fallback order
when all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order doesn't
allow that.

This patch series make the creation of memory tiers explicit under
the control of device driver.

Memory Tier Initialization
==========================

Linux kernel presents memory devices as NUMA nodes and each memory device is of
a specific type. The memory type of a device is represented by its abstract
distance. A memory tier corresponds to a range of abstract distance. This allows
for classifying memory devices with a specific performance range into a memory
tier.

By default, all memory nodes are assigned to the default tier with
abstract distance 512.

A device driver can move its memory nodes from the default tier. For example,
PMEM can move its memory nodes below the default tier, whereas GPU can move its
memory nodes above the default tier.

The kernel initialization code makes the decision on which exact tier a memory
node should be assigned to based on the requests from the device drivers as well
as the memory device hardware information provided by the firmware.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.

Changes from v14
* Add Reviewed-by:
* Address review feedback w.r.t default adistance value

Changes from v13
* Address review feedback.
* Add path dropping memtier from struct memory_dev_type

Changes from v12
* Fix kernel crash on module unload
* Address review feedback.
* Add node_random patch to this series based on review feedback

Changes from v11:
* smaller abstract distance imply faster(higher) memory tier.

Changes from v10:
* rename performance level to abstract distance
* Thanks to all the good feedback from Huang, Ying <[email protected]>.
Updated the patchset to cover most of the review feedback.

Changes from v9:
* Use performance level for initializing memory tiers.

Changes from v8:
* Drop the sysfs interface patches and related documentation changes.

Changes from v7:
* Fix kernel crash with demotion.
* Improve documentation.

Changes from v6:
* Drop the usage of rank.
* Address other review feedback.

Changes from v5:
* Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers
are going to be used for features other than demotion. Hence keep all N_MEMORY
nodes in memory tiers irrespective of whether they want to participate in promotion or demotion.
* Add NODE_DATA->memtier
* Rearrage patches to add sysfs files later.
* Add support to create memory tiers from userspace.
* Address other review feedback.


Changes from v4:
* Address review feedback.
* Reverse the meaning of "rank": higher rank value means higher tier.
* Add "/sys/devices/system/memtier/default_tier".
* Add node_is_toptier

v4:
Add support for explicit memory tiers and ranks.

v3:
- Modify patch 1 subject to make it more specific
- Remove /sys/kernel/mm/numa/demotion_targets interface, use
/sys/devices/system/node/demotion_targets instead and make
it writable to override node_states[N_DEMOTION_TARGETS].
- Add support to view per node demotion targets via sysfs

v2:
In v1, only 1st patch of this patch series was sent, which was
implemented to avoid some of the limitations on the demotion
target sharing, however for certain numa topology, the demotion
targets found by that patch was not most optimal, so 1st patch
in this series is modified according to suggestions from Huang
and Baolin. Different examples of demotion list comparasion
between existing implementation and changed implementation can
be found in the commit message of 1st patch.


Aneesh Kumar K.V (9):
mm/demotion: Add support for explicit memory tiers
mm/demotion: Move memory demotion related code
mm/demotion: Add hotplug callbacks to handle new numa node onlined
mm/demotion/dax/kmem: Set node's abstract distance to
MEMTIER_DEFAULT_DAX_ADISTANCE
mm/demotion: Build demotion targets based on explicit memory tiers
mm/demotion: Add pg_data_t member to track node memory tier details
mm/demotion: Drop memtier from memtype
mm/demotion: Update node_is_toptier to work with memory tiers
lib/nodemask: Optimize node_random for nodemask with single NUMA node

Jagdish Gediya (1):
mm/demotion: Demote pages according to allocation fallback order

drivers/dax/kmem.c | 42 ++-
include/linux/memory-tiers.h | 102 ++++++
include/linux/migrate.h | 15 -
include/linux/mmzone.h | 3 +
include/linux/node.h | 5 -
include/linux/nodemask.h | 15 +-
mm/Makefile | 1 +
mm/huge_memory.c | 1 +
mm/memory-tiers.c | 645 +++++++++++++++++++++++++++++++++++
mm/migrate.c | 453 +-----------------------
mm/mprotect.c | 1 +
mm/vmscan.c | 59 +++-
mm/vmstat.c | 4 -
13 files changed, 849 insertions(+), 497 deletions(-)
create mode 100644 include/linux/memory-tiers.h
create mode 100644 mm/memory-tiers.c

--
2.37.2


2022-08-18 13:31:40

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH v15 10/10] lib/nodemask: Optimize node_random for nodemask with single NUMA node

The most common case for certain node_random usage (demotion nodemask) is with
nodemask weight 1. We can avoid calling get_random_init() in that case and
always return the only node set in the nodemask.

A simple test as below
before = rdtsc_ordered();
for (i= 0; i < 100; i++) {
rand = node_random(&nmask);
}
after = rdtsc_ordered();

Without fix after - before : 16438
With fix after - before : 816

Reviewed-by: "Huang, Ying" <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/nodemask.h | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 4b71a96190a8..ac5b6a371be5 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -504,12 +504,21 @@ static inline int num_node_state(enum node_states state)
static inline int node_random(const nodemask_t *maskp)
{
#if defined(CONFIG_NUMA) && (MAX_NUMNODES > 1)
- int w, bit = NUMA_NO_NODE;
+ int w, bit;

w = nodes_weight(*maskp);
- if (w)
+ switch (w) {
+ case 0:
+ bit = NUMA_NO_NODE;
+ break;
+ case 1:
+ bit = first_node(*maskp);
+ break;
+ default:
bit = bitmap_ord_to_pos(maskp->bits,
- get_random_int() % w, MAX_NUMNODES);
+ get_random_int() % w, MAX_NUMNODES);
+ break;
+ }
return bit;
#else
return 0;
--
2.37.2

2022-08-18 13:38:57

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH v15 04/10] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE

By default, all nodes are assigned to the default memory tier which
is the memory tier designated for nodes with DRAM

Set dax kmem device node's tier to slower memory tier by assigning
abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE. Low-level drivers
like papr_scm or ACPI NFIT can initialize memory device type to a
more accurate value based on device tree details or HMAT. If the
kernel doesn't find the memory type initialized, a default slower
memory type is assigned by the kmem driver.

Reviewed-by: "Huang, Ying" <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
drivers/dax/kmem.c | 42 +++++++++++++++--
include/linux/memory-tiers.h | 42 ++++++++++++++++-
mm/memory-tiers.c | 91 +++++++++++++++++++++++++++---------
3 files changed, 149 insertions(+), 26 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..4852a2dbdb27 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -11,9 +11,17 @@
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
+#include <linux/memory-tiers.h>
#include "dax-private.h"
#include "bus.h"

+/*
+ * Default abstract distance assigned to the NUMA node onlined
+ * by DAX/kmem if the low level platform driver didn't initialize
+ * one for this NUMA node.
+ */
+#define MEMTIER_DEFAULT_DAX_ADISTANCE (MEMTIER_ADISTANCE_DRAM * 5)
+
/* Memory resource name used for add_memory_driver_managed(). */
static const char *kmem_name;
/* Set if any memory will remain added when the driver will be unloaded. */
@@ -41,6 +49,7 @@ struct dax_kmem_data {
struct resource *res[];
};

+static struct memory_dev_type *dax_slowmem_type;
static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
struct device *dev = &dev_dax->dev;
@@ -79,11 +88,13 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
return -EINVAL;
}

+ init_node_memory_type(numa_node, dax_slowmem_type);
+
+ rc = -ENOMEM;
data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
if (!data)
- return -ENOMEM;
+ goto err_dax_kmem_data;

- rc = -ENOMEM;
data->res_name = kstrdup(dev_name(dev), GFP_KERNEL);
if (!data->res_name)
goto err_res_name;
@@ -155,6 +166,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
kfree(data->res_name);
err_res_name:
kfree(data);
+err_dax_kmem_data:
+ clear_node_memory_type(numa_node, dax_slowmem_type);
return rc;
}

@@ -162,6 +175,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
{
int i, success = 0;
+ int node = dev_dax->target_node;
struct device *dev = &dev_dax->dev;
struct dax_kmem_data *data = dev_get_drvdata(dev);

@@ -198,6 +212,14 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
kfree(data->res_name);
kfree(data);
dev_set_drvdata(dev, NULL);
+ /*
+ * Clear the memtype association on successful unplug.
+ * If not, we have memory blocks left which can be
+ * offlined/onlined later. We need to keep memory_dev_type
+ * for that. This implies this reference will be around
+ * till next reboot.
+ */
+ clear_node_memory_type(node, dax_slowmem_type);
}
}
#else
@@ -228,9 +250,22 @@ static int __init dax_kmem_init(void)
if (!kmem_name)
return -ENOMEM;

+ dax_slowmem_type = alloc_memory_type(MEMTIER_DEFAULT_DAX_ADISTANCE);
+ if (IS_ERR(dax_slowmem_type)) {
+ rc = PTR_ERR(dax_slowmem_type);
+ goto err_dax_slowmem_type;
+ }
+
rc = dax_driver_register(&device_dax_kmem_driver);
if (rc)
- kfree_const(kmem_name);
+ goto error_dax_driver;
+
+ return rc;
+
+error_dax_driver:
+ destroy_memory_type(dax_slowmem_type);
+err_dax_slowmem_type:
+ kfree_const(kmem_name);
return rc;
}

@@ -239,6 +274,7 @@ static void __exit dax_kmem_exit(void)
dax_driver_unregister(&device_dax_kmem_driver);
if (!any_hotremove_failed)
kfree_const(kmem_name);
+ destroy_memory_type(dax_slowmem_type);
}

MODULE_AUTHOR("Intel Corporation");
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 17b41e592be6..30aecff9ae79 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -2,6 +2,9 @@
#ifndef _LINUX_MEMORY_TIERS_H
#define _LINUX_MEMORY_TIERS_H

+#include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/kref.h>
/*
* Each tier cover a abstrace distance chunk size of 128
*/
@@ -16,12 +19,49 @@
#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
#define MEMTIER_HOTPLUG_PRIO 100

+struct memory_tier;
+struct memory_dev_type {
+ /* list of memory types that are part of same tier as this type */
+ struct list_head tier_sibiling;
+ /* abstract distance for this specific memory type */
+ int adistance;
+ /* Nodes of same abstract distance */
+ nodemask_t nodes;
+ struct kref kref;
+ struct memory_tier *memtier;
+};
+
#ifdef CONFIG_NUMA
-#include <linux/types.h>
extern bool numa_demotion_enabled;
+struct memory_dev_type *alloc_memory_type(int adistance);
+void destroy_memory_type(struct memory_dev_type *memtype);
+void init_node_memory_type(int node, struct memory_dev_type *default_type);
+void clear_node_memory_type(int node, struct memory_dev_type *memtype);

#else

#define numa_demotion_enabled false
+/*
+ * CONFIG_NUMA implementation returns non NULL error.
+ */
+static inline struct memory_dev_type *alloc_memory_type(int adistance)
+{
+ return NULL;
+}
+
+static inline void destroy_memory_type(struct memory_dev_type *memtype)
+{
+
+}
+
+static inline void init_node_memory_type(int node, struct memory_dev_type *default_type)
+{
+
+}
+
+static inline void clear_node_memory_type(int node, struct memory_dev_type *memtype)
+{
+
+}
#endif /* CONFIG_NUMA */
#endif /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 05f05395468a..3ddf305df7d1 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -1,6 +1,4 @@
// SPDX-License-Identifier: GPL-2.0
-#include <linux/types.h>
-#include <linux/nodemask.h>
#include <linux/slab.h>
#include <linux/lockdep.h>
#include <linux/sysfs.h>
@@ -21,27 +19,10 @@ struct memory_tier {
int adistance_start;
};

-struct memory_dev_type {
- /* list of memory types that are part of same tier as this type */
- struct list_head tier_sibiling;
- /* abstract distance for this specific memory type */
- int adistance;
- /* Nodes of same abstract distance */
- nodemask_t nodes;
- struct memory_tier *memtier;
-};
-
static DEFINE_MUTEX(memory_tier_lock);
static LIST_HEAD(memory_tiers);
static struct memory_dev_type *node_memory_types[MAX_NUMNODES];
-/*
- * For now we can have 4 faster memory tiers with smaller adistance
- * than default DRAM tier.
- */
-static struct memory_dev_type default_dram_type = {
- .adistance = MEMTIER_ADISTANCE_DRAM,
- .tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling),
-};
+static struct memory_dev_type *default_dram_type;

static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
{
@@ -87,6 +68,14 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
return new_memtier;
}

+static inline void __init_node_memory_type(int node, struct memory_dev_type *memtype)
+{
+ if (!node_memory_types[node]) {
+ node_memory_types[node] = memtype;
+ kref_get(&memtype->kref);
+ }
+}
+
static struct memory_tier *set_node_memory_tier(int node)
{
struct memory_tier *memtier;
@@ -97,8 +86,7 @@ static struct memory_tier *set_node_memory_tier(int node)
if (!node_state(node, N_MEMORY))
return ERR_PTR(-EINVAL);

- if (!node_memory_types[node])
- node_memory_types[node] = &default_dram_type;
+ __init_node_memory_type(node, default_dram_type);

memtype = node_memory_types[node];
node_set(node, memtype->nodes);
@@ -144,6 +132,57 @@ static bool clear_node_memory_tier(int node)
return cleared;
}

+static void release_memtype(struct kref *kref)
+{
+ struct memory_dev_type *memtype;
+
+ memtype = container_of(kref, struct memory_dev_type, kref);
+ kfree(memtype);
+}
+
+struct memory_dev_type *alloc_memory_type(int adistance)
+{
+ struct memory_dev_type *memtype;
+
+ memtype = kmalloc(sizeof(*memtype), GFP_KERNEL);
+ if (!memtype)
+ return ERR_PTR(-ENOMEM);
+
+ memtype->adistance = adistance;
+ INIT_LIST_HEAD(&memtype->tier_sibiling);
+ memtype->nodes = NODE_MASK_NONE;
+ memtype->memtier = NULL;
+ kref_init(&memtype->kref);
+ return memtype;
+}
+EXPORT_SYMBOL_GPL(alloc_memory_type);
+
+void destroy_memory_type(struct memory_dev_type *memtype)
+{
+ kref_put(&memtype->kref, release_memtype);
+}
+EXPORT_SYMBOL_GPL(destroy_memory_type);
+
+void init_node_memory_type(int node, struct memory_dev_type *memtype)
+{
+
+ mutex_lock(&memory_tier_lock);
+ __init_node_memory_type(node, memtype);
+ mutex_unlock(&memory_tier_lock);
+}
+EXPORT_SYMBOL_GPL(init_node_memory_type);
+
+void clear_node_memory_type(int node, struct memory_dev_type *memtype)
+{
+ mutex_lock(&memory_tier_lock);
+ if (node_memory_types[node] == memtype) {
+ node_memory_types[node] = NULL;
+ kref_put(&memtype->kref, release_memtype);
+ }
+ mutex_unlock(&memory_tier_lock);
+}
+EXPORT_SYMBOL_GPL(clear_node_memory_type);
+
static int __meminit memtier_hotplug_callback(struct notifier_block *self,
unsigned long action, void *_arg)
{
@@ -178,6 +217,14 @@ static int __init memory_tier_init(void)
struct memory_tier *memtier;

mutex_lock(&memory_tier_lock);
+ /*
+ * For now we can have 4 faster memory tiers with smaller adistance
+ * than default DRAM tier.
+ */
+ default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
+ if (!default_dram_type)
+ panic("%s() failed to allocate default DRAM tier\n", __func__);
+
/*
* Look at all the existing N_MEMORY nodes and add them to
* default memory tier or to a tier if we already have memory
--
2.37.2

2022-08-18 13:40:03

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH v15 07/10] mm/demotion: Drop memtier from memtype

Now that we track node-specific memtier in pg_data_t, we can drop
memtier from memtype.

Reviewed-by: "Huang, Ying" <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/memory-tiers.h | 1 -
mm/memory-tiers.c | 16 +++++++++-------
2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 548e69f23727..108083d74557 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -28,7 +28,6 @@ struct memory_dev_type {
/* Nodes of same abstract distance */
nodemask_t nodes;
struct kref kref;
- struct memory_tier *memtier;
};

#ifdef CONFIG_NUMA
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 2b7e91b45a75..455c104fab5d 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -100,17 +100,22 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty

lockdep_assert_held_once(&memory_tier_lock);

+ adistance = round_down(adistance, memtier_adistance_chunk_size);
/*
* If the memtype is already part of a memory tier,
* just return that.
*/
- if (memtype->memtier)
- return memtype->memtier;
+ if (!list_empty(&memtype->tier_sibiling)) {
+ list_for_each_entry(memtier, &memory_tiers, list) {
+ if (adistance == memtier->adistance_start)
+ return memtier;
+ }
+ WARN_ON(1);
+ return ERR_PTR(-EINVAL);
+ }

- adistance = round_down(adistance, memtier_adistance_chunk_size);
list_for_each_entry(memtier, &memory_tiers, list) {
if (adistance == memtier->adistance_start) {
- memtype->memtier = memtier;
list_add(&memtype->tier_sibiling, &memtier->memory_types);
return memtier;
} else if (adistance < memtier->adistance_start) {
@@ -130,7 +135,6 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
list_add_tail(&new_memtier->list, &memtier->list);
else
list_add_tail(&new_memtier->list, &memory_tiers);
- memtype->memtier = new_memtier;
list_add(&memtype->tier_sibiling, &new_memtier->memory_types);
return new_memtier;
}
@@ -357,7 +361,6 @@ static bool clear_node_memory_tier(int node)
node_clear(node, memtype->nodes);
if (nodes_empty(memtype->nodes)) {
list_del(&memtype->tier_sibiling);
- memtype->memtier = NULL;
if (list_empty(&memtier->memory_types))
destroy_memory_tier(memtier);
}
@@ -385,7 +388,6 @@ struct memory_dev_type *alloc_memory_type(int adistance)
memtype->adistance = adistance;
INIT_LIST_HEAD(&memtype->tier_sibiling);
memtype->nodes = NODE_MASK_NONE;
- memtype->memtier = NULL;
kref_init(&memtype->kref);
return memtype;
}
--
2.37.2

2022-08-18 13:40:41

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH v15 05/10] mm/demotion: Build demotion targets based on explicit memory tiers

This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
default memory tier and additional memory tiers will be added by drivers like
dax kmem.

This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs. The closest node
in the immediately following memory tier is used as a demotion target.

Since we are now only building demotion target for N_MEMORY NUMA nodes
the CPU hotplug calls are removed in this patch.

Reviewed-by: "Huang, Ying" <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/memory-tiers.h | 13 ++
include/linux/migrate.h | 13 --
mm/memory-tiers.c | 238 +++++++++++++++++++--
mm/migrate.c | 394 -----------------------------------
mm/vmstat.c | 4 -
5 files changed, 239 insertions(+), 423 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 30aecff9ae79..548e69f23727 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -37,6 +37,14 @@ struct memory_dev_type *alloc_memory_type(int adistance);
void destroy_memory_type(struct memory_dev_type *memtype);
void init_node_memory_type(int node, struct memory_dev_type *default_type);
void clear_node_memory_type(int node, struct memory_dev_type *memtype);
+#ifdef CONFIG_MIGRATION
+int next_demotion_node(int node);
+#else
+static inline int next_demotion_node(int node)
+{
+ return NUMA_NO_NODE;
+}
+#endif

#else

@@ -63,5 +71,10 @@ static inline void clear_node_memory_type(int node, struct memory_dev_type *memt
{

}
+
+static inline int next_demotion_node(int node)
+{
+ return NUMA_NO_NODE;
+}
#endif /* CONFIG_NUMA */
#endif /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 96f8c84413fe..704a04f5a074 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -100,19 +100,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,

#endif /* CONFIG_MIGRATION */

-#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
-extern void set_migration_target_nodes(void);
-extern void migrate_on_reclaim_init(void);
-extern int next_demotion_node(int node);
-#else
-static inline void set_migration_target_nodes(void) {}
-static inline void migrate_on_reclaim_init(void) {}
-static inline int next_demotion_node(int node)
-{
- return NUMA_NO_NODE;
-}
-#endif
-
#ifdef CONFIG_COMPACTION
bool PageMovable(struct page *page);
void __SetPageMovable(struct page *page, const struct movable_operations *ops);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 3ddf305df7d1..c29bb24449b8 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -6,6 +6,8 @@
#include <linux/memory.h>
#include <linux/memory-tiers.h>

+#include "internal.h"
+
struct memory_tier {
/* hierarchy of memory tiers */
struct list_head list;
@@ -19,10 +21,74 @@ struct memory_tier {
int adistance_start;
};

+struct demotion_nodes {
+ nodemask_t preferred;
+};
+
static DEFINE_MUTEX(memory_tier_lock);
static LIST_HEAD(memory_tiers);
static struct memory_dev_type *node_memory_types[MAX_NUMNODES];
static struct memory_dev_type *default_dram_type;
+#ifdef CONFIG_MIGRATION
+/*
+ * node_demotion[] examples:
+ *
+ * Example 1:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
+ *
+ * node distances:
+ * node 0 1 2 3
+ * 0 10 20 30 40
+ * 1 20 10 40 30
+ * 2 30 40 10 40
+ * 3 40 30 40 10
+ *
+ * memory_tiers0 = 0-1
+ * memory_tiers1 = 2-3
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 3
+ * node_demotion[2].preferred = <empty>
+ * node_demotion[3].preferred = <empty>
+ *
+ * Example 2:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
+ *
+ * node distances:
+ * node 0 1 2
+ * 0 10 20 30
+ * 1 20 10 30
+ * 2 30 30 10
+ *
+ * memory_tiers0 = 0-2
+ *
+ * node_demotion[0].preferred = <empty>
+ * node_demotion[1].preferred = <empty>
+ * node_demotion[2].preferred = <empty>
+ *
+ * Example 3:
+ *
+ * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
+ *
+ * node distances:
+ * node 0 1 2
+ * 0 10 20 30
+ * 1 20 10 40
+ * 2 30 40 10
+ *
+ * memory_tiers0 = 1
+ * memory_tiers1 = 0
+ * memory_tiers2 = 2
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 0
+ * node_demotion[2].preferred = <empty>
+ *
+ */
+static struct demotion_nodes *node_demotion __read_mostly;
+#endif /* CONFIG_MIGRATION */

static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
{
@@ -68,6 +134,154 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
return new_memtier;
}

+static struct memory_tier *__node_get_memory_tier(int node)
+{
+ struct memory_dev_type *memtype;
+
+ memtype = node_memory_types[node];
+ if (memtype && node_isset(node, memtype->nodes))
+ return memtype->memtier;
+ return NULL;
+}
+
+#ifdef CONFIG_MIGRATION
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * Return: node id for next memory node in the demotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is terminal. This does not keep
+ * @node online or guarantee that it *continues* to be the next demotion
+ * target.
+ */
+int next_demotion_node(int node)
+{
+ struct demotion_nodes *nd;
+ int target;
+
+ if (!node_demotion)
+ return NUMA_NO_NODE;
+
+ nd = &node_demotion[node];
+
+ /*
+ * node_demotion[] is updated without excluding this
+ * function from running.
+ *
+ * Make sure to use RCU over entire code blocks if
+ * node_demotion[] reads need to be consistent.
+ */
+ rcu_read_lock();
+ /*
+ * If there are multiple target nodes, just select one
+ * target node randomly.
+ *
+ * In addition, we can also use round-robin to select
+ * target node, but we should introduce another variable
+ * for node_demotion[] to record last selected target node,
+ * that may cause cache ping-pong due to the changing of
+ * last target node. Or introducing per-cpu data to avoid
+ * caching issue, which seems more complicated. So selecting
+ * target node randomly seems better until now.
+ */
+ target = node_random(&nd->preferred);
+ rcu_read_unlock();
+
+ return target;
+}
+
+static void disable_all_demotion_targets(void)
+{
+ int node;
+
+ for_each_node_state(node, N_MEMORY)
+ node_demotion[node].preferred = NODE_MASK_NONE;
+ /*
+ * Ensure that the "disable" is visible across the system.
+ * Readers will see either a combination of before+disable
+ * state or disable+after. They will never see before and
+ * after state together.
+ */
+ synchronize_rcu();
+}
+
+static __always_inline nodemask_t get_memtier_nodemask(struct memory_tier *memtier)
+{
+ nodemask_t nodes = NODE_MASK_NONE;
+ struct memory_dev_type *memtype;
+
+ list_for_each_entry(memtype, &memtier->memory_types, tier_sibiling)
+ nodes_or(nodes, nodes, memtype->nodes);
+
+ return nodes;
+}
+
+/*
+ * Find an automatic demotion target for all memory
+ * nodes. Failing here is OK. It might just indicate
+ * being at the end of a chain.
+ */
+static void establish_demotion_targets(void)
+{
+ struct memory_tier *memtier;
+ struct demotion_nodes *nd;
+ int target = NUMA_NO_NODE, node;
+ int distance, best_distance;
+ nodemask_t tier_nodes;
+
+ lockdep_assert_held_once(&memory_tier_lock);
+
+ if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
+ return;
+
+ disable_all_demotion_targets();
+
+ for_each_node_state(node, N_MEMORY) {
+ best_distance = -1;
+ nd = &node_demotion[node];
+
+ memtier = __node_get_memory_tier(node);
+ if (!memtier || list_is_last(&memtier->list, &memory_tiers))
+ continue;
+ /*
+ * Get the lower memtier to find the demotion node list.
+ */
+ memtier = list_next_entry(memtier, list);
+ tier_nodes = get_memtier_nodemask(memtier);
+ /*
+ * find_next_best_node, use 'used' nodemask as a skip list.
+ * Add all memory nodes except the selected memory tier
+ * nodelist to skip list so that we find the best node from the
+ * memtier nodelist.
+ */
+ nodes_andnot(tier_nodes, node_states[N_MEMORY], tier_nodes);
+
+ /*
+ * Find all the nodes in the memory tier node list of same best distance.
+ * add them to the preferred mask. We randomly select between nodes
+ * in the preferred mask when allocating pages during demotion.
+ */
+ do {
+ target = find_next_best_node(node, &tier_nodes);
+ if (target == NUMA_NO_NODE)
+ break;
+
+ distance = node_distance(node, target);
+ if (distance == best_distance || best_distance == -1) {
+ best_distance = distance;
+ node_set(target, nd->preferred);
+ } else {
+ break;
+ }
+ } while (1);
+ }
+}
+
+#else
+static inline void disable_all_demotion_targets(void) {}
+static inline void establish_demotion_targets(void) {}
+#endif /* CONFIG_MIGRATION */
+
static inline void __init_node_memory_type(int node, struct memory_dev_type *memtype)
{
if (!node_memory_types[node]) {
@@ -94,16 +308,6 @@ static struct memory_tier *set_node_memory_tier(int node)
return memtier;
}

-static struct memory_tier *__node_get_memory_tier(int node)
-{
- struct memory_dev_type *memtype;
-
- memtype = node_memory_types[node];
- if (memtype && node_isset(node, memtype->nodes))
- return memtype->memtier;
- return NULL;
-}
-
static void destroy_memory_tier(struct memory_tier *memtier)
{
list_del(&memtier->list);
@@ -186,6 +390,7 @@ EXPORT_SYMBOL_GPL(clear_node_memory_type);
static int __meminit memtier_hotplug_callback(struct notifier_block *self,
unsigned long action, void *_arg)
{
+ struct memory_tier *memtier;
struct memory_notify *arg = _arg;

/*
@@ -198,12 +403,15 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
switch (action) {
case MEM_OFFLINE:
mutex_lock(&memory_tier_lock);
- clear_node_memory_tier(arg->status_change_nid);
+ if (clear_node_memory_tier(arg->status_change_nid))
+ establish_demotion_targets();
mutex_unlock(&memory_tier_lock);
break;
case MEM_ONLINE:
mutex_lock(&memory_tier_lock);
- set_node_memory_tier(arg->status_change_nid);
+ memtier = set_node_memory_tier(arg->status_change_nid);
+ if (!IS_ERR(memtier))
+ establish_demotion_targets();
mutex_unlock(&memory_tier_lock);
break;
}
@@ -216,6 +424,11 @@ static int __init memory_tier_init(void)
int node;
struct memory_tier *memtier;

+#ifdef CONFIG_MIGRATION
+ node_demotion = kcalloc(nr_node_ids, sizeof(struct demotion_nodes),
+ GFP_KERNEL);
+ WARN_ON(!node_demotion);
+#endif
mutex_lock(&memory_tier_lock);
/*
* For now we can have 4 faster memory tiers with smaller adistance
@@ -238,6 +451,7 @@ static int __init memory_tier_init(void)
*/
break;
}
+ establish_demotion_targets();
mutex_unlock(&memory_tier_lock);

hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRIO);
diff --git a/mm/migrate.c b/mm/migrate.c
index 5d7fb417edbf..ea86594f4bc5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2170,398 +2170,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
return 0;
}
#endif /* CONFIG_NUMA_BALANCING */
-
-/*
- * node_demotion[] example:
- *
- * Consider a system with two sockets. Each socket has
- * three classes of memory attached: fast, medium and slow.
- * Each memory class is placed in its own NUMA node. The
- * CPUs are placed in the node with the "fast" memory. The
- * 6 NUMA nodes (0-5) might be split among the sockets like
- * this:
- *
- * Socket A: 0, 1, 2
- * Socket B: 3, 4, 5
- *
- * When Node 0 fills up, its memory should be migrated to
- * Node 1. When Node 1 fills up, it should be migrated to
- * Node 2. The migration path start on the nodes with the
- * processors (since allocations default to this node) and
- * fast memory, progress through medium and end with the
- * slow memory:
- *
- * 0 -> 1 -> 2 -> stop
- * 3 -> 4 -> 5 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *
- * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1
- * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2
- * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate
- * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4
- * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5
- * { nr=0, nodes[0]=-1 }, // Node 5 does not migrate
- *
- * Moreover some systems may have multiple slow memory nodes.
- * Suppose a system has one socket with 3 memory nodes, node 0
- * is fast memory type, and node 1/2 both are slow memory
- * type, and the distance between fast memory node and slow
- * memory node is same. So the migration path should be:
- *
- * 0 -> 1/2 -> stop
- *
- * This is represented in the node_demotion[] like this:
- * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
- * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
- * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate
- */
-
-/*
- * Writes to this array occur without locking. Cycles are
- * not allowed: Node X demotes to Y which demotes to X...
- *
- * If multiple reads are performed, a single rcu_read_lock()
- * must be held over all reads to ensure that no cycles are
- * observed.
- */
-#define DEFAULT_DEMOTION_TARGET_NODES 15
-
-#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
-#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1)
-#else
-#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES
-#endif
-
-struct demotion_nodes {
- unsigned short nr;
- short nodes[DEMOTION_TARGET_NODES];
-};
-
-static struct demotion_nodes *node_demotion __read_mostly;
-
-/**
- * next_demotion_node() - Get the next node in the demotion path
- * @node: The starting node to lookup the next node
- *
- * Return: node id for next memory node in the demotion path hierarchy
- * from @node; NUMA_NO_NODE if @node is terminal. This does not keep
- * @node online or guarantee that it *continues* to be the next demotion
- * target.
- */
-int next_demotion_node(int node)
-{
- struct demotion_nodes *nd;
- unsigned short target_nr, index;
- int target;
-
- if (!node_demotion)
- return NUMA_NO_NODE;
-
- nd = &node_demotion[node];
-
- /*
- * node_demotion[] is updated without excluding this
- * function from running. RCU doesn't provide any
- * compiler barriers, so the READ_ONCE() is required
- * to avoid compiler reordering or read merging.
- *
- * Make sure to use RCU over entire code blocks if
- * node_demotion[] reads need to be consistent.
- */
- rcu_read_lock();
- target_nr = READ_ONCE(nd->nr);
-
- switch (target_nr) {
- case 0:
- target = NUMA_NO_NODE;
- goto out;
- case 1:
- index = 0;
- break;
- default:
- /*
- * If there are multiple target nodes, just select one
- * target node randomly.
- *
- * In addition, we can also use round-robin to select
- * target node, but we should introduce another variable
- * for node_demotion[] to record last selected target node,
- * that may cause cache ping-pong due to the changing of
- * last target node. Or introducing per-cpu data to avoid
- * caching issue, which seems more complicated. So selecting
- * target node randomly seems better until now.
- */
- index = get_random_int() % target_nr;
- break;
- }
-
- target = READ_ONCE(nd->nodes[index]);
-
-out:
- rcu_read_unlock();
- return target;
-}
-
-/* Disable reclaim-based migration. */
-static void __disable_all_migrate_targets(void)
-{
- int node, i;
-
- if (!node_demotion)
- return;
-
- for_each_online_node(node) {
- node_demotion[node].nr = 0;
- for (i = 0; i < DEMOTION_TARGET_NODES; i++)
- node_demotion[node].nodes[i] = NUMA_NO_NODE;
- }
-}
-
-static void disable_all_migrate_targets(void)
-{
- __disable_all_migrate_targets();
-
- /*
- * Ensure that the "disable" is visible across the system.
- * Readers will see either a combination of before+disable
- * state or disable+after. They will never see before and
- * after state together.
- *
- * The before+after state together might have cycles and
- * could cause readers to do things like loop until this
- * function finishes. This ensures they can only see a
- * single "bad" read and would, for instance, only loop
- * once.
- */
- synchronize_rcu();
-}
-
-/*
- * Find an automatic demotion target for 'node'.
- * Failing here is OK. It might just indicate
- * being at the end of a chain.
- */
-static int establish_migrate_target(int node, nodemask_t *used,
- int best_distance)
-{
- int migration_target, index, val;
- struct demotion_nodes *nd;
-
- if (!node_demotion)
- return NUMA_NO_NODE;
-
- nd = &node_demotion[node];
-
- migration_target = find_next_best_node(node, used);
- if (migration_target == NUMA_NO_NODE)
- return NUMA_NO_NODE;
-
- /*
- * If the node has been set a migration target node before,
- * which means it's the best distance between them. Still
- * check if this node can be demoted to other target nodes
- * if they have a same best distance.
- */
- if (best_distance != -1) {
- val = node_distance(node, migration_target);
- if (val > best_distance)
- goto out_clear;
- }
-
- index = nd->nr;
- if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
- "Exceeds maximum demotion target nodes\n"))
- goto out_clear;
-
- nd->nodes[index] = migration_target;
- nd->nr++;
-
- return migration_target;
-out_clear:
- node_clear(migration_target, *used);
- return NUMA_NO_NODE;
-}
-
-/*
- * When memory fills up on a node, memory contents can be
- * automatically migrated to another node instead of
- * discarded at reclaim.
- *
- * Establish a "migration path" which will start at nodes
- * with CPUs and will follow the priorities used to build the
- * page allocator zonelists.
- *
- * The difference here is that cycles must be avoided. If
- * node0 migrates to node1, then neither node1, nor anything
- * node1 migrates to can migrate to node0. Also one node can
- * be migrated to multiple nodes if the target nodes all have
- * a same best-distance against the source node.
- *
- * This function can run simultaneously with readers of
- * node_demotion[]. However, it can not run simultaneously
- * with itself. Exclusion is provided by memory hotplug events
- * being single-threaded.
- */
-static void __set_migration_target_nodes(void)
-{
- nodemask_t next_pass;
- nodemask_t this_pass;
- nodemask_t used_targets = NODE_MASK_NONE;
- int node, best_distance;
-
- /*
- * Avoid any oddities like cycles that could occur
- * from changes in the topology. This will leave
- * a momentary gap when migration is disabled.
- */
- disable_all_migrate_targets();
-
- /*
- * Allocations go close to CPUs, first. Assume that
- * the migration path starts at the nodes with CPUs.
- */
- next_pass = node_states[N_CPU];
-again:
- this_pass = next_pass;
- next_pass = NODE_MASK_NONE;
- /*
- * To avoid cycles in the migration "graph", ensure
- * that migration sources are not future targets by
- * setting them in 'used_targets'. Do this only
- * once per pass so that multiple source nodes can
- * share a target node.
- *
- * 'used_targets' will become unavailable in future
- * passes. This limits some opportunities for
- * multiple source nodes to share a destination.
- */
- nodes_or(used_targets, used_targets, this_pass);
-
- for_each_node_mask(node, this_pass) {
- best_distance = -1;
-
- /*
- * Try to set up the migration path for the node, and the target
- * migration nodes can be multiple, so doing a loop to find all
- * the target nodes if they all have a best node distance.
- */
- do {
- int target_node =
- establish_migrate_target(node, &used_targets,
- best_distance);
-
- if (target_node == NUMA_NO_NODE)
- break;
-
- if (best_distance == -1)
- best_distance = node_distance(node, target_node);
-
- /*
- * Visit targets from this pass in the next pass.
- * Eventually, every node will have been part of
- * a pass, and will become set in 'used_targets'.
- */
- node_set(target_node, next_pass);
- } while (1);
- }
- /*
- * 'next_pass' contains nodes which became migration
- * targets in this pass. Make additional passes until
- * no more migrations targets are available.
- */
- if (!nodes_empty(next_pass))
- goto again;
-}
-
-/*
- * For callers that do not hold get_online_mems() already.
- */
-void set_migration_target_nodes(void)
-{
- get_online_mems();
- __set_migration_target_nodes();
- put_online_mems();
-}
-
-/*
- * This leaves migrate-on-reclaim transiently disabled between
- * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs
- * whether reclaim-based migration is enabled or not, which
- * ensures that the user can turn reclaim-based migration at
- * any time without needing to recalculate migration targets.
- *
- * These callbacks already hold get_online_mems(). That is why
- * __set_migration_target_nodes() can be used as opposed to
- * set_migration_target_nodes().
- */
-#ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
- unsigned long action, void *_arg)
-{
- struct memory_notify *arg = _arg;
-
- /*
- * Only update the node migration order when a node is
- * changing status, like online->offline. This avoids
- * the overhead of synchronize_rcu() in most cases.
- */
- if (arg->status_change_nid < 0)
- return notifier_from_errno(0);
-
- switch (action) {
- case MEM_GOING_OFFLINE:
- /*
- * Make sure there are not transient states where
- * an offline node is a migration target. This
- * will leave migration disabled until the offline
- * completes and the MEM_OFFLINE case below runs.
- */
- disable_all_migrate_targets();
- break;
- case MEM_OFFLINE:
- case MEM_ONLINE:
- /*
- * Recalculate the target nodes once the node
- * reaches its final state (online or offline).
- */
- __set_migration_target_nodes();
- break;
- case MEM_CANCEL_OFFLINE:
- /*
- * MEM_GOING_OFFLINE disabled all the migration
- * targets. Reenable them.
- */
- __set_migration_target_nodes();
- break;
- case MEM_GOING_ONLINE:
- case MEM_CANCEL_ONLINE:
- break;
- }
-
- return notifier_from_errno(0);
-}
-#endif
-
-void __init migrate_on_reclaim_init(void)
-{
- node_demotion = kcalloc(nr_node_ids,
- sizeof(struct demotion_nodes),
- GFP_KERNEL);
- WARN_ON(!node_demotion);
-#ifdef CONFIG_MEMORY_HOTPLUG
- hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
-#endif
- /*
- * At this point, all numa nodes with memory/CPus have their state
- * properly set, so we can build the demotion order now.
- * Let us hold the cpu_hotplug lock just, as we could possibily have
- * CPU hotplug events during boot.
- */
- cpus_read_lock();
- set_migration_target_nodes();
- cpus_read_unlock();
-}
#endif /* CONFIG_NUMA */
-
-
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 373d2730fcf2..35c6ff97cf29 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -28,7 +28,6 @@
#include <linux/mm_inline.h>
#include <linux/page_ext.h>
#include <linux/page_owner.h>
-#include <linux/migrate.h>

#include "internal.h"

@@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)

if (!node_state(cpu_to_node(cpu), N_CPU)) {
node_set_state(cpu_to_node(cpu), N_CPU);
- set_migration_target_nodes();
}

return 0;
@@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
return 0;

node_clear_state(node, N_CPU);
- set_migration_target_nodes();

return 0;
}
@@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)

start_shepherd_timer();
#endif
- migrate_on_reclaim_init();
#ifdef CONFIG_PROC_FS
proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
--
2.37.2

2022-08-19 07:29:17

by Bharata B Rao

[permalink] [raw]
Subject: Re: [PATCH v15 00/10] mm/demotion: Memory tiers and demotion

On 8/18/2022 6:40 PM, Aneesh Kumar K.V wrote:
> The current kernel has the basic memory tiering support: Inactive pages on a
> higher tier NUMA node can be migrated (demoted) to a lower tier NUMA node to
> make room for new allocations on the higher tier NUMA node. Frequently accessed
> pages on a lower tier NUMA node can be migrated (promoted) to a higher tier NUMA
> node to improve the performance.
>
> In the current kernel, memory tiers are defined implicitly via a demotion path
> relationship between NUMA nodes, which is created during the kernel
> initialization and updated when a NUMA node is hot-added or hot-removed. The
> current implementation puts all nodes with CPU into the highest tier, and builds the
> tier hierarchy tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
>
> This current memory tier kernel implementation needs to be improved for several
> important use cases:
>
> * The current tier initialization code always initializes each memory-only NUMA
> node into a lower tier. But a memory-only NUMA node may have a high
> performance memory device (e.g. a DRAM-backed memory-only node on a virtual
> machine) and that should be put into a higher tier.
>
> * The current tier hierarchy always puts CPU nodes into the top tier. But on a
> system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes
> should be in the top tier, and DRAM nodes with CPUs are better to be placed
> into the next lower tier.
>
> * Also because the current tier hierarchy always puts CPU nodes into the top
> tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from
> CPU-less into a CPU node (or vice versa), the memory tier hierarchy gets
> changed, even though no memory node is added or removed. This can make the
> tier hierarchy unstable and make it difficult to support tier-based memory
> accounting.
>
> * A higher tier node can only be demoted to nodes with shortest distance on the
> next lower tier as defined by the demotion path, not any other node from any
> lower tier. This strict, demotion order does not work in all use
> cases (e.g. some use cases may want to allow cross-socket demotion to another
> node in the same demotion tier as a fallback when the preferred demotion node
> is out of space), and has resulted in the feature request for an interface to
> override the system-wide, per-node demotion order from the userspace. This
> demotion order is also inconsistent with the page allocation fallback order
> when all the nodes in a higher tier are out of space: The page allocation can
> fall back to any node from any lower tier, whereas the demotion order doesn't
> allow that.
>
> This patch series make the creation of memory tiers explicit under
> the control of device driver.
>
> Memory Tier Initialization
> ==========================
>
> Linux kernel presents memory devices as NUMA nodes and each memory device is of
> a specific type. The memory type of a device is represented by its abstract
> distance. A memory tier corresponds to a range of abstract distance. This allows
> for classifying memory devices with a specific performance range into a memory
> tier.
>
> By default, all memory nodes are assigned to the default tier with
> abstract distance 512.
>
> A device driver can move its memory nodes from the default tier. For example,
> PMEM can move its memory nodes below the default tier, whereas GPU can move its
> memory nodes above the default tier.
>
> The kernel initialization code makes the decision on which exact tier a memory
> node should be assigned to based on the requests from the device drivers as well
> as the memory device hardware information provided by the firmware.

I gave this patchset a quick try on two setups:

1. With QEMU, when an nvdimm device is bound to dax kmem driver, I can see
the memory node with pmem getting into a lower tier than DRAM.

2. In an experimental CXL setup that has DRAM as part of CXL memory, I see that
CXL memory node falls into the same tier as the regular DRAM tier. This is
expected for now since there is no code (in low level ACPI driver?) yet to
map the latency or bandwidth info (when available from firmware) into an
abstract distance value, and register a memory type for the same. Guess these
bits can be covered as part of future enhancements.

Regards,
Bharata.

2022-08-20 01:18:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v15 00/10] mm/demotion: Memory tiers and demotion

On Fri, 19 Aug 2022 11:57:18 +0530 Bharata B Rao <[email protected]> wrote:

> > The kernel initialization code makes the decision on which exact tier a memory
> > node should be assigned to based on the requests from the device drivers as well
> > as the memory device hardware information provided by the firmware.
>
> I gave this patchset a quick try on two setups:
>
> 1. With QEMU, when an nvdimm device is bound to dax kmem driver, I can see
> the memory node with pmem getting into a lower tier than DRAM.
>
> 2. In an experimental CXL setup that has DRAM as part of CXL memory, I see that
> CXL memory node falls into the same tier as the regular DRAM tier. This is
> expected for now since there is no code (in low level ACPI driver?) yet to
> map the latency or bandwidth info (when available from firmware) into an
> abstract distance value, and register a memory type for the same. Guess these
> bits can be covered as part of future enhancements.

Should I add your Tested-by:?

2022-08-20 02:53:02

by Wei Xu

[permalink] [raw]
Subject: Re: [PATCH v15 00/10] mm/demotion: Memory tiers and demotion

Acked-by: Wei Xu <[email protected]>

On Thu, Aug 18, 2022 at 6:10 AM Aneesh Kumar K.V
<[email protected]> wrote:
>
> The current kernel has the basic memory tiering support: Inactive pages on a
> higher tier NUMA node can be migrated (demoted) to a lower tier NUMA node to
> make room for new allocations on the higher tier NUMA node. Frequently accessed
> pages on a lower tier NUMA node can be migrated (promoted) to a higher tier NUMA
> node to improve the performance.
>
> In the current kernel, memory tiers are defined implicitly via a demotion path
> relationship between NUMA nodes, which is created during the kernel
> initialization and updated when a NUMA node is hot-added or hot-removed. The
> current implementation puts all nodes with CPU into the highest tier, and builds the
> tier hierarchy tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
>
> This current memory tier kernel implementation needs to be improved for several
> important use cases:
>
> * The current tier initialization code always initializes each memory-only NUMA
> node into a lower tier. But a memory-only NUMA node may have a high
> performance memory device (e.g. a DRAM-backed memory-only node on a virtual
> machine) and that should be put into a higher tier.
>
> * The current tier hierarchy always puts CPU nodes into the top tier. But on a
> system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes
> should be in the top tier, and DRAM nodes with CPUs are better to be placed
> into the next lower tier.
>
> * Also because the current tier hierarchy always puts CPU nodes into the top
> tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from
> CPU-less into a CPU node (or vice versa), the memory tier hierarchy gets
> changed, even though no memory node is added or removed. This can make the
> tier hierarchy unstable and make it difficult to support tier-based memory
> accounting.
>
> * A higher tier node can only be demoted to nodes with shortest distance on the
> next lower tier as defined by the demotion path, not any other node from any
> lower tier. This strict, demotion order does not work in all use
> cases (e.g. some use cases may want to allow cross-socket demotion to another
> node in the same demotion tier as a fallback when the preferred demotion node
> is out of space), and has resulted in the feature request for an interface to
> override the system-wide, per-node demotion order from the userspace. This
> demotion order is also inconsistent with the page allocation fallback order
> when all the nodes in a higher tier are out of space: The page allocation can
> fall back to any node from any lower tier, whereas the demotion order doesn't
> allow that.
>
> This patch series make the creation of memory tiers explicit under
> the control of device driver.
>
> Memory Tier Initialization
> ==========================
>
> Linux kernel presents memory devices as NUMA nodes and each memory device is of
> a specific type. The memory type of a device is represented by its abstract
> distance. A memory tier corresponds to a range of abstract distance. This allows
> for classifying memory devices with a specific performance range into a memory
> tier.
>
> By default, all memory nodes are assigned to the default tier with
> abstract distance 512.
>
> A device driver can move its memory nodes from the default tier. For example,
> PMEM can move its memory nodes below the default tier, whereas GPU can move its
> memory nodes above the default tier.
>
> The kernel initialization code makes the decision on which exact tier a memory
> node should be assigned to based on the requests from the device drivers as well
> as the memory device hardware information provided by the firmware.
>
> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>
> Changes from v14
> * Add Reviewed-by:
> * Address review feedback w.r.t default adistance value
>
> Changes from v13
> * Address review feedback.
> * Add path dropping memtier from struct memory_dev_type
>
> Changes from v12
> * Fix kernel crash on module unload
> * Address review feedback.
> * Add node_random patch to this series based on review feedback
>
> Changes from v11:
> * smaller abstract distance imply faster(higher) memory tier.
>
> Changes from v10:
> * rename performance level to abstract distance
> * Thanks to all the good feedback from Huang, Ying <[email protected]>.
> Updated the patchset to cover most of the review feedback.
>
> Changes from v9:
> * Use performance level for initializing memory tiers.
>
> Changes from v8:
> * Drop the sysfs interface patches and related documentation changes.
>
> Changes from v7:
> * Fix kernel crash with demotion.
> * Improve documentation.
>
> Changes from v6:
> * Drop the usage of rank.
> * Address other review feedback.
>
> Changes from v5:
> * Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers
> are going to be used for features other than demotion. Hence keep all N_MEMORY
> nodes in memory tiers irrespective of whether they want to participate in promotion or demotion.
> * Add NODE_DATA->memtier
> * Rearrage patches to add sysfs files later.
> * Add support to create memory tiers from userspace.
> * Address other review feedback.
>
>
> Changes from v4:
> * Address review feedback.
> * Reverse the meaning of "rank": higher rank value means higher tier.
> * Add "/sys/devices/system/memtier/default_tier".
> * Add node_is_toptier
>
> v4:
> Add support for explicit memory tiers and ranks.
>
> v3:
> - Modify patch 1 subject to make it more specific
> - Remove /sys/kernel/mm/numa/demotion_targets interface, use
> /sys/devices/system/node/demotion_targets instead and make
> it writable to override node_states[N_DEMOTION_TARGETS].
> - Add support to view per node demotion targets via sysfs
>
> v2:
> In v1, only 1st patch of this patch series was sent, which was
> implemented to avoid some of the limitations on the demotion
> target sharing, however for certain numa topology, the demotion
> targets found by that patch was not most optimal, so 1st patch
> in this series is modified according to suggestions from Huang
> and Baolin. Different examples of demotion list comparasion
> between existing implementation and changed implementation can
> be found in the commit message of 1st patch.
>
>
> Aneesh Kumar K.V (9):
> mm/demotion: Add support for explicit memory tiers
> mm/demotion: Move memory demotion related code
> mm/demotion: Add hotplug callbacks to handle new numa node onlined
> mm/demotion/dax/kmem: Set node's abstract distance to
> MEMTIER_DEFAULT_DAX_ADISTANCE
> mm/demotion: Build demotion targets based on explicit memory tiers
> mm/demotion: Add pg_data_t member to track node memory tier details
> mm/demotion: Drop memtier from memtype
> mm/demotion: Update node_is_toptier to work with memory tiers
> lib/nodemask: Optimize node_random for nodemask with single NUMA node
>
> Jagdish Gediya (1):
> mm/demotion: Demote pages according to allocation fallback order
>
> drivers/dax/kmem.c | 42 ++-
> include/linux/memory-tiers.h | 102 ++++++
> include/linux/migrate.h | 15 -
> include/linux/mmzone.h | 3 +
> include/linux/node.h | 5 -
> include/linux/nodemask.h | 15 +-
> mm/Makefile | 1 +
> mm/huge_memory.c | 1 +
> mm/memory-tiers.c | 645 +++++++++++++++++++++++++++++++++++
> mm/migrate.c | 453 +-----------------------
> mm/mprotect.c | 1 +
> mm/vmscan.c | 59 +++-
> mm/vmstat.c | 4 -
> 13 files changed, 849 insertions(+), 497 deletions(-)
> create mode 100644 include/linux/memory-tiers.h
> create mode 100644 mm/memory-tiers.c
>
> --
> 2.37.2
>

2022-08-22 03:53:55

by Bharata B Rao

[permalink] [raw]
Subject: Re: [PATCH v15 00/10] mm/demotion: Memory tiers and demotion

On 8/20/2022 6:04 AM, Andrew Morton wrote:
> On Fri, 19 Aug 2022 11:57:18 +0530 Bharata B Rao <[email protected]> wrote:
>
>>> The kernel initialization code makes the decision on which exact tier a memory
>>> node should be assigned to based on the requests from the device drivers as well
>>> as the memory device hardware information provided by the firmware.
>>
>> I gave this patchset a quick try on two setups:
>>
>> 1. With QEMU, when an nvdimm device is bound to dax kmem driver, I can see
>> the memory node with pmem getting into a lower tier than DRAM.
>>
>> 2. In an experimental CXL setup that has DRAM as part of CXL memory, I see that
>> CXL memory node falls into the same tier as the regular DRAM tier. This is
>> expected for now since there is no code (in low level ACPI driver?) yet to
>> map the latency or bandwidth info (when available from firmware) into an
>> abstract distance value, and register a memory type for the same. Guess these
>> bits can be covered as part of future enhancements.
>
> Should I add your Tested-by:?

May be not. I have done only a very minimal testing of specific scenarios
as mentioned above. Thanks for checking.

Regards,
Bharata.

2022-09-12 00:21:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v15 00/10] mm/demotion: Memory tiers and demotion

On Thu, 18 Aug 2022 18:40:32 +0530 "Aneesh Kumar K.V" <[email protected]> wrote:

> This patch series make the creation of memory tiers explicit under
> the control of device driver.

This series has been in mm-unstable for nearly four weeks and
everything has died down, so I'm planning on moving it into mm-stable
late this week unless someone stops me...

2022-09-26 21:24:56

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v15 10/10] lib/nodemask: Optimize node_random for nodemask with single NUMA node

Hi Aneesh,

Please CC maintainers in your recipient list.

On Thu, Aug 18, 2022 at 06:40:42PM +0530, Aneesh Kumar K.V wrote:
> The most common case for certain node_random usage (demotion nodemask) is with
> nodemask weight 1. We can avoid calling get_random_init() in that case and
> always return the only node set in the nodemask.

Can you move the comment about get_random_int() to the code?.

> A simple test as below
> before = rdtsc_ordered();
> for (i= 0; i < 100; i++) {
> rand = node_random(&nmask);
> }
> after = rdtsc_ordered();
>
> Without fix after - before : 16438
> With fix after - before : 816
>
> Reviewed-by: "Huang, Ying" <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> include/linux/nodemask.h | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 4b71a96190a8..ac5b6a371be5 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -504,12 +504,21 @@ static inline int num_node_state(enum node_states state)
> static inline int node_random(const nodemask_t *maskp)
> {
> #if defined(CONFIG_NUMA) && (MAX_NUMNODES > 1)
> - int w, bit = NUMA_NO_NODE;
> + int w, bit;
>
> w = nodes_weight(*maskp);
> - if (w)
> + switch (w) {
> + case 0:
> + bit = NUMA_NO_NODE;
> + break;

Why not 'return NUMA_NO_NODE' instead of break thing?

> + case 1:
> + bit = first_node(*maskp);
> + break;
> + default:
> bit = bitmap_ord_to_pos(maskp->bits,
> - get_random_int() % w, MAX_NUMNODES);
> + get_random_int() % w, MAX_NUMNODES);

Don't fix tabs - it trashes the history.

> + break;
> + }
> return bit;
> #else
> return 0;
> --
> 2.37.2
>
>