2022-05-28 02:53:16

by Wei Xu

[permalink] [raw]
Subject: RFC: Memory Tiering Kernel Interfaces (v3)

Changes since v2
================
* Updated the design and examples to use "rank" instead of device ID
to determine the order between memory tiers for better flexibility.

Overview
========

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node. Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created during
the kernel initialization and updated when a NUMA node is hot-added or
hot-removed. The current implementation puts all nodes with CPU into
the top tier, and builds the tier hierarchy tier-by-tier by
establishing the per-node demotion targets based on the distances
between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases:

* The current tier initialization code always initializes
each memory-only NUMA node into a lower tier. But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

* The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM (e.g. GPU memory) devices, these
memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
with CPUs are better to be placed into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes
into the top tier, when a CPU is hot-added (or hot-removed) and
triggers a memory node from CPU-less into a CPU node (or vice
versa), the memory tier hierarchy gets changed, even though no
memory node is added or removed. This can make the tier
hierarchy unstable and make it difficult to support tier-based
memory accounting.

* A higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), and has resulted in the feature request for an interface to
override the system-wide, per-node demotion order from the
userspace. This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

* There are no interfaces for the userspace to learn about the memory
tier hierarchy in order to optimize its memory allocations.

I'd like to propose revised memory tier kernel interfaces based on
the discussions in the threads:

- https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
- https://lore.kernel.org/linux-mm/[email protected]/t/
- https://lore.kernel.org/linux-mm/[email protected]/T/
- https://lore.kernel.org/linux-mm/[email protected]/T/


High-level Design Ideas
=======================

* Define memory tiers explicitly, not implicitly.

* Memory tiers are defined based on hardware capabilities of memory
nodes, not their relative node distances between each other.

* The tier assignment of each node is independent from each other.
Moving a node from one tier to another tier doesn't affect the tier
assignment of any other node.

* The node-tier association is stable. A node can be reassigned to a
different tier only under the specific conditions that don't block
future tier-based memory cgroup accounting.

* A node can demote its pages to any nodes of any lower tiers. The
demotion target node selection follows the allocation fallback order
of the source node, which is built based on node distances. The
demotion targets are also restricted to only the nodes from the tiers
lower than the source node. We no longer need to maintain a separate
per-node demotion order (node_demotion[]).


Sysfs Interfaces
================

* /sys/devices/system/memtier/

This is the directory containing the information about memory tiers.

Each memory tier has its own subdirectory.

The order of memory tiers is determined by their rank values, not by
their memtier device names.

- /sys/devices/system/memtier/possible

Format: ordered list of "memtier(rank)"
Example: 0(64), 1(128), 2(192)

Read-only. When read, list all available memory tiers and their
associated ranks, ordered by the rank values (from the highest
tier to the lowest tier).

* /sys/devices/system/memtier/memtierN/

This is the directory containing the information about a particular
memory tier, memtierN, where N is the memtier device ID (e.g. 0, 1).

The memtier device ID number itself is just an identifier and has no
special meaning, i.e. memtier device ID numbers do not determine the
order of memory tiers.

- /sys/devices/system/memtier/memtierN/rank

Format: int
Example: 100

Read-only. When read, list the "rank" value associated with memtierN.

"Rank" is an opaque value. Its absolute value doesn't have any
special meaning. But the rank values of different memtiers can be
compared with each other to determine the memory tier order.
For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
their rank values are 10, 20, 15, then the memory tier order is:
memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
and memtier1 is the lowest tier.

The rank value of each memtier should be unique.

- /sys/devices/system/memtier/memtierN/nodelist

Format: node_list
Example: 1-2

Read-only. When read, list the memory nodes in the specified tier.

If a memory tier has no memory nodes, the kernel can hide the sysfs
directory of this memory tier, though the tier itself can still be
visible from /sys/devices/system/memtier/possible.

* /sys/devices/system/node/nodeN/memtier

where N = 0, 1, ...

Format: int or empty
Example: 1

When read, list the device ID of the memory tier that the node belongs
to. Its value is empty for a CPU-only NUMA node.

When written, the kernel moves the node into the specified memory
tier if the move is allowed. The tier assignment of all other nodes
are not affected.

Initially, we can make this interface read-only.


Kernel Representation
=====================

* All memory tiering code is guarded by CONFIG_TIERED_MEMORY.

* #define MAX_MEMORY_TIERS 3

Support 3 memory tiers for now. This can be a kconfig option.

* #define MEMORY_DEFAULT_TIER_DEVICE 1

The default tier device that a memory node is assigned to.

* struct memtier_dev {
nodemask_t nodelist;
int rank;
int tier;
} memtier_devices[MAX_MEMORY_TIERS]

Store memory tiers by device IDs.

* struct memtier_dev *memory_tier(int tier)

Returns the memtier device for a given memory tier.

* int node_tier_dev_map[MAX_NUMNODES]

Map a node to its tier device ID..

For each CPU-only node c, node_tier_dev_map[c] = -1.


Memory Tier Initialization
==========================

By default, all memory nodes are assigned to the default tier
(MEMORY_DEFAULT_TIER_DEVICE). The default tier device has a rank value
in the middle of the possible rank value range (e.g. 127 if the range
is [0..255]).

A device driver can move up or down its memory nodes from the default
tier. For example, PMEM can move down its memory nodes below the
default tier, whereas GPU can move up its memory nodes above the
default tier.

The kernel initialization code makes the decision on which exact tier
a memory node should be assigned to based on the requests from the
device drivers as well as the memory device hardware information
provided by the firmware.


Memory Tier Reassignment
========================

After a memory node is hot-removed, it can be hot-added back to a
different memory tier. This is useful for supporting dynamically
provisioned CXL.mem NUMA nodes, which may connect to different
memory devices across hot-plug events. Such tier changes should
be compatible with tier-based memory accounting.

The userspace may also reassign an existing online memory node to a
different tier. However, this should only be allowed when no pages
are allocated from the memory node or when there are no non-root
memory cgroups (e.g. during the system boot). This restriction is
important for keeping memory tier hierarchy stable enough for
tier-based memory cgroup accounting.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.


Memory Allocation for Demotion
==============================

To allocate a new page as the demotion target for a page, the kernel
calls the allocation function (__alloc_pages_nodemask) with the
source page node as the preferred node and the union of all lower
tier nodes as the allowed nodemask. The actual target node selection
then follows the allocation fallback order that the kernel has
already defined.

The pseudo code looks like:

targets = NODE_MASK_NONE;
src_nid = page_to_nid(page);
src_tier = memtier_devices[node_tier_dev_map[src_nid]].tier;
for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
nodes_or(targets, targets, memory_tier(i)->nodelist);
new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);

The memopolicy of cpuset, vma and owner task of the source page can
be set to refine the demotion target nodemask, e.g. to prevent
demotion or select a particular allowed node as the demotion target.


Memory Allocation for Promotion
===============================

The page allocation for promotion is similar to demotion, except that (1)
the target nodemask uses the promotion tiers, (2) the preferred node can
be the accessing CPU node, not the source page node.


Examples
========

* Example 1:

Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.

20
Node 0 (DRAM) ---- Node 1 (DRAM)
| \ / |
| 30 40 X 40 | 30
| / \ |
Node 2 (PMEM) ---- Node 3 (PMEM)
40

node distances:
node 0 1 2 3
0 10 20 30 40
1 20 10 40 30
2 30 40 10 40
3 40 30 40 10

$ cat /sys/devices/system/memtier/possible
0(64), 1(128), 2(192)

$ grep '' /sys/devices/system/memtier/memtier*/rank
/sys/devices/system/memtier/memtier1/rank:128
/sys/devices/system/memtier/memtier2/rank:192

$ grep '' /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier1/nodelist:0-1
/sys/devices/system/memtier/memtier2/nodelist:2-3

$ grep '' /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier:1
/sys/devices/system/node/node1/memtier:1
/sys/devices/system/node/node2/memtier:2
/sys/devices/system/node/node3/memtier:2

Demotion fallback order:
node 0: 2, 3
node 1: 3, 2
node 2: empty
node 3: empty

To prevent cross-socket demotion and memory access, the user can set
mempolicy, e.g. cpuset.mems=0,2.


* Example 2:

Node 0 & 1 are DRAM nodes.
Node 2 is a PMEM node and closer to node 0.

20
Node 0 (DRAM) ---- Node 1 (DRAM)
| /
| 30 / 40
| /
Node 2 (PMEM)

node distances:
node 0 1 2
0 10 20 30
1 20 10 40
2 30 40 10

$ cat /sys/devices/system/memtier/possible
0(64), 1(128), 2(192)

$ grep '' /sys/devices/system/memtier/memtier*/rank
/sys/devices/system/memtier/memtier1/rank:128
/sys/devices/system/memtier/memtier2/rank:192

$ grep '' /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier1/nodelist:0-1
/sys/devices/system/memtier/memtier2/nodelist:2

$ grep '' /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier:1
/sys/devices/system/node/node1/memtier:1
/sys/devices/system/node/node2/memtier:2

Demotion fallback order:
node 0: 2
node 1: 2
node 2: empty


* Example 3:

Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.

All nodes are in the same tier.

20
Node 0 (DRAM) ---- Node 1 (DRAM)
\ /
\ 30 / 30
\ /
Node 2 (PMEM)

node distances:
node 0 1 2
0 10 20 30
1 20 10 30
2 30 30 10

$ cat /sys/devices/system/memtier/possible
0(64), 1(128), 2(192)

$ grep '' /sys/devices/system/memtier/memtier*/rank
/sys/devices/system/memtier/memtier1/rank:128

$ grep '' /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier1/nodelist:0-2

$ grep '' /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier:1
/sys/devices/system/node/node1/memtier:1
/sys/devices/system/node/node2/memtier:1

Demotion fallback order:
node 0: empty
node 1: empty
node 2: empty


* Example 4:

Node 0 is a DRAM node with CPU.
Node 1 is a PMEM node.
Node 2 is a GPU node.

50
Node 0 (DRAM) ---- Node 2 (GPU)
\ /
\ 30 / 60
\ /
Node 1 (PMEM)

node distances:
node 0 1 2
0 10 30 50
1 30 10 60
2 50 60 10

$ cat /sys/devices/system/memtier/possible
0(64), 1(128), 2(192)

$ grep '' /sys/devices/system/memtier/memtier*/rank
/sys/devices/system/memtier/memtier0/rank:64
/sys/devices/system/memtier/memtier1/rank:128
/sys/devices/system/memtier/memtier2/rank:192

$ grep '' /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier0/nodelist:2
/sys/devices/system/memtier/memtier1/nodelist:0
/sys/devices/system/memtier/memtier2/nodelist:1

$ grep '' /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier:1
/sys/devices/system/node/node1/memtier:2
/sys/devices/system/node/node2/memtier:0

Demotion fallback order:
node 0: 1
node 1: empty
node 2: 0, 1


* Example 5:

Node 0 is a DRAM node with CPU.
Node 1 is a GPU node.
Node 2 is a PMEM node.
Node 3 is a large, slow DRAM node without CPU.

100
Node 0 (DRAM) ---- Node 1 (GPU)
/ | / |
/40 |30 120 / | 110
| | / |
| Node 2 (PMEM) ---- /
| \ /
\ 80 \ /
------- Node 3 (Slow DRAM)

node distances:
node 0 1 2 3
0 10 100 30 40
1 100 10 120 110
2 30 120 10 80
3 40 110 80 10

MAX_MEMORY_TIERS=4 (memtier3 is a memory tier added later).

$ cat /sys/devices/system/memtier/possible
0(64), 1(128), 3(160), 2(192)

$ grep '' /sys/devices/system/memtier/memtier*/rank
/sys/devices/system/memtier/memtier0/rank:64
/sys/devices/system/memtier/memtier1/rank:128
/sys/devices/system/memtier/memtier2/rank:192
/sys/devices/system/memtier/memtier3/rank:160

$ grep '' /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier0/nodelist:1
/sys/devices/system/memtier/memtier1/nodelist:0
/sys/devices/system/memtier/memtier2/nodelist:2
/sys/devices/system/memtier/memtier3/nodelist:3

$ grep '' /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier:1
/sys/devices/system/node/node1/memtier:0
/sys/devices/system/node/node2/memtier:2
/sys/devices/system/node/node3/memtier:3

Demotion fallback order:
node 0: 2, 3
node 1: 0, 3, 2
node 2: empty
node 3: 2


2022-05-28 16:59:08

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node. Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created during
the kernel initialization and updated when a NUMA node is hot-added or
hot-removed. The current implementation puts all nodes with CPU into
the top tier, and builds the tier hierarchy tier-by-tier by establishing
the per-node demotion targets based on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases:

* The current tier initialization code always initializes
each memory-only NUMA node into a lower tier. But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

* The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM (e.g. GPU memory) devices, these
memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
with CPUs are better to be placed into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes
into the top tier, when a CPU is hot-added (or hot-removed) and
triggers a memory node from CPU-less into a CPU node (or vice
versa), the memory tier hierarchy gets changed, even though no
memory node is added or removed. This can make the tier
hierarchy unstable and make it difficult to support tier-based
memory accounting.

* A higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), and has resulted in the feature request for an interface to
override the system-wide, per-node demotion order from the
userspace. This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

* There are no interfaces for the userspace to learn about the memory
tier hierarchy in order to optimize its memory allocations.

This patch series make the creation of memory tiers explicit under
the control of userspace or device driver.

Memory Tier Initialization
==========================

By default, all memory nodes are assigned to the default tier (1).
The default tier device has a rank value (200).

A device driver can move up or down its memory nodes from the default
tier. For example, PMEM can move down its memory nodes below the
default tier, whereas GPU can move up its memory nodes above the
default tier.

The kernel initialization code makes the decision on which exact tier
a memory node should be assigned to based on the requests from the
device drivers as well as the memory device hardware information
provided by the firmware.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.

Memory Allocation for Demotion
==============================
This patch series keep the demotion target page allocation logic same.
The demotion page allocation pick the closest NUMA node in the
next lower tier to the current NUMA node allocating pages from.

This will be later improved to use the same page allocation strategy
using fallback list.

Sysfs Interface:
-------------
Listing current list of memory tiers and rank details:

:/sys/devices/system/memtier$ ls
default_rank max_tier memtier1 power uevent
:/sys/devices/system/memtier$ cat default_rank
200
:/sys/devices/system/memtier$ cat max_tier
3
:/sys/devices/system/memtier$

Per node memory tier details:

For a cpu only NUMA node:

:/sys/devices/system/node# cat node0/memtier
:/sys/devices/system/node# echo 1 > node0/memtier
:/sys/devices/system/node# cat node0/memtier
:/sys/devices/system/node#

For a NUMA node with memory:
:/sys/devices/system/node# cat node1/memtier
1
:/sys/devices/system/node# ls ../memtier/
default_rank max_tier memtier1 power uevent
:/sys/devices/system/node# echo 2 > node1/memtier
:/sys/devices/system/node#
:/sys/devices/system/node# ls ../memtier/
default_rank max_tier memtier1 memtier2 power uevent
:/sys/devices/system/node# cat node1/memtier
2
:/sys/devices/system/node#
:/sys/devices/system/node# cat ../memtier/memtier2/rank
300
:/sys/devices/system/node#
:/sys/devices/system/node# cat ../memtier/memtier1/rank
200
:/sys/devices/system/node#

Removing a NUMA node from demotion:
:/sys/devices/system/node# cat node1/memtier
2
:/sys/devices/system/node# echo none > node1/memtier
:/sys/devices/system/node#
:/sys/devices/system/node# cat node1/memtier
:/sys/devices/system/node#
:/sys/devices/system/node# ls ../memtier/
default_rank max_tier memtier1 power uevent
:/sys/devices/system/node#

The above also resulted in removal of memtier2 which was created in the earlier step.


Changelog
----------

v4:
Add support for explicit memory tiers and ranks.

v3:
- Modify patch 1 subject to make it more specific
- Remove /sys/kernel/mm/numa/demotion_targets interface, use
/sys/devices/system/node/demotion_targets instead and make
it writable to override node_states[N_DEMOTION_TARGETS].
- Add support to view per node demotion targets via sysfs

v2:
In v1, only 1st patch of this patch series was sent, which was
implemented to avoid some of the limitations on the demotion
target sharing, however for certain numa topology, the demotion
targets found by that patch was not most optimal, so 1st patch
in this series is modified according to suggestions from Huang
and Baolin. Different examples of demotion list comparasion
between existing implementation and changed implementation can
be found in the commit message of 1st patch.


Aneesh Kumar K.V (2):
mm/demotion: Add support to associate rank with memory tier
mm/demotion: Add support for removing node from demotion memory tiers

Jagdish Gediya (5):
mm/demotion: Add support for explicit memory tiers
mm/demotion: Expose per node memory tier to sysfs
mm/demotion: Build demotion targets based on explicit memory tiers
mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
mm/demotion: Demote pages according to allocation fallback order

drivers/base/node.c | 43 +++
drivers/dax/kmem.c | 4 +
include/linux/migrate.h | 39 ++-
mm/Kconfig | 11 +
mm/migrate.c | 756 ++++++++++++++++++++++++++--------------
mm/vmscan.c | 38 +-
mm/vmstat.c | 5 -
7 files changed, 590 insertions(+), 306 deletions(-)

--
2.36.1


2022-05-28 18:23:27

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers

This patch adds the special string "none" as a supported memtier value
that we can use to remove a specific node from being using as demotion target.

For ex:
:/sys/devices/system/node/node1# cat memtier
1
:/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
1-3
:/sys/devices/system/node/node1# echo none > memtier
:/sys/devices/system/node/node1#
:/sys/devices/system/node/node1# cat memtier
:/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
2-3
:/sys/devices/system/node/node1#

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
drivers/base/node.c | 7 ++++++-
include/linux/migrate.h | 1 +
mm/migrate.c | 15 +++++++++++++--
3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 892f7c23c94e..5311cf1db500 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -578,10 +578,15 @@ static ssize_t memtier_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
+ int ret;
unsigned long tier;
int node = dev->id;

- int ret = kstrtoul(buf, 10, &tier);
+ if (!strncmp(buf, "none", strlen("none"))) {
+ node_remove_from_memory_tier(node);
+ return count;
+ }
+ ret = kstrtoul(buf, 10, &tier);
if (ret)
return ret;

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index fd09fd009a69..77c581f47953 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -178,6 +178,7 @@ enum memory_tier_type {
#define MAX_MEMORY_TIERS 3

int next_demotion_node(int node);
+void node_remove_from_memory_tier(int node);
int node_get_memory_tier_id(int node);
int node_set_memory_tier_rank(int node, int tier);
int node_reset_memory_tier(int node, int tier);
diff --git a/mm/migrate.c b/mm/migrate.c
index f013d14f77ed..114c7428b9f3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2354,7 +2354,7 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
}


-static void node_remove_from_memory_tier(int node)
+void node_remove_from_memory_tier(int node)
{
struct memory_tier *memtier;

@@ -2418,7 +2418,18 @@ int node_reset_memory_tier(int node, int tier)
mutex_lock(&memory_tier_lock);

current_tier = __node_get_memory_tier(node);
- if (!current_tier || current_tier->dev.id == tier)
+ if (!current_tier) {
+ /*
+ * If a N_MEMORY node doesn't have a tier index, then
+ * we removed it from demotion earlier and we are trying
+ * add it back. Just add the node to requested tier.
+ */
+ if (node_state(node, N_MEMORY))
+ ret = __node_set_memory_tier(node, tier);
+ goto out;
+ }
+
+ if (current_tier->dev.id == tier)
goto out;

node_clear(node, current_tier->nodelist);
--
2.36.1


2022-05-28 18:25:24

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

From: Jagdish Gediya <[email protected]>

By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
is memory tier 1 which is designated for nodes with DRAM, so it
is not the right tier for dax devices.

Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future,
support should be added to distinguish the dax-devices which should
not be MEMORY_TIER_PMEM and right memory tier should be set for them.

Signed-off-by: Jagdish Gediya <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
drivers/dax/kmem.c | 4 ++++
mm/migrate.c | 2 ++
2 files changed, 6 insertions(+)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..991782aa2448 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -11,6 +11,7 @@
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
+#include <linux/migrate.h>
#include "dax-private.h"
#include "bus.h"

@@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)

dev_set_drvdata(dev, data);

+#ifdef CONFIG_TIERED_MEMORY
+ node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+#endif
return 0;

err_request_mem:
diff --git a/mm/migrate.c b/mm/migrate.c
index d819a64db5b1..59d8558dd2ee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2418,6 +2418,8 @@ int node_set_memory_tier(int node, int tier)

return ret;
}
+EXPORT_SYMBOL_GPL(node_set_memory_tier);
+

/**
* next_demotion_node() - Get the next node in the demotion path
--
2.36.1


2022-05-28 18:38:11

by Wei Xu

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces (v3)

On Fri, May 27, 2022 at 7:05 AM Hesham Almatary
<[email protected]> wrote:
>
> Hello Wei and Ying,
>
> Please find my comments below based on a discussion with Jonathan.
>
> On Fri, 27 May 2022 10:58:39 +0800
> Ying Huang <[email protected]> wrote:
>
> > On Thu, 2022-05-26 at 14:22 -0700, Wei Xu wrote:
> > > Changes since v2
> > > ================
> > > * Updated the design and examples to use "rank" instead of device ID
> > > to determine the order between memory tiers for better
> > > flexibility.
> > >
> > > Overview
> > > ========
> > >
> > > The current kernel has the basic memory tiering support: Inactive
> > > pages on a higher tier NUMA node can be migrated (demoted) to a
> > > lower tier NUMA node to make room for new allocations on the higher
> > > tier NUMA node. Frequently accessed pages on a lower tier NUMA
> > > node can be migrated (promoted) to a higher tier NUMA node to
> > > improve the performance.
> > >
> > > In the current kernel, memory tiers are defined implicitly via a
> > > demotion path relationship between NUMA nodes, which is created
> > > during the kernel initialization and updated when a NUMA node is
> > > hot-added or hot-removed. The current implementation puts all
> > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based on
> > > the distances between nodes.
> > >
> > > This current memory tier kernel interface needs to be improved for
> > > several important use cases:
> > >
> > > * The current tier initialization code always initializes
> > > each memory-only NUMA node into a lower tier. But a memory-only
> > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > a virtual machine) and should be put into a higher tier.
> > >
> > > * The current tier hierarchy always puts CPU nodes into the top
> > > tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > memory-only HBM NUMA nodes should be in the top tier, and DRAM
> > > nodes with CPUs are better to be placed into the next lower tier.
> > >
> > > * Also because the current tier hierarchy always puts CPU nodes
> > > into the top tier, when a CPU is hot-added (or hot-removed) and
> > > triggers a memory node from CPU-less into a CPU node (or vice
> > > versa), the memory tier hierarchy gets changed, even though no
> > > memory node is added or removed. This can make the tier
> > > hierarchy unstable and make it difficult to support tier-based
> > > memory accounting.
> > >
> > > * A higher tier node can only be demoted to selected nodes on the
> > > next lower tier as defined by the demotion path, not any other
> > > node from any lower tier. This strict, hard-coded demotion order
> > > does not work in all use cases (e.g. some use cases may want to
> > > allow cross-socket demotion to another node in the same demotion
> > > tier as a fallback when the preferred demotion node is out of
> > > space), and has resulted in the feature request for an interface
> > > to override the system-wide, per-node demotion order from the
> > > userspace. This demotion order is also inconsistent with the page
> > > allocation fallback order when all the nodes in a higher tier are
> > > out of space: The page allocation can fall back to any node from
> > > any lower tier, whereas the demotion order doesn't allow that.
> > >
> > > * There are no interfaces for the userspace to learn about the
> > > memory tier hierarchy in order to optimize its memory allocations.
> > >
> > > I'd like to propose revised memory tier kernel interfaces based on
> > > the discussions in the threads:
> > >
> > > -
> > > https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > -
> > > https://lore.kernel.org/linux-mm/[email protected]/t/
> > > -
> > > https://lore.kernel.org/linux-mm/[email protected]/T/
> > > -
> > > https://lore.kernel.org/linux-mm/[email protected]/T/
> > >
> > >
> > > High-level Design Ideas
> > > =======================
> > >
> > > * Define memory tiers explicitly, not implicitly.
> > >
> > > * Memory tiers are defined based on hardware capabilities of memory
> > > nodes, not their relative node distances between each other.
> > >
> > > * The tier assignment of each node is independent from each other.
> > > Moving a node from one tier to another tier doesn't affect the
> > > tier assignment of any other node.
> > >
> > > * The node-tier association is stable. A node can be reassigned to a
> > > different tier only under the specific conditions that don't block
> > > future tier-based memory cgroup accounting.
> > >
> > > * A node can demote its pages to any nodes of any lower tiers. The
> > > demotion target node selection follows the allocation fallback
> > > order of the source node, which is built based on node distances.
> > > The demotion targets are also restricted to only the nodes from the
> > > tiers lower than the source node. We no longer need to maintain a
> > > separate per-node demotion order (node_demotion[]).
> > >
> > >
> > > Sysfs Interfaces
> > > ================
> > >
> > > * /sys/devices/system/memtier/
> > >
> > > This is the directory containing the information about memory
> > > tiers.
> > >
> > > Each memory tier has its own subdirectory.
> > >
> > > The order of memory tiers is determined by their rank values, not
> > > by their memtier device names.
> > >
> > > - /sys/devices/system/memtier/possible
> > >
> > > Format: ordered list of "memtier(rank)"
> > > Example: 0(64), 1(128), 2(192)
> > >
> > > Read-only. When read, list all available memory tiers and their
> > > associated ranks, ordered by the rank values (from the highest
> > > tier to the lowest tier).
> >
> > I like the idea of "possible" file. And I think we can show default
> > tier too. That is, if "1(128)" is the default tier (tier with DRAM),
> > then the list can be,
> >
> > "
> > 0/64 [1/128] 2/192
> > "
> >
> > To make it more easier to be parsed by shell, I will prefer something
> > like,
> >
> > "
> > 0 64
> > 1 128 default
> > 2 192
> > "
> >
> > But one line format is OK for me too.
> >
> I wonder if there's a good argument to have this "possible" file at all?
> My thinking is that, 1) all the details can be scripted at
> user-level by reading memtierN/nodeN, offloading some work from the
> kernel side, and 2) the format/numbers are confusing anyway; it could
> get tricky when/if tier device IDs are similar to ranks.

If we don't hide memtiers that have no nodes, we don't need this
"possible" file. I am fine either way. Given that there should not be
too many tiers, it doesn't add much value to hide the empty tiers. We
can go without this "possible" file.

> The other thing is whether we should have a file called "default"
> containing the default tier value for the user to read?

Sure, we can have a default_tier or default_rank file for this.

> > >
> > > * /sys/devices/system/memtier/memtierN/
> > >
> > > This is the directory containing the information about a
> > > particular memory tier, memtierN, where N is the memtier device ID
> > > (e.g. 0, 1).
> > >
> > > The memtier device ID number itself is just an identifier and has
> > > no special meaning, i.e. memtier device ID numbers do not determine
> > > the order of memory tiers.
> > >
> > > - /sys/devices/system/memtier/memtierN/rank
> > >
> > > Format: int
> > > Example: 100
> > >
> > > Read-only. When read, list the "rank" value associated with
> > > memtierN.
> > >
> > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > special meaning. But the rank values of different memtiers can
> > > be compared with each other to determine the memory tier order.
> > > For example, if we have 3 memtiers: memtier0, memtier1,
> > > memiter2, and their rank values are 10, 20, 15, then the memory
> > > tier order is: memtier0 -> memtier2 -> memtier1, where memtier0 is
> > > the highest tier and memtier1 is the lowest tier.
> > >
> > > The rank value of each memtier should be unique.
> > >
> > > - /sys/devices/system/memtier/memtierN/nodelist
> > >
> > > Format: node_list
> > > Example: 1-2
> > >
> > > Read-only. When read, list the memory nodes in the specified
> > > tier.
> > >
> > > If a memory tier has no memory nodes, the kernel can hide the
> > > sysfs directory of this memory tier, though the tier itself can
> > > still be visible from /sys/devices/system/memtier/possible.
> > >
> Is there a good reason why the kernel needs to hide this directory?

It is just to reduce the clutter of empty tiers. Given that there
should not be too many tiers, we can revert this and always show all
tiers.

> > > * /sys/devices/system/node/nodeN/memtier
> > >
> > > where N = 0, 1, ...
> > >
> > > Format: int or empty
> > > Example: 1
> > >
> > > When read, list the device ID of the memory tier that the node
> > > belongs to. Its value is empty for a CPU-only NUMA node.
> > >
> > > When written, the kernel moves the node into the specified memory
> > > tier if the move is allowed. The tier assignment of all other
> > > nodes are not affected.
> > >
> Who decides if the move is allowed or not? Might need to explicitly
> mention that?

"memory tier reassignment" discusses the conditions when the move is allowed.

> > > Initially, we can make this interface read-only.
> > >
> > >
> > > Kernel Representation
> > > =====================
> > >
> > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > >
> > > * #define MAX_MEMORY_TIERS 3
> > >
> > > Support 3 memory tiers for now. This can be a kconfig option.
> > >
> > > * #define MEMORY_DEFAULT_TIER_DEVICE 1
> > >
> > > The default tier device that a memory node is assigned to.
> > >
> > > * struct memtier_dev {
> > > nodemask_t nodelist;
> > > int rank;
> > > int tier;
> > > } memtier_devices[MAX_MEMORY_TIERS]
> > >
> > > Store memory tiers by device IDs.
> > >
> > > * struct memtier_dev *memory_tier(int tier)
> > >
> > > Returns the memtier device for a given memory tier.
> > >
> Might need to define the case where there's no memory tier device for a
> specific tier number. For example, we can return NULL or an error code
> when an invalid tier number is passed (e.g., -1 for CPU-only nodes).

Sure.

> > > * int node_tier_dev_map[MAX_NUMNODES]
> > >
> > > Map a node to its tier device ID..
> > >
> > > For each CPU-only node c, node_tier_dev_map[c] = -1.
> > >
> > >
> > > Memory Tier Initialization
> > > ==========================
> > >
> > > By default, all memory nodes are assigned to the default tier
> > > (MEMORY_DEFAULT_TIER_DEVICE). The default tier device has a rank
> > > value in the middle of the possible rank value range (e.g. 127 if
> > > the range is [0..255]).
> > >
> > > A device driver can move up or down its memory nodes from the
> > > default tier. For example, PMEM can move down its memory nodes
> > > below the default tier, whereas GPU can move up its memory nodes
> > > above the default tier.
> > >
> Is "up/down" here still relative after the rank addition?

Good point. I think we should reverse the definition of rank: a higher
rank value means a higher tier, to avoid this kind of confusion.

> > > The kernel initialization code makes the decision on which exact
> > > tier a memory node should be assigned to based on the requests from
> > > the device drivers as well as the memory device hardware information
> > > provided by the firmware.
> > >
> > >
> > > Memory Tier Reassignment
> > > ========================
> > >
> > > After a memory node is hot-removed, it can be hot-added back to a
> > > different memory tier. This is useful for supporting dynamically
> > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > memory devices across hot-plug events. Such tier changes should
> > > be compatible with tier-based memory accounting.
> > >
> > > The userspace may also reassign an existing online memory node to a
> > > different tier. However, this should only be allowed when no pages
> > > are allocated from the memory node or when there are no non-root
> > > memory cgroups (e.g. during the system boot). This restriction is
> > > important for keeping memory tier hierarchy stable enough for
> > > tier-based memory cgroup accounting.
> >
> > One way to do this is hot-remove all memory of a node, change its
> > memtier, then hot-add its memory.
> >
> > Best Regards,
> > Huang, Ying
> >
> > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > >
> > >
> > > Memory Allocation for Demotion
> > > ==============================
> > >
> > > To allocate a new page as the demotion target for a page, the kernel
> > > calls the allocation function (__alloc_pages_nodemask) with the
> > > source page node as the preferred node and the union of all lower
> > > tier nodes as the allowed nodemask. The actual target node
> > > selection then follows the allocation fallback order that the
> > > kernel has already defined.
> > >
> > > The pseudo code looks like:
> > >
> > > targets = NODE_MASK_NONE;
> > > src_nid = page_to_nid(page);
> > > src_tier = memtier_devices[node_tier_dev_map[src_nid]].tier;
> > > for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > nodes_or(targets, targets, memory_tier(i)->nodelist);
> > > new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > >
> > > The memopolicy of cpuset, vma and owner task of the source page can
> > > be set to refine the demotion target nodemask, e.g. to prevent
> > > demotion or select a particular allowed node as the demotion target.
> > >
> > >
> > > Memory Allocation for Promotion
> > > ===============================
> > >
> > > The page allocation for promotion is similar to demotion, except
> > > that (1) the target nodemask uses the promotion tiers, (2) the
> > > preferred node can be the accessing CPU node, not the source page
> > > node.
> > >
> > >
> > > Examples
> > > ========
> > >
> > > * Example 1:
> > >
> > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> > >
> > > 20
> > > Node 0 (DRAM) ---- Node 1 (DRAM)
> > > | \ / |
> > > | 30 40 X 40 | 30
> > > | / \ |
> > > Node 2 (PMEM) ---- Node 3 (PMEM)
> > > 40
> > >
> > > node distances:
> > > node 0 1 2 3
> > > 0 10 20 30 40
> > > 1 20 10 40 30
> > > 2 30 40 10 40
> > > 3 40 30 40 10
> > >
> > > $ cat /sys/devices/system/memtier/possible
> > > 0(64), 1(128), 2(192)
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > > /sys/devices/system/memtier/memtier1/rank:128
> > > /sys/devices/system/memtier/memtier2/rank:192
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > > /sys/devices/system/memtier/memtier1/nodelist:0-1
> > > /sys/devices/system/memtier/memtier2/nodelist:2-3
> > >
> > > $ grep '' /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier:1
> > > /sys/devices/system/node/node1/memtier:1
> > > /sys/devices/system/node/node2/memtier:2
> > > /sys/devices/system/node/node3/memtier:2
> > >
> > > Demotion fallback order:
> > > node 0: 2, 3
> > > node 1: 3, 2
> > > node 2: empty
> > > node 3: empty
> > >
> > > To prevent cross-socket demotion and memory access, the user can set
> > > mempolicy, e.g. cpuset.mems=0,2.
> > >
> > >
> > > * Example 2:
> > >
> > > Node 0 & 1 are DRAM nodes.
> > > Node 2 is a PMEM node and closer to node 0.
> > >
> > > 20
> > > Node 0 (DRAM) ---- Node 1 (DRAM)
> > > | /
> > > | 30 / 40
> > > | /
> > > Node 2 (PMEM)
> > >
> > > node distances:
> > > node 0 1 2
> > > 0 10 20 30
> > > 1 20 10 40
> > > 2 30 40 10
> > >
> > > $ cat /sys/devices/system/memtier/possible
> > > 0(64), 1(128), 2(192)
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > > /sys/devices/system/memtier/memtier1/rank:128
> > > /sys/devices/system/memtier/memtier2/rank:192
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > > /sys/devices/system/memtier/memtier1/nodelist:0-1
> > > /sys/devices/system/memtier/memtier2/nodelist:2
> > >
> > > $ grep '' /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier:1
> > > /sys/devices/system/node/node1/memtier:1
> > > /sys/devices/system/node/node2/memtier:2
> > >
> > > Demotion fallback order:
> > > node 0: 2
> > > node 1: 2
> > > node 2: empty
> > >
> > >
> > > * Example 3:
> > >
> > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > >
> np: PMEM instead of memory-only DRAM?
>
> > > All nodes are in the same tier.
> > >
> > > 20
> > > Node 0 (DRAM) ---- Node 1 (DRAM)
> > > \ /
> > > \ 30 / 30
> > > \ /
> > > Node 2 (PMEM)
> > >
> > > node distances:
> > > node 0 1 2
> > > 0 10 20 30
> > > 1 20 10 30
> > > 2 30 30 10
> > >
> > > $ cat /sys/devices/system/memtier/possible
> > > 0(64), 1(128), 2(192)
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > > /sys/devices/system/memtier/memtier1/rank:128
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > > /sys/devices/system/memtier/memtier1/nodelist:0-2
> > >
> > > $ grep '' /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier:1
> > > /sys/devices/system/node/node1/memtier:1
> > > /sys/devices/system/node/node2/memtier:1
> > >
> > > Demotion fallback order:
> > > node 0: empty
> > > node 1: empty
> > > node 2: empty
> > >
> > >
> > > * Example 4:
> > >
> > > Node 0 is a DRAM node with CPU.
> > > Node 1 is a PMEM node.
> > > Node 2 is a GPU node.
> > >
> > > 50
> > > Node 0 (DRAM) ---- Node 2 (GPU)
> > > \ /
> > > \ 30 / 60
> > > \ /
> > > Node 1 (PMEM)
> > >
> > > node distances:
> > > node 0 1 2
> > > 0 10 30 50
> > > 1 30 10 60
> > > 2 50 60 10
> > >
> > > $ cat /sys/devices/system/memtier/possible
> > > 0(64), 1(128), 2(192)
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > > /sys/devices/system/memtier/memtier0/rank:64
> > > /sys/devices/system/memtier/memtier1/rank:128
> > > /sys/devices/system/memtier/memtier2/rank:192
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > > /sys/devices/system/memtier/memtier0/nodelist:2
> > > /sys/devices/system/memtier/memtier1/nodelist:0
> > > /sys/devices/system/memtier/memtier2/nodelist:1
> > >
> > > $ grep '' /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier:1
> > > /sys/devices/system/node/node1/memtier:2
> > > /sys/devices/system/node/node2/memtier:0
> > >
> > > Demotion fallback order:
> > > node 0: 1
> > > node 1: empty
> > > node 2: 0, 1
> > >
> > >
> > > * Example 5:
> > >
> > > Node 0 is a DRAM node with CPU.
> > > Node 1 is a GPU node.
> > > Node 2 is a PMEM node.
> > > Node 3 is a large, slow DRAM node without CPU.
> > >
> > > 100
> > > Node 0 (DRAM) ---- Node 1 (GPU)
> > > / | / |
> > > /40 |30 120 / | 110
> > > | | / |
> > > | Node 2 (PMEM) ---- /
> > > | \ /
> > > \ 80 \ /
> > > ------- Node 3 (Slow DRAM)
> > >
> > > node distances:
> > > node 0 1 2 3
> > > 0 10 100 30 40
> > > 1 100 10 120 110
> > > 2 30 120 10 80
> > > 3 40 110 80 10
> > >
> > > MAX_MEMORY_TIERS=4 (memtier3 is a memory tier added later).
> > >
> > > $ cat /sys/devices/system/memtier/possible
> > > 0(64), 1(128), 3(160), 2(192)
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > > /sys/devices/system/memtier/memtier0/rank:64
> > > /sys/devices/system/memtier/memtier1/rank:128
> > > /sys/devices/system/memtier/memtier2/rank:192
> > > /sys/devices/system/memtier/memtier3/rank:160
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > > /sys/devices/system/memtier/memtier0/nodelist:1
> > > /sys/devices/system/memtier/memtier1/nodelist:0
> > > /sys/devices/system/memtier/memtier2/nodelist:2
> > > /sys/devices/system/memtier/memtier3/nodelist:3
> > >
> > > $ grep '' /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier:1
> > > /sys/devices/system/node/node1/memtier:0
> > > /sys/devices/system/node/node2/memtier:2
> > > /sys/devices/system/node/node3/memtier:3
> > >
> > > Demotion fallback order:
> > > node 0: 2, 3
> > > node 1: 0, 3, 2
> > > node 2: empty
> > > node 3: 2
> >
> >
>
>

2022-05-28 18:49:05

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces (v3)

On 5/27/22 2:52 AM, Wei Xu wrote:

> The order of memory tiers is determined by their rank values, not by
> their memtier device names.
>
> - /sys/devices/system/memtier/possible
>
> Format: ordered list of "memtier(rank)"
> Example: 0(64), 1(128), 2(192)
>
> Read-only. When read, list all available memory tiers and their
> associated ranks, ordered by the rank values (from the highest
> tier to the lowest tier).
>

Did we discuss the need for this? I haven't done this in the patch
series I sent across. We do have
/sys/devices/system/memtier/default_rank which should allow user to
identify the default rank to which memory would get added via hotplug if
the NUMA node is not part of any memory tier.


-aneesh

2022-05-28 19:29:39

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers

From: Jagdish Gediya <[email protected]>

This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
default tier 1 and additional memory tiers will be added by drivers like
dax kmem.

This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs. The closest node
in the immediately following memory tier is used as a demotion target.

Since we are now only building demotion target for N_MEMORY NUMA nodes
the CPU hotplug calls are removed in this patch.

Signed-off-by: Jagdish Gediya <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/migrate.h | 8 -
mm/migrate.c | 460 +++++++++++++++-------------------------
mm/vmstat.c | 5 -
3 files changed, 172 insertions(+), 301 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index d37d1d5dee82..cbef71a499c1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -177,12 +177,6 @@ enum memory_tier_type {
};

int next_demotion_node(int node);
-extern void migrate_on_reclaim_init(void);
-#ifdef CONFIG_HOTPLUG_CPU
-extern void set_migration_target_nodes(void);
-#else
-static inline void set_migration_target_nodes(void) {}
-#endif
int node_get_memory_tier(int node);
int node_set_memory_tier(int node, int tier);
int node_reset_memory_tier(int node, int tier);
@@ -193,8 +187,6 @@ static inline int next_demotion_node(int node)
return NUMA_NO_NODE;
}

-static inline void set_migration_target_nodes(void) {}
-static inline void migrate_on_reclaim_init(void) {}
#endif /* CONFIG_TIERED_MEMORY */

#endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 304559ba3372..d819a64db5b1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2125,6 +2125,10 @@ struct memory_tier {
nodemask_t nodelist;
};

+struct demotion_nodes {
+ nodemask_t preferred;
+};
+
#define to_memory_tier(device) container_of(device, struct memory_tier, dev)

static struct bus_type memory_tier_subsys = {
@@ -2132,9 +2136,73 @@ static struct bus_type memory_tier_subsys = {
.dev_name = "memtier",
};

+static void establish_migration_targets(void);
+
DEFINE_MUTEX(memory_tier_lock);
static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];

+/*
+ * node_demotion[] examples:
+ *
+ * Example 1:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
+ *
+ * node distances:
+ * node 0 1 2 3
+ * 0 10 20 30 40
+ * 1 20 10 40 30
+ * 2 30 40 10 40
+ * 3 40 30 40 10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-1
+ * memory_tiers[2] = 2-3
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 3
+ * node_demotion[2].preferred = <empty>
+ * node_demotion[3].preferred = <empty>
+ *
+ * Example 2:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
+ *
+ * node distances:
+ * node 0 1 2
+ * 0 10 20 30
+ * 1 20 10 30
+ * 2 30 30 10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-2
+ * memory_tiers[2] = <empty>
+ *
+ * node_demotion[0].preferred = <empty>
+ * node_demotion[1].preferred = <empty>
+ * node_demotion[2].preferred = <empty>
+ *
+ * Example 3:
+ *
+ * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
+ *
+ * node distances:
+ * node 0 1 2
+ * 0 10 20 30
+ * 1 20 10 40
+ * 2 30 40 10
+ *
+ * memory_tiers[0] = 1
+ * memory_tiers[1] = 0
+ * memory_tiers[2] = 2
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 0
+ * node_demotion[2].preferred = <empty>
+ *
+ */
+static struct demotion_nodes *node_demotion __read_mostly;
+
static ssize_t nodelist_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -2238,6 +2306,28 @@ static int __node_get_memory_tier(int node)
return -1;
}

+static void node_remove_from_memory_tier(int node)
+{
+ int tier;
+
+ mutex_lock(&memory_tier_lock);
+
+ tier = __node_get_memory_tier(node);
+
+ /*
+ * Remove node from tier, if tier becomes
+ * empty then unregister it to make it invisible
+ * in sysfs.
+ */
+ node_clear(node, memory_tiers[tier]->nodelist);
+ if (nodes_empty(memory_tiers[tier]->nodelist))
+ unregister_memory_tier(tier);
+
+ establish_migration_targets();
+
+ mutex_unlock(&memory_tier_lock);
+}
+
int node_get_memory_tier(int node)
{
int tier;
@@ -2271,6 +2361,7 @@ int __node_set_memory_tier(int node, int tier)
}

node_set(node, memory_tiers[tier]->nodelist);
+ establish_migration_targets();

out:
return ret;
@@ -2328,75 +2419,6 @@ int node_set_memory_tier(int node, int tier)
return ret;
}

-/*
- * node_demotion[] example:
- *
- * Consider a system with two sockets. Each socket has
- * three classes of memory attached: fast, medium and slow.
- * Each memory class is placed in its own NUMA node. The
- * CPUs are placed in the node with the "fast" memory. The
- * 6 NUMA nodes (0-5) might be split among the sockets like
- * this:
- *
- * Socket A: 0, 1, 2
- * Socket B: 3, 4, 5
- *
- * When Node 0 fills up, its memory should be migrated to
- * Node 1. When Node 1 fills up, it should be migrated to
- * Node 2. The migration path start on the nodes with the
- * processors (since allocations default to this node) and
- * fast memory, progress through medium and end with the
- * slow memory:
- *
- * 0 -> 1 -> 2 -> stop
- * 3 -> 4 -> 5 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *
- * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1
- * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2
- * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate
- * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4
- * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5
- * { nr=0, nodes[0]=-1 }, // Node 5 does not migrate
- *
- * Moreover some systems may have multiple slow memory nodes.
- * Suppose a system has one socket with 3 memory nodes, node 0
- * is fast memory type, and node 1/2 both are slow memory
- * type, and the distance between fast memory node and slow
- * memory node is same. So the migration path should be:
- *
- * 0 -> 1/2 -> stop
- *
- * This is represented in the node_demotion[] like this:
- * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
- * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
- * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate
- */
-
-/*
- * Writes to this array occur without locking. Cycles are
- * not allowed: Node X demotes to Y which demotes to X...
- *
- * If multiple reads are performed, a single rcu_read_lock()
- * must be held over all reads to ensure that no cycles are
- * observed.
- */
-#define DEFAULT_DEMOTION_TARGET_NODES 15
-
-#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
-#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1)
-#else
-#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES
-#endif
-
-struct demotion_nodes {
- unsigned short nr;
- short nodes[DEMOTION_TARGET_NODES];
-};
-
-static struct demotion_nodes *node_demotion __read_mostly;
-
/**
* next_demotion_node() - Get the next node in the demotion path
* @node: The starting node to lookup the next node
@@ -2409,8 +2431,7 @@ static struct demotion_nodes *node_demotion __read_mostly;
int next_demotion_node(int node)
{
struct demotion_nodes *nd;
- unsigned short target_nr, index;
- int target;
+ int target, nnodes, i;

if (!node_demotion)
return NUMA_NO_NODE;
@@ -2419,61 +2440,46 @@ int next_demotion_node(int node)

/*
* node_demotion[] is updated without excluding this
- * function from running. RCU doesn't provide any
- * compiler barriers, so the READ_ONCE() is required
- * to avoid compiler reordering or read merging.
+ * function from running.
*
* Make sure to use RCU over entire code blocks if
* node_demotion[] reads need to be consistent.
*/
rcu_read_lock();
- target_nr = READ_ONCE(nd->nr);

- switch (target_nr) {
- case 0:
- target = NUMA_NO_NODE;
- goto out;
- case 1:
- index = 0;
- break;
- default:
- /*
- * If there are multiple target nodes, just select one
- * target node randomly.
- *
- * In addition, we can also use round-robin to select
- * target node, but we should introduce another variable
- * for node_demotion[] to record last selected target node,
- * that may cause cache ping-pong due to the changing of
- * last target node. Or introducing per-cpu data to avoid
- * caching issue, which seems more complicated. So selecting
- * target node randomly seems better until now.
- */
- index = get_random_int() % target_nr;
- break;
- }
+ nnodes = nodes_weight(nd->preferred);
+ if (!nnodes)
+ return NUMA_NO_NODE;

- target = READ_ONCE(nd->nodes[index]);
+ /*
+ * If there are multiple target nodes, just select one
+ * target node randomly.
+ *
+ * In addition, we can also use round-robin to select
+ * target node, but we should introduce another variable
+ * for node_demotion[] to record last selected target node,
+ * that may cause cache ping-pong due to the changing of
+ * last target node. Or introducing per-cpu data to avoid
+ * caching issue, which seems more complicated. So selecting
+ * target node randomly seems better until now.
+ */
+ nnodes = get_random_int() % nnodes;
+ target = first_node(nd->preferred);
+ for (i = 0; i < nnodes; i++)
+ target = next_node(target, nd->preferred);

-out:
rcu_read_unlock();
+
return target;
}

-#if defined(CONFIG_HOTPLUG_CPU)
/* Disable reclaim-based migration. */
static void __disable_all_migrate_targets(void)
{
- int node, i;
+ int node;

- if (!node_demotion)
- return;
-
- for_each_online_node(node) {
- node_demotion[node].nr = 0;
- for (i = 0; i < DEMOTION_TARGET_NODES; i++)
- node_demotion[node].nodes[i] = NUMA_NO_NODE;
- }
+ for_each_node_mask(node, node_states[N_MEMORY])
+ node_demotion[node].preferred = NODE_MASK_NONE;
}

static void disable_all_migrate_targets(void)
@@ -2485,173 +2491,70 @@ static void disable_all_migrate_targets(void)
* Readers will see either a combination of before+disable
* state or disable+after. They will never see before and
* after state together.
- *
- * The before+after state together might have cycles and
- * could cause readers to do things like loop until this
- * function finishes. This ensures they can only see a
- * single "bad" read and would, for instance, only loop
- * once.
*/
synchronize_rcu();
}

/*
- * Find an automatic demotion target for 'node'.
- * Failing here is OK. It might just indicate
- * being at the end of a chain.
- */
-static int establish_migrate_target(int node, nodemask_t *used,
- int best_distance)
+* Find an automatic demotion target for all memory
+* nodes. Failing here is OK. It might just indicate
+* being at the end of a chain.
+*/
+static void establish_migration_targets(void)
{
- int migration_target, index, val;
struct demotion_nodes *nd;
+ int tier, target = NUMA_NO_NODE, node;
+ int distance, best_distance;
+ nodemask_t used;

if (!node_demotion)
- return NUMA_NO_NODE;
-
- nd = &node_demotion[node];
-
- migration_target = find_next_best_node(node, used);
- if (migration_target == NUMA_NO_NODE)
- return NUMA_NO_NODE;
-
- /*
- * If the node has been set a migration target node before,
- * which means it's the best distance between them. Still
- * check if this node can be demoted to other target nodes
- * if they have a same best distance.
- */
- if (best_distance != -1) {
- val = node_distance(node, migration_target);
- if (val > best_distance)
- goto out_clear;
- }
-
- index = nd->nr;
- if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
- "Exceeds maximum demotion target nodes\n"))
- goto out_clear;
-
- nd->nodes[index] = migration_target;
- nd->nr++;
+ return;

- return migration_target;
-out_clear:
- node_clear(migration_target, *used);
- return NUMA_NO_NODE;
-}
+ disable_all_migrate_targets();

-/*
- * When memory fills up on a node, memory contents can be
- * automatically migrated to another node instead of
- * discarded at reclaim.
- *
- * Establish a "migration path" which will start at nodes
- * with CPUs and will follow the priorities used to build the
- * page allocator zonelists.
- *
- * The difference here is that cycles must be avoided. If
- * node0 migrates to node1, then neither node1, nor anything
- * node1 migrates to can migrate to node0. Also one node can
- * be migrated to multiple nodes if the target nodes all have
- * a same best-distance against the source node.
- *
- * This function can run simultaneously with readers of
- * node_demotion[]. However, it can not run simultaneously
- * with itself. Exclusion is provided by memory hotplug events
- * being single-threaded.
- */
-static void __set_migration_target_nodes(void)
-{
- nodemask_t next_pass = NODE_MASK_NONE;
- nodemask_t this_pass = NODE_MASK_NONE;
- nodemask_t used_targets = NODE_MASK_NONE;
- int node, best_distance;
+ for_each_node_mask(node, node_states[N_MEMORY]) {
+ best_distance = -1;
+ nd = &node_demotion[node];

- /*
- * Avoid any oddities like cycles that could occur
- * from changes in the topology. This will leave
- * a momentary gap when migration is disabled.
- */
- disable_all_migrate_targets();
+ tier = __node_get_memory_tier(node);
+ /*
+ * Find next tier to demote.
+ */
+ while (++tier < MAX_MEMORY_TIERS) {
+ if (memory_tiers[tier])
+ break;
+ }

- /*
- * Allocations go close to CPUs, first. Assume that
- * the migration path starts at the nodes with CPUs.
- */
- next_pass = node_states[N_CPU];
-again:
- this_pass = next_pass;
- next_pass = NODE_MASK_NONE;
- /*
- * To avoid cycles in the migration "graph", ensure
- * that migration sources are not future targets by
- * setting them in 'used_targets'. Do this only
- * once per pass so that multiple source nodes can
- * share a target node.
- *
- * 'used_targets' will become unavailable in future
- * passes. This limits some opportunities for
- * multiple source nodes to share a destination.
- */
- nodes_or(used_targets, used_targets, this_pass);
+ if (tier >= MAX_MEMORY_TIERS)
+ continue;

- for_each_node_mask(node, this_pass) {
- best_distance = -1;
+ nodes_andnot(used, node_states[N_MEMORY], memory_tiers[tier]->nodelist);

/*
- * Try to set up the migration path for the node, and the target
- * migration nodes can be multiple, so doing a loop to find all
- * the target nodes if they all have a best node distance.
+ * Find all the nodes in the memory tier node list of same best distance.
+ * add add them to the preferred mask. We randomly select between nodes
+ * in the preferred mask when allocating pages during demotion.
*/
do {
- int target_node =
- establish_migrate_target(node, &used_targets,
- best_distance);
-
- if (target_node == NUMA_NO_NODE)
+ target = find_next_best_node(node, &used);
+ if (target == NUMA_NO_NODE)
break;

- if (best_distance == -1)
- best_distance = node_distance(node, target_node);
-
- /*
- * Visit targets from this pass in the next pass.
- * Eventually, every node will have been part of
- * a pass, and will become set in 'used_targets'.
- */
- node_set(target_node, next_pass);
+ distance = node_distance(node, target);
+ if (distance == best_distance || best_distance == -1) {
+ best_distance = distance;
+ node_set(target, nd->preferred);
+ } else {
+ break;
+ }
} while (1);
}
- /*
- * 'next_pass' contains nodes which became migration
- * targets in this pass. Make additional passes until
- * no more migrations targets are available.
- */
- if (!nodes_empty(next_pass))
- goto again;
}

/*
- * For callers that do not hold get_online_mems() already.
- */
-void set_migration_target_nodes(void)
-{
- get_online_mems();
- __set_migration_target_nodes();
- put_online_mems();
-}
-
-/*
- * This leaves migrate-on-reclaim transiently disabled between
- * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs
- * whether reclaim-based migration is enabled or not, which
- * ensures that the user can turn reclaim-based migration at
- * any time without needing to recalculate migration targets.
- *
- * These callbacks already hold get_online_mems(). That is why
- * __set_migration_target_nodes() can be used as opposed to
- * set_migration_target_nodes().
+ * This runs whether reclaim-based migration is enabled or not,
+ * which ensures that the user can turn reclaim-based migration
+ * at any time without needing to recalculate migration targets.
*/
static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
unsigned long action, void *_arg)
@@ -2660,64 +2563,44 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,

/*
* Only update the node migration order when a node is
- * changing status, like online->offline. This avoids
- * the overhead of synchronize_rcu() in most cases.
+ * changing status, like online->offline.
*/
if (arg->status_change_nid < 0)
return notifier_from_errno(0);

switch (action) {
- case MEM_GOING_OFFLINE:
- /*
- * Make sure there are not transient states where
- * an offline node is a migration target. This
- * will leave migration disabled until the offline
- * completes and the MEM_OFFLINE case below runs.
- */
- disable_all_migrate_targets();
- break;
case MEM_OFFLINE:
- case MEM_ONLINE:
/*
- * Recalculate the target nodes once the node
- * reaches its final state (online or offline).
+ * In case we are moving out of N_MEMORY. Keep the node
+ * in the memory tier so that when we bring memory online,
+ * they appear in the right memory tier. We still need
+ * to rebuild the demotion order.
*/
- __set_migration_target_nodes();
+ mutex_lock(&memory_tier_lock);
+ establish_migration_targets();
+ mutex_unlock(&memory_tier_lock);
break;
- case MEM_CANCEL_OFFLINE:
+ case MEM_ONLINE:
/*
- * MEM_GOING_OFFLINE disabled all the migration
- * targets. Reenable them.
+ * We ignore the error here, if the node already have the tier
+ * registered, we will continue to use that for the new memory
+ * we are adding here.
*/
- __set_migration_target_nodes();
- break;
- case MEM_GOING_ONLINE:
- case MEM_CANCEL_ONLINE:
+ node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER);
break;
}

return notifier_from_errno(0);
}

-void __init migrate_on_reclaim_init(void)
+static void __init migrate_on_reclaim_init(void)
{
- node_demotion = kmalloc_array(nr_node_ids,
- sizeof(struct demotion_nodes),
- GFP_KERNEL);
+ node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
+ GFP_KERNEL);
WARN_ON(!node_demotion);

hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
- /*
- * At this point, all numa nodes with memory/CPus have their state
- * properly set, so we can build the demotion order now.
- * Let us hold the cpu_hotplug lock just, as we could possibily have
- * CPU hotplug events during boot.
- */
- cpus_read_lock();
- set_migration_target_nodes();
- cpus_read_unlock();
}
-#endif /* CONFIG_HOTPLUG_CPU */

bool numa_demotion_enabled = false;

@@ -2800,6 +2683,7 @@ static int __init memory_tier_init(void)
* CPU only nodes are not part of memoty tiers.
*/
memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
+ migrate_on_reclaim_init();

return 0;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b75b1a64b54c..7815d21345a4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2053,7 +2053,6 @@ static int vmstat_cpu_online(unsigned int cpu)

if (!node_state(cpu_to_node(cpu), N_CPU)) {
node_set_state(cpu_to_node(cpu), N_CPU);
- set_migration_target_nodes();
}

return 0;
@@ -2078,7 +2077,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
return 0;

node_clear_state(node, N_CPU);
- set_migration_target_nodes();

return 0;
}
@@ -2111,9 +2109,6 @@ void __init init_mm_internals(void)

start_shepherd_timer();
#endif
-#if defined(CONFIG_MIGRATION) && defined(CONFIG_HOTPLUG_CPU)
- migrate_on_reclaim_init();
-#endif
#ifdef CONFIG_PROC_FS
proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
--
2.36.1


2022-05-28 19:37:09

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

From: Jagdish Gediya <[email protected]>

Add support to read/write the memory tierindex for a NUMA node.

/sys/devices/system/node/nodeN/memtier

where N = node id

When read, It list the memory tier that the node belongs to.

When written, the kernel moves the node into the specified
memory tier, the tier assignment of all other nodes are not
affected.

If the memory tier does not exist, writing to the above file
create the tier and assign the NUMA node to that tier.

mutex memory_tier_lock is introduced to protect memory tier
related chanegs as it can happen from sysfs as well on hot
plug events.

Signed-off-by: Jagdish Gediya <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
drivers/base/node.c | 35 ++++++++++++++
include/linux/migrate.h | 4 +-
mm/migrate.c | 103 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ec8bb24a5a22..cf4a58446d8c 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -20,6 +20,7 @@
#include <linux/pm_runtime.h>
#include <linux/swap.h>
#include <linux/slab.h>
+#include <linux/migrate.h>

static struct bus_type node_subsys = {
.name = "node",
@@ -560,11 +561,45 @@ static ssize_t node_read_distance(struct device *dev,
}
static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);

+#ifdef CONFIG_TIERED_MEMORY
+static ssize_t memtier_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ int node = dev->id;
+
+ return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
+}
+
+static ssize_t memtier_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long tier;
+ int node = dev->id;
+
+ int ret = kstrtoul(buf, 10, &tier);
+ if (ret)
+ return ret;
+
+ ret = node_reset_memory_tier(node, tier);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
+static DEVICE_ATTR_RW(memtier);
+#endif
+
static struct attribute *node_dev_attrs[] = {
&dev_attr_meminfo.attr,
&dev_attr_numastat.attr,
&dev_attr_distance.attr,
&dev_attr_vmstat.attr,
+#ifdef CONFIG_TIERED_MEMORY
+ &dev_attr_memtier.attr,
+#endif
NULL
};

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0ec653623565..d37d1d5dee82 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -177,13 +177,15 @@ enum memory_tier_type {
};

int next_demotion_node(int node);
-
extern void migrate_on_reclaim_init(void);
#ifdef CONFIG_HOTPLUG_CPU
extern void set_migration_target_nodes(void);
#else
static inline void set_migration_target_nodes(void) {}
#endif
+int node_get_memory_tier(int node);
+int node_set_memory_tier(int node, int tier);
+int node_reset_memory_tier(int node, int tier);
#else
#define numa_demotion_enabled false
static inline int next_demotion_node(int node)
diff --git a/mm/migrate.c b/mm/migrate.c
index f28ee93fb017..304559ba3372 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2132,6 +2132,7 @@ static struct bus_type memory_tier_subsys = {
.dev_name = "memtier",
};

+DEFINE_MUTEX(memory_tier_lock);
static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];

static ssize_t nodelist_show(struct device *dev,
@@ -2225,6 +2226,108 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
NULL,
};

+static int __node_get_memory_tier(int node)
+{
+ int tier;
+
+ for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) {
+ if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist))
+ return tier;
+ }
+
+ return -1;
+}
+
+int node_get_memory_tier(int node)
+{
+ int tier;
+
+ /*
+ * Make sure memory tier is not unregistered
+ * while it is being read.
+ */
+ mutex_lock(&memory_tier_lock);
+
+ tier = __node_get_memory_tier(node);
+
+ mutex_unlock(&memory_tier_lock);
+
+ return tier;
+}
+
+int __node_set_memory_tier(int node, int tier)
+{
+ int ret = 0;
+ /*
+ * As register_memory_tier() for new tier can fail,
+ * try it before modifying existing tier. register
+ * tier makes tier visible in sysfs.
+ */
+ if (!memory_tiers[tier]) {
+ ret = register_memory_tier(tier);
+ if (ret) {
+ goto out;
+ }
+ }
+
+ node_set(node, memory_tiers[tier]->nodelist);
+
+out:
+ return ret;
+}
+
+int node_reset_memory_tier(int node, int tier)
+{
+ int current_tier, ret = 0;
+
+ mutex_lock(&memory_tier_lock);
+
+ current_tier = __node_get_memory_tier(node);
+ if (current_tier == tier)
+ goto out;
+
+ if (current_tier != -1 )
+ node_clear(node, memory_tiers[current_tier]->nodelist);
+
+ ret = __node_set_memory_tier(node, tier);
+
+ if (!ret) {
+ if (nodes_empty(memory_tiers[current_tier]->nodelist))
+ unregister_memory_tier(current_tier);
+ } else {
+ /* reset it back to older tier */
+ ret = __node_set_memory_tier(node, current_tier);
+ }
+out:
+ mutex_unlock(&memory_tier_lock);
+
+ return ret;
+}
+
+int node_set_memory_tier(int node, int tier)
+{
+ int current_tier, ret = 0;
+
+ if (tier >= MAX_MEMORY_TIERS)
+ return -EINVAL;
+
+ mutex_lock(&memory_tier_lock);
+ current_tier = __node_get_memory_tier(node);
+ /*
+ * if node is already part of the tier proceed with the
+ * current tier value, because we might want to establish
+ * new migration paths now. The node might be added to a tier
+ * before it was made part of N_MEMORY, hence estabilish_migration_targets
+ * will have skipped this node.
+ */
+ if (current_tier != -1)
+ tier = current_tier;
+ ret = __node_set_memory_tier(node, tier);
+ mutex_unlock(&memory_tier_lock);
+
+ return ret;
+}
+
/*
* node_demotion[] example:
*
--
2.36.1


2022-05-28 19:51:02

by Huang, Ying

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces (v3)

On Thu, 2022-05-26 at 14:22 -0700, Wei Xu wrote:
> Changes since v2
> ================
> * Updated the design and examples to use "rank" instead of device ID
>   to determine the order between memory tiers for better flexibility.
>
> Overview
> ========
>
> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node. Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
>
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created during
> the kernel initialization and updated when a NUMA node is hot-added or
> hot-removed. The current implementation puts all nodes with CPU into
> the top tier, and builds the tier hierarchy tier-by-tier by
> establishing the per-node demotion targets based on the distances
> between nodes.
>
> This current memory tier kernel interface needs to be improved for
> several important use cases:
>
> * The current tier initialization code always initializes
>   each memory-only NUMA node into a lower tier. But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into a higher tier.
>
> * The current tier hierarchy always puts CPU nodes into the top
>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>   with CPUs are better to be placed into the next lower tier.
>
> * Also because the current tier hierarchy always puts CPU nodes
>   into the top tier, when a CPU is hot-added (or hot-removed) and
>   triggers a memory node from CPU-less into a CPU node (or vice
>   versa), the memory tier hierarchy gets changed, even though no
>   memory node is added or removed. This can make the tier
>   hierarchy unstable and make it difficult to support tier-based
>   memory accounting.
>
> * A higher tier node can only be demoted to selected nodes on the
>   next lower tier as defined by the demotion path, not any other
>   node from any lower tier. This strict, hard-coded demotion order
>   does not work in all use cases (e.g. some use cases may want to
>   allow cross-socket demotion to another node in the same demotion
>   tier as a fallback when the preferred demotion node is out of
>   space), and has resulted in the feature request for an interface to
>   override the system-wide, per-node demotion order from the
>   userspace. This demotion order is also inconsistent with the page
>   allocation fallback order when all the nodes in a higher tier are
>   out of space: The page allocation can fall back to any node from
>   any lower tier, whereas the demotion order doesn't allow that.
>
> * There are no interfaces for the userspace to learn about the memory
>   tier hierarchy in order to optimize its memory allocations.
>
> I'd like to propose revised memory tier kernel interfaces based on
> the discussions in the threads:
>
> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> - https://lore.kernel.org/linux-mm/[email protected]/t/
> - https://lore.kernel.org/linux-mm/[email protected]/T/
> - https://lore.kernel.org/linux-mm/[email protected]/T/
>
>
> High-level Design Ideas
> =======================
>
> * Define memory tiers explicitly, not implicitly.
>
> * Memory tiers are defined based on hardware capabilities of memory
>   nodes, not their relative node distances between each other.
>
> * The tier assignment of each node is independent from each other.
>   Moving a node from one tier to another tier doesn't affect the tier
>   assignment of any other node.
>
> * The node-tier association is stable. A node can be reassigned to a
>   different tier only under the specific conditions that don't block
>   future tier-based memory cgroup accounting.
>
> * A node can demote its pages to any nodes of any lower tiers. The
>   demotion target node selection follows the allocation fallback order
>   of the source node, which is built based on node distances. The
>   demotion targets are also restricted to only the nodes from the tiers
>   lower than the source node. We no longer need to maintain a separate
>   per-node demotion order (node_demotion[]).
>
>
> Sysfs Interfaces
> ================
>
> * /sys/devices/system/memtier/
>
>   This is the directory containing the information about memory tiers.
>
>   Each memory tier has its own subdirectory.
>
>   The order of memory tiers is determined by their rank values, not by
>   their memtier device names.
>
>   - /sys/devices/system/memtier/possible
>
>     Format: ordered list of "memtier(rank)"
>     Example: 0(64), 1(128), 2(192)
>
>     Read-only. When read, list all available memory tiers and their
>     associated ranks, ordered by the rank values (from the highest
>      tier to the lowest tier).

I like the idea of "possible" file. And I think we can show default
tier too. That is, if "1(128)" is the default tier (tier with DRAM),
then the list can be,

"
0/64 [1/128] 2/192
"

To make it more easier to be parsed by shell, I will prefer something
like,

"
0 64
1 128 default
2 192
"

But one line format is OK for me too.

>
> * /sys/devices/system/memtier/memtierN/
>
>   This is the directory containing the information about a particular
>   memory tier, memtierN, where N is the memtier device ID (e.g. 0, 1).
>
>   The memtier device ID number itself is just an identifier and has no
>   special meaning, i.e. memtier device ID numbers do not determine the
>   order of memory tiers.
>
>   - /sys/devices/system/memtier/memtierN/rank
>
>     Format: int
>     Example: 100
>
>     Read-only. When read, list the "rank" value associated with memtierN.
>
>     "Rank" is an opaque value. Its absolute value doesn't have any
>     special meaning. But the rank values of different memtiers can be
>     compared with each other to determine the memory tier order.
>     For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
>     their rank values are 10, 20, 15, then the memory tier order is:
>     memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
>     and memtier1 is the lowest tier.
>
>     The rank value of each memtier should be unique.
>
>   - /sys/devices/system/memtier/memtierN/nodelist
>
>     Format: node_list
>     Example: 1-2
>
>     Read-only. When read, list the memory nodes in the specified tier.
>
>     If a memory tier has no memory nodes, the kernel can hide the sysfs
>     directory of this memory tier, though the tier itself can still be
>     visible from /sys/devices/system/memtier/possible.
>
> * /sys/devices/system/node/nodeN/memtier
>
>   where N = 0, 1, ...
>
>   Format: int or empty
>   Example: 1
>
>   When read, list the device ID of the memory tier that the node belongs
>   to. Its value is empty for a CPU-only NUMA node.
>
>   When written, the kernel moves the node into the specified memory
>   tier if the move is allowed. The tier assignment of all other nodes
>   are not affected.
>
>   Initially, we can make this interface read-only.
>
>
> Kernel Representation
> =====================
>
> * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
>
> * #define MAX_MEMORY_TIERS 3
>
>   Support 3 memory tiers for now. This can be a kconfig option.
>
> * #define MEMORY_DEFAULT_TIER_DEVICE 1
>
>   The default tier device that a memory node is assigned to.
>
> * struct memtier_dev {
>       nodemask_t nodelist;
>       int rank;
>       int tier;
>   } memtier_devices[MAX_MEMORY_TIERS]
>
>   Store memory tiers by device IDs.
>
> * struct memtier_dev *memory_tier(int tier)
>
>   Returns the memtier device for a given memory tier.
>
> * int node_tier_dev_map[MAX_NUMNODES]
>
>   Map a node to its tier device ID..
>
>   For each CPU-only node c, node_tier_dev_map[c] = -1.
>
>
> Memory Tier Initialization
> ==========================
>
> By default, all memory nodes are assigned to the default tier
> (MEMORY_DEFAULT_TIER_DEVICE). The default tier device has a rank value
> in the middle of the possible rank value range (e.g. 127 if the range
> is [0..255]).
>
> A device driver can move up or down its memory nodes from the default
> tier. For example, PMEM can move down its memory nodes below the
> default tier, whereas GPU can move up its memory nodes above the
> default tier.
>
> The kernel initialization code makes the decision on which exact tier
> a memory node should be assigned to based on the requests from the
> device drivers as well as the memory device hardware information
> provided by the firmware.
>
>
> Memory Tier Reassignment
> ========================
>
> After a memory node is hot-removed, it can be hot-added back to a
> different memory tier. This is useful for supporting dynamically
> provisioned CXL.mem NUMA nodes, which may connect to different
> memory devices across hot-plug events. Such tier changes should
> be compatible with tier-based memory accounting.
>
> The userspace may also reassign an existing online memory node to a
> different tier. However, this should only be allowed when no pages
> are allocated from the memory node or when there are no non-root
> memory cgroups (e.g. during the system boot). This restriction is
> important for keeping memory tier hierarchy stable enough for
> tier-based memory cgroup accounting.

One way to do this is hot-remove all memory of a node, change its
memtier, then hot-add its memory.

Best Regards,
Huang, Ying

> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>
>
> Memory Allocation for Demotion
> ==============================
>
> To allocate a new page as the demotion target for a page, the kernel
> calls the allocation function (__alloc_pages_nodemask) with the
> source page node as the preferred node and the union of all lower
> tier nodes as the allowed nodemask. The actual target node selection
> then follows the allocation fallback order that the kernel has
> already defined.
>
> The pseudo code looks like:
>
>     targets = NODE_MASK_NONE;
>     src_nid = page_to_nid(page);
>     src_tier = memtier_devices[node_tier_dev_map[src_nid]].tier;
>     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
>             nodes_or(targets, targets, memory_tier(i)->nodelist);
>     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
>
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion target nodemask, e.g. to prevent
> demotion or select a particular allowed node as the demotion target.
>
>
> Memory Allocation for Promotion
> ===============================
>
> The page allocation for promotion is similar to demotion, except that (1)
> the target nodemask uses the promotion tiers, (2) the preferred node can
> be the accessing CPU node, not the source page node.
>
>
> Examples
> ========
>
> * Example 1:
>
> Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>
>                   20
>   Node 0 (DRAM) ---- Node 1 (DRAM)
>        | \ / |
>        | 30 40 X 40 | 30
>        | / \ |
>   Node 2 (PMEM) ---- Node 3 (PMEM)
>                   40
>
> node distances:
> node 0 1 2 3
>    0 10 20 30 40
>    1 20 10 40 30
>    2 30 40 10 40
>    3 40 30 40 10
>
> $ cat /sys/devices/system/memtier/possible
> 0(64), 1(128), 2(192)
>
> $ grep '' /sys/devices/system/memtier/memtier*/rank
> /sys/devices/system/memtier/memtier1/rank:128
> /sys/devices/system/memtier/memtier2/rank:192
>
> $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> /sys/devices/system/memtier/memtier1/nodelist:0-1
> /sys/devices/system/memtier/memtier2/nodelist:2-3
>
> $ grep '' /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier:1
> /sys/devices/system/node/node1/memtier:1
> /sys/devices/system/node/node2/memtier:2
> /sys/devices/system/node/node3/memtier:2
>
> Demotion fallback order:
> node 0: 2, 3
> node 1: 3, 2
> node 2: empty
> node 3: empty
>
> To prevent cross-socket demotion and memory access, the user can set
> mempolicy, e.g. cpuset.mems=0,2.
>
>
> * Example 2:
>
> Node 0 & 1 are DRAM nodes.
> Node 2 is a PMEM node and closer to node 0.
>
>                   20
>   Node 0 (DRAM) ---- Node 1 (DRAM)
>        | /
>        | 30 / 40
>        | /
>   Node 2 (PMEM)
>
> node distances:
> node 0 1 2
>    0 10 20 30
>    1 20 10 40
>    2 30 40 10
>
> $ cat /sys/devices/system/memtier/possible
> 0(64), 1(128), 2(192)
>
> $ grep '' /sys/devices/system/memtier/memtier*/rank
> /sys/devices/system/memtier/memtier1/rank:128
> /sys/devices/system/memtier/memtier2/rank:192
>
> $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> /sys/devices/system/memtier/memtier1/nodelist:0-1
> /sys/devices/system/memtier/memtier2/nodelist:2
>
> $ grep '' /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier:1
> /sys/devices/system/node/node1/memtier:1
> /sys/devices/system/node/node2/memtier:2
>
> Demotion fallback order:
> node 0: 2
> node 1: 2
> node 2: empty
>
>
> * Example 3:
>
> Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>
> All nodes are in the same tier.
>
>                   20
>   Node 0 (DRAM) ---- Node 1 (DRAM)
>          \ /
>           \ 30 / 30
>            \ /
>              Node 2 (PMEM)
>
> node distances:
> node 0 1 2
>    0 10 20 30
>    1 20 10 30
>    2 30 30 10
>
> $ cat /sys/devices/system/memtier/possible
> 0(64), 1(128), 2(192)
>
> $ grep '' /sys/devices/system/memtier/memtier*/rank
> /sys/devices/system/memtier/memtier1/rank:128
>
> $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> /sys/devices/system/memtier/memtier1/nodelist:0-2
>
> $ grep '' /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier:1
> /sys/devices/system/node/node1/memtier:1
> /sys/devices/system/node/node2/memtier:1
>
> Demotion fallback order:
> node 0: empty
> node 1: empty
> node 2: empty
>
>
> * Example 4:
>
> Node 0 is a DRAM node with CPU.
> Node 1 is a PMEM node.
> Node 2 is a GPU node.
>
>                   50
>   Node 0 (DRAM) ---- Node 2 (GPU)
>          \ /
>           \ 30 / 60
>            \ /
>              Node 1 (PMEM)
>
> node distances:
> node 0 1 2
>    0 10 30 50
>    1 30 10 60
>    2 50 60 10
>
> $ cat /sys/devices/system/memtier/possible
> 0(64), 1(128), 2(192)
>
> $ grep '' /sys/devices/system/memtier/memtier*/rank
> /sys/devices/system/memtier/memtier0/rank:64
> /sys/devices/system/memtier/memtier1/rank:128
> /sys/devices/system/memtier/memtier2/rank:192
>
> $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> /sys/devices/system/memtier/memtier0/nodelist:2
> /sys/devices/system/memtier/memtier1/nodelist:0
> /sys/devices/system/memtier/memtier2/nodelist:1
>
> $ grep '' /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier:1
> /sys/devices/system/node/node1/memtier:2
> /sys/devices/system/node/node2/memtier:0
>
> Demotion fallback order:
> node 0: 1
> node 1: empty
> node 2: 0, 1
>
>
> * Example 5:
>
> Node 0 is a DRAM node with CPU.
> Node 1 is a GPU node.
> Node 2 is a PMEM node.
> Node 3 is a large, slow DRAM node without CPU.
>
>                     100
>      Node 0 (DRAM) ---- Node 1 (GPU)
>     / | / |
>    /40 |30 120 / | 110
>   | | / |
>   | Node 2 (PMEM) ---- /
>   | \ /
>    \ 80 \ /
>     ------- Node 3 (Slow DRAM)
>
> node distances:
> node 0 1 2 3
>    0 10 100 30 40
>    1 100 10 120 110
>    2 30 120 10 80
>    3 40 110 80 10
>
> MAX_MEMORY_TIERS=4 (memtier3 is a memory tier added later).
>
> $ cat /sys/devices/system/memtier/possible
> 0(64), 1(128), 3(160), 2(192)
>
> $ grep '' /sys/devices/system/memtier/memtier*/rank
> /sys/devices/system/memtier/memtier0/rank:64
> /sys/devices/system/memtier/memtier1/rank:128
> /sys/devices/system/memtier/memtier2/rank:192
> /sys/devices/system/memtier/memtier3/rank:160
>
> $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> /sys/devices/system/memtier/memtier0/nodelist:1
> /sys/devices/system/memtier/memtier1/nodelist:0
> /sys/devices/system/memtier/memtier2/nodelist:2
> /sys/devices/system/memtier/memtier3/nodelist:3
>
> $ grep '' /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier:1
> /sys/devices/system/node/node1/memtier:0
> /sys/devices/system/node/node2/memtier:2
> /sys/devices/system/node/node3/memtier:3
>
> Demotion fallback order:
> node 0: 2, 3
> node 1: 0, 3, 2
> node 2: empty
> node 3: 2



2022-05-28 20:27:45

by Wei Xu

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces (v3)

On Fri, May 27, 2022 at 6:41 AM Aneesh Kumar K V
<[email protected]> wrote:
>
> On 5/27/22 2:52 AM, Wei Xu wrote:
>
> > The order of memory tiers is determined by their rank values, not by
> > their memtier device names.
> >
> > - /sys/devices/system/memtier/possible
> >
> > Format: ordered list of "memtier(rank)"
> > Example: 0(64), 1(128), 2(192)
> >
> > Read-only. When read, list all available memory tiers and their
> > associated ranks, ordered by the rank values (from the highest
> > tier to the lowest tier).
> >
>
> Did we discuss the need for this? I haven't done this in the patch
> series I sent across.

The "possible" file is only needed if we decide to hide the
directories of memtiers that have no nodes. We can remove this
interface and always show all memtier directories to keep things
simpler.

> We do have
> /sys/devices/system/memtier/default_rank which should allow user to
> identify the default rank to which memory would get added via hotplug if
> the NUMA node is not part of any memory tier.

Sounds good to me to have it.

>
> -aneesh

2022-05-28 20:29:53

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier

The rank approach allows us to keep memory tier device IDs stable even if there
is a need to change the tier ordering among different memory tiers. e.g. DRAM
nodes with CPUs will always be on memtier1, no matter how many tiers are higher
or lower than these nodes. A new memory tier can be inserted into the tier
hierarchy for a new set of nodes without affecting the node assignment of any
existing memtier, provided that there is enough gap in the rank values for the
new memtier.

The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
Its value relative to other memtiers decides the level of this memtier in the tier
hierarchy.

For now, This patch supports hardcoded rank values which are 100, 200, & 300 for
memory tiers 0,1 & 2 respectively.

Below is the sysfs interface to read the rank values of memory tier,
/sys/devices/system/memtier/memtierN/rank

This interface is read only for now, write support can be added when there is
a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
requirement among them, rank can be utilized there as rank decides now memory
tiering ordering and not memory tier device ids.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
drivers/base/node.c | 5 +-
drivers/dax/kmem.c | 2 +-
include/linux/migrate.h | 17 ++--
mm/migrate.c | 218 ++++++++++++++++++++++++----------------
4 files changed, 144 insertions(+), 98 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index cf4a58446d8c..892f7c23c94e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -567,8 +567,11 @@ static ssize_t memtier_show(struct device *dev,
char *buf)
{
int node = dev->id;
+ int tier_index = node_get_memory_tier_id(node);

- return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
+ if (tier_index != -1)
+ return sysfs_emit(buf, "%d\n", tier_index);
+ return 0;
}

static ssize_t memtier_store(struct device *dev,
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 991782aa2448..79953426ddaf 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -149,7 +149,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
dev_set_drvdata(dev, data);

#ifdef CONFIG_TIERED_MEMORY
- node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+ node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM);
#endif
return 0;

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index cbef71a499c1..fd09fd009a69 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -167,18 +167,19 @@ void migrate_vma_finalize(struct migrate_vma *migrate);
#ifdef CONFIG_TIERED_MEMORY

extern bool numa_demotion_enabled;
-#define DEFAULT_MEMORY_TIER 1
-
enum memory_tier_type {
- MEMORY_TIER_HBM_GPU,
- MEMORY_TIER_DRAM,
- MEMORY_TIER_PMEM,
- MAX_MEMORY_TIERS
+ MEMORY_RANK_HBM_GPU,
+ MEMORY_RANK_DRAM,
+ DEFAULT_MEMORY_RANK = MEMORY_RANK_DRAM,
+ MEMORY_RANK_PMEM
};

+#define DEFAULT_MEMORY_TIER 1
+#define MAX_MEMORY_TIERS 3
+
int next_demotion_node(int node);
-int node_get_memory_tier(int node);
-int node_set_memory_tier(int node, int tier);
+int node_get_memory_tier_id(int node);
+int node_set_memory_tier_rank(int node, int tier);
int node_reset_memory_tier(int node, int tier);
#else
#define numa_demotion_enabled false
diff --git a/mm/migrate.c b/mm/migrate.c
index 59d8558dd2ee..f013d14f77ed 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2121,8 +2121,10 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
#ifdef CONFIG_TIERED_MEMORY

struct memory_tier {
+ struct list_head list;
struct device dev;
nodemask_t nodelist;
+ int rank;
};

struct demotion_nodes {
@@ -2139,7 +2141,7 @@ static struct bus_type memory_tier_subsys = {
static void establish_migration_targets(void);

DEFINE_MUTEX(memory_tier_lock);
-static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
+static LIST_HEAD(memory_tiers);

/*
* node_demotion[] examples:
@@ -2206,16 +2208,25 @@ static struct demotion_nodes *node_demotion __read_mostly;
static ssize_t nodelist_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
- int tier = dev->id;
+ struct memory_tier *memtier = to_memory_tier(dev);

return sysfs_emit(buf, "%*pbl\n",
- nodemask_pr_args(&memory_tiers[tier]->nodelist));
-
+ nodemask_pr_args(&memtier->nodelist));
}
static DEVICE_ATTR_RO(nodelist);

+static ssize_t rank_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_tier *memtier = to_memory_tier(dev);
+
+ return sysfs_emit(buf, "%d\n", memtier->rank);
+}
+static DEVICE_ATTR_RO(rank);
+
static struct attribute *memory_tier_dev_attrs[] = {
&dev_attr_nodelist.attr,
+ &dev_attr_rank.attr,
NULL
};

@@ -2235,53 +2246,79 @@ static void memory_tier_device_release(struct device *dev)
kfree(tier);
}

-static int register_memory_tier(int tier)
+static void insert_memory_tier(struct memory_tier *memtier)
+{
+ struct list_head *ent;
+ struct memory_tier *tmp_memtier;
+
+ list_for_each(ent, &memory_tiers) {
+ tmp_memtier = list_entry(ent, struct memory_tier, list);
+ if (tmp_memtier->rank > memtier->rank) {
+ list_add_tail(&memtier->list, ent);
+ return;
+ }
+ }
+ list_add_tail(&memtier->list, &memory_tiers);
+}
+
+static struct memory_tier *register_memory_tier(unsigned int tier)
{
int error;
+ struct memory_tier *memtier;

- memory_tiers[tier] = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
- if (!memory_tiers[tier])
- return -ENOMEM;
+ if (tier >= MAX_MEMORY_TIERS)
+ return NULL;

- memory_tiers[tier]->dev.id = tier;
- memory_tiers[tier]->dev.bus = &memory_tier_subsys;
- memory_tiers[tier]->dev.release = memory_tier_device_release;
- memory_tiers[tier]->dev.groups = memory_tier_dev_groups;
- error = device_register(&memory_tiers[tier]->dev);
+ memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+ if (!memtier)
+ return NULL;

+ memtier->dev.id = tier;
+ /*
+ * For now we only supported hardcoded rank value which
+ * 100, 200, 300 with no special meaning.
+ */
+ memtier->rank = 100 + 100 * tier;
+ memtier->dev.bus = &memory_tier_subsys;
+ memtier->dev.release = memory_tier_device_release;
+ memtier->dev.groups = memory_tier_dev_groups;
+
+ insert_memory_tier(memtier);
+
+ error = device_register(&memtier->dev);
if (error) {
- put_device(&memory_tiers[tier]->dev);
- memory_tiers[tier] = NULL;
+ list_del(&memtier->list);
+ put_device(&memtier->dev);
+ return NULL;
}
-
- return error;
+ return memtier;
}

-static void unregister_memory_tier(int tier)
+static void unregister_memory_tier(struct memory_tier *memtier)
{
- device_unregister(&memory_tiers[tier]->dev);
- memory_tiers[tier] = NULL;
+ list_del(&memtier->list);
+ device_unregister(&memtier->dev);
}

static ssize_t
-max_tiers_show(struct device *dev, struct device_attribute *attr, char *buf)
+max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
{
return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
}

-static DEVICE_ATTR_RO(max_tiers);
+static DEVICE_ATTR_RO(max_tier);

static ssize_t
-default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+default_rank_show(struct device *dev, struct device_attribute *attr, char *buf)
{
- return sysfs_emit(buf, "%d\n", DEFAULT_MEMORY_TIER);
+ return sysfs_emit(buf, "%d\n", 100 + 100 * DEFAULT_MEMORY_TIER);
}

-static DEVICE_ATTR_RO(default_tier);
+static DEVICE_ATTR_RO(default_rank);

static struct attribute *memoty_tier_attrs[] = {
- &dev_attr_max_tiers.attr,
- &dev_attr_default_tier.attr,
+ &dev_attr_max_tier.attr,
+ &dev_attr_default_rank.attr,
NULL
};

@@ -2294,52 +2331,61 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
NULL,
};

-static int __node_get_memory_tier(int node)
+static struct memory_tier *__node_get_memory_tier(int node)
{
- int tier;
+ struct memory_tier *memtier;

- for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) {
- if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist))
- return tier;
+ list_for_each_entry(memtier, &memory_tiers, list) {
+ if (node_isset(node, memtier->nodelist))
+ return memtier;
}
+ return NULL;
+}

- return -1;
+static struct memory_tier *__get_memory_tier_from_id(int id)
+{
+ struct memory_tier *memtier;
+
+ list_for_each_entry(memtier, &memory_tiers, list) {
+ if (memtier->dev.id == id)
+ return memtier;
+ }
+ return NULL;
}

+
static void node_remove_from_memory_tier(int node)
{
- int tier;
+ struct memory_tier *memtier;

mutex_lock(&memory_tier_lock);

- tier = __node_get_memory_tier(node);
-
+ memtier = __node_get_memory_tier(node);
/*
* Remove node from tier, if tier becomes
* empty then unregister it to make it invisible
* in sysfs.
*/
- node_clear(node, memory_tiers[tier]->nodelist);
- if (nodes_empty(memory_tiers[tier]->nodelist))
- unregister_memory_tier(tier);
+ node_clear(node, memtier->nodelist);
+ if (nodes_empty(memtier->nodelist))
+ unregister_memory_tier(memtier);

establish_migration_targets();
-
mutex_unlock(&memory_tier_lock);
}

-int node_get_memory_tier(int node)
+int node_get_memory_tier_id(int node)
{
- int tier;
-
+ int tier = -1;
+ struct memory_tier *memtier;
/*
* Make sure memory tier is not unregistered
* while it is being read.
*/
mutex_lock(&memory_tier_lock);
-
- tier = __node_get_memory_tier(node);
-
+ memtier = __node_get_memory_tier(node);
+ if (memtier)
+ tier = memtier->dev.id;
mutex_unlock(&memory_tier_lock);

return tier;
@@ -2348,46 +2394,43 @@ int node_get_memory_tier(int node)
int __node_set_memory_tier(int node, int tier)
{
int ret = 0;
- /*
- * As register_memory_tier() for new tier can fail,
- * try it before modifying existing tier. register
- * tier makes tier visible in sysfs.
- */
- if (!memory_tiers[tier]) {
- ret = register_memory_tier(tier);
- if (ret) {
+ struct memory_tier *memtier;
+
+ memtier = __get_memory_tier_from_id(tier);
+ if (!memtier) {
+ memtier = register_memory_tier(tier);
+ if (!memtier) {
+ ret = -EINVAL;
goto out;
}
}
-
- node_set(node, memory_tiers[tier]->nodelist);
+ node_set(node, memtier->nodelist);
establish_migration_targets();
-
out:
return ret;
}

int node_reset_memory_tier(int node, int tier)
{
- int current_tier, ret = 0;
+ struct memory_tier *current_tier;
+ int ret = 0;

mutex_lock(&memory_tier_lock);

current_tier = __node_get_memory_tier(node);
- if (current_tier == tier)
+ if (!current_tier || current_tier->dev.id == tier)
goto out;

- if (current_tier != -1 )
- node_clear(node, memory_tiers[current_tier]->nodelist);
+ node_clear(node, current_tier->nodelist);

ret = __node_set_memory_tier(node, tier);

if (!ret) {
- if (nodes_empty(memory_tiers[current_tier]->nodelist))
+ if (nodes_empty(current_tier->nodelist))
unregister_memory_tier(current_tier);
} else {
/* reset it back to older tier */
- ret = __node_set_memory_tier(node, current_tier);
+ node_set(node, current_tier->nodelist);
}
out:
mutex_unlock(&memory_tier_lock);
@@ -2395,15 +2438,13 @@ int node_reset_memory_tier(int node, int tier)
return ret;
}

-int node_set_memory_tier(int node, int tier)
+int node_set_memory_tier_rank(int node, int rank)
{
- int current_tier, ret = 0;
-
- if (tier >= MAX_MEMORY_TIERS)
- return -EINVAL;
+ struct memory_tier *memtier;
+ int ret = 0;

mutex_lock(&memory_tier_lock);
- current_tier = __node_get_memory_tier(node);
+ memtier = __node_get_memory_tier(node);
/*
* if node is already part of the tier proceed with the
* current tier value, because we might want to establish
@@ -2411,15 +2452,17 @@ int node_set_memory_tier(int node, int tier)
* before it was made part of N_MEMORY, hence estabilish_migration_targets
* will have skipped this node.
*/
- if (current_tier != -1)
- tier = current_tier;
- ret = __node_set_memory_tier(node, tier);
+ if (memtier)
+ establish_migration_targets();
+ else {
+ /* For now rank value and tier value is same. */
+ ret = __node_set_memory_tier(node, rank);
+ }
mutex_unlock(&memory_tier_lock);

return ret;
}
-EXPORT_SYMBOL_GPL(node_set_memory_tier);
-
+EXPORT_SYMBOL_GPL(node_set_memory_tier_rank);

/**
* next_demotion_node() - Get the next node in the demotion path
@@ -2504,6 +2547,8 @@ static void disable_all_migrate_targets(void)
*/
static void establish_migration_targets(void)
{
+ struct list_head *ent;
+ struct memory_tier *memtier;
struct demotion_nodes *nd;
int tier, target = NUMA_NO_NODE, node;
int distance, best_distance;
@@ -2518,19 +2563,15 @@ static void establish_migration_targets(void)
best_distance = -1;
nd = &node_demotion[node];

- tier = __node_get_memory_tier(node);
+ memtier = __node_get_memory_tier(node);
+ if (!memtier || list_is_last(&memtier->list, &memory_tiers))
+ continue;
/*
- * Find next tier to demote.
+ * Get the next memtier to find the demotion node list.
*/
- while (++tier < MAX_MEMORY_TIERS) {
- if (memory_tiers[tier])
- break;
- }
+ memtier = list_next_entry(memtier, list);

- if (tier >= MAX_MEMORY_TIERS)
- continue;
-
- nodes_andnot(used, node_states[N_MEMORY], memory_tiers[tier]->nodelist);
+ nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);

/*
* Find all the nodes in the memory tier node list of same best distance.
@@ -2588,7 +2629,7 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
* registered, we will continue to use that for the new memory
* we are adding here.
*/
- node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER);
+ node_set_memory_tier_rank(arg->status_change_nid, DEFAULT_MEMORY_RANK);
break;
}

@@ -2668,6 +2709,7 @@ subsys_initcall(numa_init_sysfs);
static int __init memory_tier_init(void)
{
int ret;
+ struct memory_tier *memtier;

ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
if (ret)
@@ -2677,14 +2719,14 @@ static int __init memory_tier_init(void)
* Register only default memory tier to hide all empty
* memory tier from sysfs.
*/
- ret = register_memory_tier(DEFAULT_MEMORY_TIER);
- if (ret)
+ memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
+ if (!memtier)
panic("%s() failed to register memory tier: %d\n", __func__, ret);

/*
* CPU only nodes are not part of memoty tiers.
*/
- memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
+ memtier->nodelist = node_states[N_MEMORY];
migrate_on_reclaim_init();

return 0;
--
2.36.1


2022-05-28 20:30:51

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

From: Jagdish Gediya <[email protected]>

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed. The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases,

The current tier initialization code always initializes
each memory-only NUMA node into a lower tier. But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.

With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.

This patch series address the above by defining memory tiers explicitly.

This patch adds below sysfs interface which is read-only and
can be used to read nodes available in specific tier.

/sys/devices/system/memtier/memtierN/nodelist

Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
lowest tier. The absolute value of a tier id number has no specific
meaning. what matters is the relative order of the tier id numbers.

All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
nodes are by default assigned to DEFAULT_MEMORY_TIER(1).

Default memory tier can be read from,
/sys/devices/system/memtier/default_tier

Max memory tier can be read from,
/sys/devices/system/memtier/max_tiers

This patch implements the RFC spec sent by Wei Xu <[email protected]> at [1].

[1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/

Signed-off-by: Jagdish Gediya <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/migrate.h | 38 ++++++++----
mm/Kconfig | 11 ++++
mm/migrate.c | 134 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 170 insertions(+), 13 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 90e75d5a54d6..0ec653623565 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -47,17 +47,8 @@ void folio_migrate_copy(struct folio *newfolio, struct folio *folio);
int folio_migrate_mapping(struct address_space *mapping,
struct folio *newfolio, struct folio *folio, int extra_count);

-extern bool numa_demotion_enabled;
-extern void migrate_on_reclaim_init(void);
-#ifdef CONFIG_HOTPLUG_CPU
-extern void set_migration_target_nodes(void);
-#else
-static inline void set_migration_target_nodes(void) {}
-#endif
#else

-static inline void set_migration_target_nodes(void) {}
-
static inline void putback_movable_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t new,
free_page_t free, unsigned long private, enum migrate_mode mode,
@@ -82,7 +73,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
return -ENOSYS;
}

-#define numa_demotion_enabled false
#endif /* CONFIG_MIGRATION */

#ifdef CONFIG_COMPACTION
@@ -172,15 +162,37 @@ struct migrate_vma {
int migrate_vma_setup(struct migrate_vma *args);
void migrate_vma_pages(struct migrate_vma *migrate);
void migrate_vma_finalize(struct migrate_vma *migrate);
-int next_demotion_node(int node);
+#endif /* CONFIG_MIGRATION */
+
+#ifdef CONFIG_TIERED_MEMORY
+
+extern bool numa_demotion_enabled;
+#define DEFAULT_MEMORY_TIER 1
+
+enum memory_tier_type {
+ MEMORY_TIER_HBM_GPU,
+ MEMORY_TIER_DRAM,
+ MEMORY_TIER_PMEM,
+ MAX_MEMORY_TIERS
+};

-#else /* CONFIG_MIGRATION disabled: */
+int next_demotion_node(int node);

+extern void migrate_on_reclaim_init(void);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void set_migration_target_nodes(void);
+#else
+static inline void set_migration_target_nodes(void) {}
+#endif
+#else
+#define numa_demotion_enabled false
static inline int next_demotion_node(int node)
{
return NUMA_NO_NODE;
}

-#endif /* CONFIG_MIGRATION */
+static inline void set_migration_target_nodes(void) {}
+static inline void migrate_on_reclaim_init(void) {}
+#endif /* CONFIG_TIERED_MEMORY */

#endif /* _LINUX_MIGRATE_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 034d87953600..7bfbddef46ed 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -258,6 +258,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
config ARCH_ENABLE_THP_MIGRATION
bool

+config TIERED_MEMORY
+ bool "Support for explicit memory tiers"
+ def_bool y
+ depends on MIGRATION && NUMA
+ help
+ Support to split nodes into memory tiers explicitly and
+ to demote pages on reclaim to lower tiers. This option
+ also exposes sysfs interface to read nodes available in
+ specific tier and to move specific node among different
+ possible tiers.
+
config HUGETLB_PAGE_SIZE_VARIABLE
def_bool n
help
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c31ee1e1c9b..f28ee93fb017 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2118,6 +2118,113 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
#endif /* CONFIG_NUMA_BALANCING */
#endif /* CONFIG_NUMA */

+#ifdef CONFIG_TIERED_MEMORY
+
+struct memory_tier {
+ struct device dev;
+ nodemask_t nodelist;
+};
+
+#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
+
+static struct bus_type memory_tier_subsys = {
+ .name = "memtier",
+ .dev_name = "memtier",
+};
+
+static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
+
+static ssize_t nodelist_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int tier = dev->id;
+
+ return sysfs_emit(buf, "%*pbl\n",
+ nodemask_pr_args(&memory_tiers[tier]->nodelist));
+
+}
+static DEVICE_ATTR_RO(nodelist);
+
+static struct attribute *memory_tier_dev_attrs[] = {
+ &dev_attr_nodelist.attr,
+ NULL
+};
+
+static const struct attribute_group memory_tier_dev_group = {
+ .attrs = memory_tier_dev_attrs,
+};
+
+static const struct attribute_group *memory_tier_dev_groups[] = {
+ &memory_tier_dev_group,
+ NULL
+};
+
+static void memory_tier_device_release(struct device *dev)
+{
+ struct memory_tier *tier = to_memory_tier(dev);
+
+ kfree(tier);
+}
+
+static int register_memory_tier(int tier)
+{
+ int error;
+
+ memory_tiers[tier] = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+ if (!memory_tiers[tier])
+ return -ENOMEM;
+
+ memory_tiers[tier]->dev.id = tier;
+ memory_tiers[tier]->dev.bus = &memory_tier_subsys;
+ memory_tiers[tier]->dev.release = memory_tier_device_release;
+ memory_tiers[tier]->dev.groups = memory_tier_dev_groups;
+ error = device_register(&memory_tiers[tier]->dev);
+
+ if (error) {
+ put_device(&memory_tiers[tier]->dev);
+ memory_tiers[tier] = NULL;
+ }
+
+ return error;
+}
+
+static void unregister_memory_tier(int tier)
+{
+ device_unregister(&memory_tiers[tier]->dev);
+ memory_tiers[tier] = NULL;
+}
+
+static ssize_t
+max_tiers_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
+}
+
+static DEVICE_ATTR_RO(max_tiers);
+
+static ssize_t
+default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", DEFAULT_MEMORY_TIER);
+}
+
+static DEVICE_ATTR_RO(default_tier);
+
+static struct attribute *memoty_tier_attrs[] = {
+ &dev_attr_max_tiers.attr,
+ &dev_attr_default_tier.attr,
+ NULL
+};
+
+static const struct attribute_group memory_tier_attr_group = {
+ .attrs = memoty_tier_attrs,
+};
+
+static const struct attribute_group *memory_tier_attr_groups[] = {
+ &memory_tier_attr_group,
+ NULL,
+};
+
/*
* node_demotion[] example:
*
@@ -2569,3 +2676,30 @@ static int __init numa_init_sysfs(void)
}
subsys_initcall(numa_init_sysfs);
#endif
+
+static int __init memory_tier_init(void)
+{
+ int ret;
+
+ ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
+ if (ret)
+ panic("%s() failed to register subsystem: %d\n", __func__, ret);
+
+ /*
+ * Register only default memory tier to hide all empty
+ * memory tier from sysfs.
+ */
+ ret = register_memory_tier(DEFAULT_MEMORY_TIER);
+ if (ret)
+ panic("%s() failed to register memory tier: %d\n", __func__, ret);
+
+ /*
+ * CPU only nodes are not part of memoty tiers.
+ */
+ memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
+
+ return 0;
+}
+subsys_initcall(memory_tier_init);
+
+#endif /* CONFIG_TIERED_MEMORY */
--
2.36.1


2022-05-29 04:31:49

by Huang, Ying

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces (v3)

On Fri, 2022-05-27 at 09:30 -0700, Wei Xu wrote:
> On Fri, May 27, 2022 at 6:41 AM Aneesh Kumar K V
> <[email protected]> wrote:
> >
> > On 5/27/22 2:52 AM, Wei Xu wrote:
> >
> > >    The order of memory tiers is determined by their rank values, not by
> > >    their memtier device names.
> > >
> > >    - /sys/devices/system/memtier/possible
> > >
> > >      Format: ordered list of "memtier(rank)"
> > >      Example: 0(64), 1(128), 2(192)
> > >
> > >      Read-only. When read, list all available memory tiers and their
> > >      associated ranks, ordered by the rank values (from the highest
> > >       tier to the lowest tier).
> > >
> >
> > Did we discuss the need for this? I haven't done this in the patch
> > series I sent across.
>
> The "possible" file is only needed if we decide to hide the
> directories of memtiers that have no nodes. We can remove this
> interface and always show all memtier directories to keep things
> simpler.

When discussed offline, Tim Chen pointed out that with the proposed
interface, it's unconvenient to know the position of a given memory tier
in all memory tiers. We must sort "rank" of all memory tiers to know
that. "possible" file can be used for that. Although "possible" file
can be generated with a shell script, it's more convenient to show it
directly.

Another way to address the issue is to add memtierN/pos for each memory
tier as suggested by Tim. It's readonly and will show position of
"memtierN" in all memory tiers. It's even better to show the relative
postion to the default memory tier (DRAM with CPU). That is, the
position of DRAM memory tier is 0.

Unlike memory tier device ID or rank, the position is relative and
dynamic.

Best Regards,
Huang, Ying



2022-05-30 05:22:16

by Oliver Sang

[permalink] [raw]
Subject: [mm/demotion] 8ebccd60c2: BUG:sleeping_function_called_from_invalid_context_at_mm/compaction.c



Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 8ebccd60c2db6beefef2f39b05a95024be0c39eb ("[RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers")
url: https://github.com/intel-lab-lkp/linux/commits/Aneesh-Kumar-K-V/mm-demotion-Add-support-for-explicit-memory-tiers/20220527-212536
base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git b232b02bf3c205b13a26dcec08e53baddd8e59ed
patch link: https://lore.kernel.org/linux-mm/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 2.576581][ T1] debug_vm_pgtable: [debug_vm_pgtable ]: Validating architecture page table helpers
[ 2.584367][ T1] BUG: sleeping function called from invalid context at mm/compaction.c:540
[ 2.585275][ T1] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
[ 2.586166][ T1] preempt_count: 1, expected: 0
[ 2.586668][ T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc5-00059-g8ebccd60c2db #1
[ 2.587562][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 2.588577][ T1] Call Trace:
[ 2.588948][ T1] <TASK>
[ 2.589284][ T1] dump_stack_lvl+0x34/0x44
[ 2.589765][ T1] __might_resched+0x134/0x149
[ 2.590253][ T1] isolate_freepages_block+0xe6/0x2d3
[ 2.590794][ T1] isolate_freepages_range+0xc5/0x118
[ 2.591342][ T1] alloc_contig_range+0x2dd/0x350
[ 2.591858][ T1] ? alloc_contig_pages+0x170/0x194
[ 2.592384][ T1] alloc_contig_pages+0x170/0x194
[ 2.592896][ T1] init_args+0x3d0/0x44e
[ 2.593345][ T1] ? init_args+0x44e/0x44e
[ 2.593816][ T1] debug_vm_pgtable+0x46/0x809
[ 2.594312][ T1] ? alloc_inode+0x37/0x8e
[ 2.594774][ T1] ? init_args+0x44e/0x44e
[ 2.595235][ T1] do_one_initcall+0x83/0x187
[ 2.595729][ T1] do_initcalls+0xc6/0xdf
[ 2.596190][ T1] kernel_init_freeable+0x10d/0x13c
[ 2.596721][ T1] ? rest_init+0xcd/0xcd
[ 2.597170][ T1] kernel_init+0x16/0x11a
[ 2.597636][ T1] ret_from_fork+0x22/0x30
[ 2.598097][ T1] </TASK>
[ 2.626547][ T1] ------------[ cut here ]------------
[ 2.627157][ T1] initcall debug_vm_pgtable+0x0/0x809 returned with preemption imbalance
[ 2.628019][ T1] WARNING: CPU: 0 PID: 1 at init/main.c:1311 do_one_initcall+0x140/0x187
[ 2.628863][ T1] Modules linked in:
[ 2.629280][ T1] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 5.18.0-rc5-00059-g8ebccd60c2db #1
[ 2.630295][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 2.631306][ T1] RIP: 0010:do_one_initcall+0x140/0x187
[ 2.631867][ T1] Code: 00 00 48 c7 c6 ca b6 2c 82 48 89 e7 e8 80 ca 44 00 fb 80 3c 24 00 74 14 48 89 e2 48 89 ee 48 c7 c7 df b6 2c 82 e8 b3 d6 a2 00 <0f> 0b 48 8b 44 24 40 65 48 2b 04 25 28 00 00 00 74 05 e8 d8 cd a4
[ 2.633713][ T1] RSP: 0000:ffffc90000013ea8 EFLAGS: 00010286
[ 2.634312][ T1] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000003
[ 2.635123][ T1] RDX: 0000000000000216 RSI: 0000000000000001 RDI: 0000000000000001
[ 2.635932][ T1] RBP: ffffffff82f3b694 R08: 0000000000000000 R09: 0000000000000019
[ 2.636735][ T1] R10: 0000000000000000 R11: 0000000074696e69 R12: 0000000000000000
[ 2.637538][ T1] R13: ffff88810cba0000 R14: 0000000000000000 R15: 0000000000000000
[ 2.638353][ T1] FS: 0000000000000000(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
[ 2.639253][ T1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.639901][ T1] CR2: ffff88843ffff000 CR3: 0000000002612000 CR4: 00000000000406f0
[ 2.640711][ T1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2.641526][ T1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2.642341][ T1] Call Trace:
[ 2.642707][ T1] <TASK>
[ 2.643051][ T1] do_initcalls+0xc6/0xdf
[ 2.643512][ T1] kernel_init_freeable+0x10d/0x13c
[ 2.644045][ T1] ? rest_init+0xcd/0xcd
[ 2.644498][ T1] kernel_init+0x16/0x11a
[ 2.644956][ T1] ret_from_fork+0x22/0x30
[ 2.645417][ T1] </TASK>
[ 2.645764][ T1] ---[ end trace 0000000000000000 ]---



To reproduce:

# build kernel
cd linux
cp config-5.18.0-rc5-00059-g8ebccd60c2db .config
make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (5.32 kB)
config-5.18.0-rc5-00059-g8ebccd60c2db (125.00 kB)
job-script (4.79 kB)
dmesg.xz (13.55 kB)
Download all attachments

2022-06-01 18:35:05

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

On 6/1/22 11:59 AM, Bharata B Rao wrote:
> On 5/27/2022 5:55 PM, Aneesh Kumar K.V wrote:
>> From: Jagdish Gediya <[email protected]>
>>
>> By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
>> is memory tier 1 which is designated for nodes with DRAM, so it
>> is not the right tier for dax devices.
>>
>> Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future,
>> support should be added to distinguish the dax-devices which should
>> not be MEMORY_TIER_PMEM and right memory tier should be set for them.
>>
>> Signed-off-by: Jagdish Gediya <[email protected]>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> ---
>> drivers/dax/kmem.c | 4 ++++
>> mm/migrate.c | 2 ++
>> 2 files changed, 6 insertions(+)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index a37622060fff..991782aa2448 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -11,6 +11,7 @@
>> #include <linux/fs.h>
>> #include <linux/mm.h>
>> #include <linux/mman.h>
>> +#include <linux/migrate.h>
>> #include "dax-private.h"
>> #include "bus.h"
>>
>> @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>
>> dev_set_drvdata(dev, data);
>>
>> +#ifdef CONFIG_TIERED_MEMORY
>> + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
>> +#endif
>
> I was experimenting with this patchset and found this behaviour.
> Here's what I did:
>
> Boot a KVM guest with vNVDIMM device which ends up with device_dax
> driver by default.
>
> Use it as RAM by binding it to dax kmem driver. It now appears as
> RAM with a new NUMA node that is put to memtier1 (the existing tier
> where DRAM already exists)
>

That should have placed it in memtier2.

> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
> that expected to happen automatically when a node with dax kmem
> device comes up?
>

This can happen if we have added the same NUMA node to memtier1 before
dax kmem driver initialized the pmem memory. Can you check before the
above node_set_memory_tier_rank() whether the specific NUMA node is
already part of any memory tier?

Thank you for testing the patchset.
-aneesh


2022-06-01 18:40:57

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

On 5/27/2022 5:55 PM, Aneesh Kumar K.V wrote:
> From: Jagdish Gediya <[email protected]>
>
> By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
> is memory tier 1 which is designated for nodes with DRAM, so it
> is not the right tier for dax devices.
>
> Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future,
> support should be added to distinguish the dax-devices which should
> not be MEMORY_TIER_PMEM and right memory tier should be set for them.
>
> Signed-off-by: Jagdish Gediya <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> drivers/dax/kmem.c | 4 ++++
> mm/migrate.c | 2 ++
> 2 files changed, 6 insertions(+)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index a37622060fff..991782aa2448 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -11,6 +11,7 @@
> #include <linux/fs.h>
> #include <linux/mm.h>
> #include <linux/mman.h>
> +#include <linux/migrate.h>
> #include "dax-private.h"
> #include "bus.h"
>
> @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>
> dev_set_drvdata(dev, data);
>
> +#ifdef CONFIG_TIERED_MEMORY
> + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> +#endif

I was experimenting with this patchset and found this behaviour.
Here's what I did:

Boot a KVM guest with vNVDIMM device which ends up with device_dax
driver by default.

Use it as RAM by binding it to dax kmem driver. It now appears as
RAM with a new NUMA node that is put to memtier1 (the existing tier
where DRAM already exists)

I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
that expected to happen automatically when a node with dax kmem
device comes up?

Regards,
Bharata.

2022-06-01 19:54:44

by Huang, Ying

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces (v3)

On Mon, 2022-05-30 at 13:50 +0100, Jonathan Cameron wrote:
> On Sun, 29 May 2022 12:31:30 +0800
> Ying Huang <[email protected]> wrote:
>
> > On Fri, 2022-05-27 at 09:30 -0700, Wei Xu wrote:
> > > On Fri, May 27, 2022 at 6:41 AM Aneesh Kumar K V
> > > <[email protected]> wrote:
> > > >
> > > > On 5/27/22 2:52 AM, Wei Xu wrote:
> > > >   
> > > >
> > > >
> > > >
> > > > >    The order of memory tiers is determined by their rank values, not by
> > > > >    their memtier device names.
> > > > >
> > > > >    - /sys/devices/system/memtier/possible
> > > > >
> > > > >      Format: ordered list of "memtier(rank)"
> > > > >      Example: 0(64), 1(128), 2(192)
> > > > >
> > > > >      Read-only. When read, list all available memory tiers and their
> > > > >      associated ranks, ordered by the rank values (from the highest
> > > > >       tier to the lowest tier).
> > > > >   
> > > > >
> > > > >
> > > > >
> > > >
> > > > Did we discuss the need for this? I haven't done this in the patch
> > > > series I sent across.
> > >
> > > The "possible" file is only needed if we decide to hide the
> > > directories of memtiers that have no nodes. We can remove this
> > > interface and always show all memtier directories to keep things
> > > simpler.
> >
> > When discussed offline, Tim Chen pointed out that with the proposed
> > interface, it's unconvenient to know the position of a given memory tier
> > in all memory tiers. We must sort "rank" of all memory tiers to know
> > that. "possible" file can be used for that. Although "possible" file
> > can be generated with a shell script, it's more convenient to show it
> > directly.
> >
> > Another way to address the issue is to add memtierN/pos for each memory
> > tier as suggested by Tim. It's readonly and will show position of
> > "memtierN" in all memory tiers. It's even better to show the relative
> > postion to the default memory tier (DRAM with CPU). That is, the
> > position of DRAM memory tier is 0.
> >
> > Unlike memory tier device ID or rank, the position is relative and
> > dynamic.
>
> Hi,
>
> I'm unconvinced. This is better done with a shell script than
> by adding ABI we'll have to live with for ever..
>
> I'm no good at shell scripting but this does the job
> grep "" tier*/rank | sort -n -k 2 -t :
>
> tier2/rank:50
> tier0/rank:100
> tier1/rank:200
> tier3/rank:240
>
> I'm sure someone more knowledgeable will do it in a simpler fashion still.

I am OK to leave this to be added later if we found that it's useful.

Best Regards,
Huang, Ying

> Jonathan
>
> >
> > Best Regards,
> > Huang, Ying
> >
> >
>



2022-06-02 08:26:12

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> The rank approach allows us to keep memory tier device IDs stable even if there
> is a need to change the tier ordering among different memory tiers. e.g. DRAM
> nodes with CPUs will always be on memtier1, no matter how many tiers are higher
> or lower than these nodes. A new memory tier can be inserted into the tier
> hierarchy for a new set of nodes without affecting the node assignment of any
> existing memtier, provided that there is enough gap in the rank values for the
> new memtier.
>
> The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
> Its value relative to other memtiers decides the level of this memtier in the tier
> hierarchy.
>
> For now, This patch supports hardcoded rank values which are 100, 200, & 300 for
> memory tiers 0,1 & 2 respectively.
>
> Below is the sysfs interface to read the rank values of memory tier,
> /sys/devices/system/memtier/memtierN/rank
>
> This interface is read only for now, write support can be added when there is
> a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
> requirement among them, rank can be utilized there as rank decides now memory
> tiering ordering and not memory tier device ids.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
>  drivers/base/node.c | 5 +-
>  drivers/dax/kmem.c | 2 +-
>  include/linux/migrate.h | 17 ++--
>  mm/migrate.c | 218 ++++++++++++++++++++++++----------------
>  4 files changed, 144 insertions(+), 98 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index cf4a58446d8c..892f7c23c94e 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -567,8 +567,11 @@ static ssize_t memtier_show(struct device *dev,
>   char *buf)
>  {
>   int node = dev->id;
> + int tier_index = node_get_memory_tier_id(node);
>  
>
>
>
> - return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
> + if (tier_index != -1)
> + return sysfs_emit(buf, "%d\n", tier_index);
> + return 0;
>  }
>  
>
>
>
>  static ssize_t memtier_store(struct device *dev,
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 991782aa2448..79953426ddaf 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -149,7 +149,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   dev_set_drvdata(dev, data);
>  
>
>
>
>  #ifdef CONFIG_TIERED_MEMORY
> - node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> + node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM);

I think that we can work with memory tier ID inside kernel?

Best Regards,
Huang, Ying


[snip]


2022-06-02 08:53:06

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>> I was experimenting with this patchset and found this behaviour.
>> Here's what I did:
>>
>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>> driver by default.
>>
>> Use it as RAM by binding it to dax kmem driver. It now appears as
>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>> where DRAM already exists)
>>
>
> That should have placed it in memtier2.
>
>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>> that expected to happen automatically when a node with dax kmem
>> device comes up?
>>
>
> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?

When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
is already part of memtier1 whose nodelist shows 0-1.

Regards,
Bharata.

2022-06-02 10:23:02

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> From: Jagdish Gediya <[email protected]>
>
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created
> during the kernel initialization and updated when a NUMA node is
> hot-added or hot-removed. The current implementation puts all
> nodes with CPU into the top tier, and builds the tier hierarchy
> tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
>
> This current memory tier kernel interface needs to be improved for
> several important use cases,
>
> The current tier initialization code always initializes
> each memory-only NUMA node into a lower tier. But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into a higher tier.
>
> The current tier hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM or GPU devices, the
> memory-only NUMA nodes mapping these devices should be in the
> top tier, and DRAM nodes with CPUs are better to be placed into the
> next lower tier.
>
> With current kernel higher tier node can only be demoted to selected nodes on the
> next lower tier as defined by the demotion path, not any other
> node from any lower tier. This strict, hard-coded demotion order
> does not work in all use cases (e.g. some use cases may want to
> allow cross-socket demotion to another node in the same demotion
> tier as a fallback when the preferred demotion node is out of
> space), This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from
> any lower tier, whereas the demotion order doesn't allow that.
>
> The current kernel also don't provide any interfaces for the
> userspace to learn about the memory tier hierarchy in order to
> optimize its memory allocations.
>
> This patch series address the above by defining memory tiers explicitly.
>
> This patch adds below sysfs interface which is read-only and
> can be used to read nodes available in specific tier.
>
> /sys/devices/system/memtier/memtierN/nodelist
>
> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> lowest tier. The absolute value of a tier id number has no specific
> meaning. what matters is the relative order of the tier id numbers.
>
> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>
> Default memory tier can be read from,
> /sys/devices/system/memtier/default_tier
>
> Max memory tier can be read from,
> /sys/devices/system/memtier/max_tiers
>
> This patch implements the RFC spec sent by Wei Xu <[email protected]> at [1].
>
> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>
> Signed-off-by: Jagdish Gediya <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

IMHO, we should change the kernel internal implementation firstly, then
implement the kerne/user space interface. That is, make memory tier
explicit inside kernel, then expose it to user space.

Best Regards,
Huang, Ying


[snip]


2022-06-02 18:22:55

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> This patch adds the special string "none" as a supported memtier value
> that we can use to remove a specific node from being using as demotion target.
>
> For ex:
> :/sys/devices/system/node/node1# cat memtier
> 1
> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> 1-3
> :/sys/devices/system/node/node1# echo none > memtier
> :/sys/devices/system/node/node1#
> :/sys/devices/system/node/node1# cat memtier
> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> 2-3
> :/sys/devices/system/node/node1#

Why do you need this? Do you have some real users?

Best Regards,
Huang, Ying


[snip]



2022-06-05 11:25:57

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

On 5/27/22 7:45 PM, Jonathan Cameron wrote:
> On Fri, 27 May 2022 17:55:23 +0530
> "Aneesh Kumar K.V" <[email protected]> wrote:
>
>> From: Jagdish Gediya <[email protected]>
>>
>> Add support to read/write the memory tierindex for a NUMA node.
>>
>> /sys/devices/system/node/nodeN/memtier
>>
>> where N = node id
>>
>> When read, It list the memory tier that the node belongs to.
>>
>> When written, the kernel moves the node into the specified
>> memory tier, the tier assignment of all other nodes are not
>> affected.
>>
>> If the memory tier does not exist, writing to the above file
>> create the tier and assign the NUMA node to that tier.
> creates
>
> There was some discussion in v2 of Wei Xu's RFC that what matter
> for creation is the rank, not the tier number.
>
> My suggestion is move to an explicit creation file such as
> memtier/create_tier_from_rank
> to which writing the rank gives results in a new tier
> with the next device ID and requested rank.

I think the below workflow is much simpler.

:/sys/devices/system# cat memtier/memtier1/nodelist
1-3
:/sys/devices/system# cat node/node1/memtier
1
:/sys/devices/system# ls memtier/memtier*
nodelist power rank subsystem uevent
/sys/devices/system# ls memtier/
default_rank max_tier memtier1 power uevent
:/sys/devices/system# echo 2 > node/node1/memtier
:/sys/devices/system#

:/sys/devices/system# ls memtier/
default_rank max_tier memtier1 memtier2 power uevent
:/sys/devices/system# cat memtier/memtier1/nodelist
2-3
:/sys/devices/system# cat memtier/memtier2/nodelist
1
:/sys/devices/system#

ie, to create a tier we just write the tier id/tier index to
node/nodeN/memtier file. That will create a new memory tier if needed
and add the node to that specific memory tier. Since for now we are
having 1:1 mapping between tier index to rank value, we can derive the
rank value from the memory tier index.

For dynamic memory tier support, we can assign a rank value such that
new memory tiers are always created such that it comes last in the
demotion order.

-aneesh




2022-06-05 12:50:42

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

On 6/2/22 12:06 PM, Bharata B Rao wrote:
> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>> I was experimenting with this patchset and found this behaviour.
>>> Here's what I did:
>>>
>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>> driver by default.
>>>
>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>> where DRAM already exists)
>>>
>>
>> That should have placed it in memtier2.
>>
>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>> that expected to happen automatically when a node with dax kmem
>>> device comes up?
>>>
>>
>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
>
> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
> is already part of memtier1 whose nodelist shows 0-1.
>

can you find out which code path added node1 to memtier1? Do you have
regular memory also appearing on node1?

-aneesh

2022-06-06 05:41:45

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > From: Jagdish Gediya <[email protected]>
> >
> > In the current kernel, memory tiers are defined implicitly via a
> > demotion path relationship between NUMA nodes, which is created
> > during the kernel initialization and updated when a NUMA node is
> > hot-added or hot-removed. The current implementation puts all
> > nodes with CPU into the top tier, and builds the tier hierarchy
> > tier-by-tier by establishing the per-node demotion targets based
> > on the distances between nodes.
> >
> > This current memory tier kernel interface needs to be improved for
> > several important use cases,
> >
> > The current tier initialization code always initializes
> > each memory-only NUMA node into a lower tier. But a memory-only
> > NUMA node may have a high performance memory device (e.g. a DRAM
> > device attached via CXL.mem or a DRAM-backed memory-only node on
> > a virtual machine) and should be put into a higher tier.
> >
> > The current tier hierarchy always puts CPU nodes into the top
> > tier. But on a system with HBM or GPU devices, the
> > memory-only NUMA nodes mapping these devices should be in the
> > top tier, and DRAM nodes with CPUs are better to be placed into the
> > next lower tier.
> >
> > With current kernel higher tier node can only be demoted to selected nodes on the
> > next lower tier as defined by the demotion path, not any other
> > node from any lower tier. This strict, hard-coded demotion order
> > does not work in all use cases (e.g. some use cases may want to
> > allow cross-socket demotion to another node in the same demotion
> > tier as a fallback when the preferred demotion node is out of
> > space), This demotion order is also inconsistent with the page
> > allocation fallback order when all the nodes in a higher tier are
> > out of space: The page allocation can fall back to any node from
> > any lower tier, whereas the demotion order doesn't allow that.
> >
> > The current kernel also don't provide any interfaces for the
> > userspace to learn about the memory tier hierarchy in order to
> > optimize its memory allocations.
> >
> > This patch series address the above by defining memory tiers explicitly.
> >
> > This patch adds below sysfs interface which is read-only and
> > can be used to read nodes available in specific tier.
> >
> > /sys/devices/system/memtier/memtierN/nodelist
> >
> > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > lowest tier. The absolute value of a tier id number has no specific
> > meaning. what matters is the relative order of the tier id numbers.
> >
> > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> >
> > Default memory tier can be read from,
> > /sys/devices/system/memtier/default_tier
> >
> > Max memory tier can be read from,
> > /sys/devices/system/memtier/max_tiers
> >
> > This patch implements the RFC spec sent by Wei Xu <[email protected]> at [1].
> >
> > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> >
> > Signed-off-by: Jagdish Gediya <[email protected]>
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> IMHO, we should change the kernel internal implementation firstly, then
> implement the kerne/user space interface. That is, make memory tier
> explicit inside kernel, then expose it to user space.

Why ignore this comment for v5? If you don't agree, please respond me.

Best Regards,
Huang, Ying

2022-06-06 06:22:36

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On 6/6/22 8:19 AM, Ying Huang wrote:
> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>> From: Jagdish Gediya <[email protected]>
>>>
>>> In the current kernel, memory tiers are defined implicitly via a
>>> demotion path relationship between NUMA nodes, which is created
>>> during the kernel initialization and updated when a NUMA node is
>>> hot-added or hot-removed. The current implementation puts all
>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>> tier-by-tier by establishing the per-node demotion targets based
>>> on the distances between nodes.
>>>
>>> This current memory tier kernel interface needs to be improved for
>>> several important use cases,
>>>
>>> The current tier initialization code always initializes
>>> each memory-only NUMA node into a lower tier. But a memory-only
>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>> a virtual machine) and should be put into a higher tier.
>>>
>>> The current tier hierarchy always puts CPU nodes into the top
>>> tier. But on a system with HBM or GPU devices, the
>>> memory-only NUMA nodes mapping these devices should be in the
>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>> next lower tier.
>>>
>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>> next lower tier as defined by the demotion path, not any other
>>> node from any lower tier. This strict, hard-coded demotion order
>>> does not work in all use cases (e.g. some use cases may want to
>>> allow cross-socket demotion to another node in the same demotion
>>> tier as a fallback when the preferred demotion node is out of
>>> space), This demotion order is also inconsistent with the page
>>> allocation fallback order when all the nodes in a higher tier are
>>> out of space: The page allocation can fall back to any node from
>>> any lower tier, whereas the demotion order doesn't allow that.
>>>
>>> The current kernel also don't provide any interfaces for the
>>> userspace to learn about the memory tier hierarchy in order to
>>> optimize its memory allocations.
>>>
>>> This patch series address the above by defining memory tiers explicitly.
>>>
>>> This patch adds below sysfs interface which is read-only and
>>> can be used to read nodes available in specific tier.
>>>
>>> /sys/devices/system/memtier/memtierN/nodelist
>>>
>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>> lowest tier. The absolute value of a tier id number has no specific
>>> meaning. what matters is the relative order of the tier id numbers.
>>>
>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>
>>> Default memory tier can be read from,
>>> /sys/devices/system/memtier/default_tier
>>>
>>> Max memory tier can be read from,
>>> /sys/devices/system/memtier/max_tiers
>>>
>>> This patch implements the RFC spec sent by Wei Xu <[email protected]> at [1].
>>>
>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>
>>> Signed-off-by: Jagdish Gediya <[email protected]>
>>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>>
>> IMHO, we should change the kernel internal implementation firstly, then
>> implement the kerne/user space interface. That is, make memory tier
>> explicit inside kernel, then expose it to user space.
>
> Why ignore this comment for v5? If you don't agree, please respond me.
>

I am not sure what benefit such a rearrange would bring in? Right now I
am writing the series from the point of view of introducing all the
plumbing and them switching the existing demotion logic to use the new
infrastructure. Redoing the code to hide all the userspace sysfs till we
switch the demotion logic to use the new infrastructure doesn't really
bring any additional clarity to patch review and would require me to
redo the series with a lot of conflicts across the patches in the patchset.

-aneesh

2022-06-06 06:28:48

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 8:19 AM, Ying Huang wrote:
> > On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > From: Jagdish Gediya <[email protected]>
> > > >
> > > > In the current kernel, memory tiers are defined implicitly via a
> > > > demotion path relationship between NUMA nodes, which is created
> > > > during the kernel initialization and updated when a NUMA node is
> > > > hot-added or hot-removed. The current implementation puts all
> > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > tier-by-tier by establishing the per-node demotion targets based
> > > > on the distances between nodes.
> > > >
> > > > This current memory tier kernel interface needs to be improved for
> > > > several important use cases,
> > > >
> > > > The current tier initialization code always initializes
> > > > each memory-only NUMA node into a lower tier. But a memory-only
> > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > a virtual machine) and should be put into a higher tier.
> > > >
> > > > The current tier hierarchy always puts CPU nodes into the top
> > > > tier. But on a system with HBM or GPU devices, the
> > > > memory-only NUMA nodes mapping these devices should be in the
> > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > next lower tier.
> > > >
> > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > next lower tier as defined by the demotion path, not any other
> > > > node from any lower tier. This strict, hard-coded demotion order
> > > > does not work in all use cases (e.g. some use cases may want to
> > > > allow cross-socket demotion to another node in the same demotion
> > > > tier as a fallback when the preferred demotion node is out of
> > > > space), This demotion order is also inconsistent with the page
> > > > allocation fallback order when all the nodes in a higher tier are
> > > > out of space: The page allocation can fall back to any node from
> > > > any lower tier, whereas the demotion order doesn't allow that.
> > > >
> > > > The current kernel also don't provide any interfaces for the
> > > > userspace to learn about the memory tier hierarchy in order to
> > > > optimize its memory allocations.
> > > >
> > > > This patch series address the above by defining memory tiers explicitly.
> > > >
> > > > This patch adds below sysfs interface which is read-only and
> > > > can be used to read nodes available in specific tier.
> > > >
> > > > /sys/devices/system/memtier/memtierN/nodelist
> > > >
> > > > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > > > lowest tier. The absolute value of a tier id number has no specific
> > > > meaning. what matters is the relative order of the tier id numbers.
> > > >
> > > > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > > > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > > > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > > >
> > > > Default memory tier can be read from,
> > > > /sys/devices/system/memtier/default_tier
> > > >
> > > > Max memory tier can be read from,
> > > > /sys/devices/system/memtier/max_tiers
> > > >
> > > > This patch implements the RFC spec sent by Wei Xu <[email protected]> at [1].
> > > >
> > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > > >
> > > > Signed-off-by: Jagdish Gediya <[email protected]>
> > > > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > >
> > > IMHO, we should change the kernel internal implementation firstly, then
> > > implement the kerne/user space interface. That is, make memory tier
> > > explicit inside kernel, then expose it to user space.
> >
> > Why ignore this comment for v5? If you don't agree, please respond me.
> >
>
> I am not sure what benefit such a rearrange would bring in? Right now I
> am writing the series from the point of view of introducing all the
> plumbing and them switching the existing demotion logic to use the new
> infrastructure. Redoing the code to hide all the userspace sysfs till we
> switch the demotion logic to use the new infrastructure doesn't really
> bring any additional clarity to patch review and would require me to
> redo the series with a lot of conflicts across the patches in the patchset.

IMHO, we shouldn't introduce regression even in the middle of a
patchset. Each step should only rely on previous patches in the series
to work correctly. In your current way of organization, after patch
[1/7], on a system with 2 memory tiers, the user space interface will
output wrong information (only 1 memory tier). So I think the correct
way is to make it right inside the kenrel firstly, then expose the right
information to user space.

Best Regards,
Huang, Ying

2022-06-06 06:45:24

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On 6/6/22 11:03 AM, Ying Huang wrote:
> On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
>> On 6/6/22 8:19 AM, Ying Huang wrote:
>>> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>>>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>>>> From: Jagdish Gediya <[email protected]>
>>>>>
>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>> demotion path relationship between NUMA nodes, which is created
>>>>> during the kernel initialization and updated when a NUMA node is
>>>>> hot-added or hot-removed. The current implementation puts all
>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>> on the distances between nodes.
>>>>>
>>>>> This current memory tier kernel interface needs to be improved for
>>>>> several important use cases,
>>>>>
>>>>> The current tier initialization code always initializes
>>>>> each memory-only NUMA node into a lower tier. But a memory-only
>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>> a virtual machine) and should be put into a higher tier.
>>>>>
>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>> tier. But on a system with HBM or GPU devices, the
>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>> next lower tier.
>>>>>
>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>> next lower tier as defined by the demotion path, not any other
>>>>> node from any lower tier. This strict, hard-coded demotion order
>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>> allow cross-socket demotion to another node in the same demotion
>>>>> tier as a fallback when the preferred demotion node is out of
>>>>> space), This demotion order is also inconsistent with the page
>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>> out of space: The page allocation can fall back to any node from
>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>
>>>>> The current kernel also don't provide any interfaces for the
>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>> optimize its memory allocations.
>>>>>
>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>
>>>>> This patch adds below sysfs interface which is read-only and
>>>>> can be used to read nodes available in specific tier.
>>>>>
>>>>> /sys/devices/system/memtier/memtierN/nodelist
>>>>>
>>>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>>>> lowest tier. The absolute value of a tier id number has no specific
>>>>> meaning. what matters is the relative order of the tier id numbers.
>>>>>
>>>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>>>
>>>>> Default memory tier can be read from,
>>>>> /sys/devices/system/memtier/default_tier
>>>>>
>>>>> Max memory tier can be read from,
>>>>> /sys/devices/system/memtier/max_tiers
>>>>>
>>>>> This patch implements the RFC spec sent by Wei Xu <[email protected]> at [1].
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>>>
>>>>> Signed-off-by: Jagdish Gediya <[email protected]>
>>>>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>>>>
>>>> IMHO, we should change the kernel internal implementation firstly, then
>>>> implement the kerne/user space interface. That is, make memory tier
>>>> explicit inside kernel, then expose it to user space.
>>>
>>> Why ignore this comment for v5? If you don't agree, please respond me.
>>>
>>
>> I am not sure what benefit such a rearrange would bring in? Right now I
>> am writing the series from the point of view of introducing all the
>> plumbing and them switching the existing demotion logic to use the new
>> infrastructure. Redoing the code to hide all the userspace sysfs till we
>> switch the demotion logic to use the new infrastructure doesn't really
>> bring any additional clarity to patch review and would require me to
>> redo the series with a lot of conflicts across the patches in the patchset.
>
> IMHO, we shouldn't introduce regression even in the middle of a
> patchset. Each step should only rely on previous patches in the series
> to work correctly. In your current way of organization, after patch
> [1/7], on a system with 2 memory tiers, the user space interface will
> output wrong information (only 1 memory tier). So I think the correct
> way is to make it right inside the kenrel firstly, then expose the right
> information to user space.
>

The patchset doesn't add additional tier until "mm/demotion/dax/kmem:
Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional
tiers done till all the demotion logic is in place. So even if the
system got dax/kmem, the support for adding dax/kmem as a memory tier
comes later in the patch series.


-aneesh

2022-06-06 07:03:09

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

Aneesh Kumar K V <[email protected]> writes:

> On 6/6/22 11:03 AM, Ying Huang wrote:
>> On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
>>> On 6/6/22 8:19 AM, Ying Huang wrote:
>>>> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>>>>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>>>>> From: Jagdish Gediya <[email protected]>
>>>>>>
>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>> demotion path relationship between NUMA nodes, which is created
>>>>>> during the kernel initialization and updated when a NUMA node is
>>>>>> hot-added or hot-removed. The current implementation puts all
>>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>>> on the distances between nodes.
>>>>>>
>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>> several important use cases,
>>>>>>
>>>>>> The current tier initialization code always initializes
>>>>>> each memory-only NUMA node into a lower tier. But a memory-only
>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>> a virtual machine) and should be put into a higher tier.
>>>>>>
>>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>>> tier. But on a system with HBM or GPU devices, the
>>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>>> next lower tier.
>>>>>>
>>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>>> next lower tier as defined by the demotion path, not any other
>>>>>> node from any lower tier. This strict, hard-coded demotion order
>>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>>> allow cross-socket demotion to another node in the same demotion
>>>>>> tier as a fallback when the preferred demotion node is out of
>>>>>> space), This demotion order is also inconsistent with the page
>>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>>> out of space: The page allocation can fall back to any node from
>>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>>
>>>>>> The current kernel also don't provide any interfaces for the
>>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>>> optimize its memory allocations.
>>>>>>
>>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>>
>>>>>> This patch adds below sysfs interface which is read-only and
>>>>>> can be used to read nodes available in specific tier.
>>>>>>
>>>>>> /sys/devices/system/memtier/memtierN/nodelist
>>>>>>
>>>>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>>>>> lowest tier. The absolute value of a tier id number has no specific
>>>>>> meaning. what matters is the relative order of the tier id numbers.
>>>>>>
>>>>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>>>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>>>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>>>>
>>>>>> Default memory tier can be read from,
>>>>>> /sys/devices/system/memtier/default_tier
>>>>>>
>>>>>> Max memory tier can be read from,
>>>>>> /sys/devices/system/memtier/max_tiers
>>>>>>
>>>>>> This patch implements the RFC spec sent by Wei Xu <[email protected]> at [1].
>>>>>>
>>>>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>>>>
>>>>>> Signed-off-by: Jagdish Gediya <[email protected]>
>>>>>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>>>>>
>>>>> IMHO, we should change the kernel internal implementation firstly, then
>>>>> implement the kerne/user space interface. That is, make memory tier
>>>>> explicit inside kernel, then expose it to user space.
>>>>
>>>> Why ignore this comment for v5? If you don't agree, please respond me.
>>>>
>>>
>>> I am not sure what benefit such a rearrange would bring in? Right now I
>>> am writing the series from the point of view of introducing all the
>>> plumbing and them switching the existing demotion logic to use the new
>>> infrastructure. Redoing the code to hide all the userspace sysfs till we
>>> switch the demotion logic to use the new infrastructure doesn't really
>>> bring any additional clarity to patch review and would require me to
>>> redo the series with a lot of conflicts across the patches in the patchset.
>>
>> IMHO, we shouldn't introduce regression even in the middle of a
>> patchset. Each step should only rely on previous patches in the series
>> to work correctly. In your current way of organization, after patch
>> [1/7], on a system with 2 memory tiers, the user space interface will
>> output wrong information (only 1 memory tier). So I think the correct
>> way is to make it right inside the kenrel firstly, then expose the right
>> information to user space.
>>
>
> The patchset doesn't add additional tier until "mm/demotion/dax/kmem:
> Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional
> tiers done till all the demotion logic is in place. So even if the
> system got dax/kmem, the support for adding dax/kmem as a memory tier
> comes later in the patch series.

Let me clarify this a bit more. This patchset doesn't change the
existing kernel behavior till "mm/demotion: Build demotion targets
based on explicit memory tiers". So there is no regression till then.
It adds a parallel framework (memory tiers to the existing demotion
logic).

I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
with two memory tiers (DRAM and pmem) the demotion continues to work
as expected after patch 3 ("mm/demotion: Build demotion targets based on
explicit memory tiers"). With that, there will not be any regression in
between the patch series.

-aneesh

2022-06-06 08:13:17

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On Mon, 2022-06-06 at 11:57 +0530, Aneesh Kumar K.V wrote:
> Aneesh Kumar K V <[email protected]> writes:
>
> > On 6/6/22 11:03 AM, Ying Huang wrote:
> > > On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
> > > > On 6/6/22 8:19 AM, Ying Huang wrote:
> > > > > On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> > > > > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > > > > From: Jagdish Gediya <[email protected]>
> > > > > > >
> > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > demotion path relationship between NUMA nodes, which is created
> > > > > > > during the kernel initialization and updated when a NUMA node is
> > > > > > > hot-added or hot-removed. The current implementation puts all
> > > > > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > > > > tier-by-tier by establishing the per-node demotion targets based
> > > > > > > on the distances between nodes.
> > > > > > >
> > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > several important use cases,
> > > > > > >
> > > > > > > The current tier initialization code always initializes
> > > > > > > each memory-only NUMA node into a lower tier. But a memory-only
> > > > > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > a virtual machine) and should be put into a higher tier.
> > > > > > >
> > > > > > > The current tier hierarchy always puts CPU nodes into the top
> > > > > > > tier. But on a system with HBM or GPU devices, the
> > > > > > > memory-only NUMA nodes mapping these devices should be in the
> > > > > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > > > > next lower tier.
> > > > > > >
> > > > > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > > > > next lower tier as defined by the demotion path, not any other
> > > > > > > node from any lower tier. This strict, hard-coded demotion order
> > > > > > > does not work in all use cases (e.g. some use cases may want to
> > > > > > > allow cross-socket demotion to another node in the same demotion
> > > > > > > tier as a fallback when the preferred demotion node is out of
> > > > > > > space), This demotion order is also inconsistent with the page
> > > > > > > allocation fallback order when all the nodes in a higher tier are
> > > > > > > out of space: The page allocation can fall back to any node from
> > > > > > > any lower tier, whereas the demotion order doesn't allow that.
> > > > > > >
> > > > > > > The current kernel also don't provide any interfaces for the
> > > > > > > userspace to learn about the memory tier hierarchy in order to
> > > > > > > optimize its memory allocations.
> > > > > > >
> > > > > > > This patch series address the above by defining memory tiers explicitly.
> > > > > > >
> > > > > > > This patch adds below sysfs interface which is read-only and
> > > > > > > can be used to read nodes available in specific tier.
> > > > > > >
> > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > >
> > > > > > > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > > > > > > lowest tier. The absolute value of a tier id number has no specific
> > > > > > > meaning. what matters is the relative order of the tier id numbers.
> > > > > > >
> > > > > > > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > > > > > > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > > > > > >
> > > > > > > Default memory tier can be read from,
> > > > > > > /sys/devices/system/memtier/default_tier
> > > > > > >
> > > > > > > Max memory tier can be read from,
> > > > > > > /sys/devices/system/memtier/max_tiers
> > > > > > >
> > > > > > > This patch implements the RFC spec sent by Wei Xu <[email protected]> at [1].
> > > > > > >
> > > > > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > > > > > >
> > > > > > > Signed-off-by: Jagdish Gediya <[email protected]>
> > > > > > > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > > > > >
> > > > > > IMHO, we should change the kernel internal implementation firstly, then
> > > > > > implement the kerne/user space interface. That is, make memory tier
> > > > > > explicit inside kernel, then expose it to user space.
> > > > >
> > > > > Why ignore this comment for v5? If you don't agree, please respond me.
> > > > >
> > > >
> > > > I am not sure what benefit such a rearrange would bring in? Right now I
> > > > am writing the series from the point of view of introducing all the
> > > > plumbing and them switching the existing demotion logic to use the new
> > > > infrastructure. Redoing the code to hide all the userspace sysfs till we
> > > > switch the demotion logic to use the new infrastructure doesn't really
> > > > bring any additional clarity to patch review and would require me to
> > > > redo the series with a lot of conflicts across the patches in the patchset.
> > >
> > > IMHO, we shouldn't introduce regression even in the middle of a
> > > patchset. Each step should only rely on previous patches in the series
> > > to work correctly. In your current way of organization, after patch
> > > [1/7], on a system with 2 memory tiers, the user space interface will
> > > output wrong information (only 1 memory tier). So I think the correct
> > > way is to make it right inside the kenrel firstly, then expose the right
> > > information to user space.
> > >
> >
> > The patchset doesn't add additional tier until "mm/demotion/dax/kmem:
> > Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional
> > tiers done till all the demotion logic is in place. So even if the
> > system got dax/kmem, the support for adding dax/kmem as a memory tier
> > comes later in the patch series.
>
> Let me clarify this a bit more. This patchset doesn't change the
> existing kernel behavior till "mm/demotion: Build demotion targets
> based on explicit memory tiers". So there is no regression till then.
> It adds a parallel framework (memory tiers to the existing demotion
> logic).
>
> I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> with two memory tiers (DRAM and pmem) the demotion continues to work
> as expected after patch 3 ("mm/demotion: Build demotion targets based on
> explicit memory tiers"). With that, there will not be any regression in
> between the patch series.
>

Thanks! Please do that. And I think you can add sysfs interface after
that patch too. That is, in [1/7]

+struct memory_tier {
+ nodemask_t nodelist;
+};

And struct device can be added after the kernel has switched the
implementation based on explicit memory tiers.

+struct memory_tier {
+ struct device dev;
+ nodemask_t nodelist;
+};

But I don't think it's a good idea to have "struct device" embedded in
"struct memory_tier". We don't have "struct device" embedded in "struct
pgdata_list"...

Best Regards,
Huang, Ying



2022-06-06 08:31:39

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On 6/6/22 1:23 PM, Ying Huang wrote:
> On Mon, 2022-06-06 at 11:57 +0530, Aneesh Kumar K.V wrote:
>> Aneesh Kumar K V <[email protected]> writes:
>>
>>> On 6/6/22 11:03 AM, Ying Huang wrote:
>>>> On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
>>>>> On 6/6/22 8:19 AM, Ying Huang wrote:
>>>>>> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>>>>>>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>>>>>>> From: Jagdish Gediya <[email protected]>
>>>>>>>>
>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>>>> demotion path relationship between NUMA nodes, which is created
>>>>>>>> during the kernel initialization and updated when a NUMA node is
>>>>>>>> hot-added or hot-removed. The current implementation puts all
>>>>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>>>>> on the distances between nodes.
>>>>>>>>
>>>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>>>> several important use cases,
>>>>>>>>
>>>>>>>> The current tier initialization code always initializes
>>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only
>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>>> a virtual machine) and should be put into a higher tier.
>>>>>>>>
>>>>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>>>>> tier. But on a system with HBM or GPU devices, the
>>>>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>>>>> next lower tier.
>>>>>>>>
>>>>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>>>>> next lower tier as defined by the demotion path, not any other
>>>>>>>> node from any lower tier. This strict, hard-coded demotion order
>>>>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>>>>> allow cross-socket demotion to another node in the same demotion
>>>>>>>> tier as a fallback when the preferred demotion node is out of
>>>>>>>> space), This demotion order is also inconsistent with the page
>>>>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>>>>> out of space: The page allocation can fall back to any node from
>>>>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>>>>
>>>>>>>> The current kernel also don't provide any interfaces for the
>>>>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>>>>> optimize its memory allocations.
>>>>>>>>
>>>>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>>>>
>>>>>>>> This patch adds below sysfs interface which is read-only and
>>>>>>>> can be used to read nodes available in specific tier.
>>>>>>>>
>>>>>>>> /sys/devices/system/memtier/memtierN/nodelist
>>>>>>>>
>>>>>>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>>>>>>> lowest tier. The absolute value of a tier id number has no specific
>>>>>>>> meaning. what matters is the relative order of the tier id numbers.
>>>>>>>>
>>>>>>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>>>>>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>>>>>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>>>>>>
>>>>>>>> Default memory tier can be read from,
>>>>>>>> /sys/devices/system/memtier/default_tier
>>>>>>>>
>>>>>>>> Max memory tier can be read from,
>>>>>>>> /sys/devices/system/memtier/max_tiers
>>>>>>>>
>>>>>>>> This patch implements the RFC spec sent by Wei Xu <[email protected]> at [1].
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>>>>>>
>>>>>>>> Signed-off-by: Jagdish Gediya <[email protected]>
>>>>>>>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>>>>>>>
>>>>>>> IMHO, we should change the kernel internal implementation firstly, then
>>>>>>> implement the kerne/user space interface. That is, make memory tier
>>>>>>> explicit inside kernel, then expose it to user space.
>>>>>>
>>>>>> Why ignore this comment for v5? If you don't agree, please respond me.
>>>>>>
>>>>>
>>>>> I am not sure what benefit such a rearrange would bring in? Right now I
>>>>> am writing the series from the point of view of introducing all the
>>>>> plumbing and them switching the existing demotion logic to use the new
>>>>> infrastructure. Redoing the code to hide all the userspace sysfs till we
>>>>> switch the demotion logic to use the new infrastructure doesn't really
>>>>> bring any additional clarity to patch review and would require me to
>>>>> redo the series with a lot of conflicts across the patches in the patchset.
>>>>
>>>> IMHO, we shouldn't introduce regression even in the middle of a
>>>> patchset. Each step should only rely on previous patches in the series
>>>> to work correctly. In your current way of organization, after patch
>>>> [1/7], on a system with 2 memory tiers, the user space interface will
>>>> output wrong information (only 1 memory tier). So I think the correct
>>>> way is to make it right inside the kenrel firstly, then expose the right
>>>> information to user space.
>>>>
>>>
>>> The patchset doesn't add additional tier until "mm/demotion/dax/kmem:
>>> Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional
>>> tiers done till all the demotion logic is in place. So even if the
>>> system got dax/kmem, the support for adding dax/kmem as a memory tier
>>> comes later in the patch series.
>>
>> Let me clarify this a bit more. This patchset doesn't change the
>> existing kernel behavior till "mm/demotion: Build demotion targets
>> based on explicit memory tiers". So there is no regression till then.
>> It adds a parallel framework (memory tiers to the existing demotion
>> logic).
>>
>> I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
>> MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
>> with two memory tiers (DRAM and pmem) the demotion continues to work
>> as expected after patch 3 ("mm/demotion: Build demotion targets based on
>> explicit memory tiers"). With that, there will not be any regression in
>> between the patch series.
>>
>
> Thanks! Please do that. And I think you can add sysfs interface after
> that patch too. That is, in [1/7]
>

I am not sure why you insist on moving sysfs interfaces later. They are
introduced based on the helper added. It make patch review easier to
look at both the helpers and the user of the helper together in a patch.

> +struct memory_tier {
> + nodemask_t nodelist;
> +};
>
> And struct device can be added after the kernel has switched the
> implementation based on explicit memory tiers.
>
> +struct memory_tier {
> + struct device dev;
> + nodemask_t nodelist;
> +};
>


Can you elaborate on this? or possibly review the v5 series indicating
what change you are suggesting here?


> But I don't think it's a good idea to have "struct device" embedded in
> "struct memory_tier". We don't have "struct device" embedded in "struct
> pgdata_list"...
>

I avoided creating an array for memory_tier (memory_tier[]) so that we
can keep it dynamic. Keeping dev embedded in struct memory_tier simplify
the life cycle management of that dynamic list. We free the struct
memory_tier allocation via device release function (memtier->dev.release
= memory_tier_device_release )

Why do you think it is not a good idea?

-aneesh

2022-06-06 09:19:29

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On Mon, 2022-06-06 at 13:31 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 1:23 PM, Ying Huang wrote:
> > On Mon, 2022-06-06 at 11:57 +0530, Aneesh Kumar K.V wrote:
> > > Aneesh Kumar K V <[email protected]> writes:
> > >
> > > > On 6/6/22 11:03 AM, Ying Huang wrote:
> > > > > On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
> > > > > > On 6/6/22 8:19 AM, Ying Huang wrote:
> > > > > > > On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> > > > > > > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > > > > > > From: Jagdish Gediya <[email protected]>
> > > > > > > > >
> > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > demotion path relationship between NUMA nodes, which is created
> > > > > > > > > during the kernel initialization and updated when a NUMA node is
> > > > > > > > > hot-added or hot-removed. The current implementation puts all
> > > > > > > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > > > > > > tier-by-tier by establishing the per-node demotion targets based
> > > > > > > > > on the distances between nodes.
> > > > > > > > >
> > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > several important use cases,
> > > > > > > > >
> > > > > > > > > The current tier initialization code always initializes
> > > > > > > > > each memory-only NUMA node into a lower tier. But a memory-only
> > > > > > > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > > a virtual machine) and should be put into a higher tier.
> > > > > > > > >
> > > > > > > > > The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > > tier. But on a system with HBM or GPU devices, the
> > > > > > > > > memory-only NUMA nodes mapping these devices should be in the
> > > > > > > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > > > > > > next lower tier.
> > > > > > > > >
> > > > > > > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > > > > > > next lower tier as defined by the demotion path, not any other
> > > > > > > > > node from any lower tier. This strict, hard-coded demotion order
> > > > > > > > > does not work in all use cases (e.g. some use cases may want to
> > > > > > > > > allow cross-socket demotion to another node in the same demotion
> > > > > > > > > tier as a fallback when the preferred demotion node is out of
> > > > > > > > > space), This demotion order is also inconsistent with the page
> > > > > > > > > allocation fallback order when all the nodes in a higher tier are
> > > > > > > > > out of space: The page allocation can fall back to any node from
> > > > > > > > > any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > >
> > > > > > > > > The current kernel also don't provide any interfaces for the
> > > > > > > > > userspace to learn about the memory tier hierarchy in order to
> > > > > > > > > optimize its memory allocations.
> > > > > > > > >
> > > > > > > > > This patch series address the above by defining memory tiers explicitly.
> > > > > > > > >
> > > > > > > > > This patch adds below sysfs interface which is read-only and
> > > > > > > > > can be used to read nodes available in specific tier.
> > > > > > > > >
> > > > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > >
> > > > > > > > > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > > > > > > > > lowest tier. The absolute value of a tier id number has no specific
> > > > > > > > > meaning. what matters is the relative order of the tier id numbers.
> > > > > > > > >
> > > > > > > > > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > > > > > > > > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > > > > > > > >
> > > > > > > > > Default memory tier can be read from,
> > > > > > > > > /sys/devices/system/memtier/default_tier
> > > > > > > > >
> > > > > > > > > Max memory tier can be read from,
> > > > > > > > > /sys/devices/system/memtier/max_tiers
> > > > > > > > >
> > > > > > > > > This patch implements the RFC spec sent by Wei Xu <[email protected]> at [1].
> > > > > > > > >
> > > > > > > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > > > > > > > >
> > > > > > > > > Signed-off-by: Jagdish Gediya <[email protected]>
> > > > > > > > > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > > > > > > >
> > > > > > > > IMHO, we should change the kernel internal implementation firstly, then
> > > > > > > > implement the kerne/user space interface. That is, make memory tier
> > > > > > > > explicit inside kernel, then expose it to user space.
> > > > > > >
> > > > > > > Why ignore this comment for v5? If you don't agree, please respond me.
> > > > > > >
> > > > > >
> > > > > > I am not sure what benefit such a rearrange would bring in? Right now I
> > > > > > am writing the series from the point of view of introducing all the
> > > > > > plumbing and them switching the existing demotion logic to use the new
> > > > > > infrastructure. Redoing the code to hide all the userspace sysfs till we
> > > > > > switch the demotion logic to use the new infrastructure doesn't really
> > > > > > bring any additional clarity to patch review and would require me to
> > > > > > redo the series with a lot of conflicts across the patches in the patchset.
> > > > >
> > > > > IMHO, we shouldn't introduce regression even in the middle of a
> > > > > patchset. Each step should only rely on previous patches in the series
> > > > > to work correctly. In your current way of organization, after patch
> > > > > [1/7], on a system with 2 memory tiers, the user space interface will
> > > > > output wrong information (only 1 memory tier). So I think the correct
> > > > > way is to make it right inside the kenrel firstly, then expose the right
> > > > > information to user space.
> > > > >
> > > >
> > > > The patchset doesn't add additional tier until "mm/demotion/dax/kmem:
> > > > Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional
> > > > tiers done till all the demotion logic is in place. So even if the
> > > > system got dax/kmem, the support for adding dax/kmem as a memory tier
> > > > comes later in the patch series.
> > >
> > > Let me clarify this a bit more. This patchset doesn't change the
> > > existing kernel behavior till "mm/demotion: Build demotion targets
> > > based on explicit memory tiers". So there is no regression till then.
> > > It adds a parallel framework (memory tiers to the existing demotion
> > > logic).
> > >
> > > I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> > > MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> > > with two memory tiers (DRAM and pmem) the demotion continues to work
> > > as expected after patch 3 ("mm/demotion: Build demotion targets based on
> > > explicit memory tiers"). With that, there will not be any regression in
> > > between the patch series.
> > >
> >
> > Thanks! Please do that. And I think you can add sysfs interface after
> > that patch too. That is, in [1/7]
> >
>
> I am not sure why you insist on moving sysfs interfaces later. They are
> introduced based on the helper added. It make patch review easier to
> look at both the helpers and the user of the helper together in a patch.

Yes. We should introduce a function and its user in one patch for
review. But this doesn't mean that we should introduce the user space
interface as the first step. I think the user space interface should
output correct information when we expose it.

> > +struct memory_tier {
> > + nodemask_t nodelist;
> > +};
> >
> > And struct device can be added after the kernel has switched the
> > implementation based on explicit memory tiers.
> >
> > +struct memory_tier {
> > + struct device dev;
> > + nodemask_t nodelist;
> > +};
> >
>
>
> Can you elaborate on this? or possibly review the v5 series indicating
> what change you are suggesting here?
>
>
> > But I don't think it's a good idea to have "struct device" embedded in
> > "struct memory_tier". We don't have "struct device" embedded in "struct
> > pgdata_list"...
> >
>
> I avoided creating an array for memory_tier (memory_tier[]) so that we
> can keep it dynamic. Keeping dev embedded in struct memory_tier simplify
> the life cycle management of that dynamic list. We free the struct
> memory_tier allocation via device release function (memtier->dev.release
> = memory_tier_device_release )
>
> Why do you think it is not a good idea?

I think that we shouldn't bind our kernel internal implementation with
user space interface too much. Yes. We can expose kernel internal
implementation to user space in a direct way. I suggest you to follow
the style of "struct pglist_data" and "struct node". If we decouple
"struct memory_tier" and "struct memory_tier_dev" (or some other name),
we can refer to "struct memory_tier" without depending on all device
core. Memory tier should be accessible inside the kernel even without a
user interface. And memory tier isn't a device in concept.

For life cycle management, I think that we can do that without sysfs
too.

Best Regards,
Huang, Ying



2022-06-06 09:29:56

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On 6/6/22 2:22 PM, Ying Huang wrote:
....
>>>> I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
>>>> MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
>>>> with two memory tiers (DRAM and pmem) the demotion continues to work
>>>> as expected after patch 3 ("mm/demotion: Build demotion targets based on
>>>> explicit memory tiers"). With that, there will not be any regression in
>>>> between the patch series.
>>>>
>>>
>>> Thanks! Please do that. And I think you can add sysfs interface after
>>> that patch too. That is, in [1/7]
>>>
>>
>> I am not sure why you insist on moving sysfs interfaces later. They are
>> introduced based on the helper added. It make patch review easier to
>> look at both the helpers and the user of the helper together in a patch.
>
> Yes. We should introduce a function and its user in one patch for
> review. But this doesn't mean that we should introduce the user space
> interface as the first step. I think the user space interface should
> output correct information when we expose it.
>

If you look at this patchset we are not exposing any wrong information.

patch 1 -> adds ability to register the memory tiers and expose details
of registered memory tier. At this point the patchset only support DRAM
tier and hence only one tier is shown

patch 2 -> adds per node memtier attribute. So only DRAM nodes shows the
details, because the patchset yet has not introduced a slower memory
tier like PMEM.

patch 4 -> introducing demotion. Will make that patch 5

patch 5 -> add dax kmem numa nodes as slower memory tier. Now this
becomes patch 4 at which point we will correctly show two memory tiers
in the system.


>>> +struct memory_tier {
>>> + nodemask_t nodelist;
>>> +};
>>>
>>> And struct device can be added after the kernel has switched the
>>> implementation based on explicit memory tiers.
>>>
>>> +struct memory_tier {
>>> + struct device dev;
>>> + nodemask_t nodelist;
>>> +};
>>>
>>
>>
>> Can you elaborate on this? or possibly review the v5 series indicating
>> what change you are suggesting here?
>>
>>
>>> But I don't think it's a good idea to have "struct device" embedded in
>>> "struct memory_tier". We don't have "struct device" embedded in "struct
>>> pgdata_list"...
>>>
>>
>> I avoided creating an array for memory_tier (memory_tier[]) so that we
>> can keep it dynamic. Keeping dev embedded in struct memory_tier simplify
>> the life cycle management of that dynamic list. We free the struct
>> memory_tier allocation via device release function (memtier->dev.release
>> = memory_tier_device_release )
>>
>> Why do you think it is not a good idea?
>
> I think that we shouldn't bind our kernel internal implementation with
> user space interface too much. Yes. We can expose kernel internal
> implementation to user space in a direct way. I suggest you to follow
> the style of "struct pglist_data" and "struct node". If we decouple
> "struct memory_tier" and "struct memory_tier_dev" (or some other name),
> we can refer to "struct memory_tier" without depending on all device
> core. Memory tier should be accessible inside the kernel even without a
> user interface. And memory tier isn't a device in concept.
>

memory_tiers are different from pglist_data and struct node in that we
also allow the creation of them from userspace. That is the life time of
a memory tier is driven from userspace and it is much easier to manage
them via sysfs file lifetime mechanism rather than inventing an
independent and more complex way of doing the same.

> For life cycle management, I think that we can do that without sysfs
> too.
>

unless there are specific details that you think will be broken by
embedding struct device inside struct memory_tier, IMHO I still consider
the embedded implementation much simpler and in accordance with other
kernel design patterns.

-aneesh

2022-06-06 10:35:26

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote:
> On 6/2/22 12:06 PM, Bharata B Rao wrote:
>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>>> I was experimenting with this patchset and found this behaviour.
>>>> Here's what I did:
>>>>
>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>>> driver by default.
>>>>
>>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>>> where DRAM already exists)
>>>>
>>>
>>> That should have placed it in memtier2.
>>>
>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>>> that expected to happen automatically when a node with dax kmem
>>>> device comes up?
>>>>
>>>
>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
>>
>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
>> is already part of memtier1 whose nodelist shows 0-1.
>>
>
> can you find out which code path added node1 to memtier1?

node_set_memory_tier_rank+0x63/0x80
migrate_on_reclaim_callback+0x40/0x4d
blocking_notifier_call_chain+0x68/0x90
memory_notify+0x1b/0x20
online_pages+0x257/0x2f0
memory_subsys_online+0x99/0x150
device_online+0x65/0x90
online_memory_block+0x1b/0x20
walk_memory_blocks+0x85/0xc0
? generic_online_page+0x40/0x40
add_memory_resource+0x1fa/0x2d0
add_memory_driver_managed+0x80/0xc0
dev_dax_kmem_probe+0x1af/0x250
dax_bus_probe+0x6e/0xa0

After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM)
from dev_dax_kmem_probe() finds that the memtier is already set.

> Do you have regular memory also appearing on node1?

No, regular memory is on Node0.

Regards,
Bharata.

2022-06-06 10:45:24

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

On 6/6/22 3:41 PM, Bharata B Rao wrote:
> On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote:
>> On 6/2/22 12:06 PM, Bharata B Rao wrote:
>>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>>>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>>>> I was experimenting with this patchset and found this behaviour.
>>>>> Here's what I did:
>>>>>
>>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>>>> driver by default.
>>>>>
>>>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>>>> where DRAM already exists)
>>>>>
>>>>
>>>> That should have placed it in memtier2.
>>>>
>>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>>>> that expected to happen automatically when a node with dax kmem
>>>>> device comes up?
>>>>>
>>>>
>>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
>>>
>>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
>>> is already part of memtier1 whose nodelist shows 0-1.
>>>
>>
>> can you find out which code path added node1 to memtier1?
>
> node_set_memory_tier_rank+0x63/0x80
> migrate_on_reclaim_callback+0x40/0x4d
> blocking_notifier_call_chain+0x68/0x90
> memory_notify+0x1b/0x20
> online_pages+0x257/0x2f0
> memory_subsys_online+0x99/0x150
> device_online+0x65/0x90
> online_memory_block+0x1b/0x20
> walk_memory_blocks+0x85/0xc0
> ? generic_online_page+0x40/0x40
> add_memory_resource+0x1fa/0x2d0
> add_memory_driver_managed+0x80/0xc0
> dev_dax_kmem_probe+0x1af/0x250
> dax_bus_probe+0x6e/0xa0
>
> After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM)
> from dev_dax_kmem_probe() finds that the memtier is already set.
>
>> Do you have regular memory also appearing on node1?
>
> No, regular memory is on Node0.
>

Thanks for the stack trace. I was getting the kvm setup on my laptop to
test this. We should move node_set_mem_tier() early. You had automatic
online on memory hotplug

/* online pages if requested */
if (mhp_default_online_type != MMOP_OFFLINE)
walk_memory_blocks(start, size, NULL, online_memory_block);


which caused memory to be onlined before we could do node_set_mem_tier.
That is a bug on my side. Will send you a change after testing .

-aneesh

2022-06-06 12:08:39

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

Aneesh Kumar K V <[email protected]> writes:

> On 6/6/22 3:41 PM, Bharata B Rao wrote:
>> On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote:
>>> On 6/2/22 12:06 PM, Bharata B Rao wrote:
>>>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>>>>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>>>>> I was experimenting with this patchset and found this behaviour.
>>>>>> Here's what I did:
>>>>>>
>>>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>>>>> driver by default.
>>>>>>
>>>>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>>>>> where DRAM already exists)
>>>>>>
>>>>>
>>>>> That should have placed it in memtier2.
>>>>>
>>>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>>>>> that expected to happen automatically when a node with dax kmem
>>>>>> device comes up?
>>>>>>
>>>>>
>>>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
>>>>
>>>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
>>>> is already part of memtier1 whose nodelist shows 0-1.
>>>>
>>>
>>> can you find out which code path added node1 to memtier1?
>>
>> node_set_memory_tier_rank+0x63/0x80
>> migrate_on_reclaim_callback+0x40/0x4d
>> blocking_notifier_call_chain+0x68/0x90
>> memory_notify+0x1b/0x20
>> online_pages+0x257/0x2f0
>> memory_subsys_online+0x99/0x150
>> device_online+0x65/0x90
>> online_memory_block+0x1b/0x20
>> walk_memory_blocks+0x85/0xc0
>> ? generic_online_page+0x40/0x40
>> add_memory_resource+0x1fa/0x2d0
>> add_memory_driver_managed+0x80/0xc0
>> dev_dax_kmem_probe+0x1af/0x250
>> dax_bus_probe+0x6e/0xa0
>>
>> After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM)
>> from dev_dax_kmem_probe() finds that the memtier is already set.
>>
>>> Do you have regular memory also appearing on node1?
>>
>> No, regular memory is on Node0.
>>
>
> Thanks for the stack trace. I was getting the kvm setup on my laptop to
> test this. We should move node_set_mem_tier() early. You had automatic
> online on memory hotplug
>
> /* online pages if requested */
> if (mhp_default_online_type != MMOP_OFFLINE)
> walk_memory_blocks(start, size, NULL, online_memory_block);
>
>
> which caused memory to be onlined before we could do node_set_mem_tier.
> That is a bug on my side. Will send you a change after testing .
>
Can you try this change?

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7a11c387fbbc..905609260dda 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
goto err_reg_mgid;
data->mgid = rc;

+ /*
+ * This get called before the node is brought online. That
+ * is because depending on the value of mhp_default_online_type
+ * the kernel will online the memory along with hotplug
+ * operation. Add the new memory tier before we try to bring
+ * memory blocks online. Otherwise new node will get added to
+ * the default memory tier via hotplug callbacks.
+ */
+#ifdef CONFIG_TIERED_MEMORY
+ node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+#endif
for (i = 0; i < dev_dax->nr_range; i++) {
struct resource *res;
struct range range;
@@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)

dev_set_drvdata(dev, data);

-#ifdef CONFIG_TIERED_MEMORY
- node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
-#endif
return 0;

err_request_mem:

2022-06-06 12:13:02

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

On 6/6/2022 5:24 PM, Aneesh Kumar K.V wrote:
> Aneesh Kumar K V <[email protected]> writes:
>>
> Can you try this change?
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7a11c387fbbc..905609260dda 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> goto err_reg_mgid;
> data->mgid = rc;
>
> + /*
> + * This get called before the node is brought online. That
> + * is because depending on the value of mhp_default_online_type
> + * the kernel will online the memory along with hotplug
> + * operation. Add the new memory tier before we try to bring
> + * memory blocks online. Otherwise new node will get added to
> + * the default memory tier via hotplug callbacks.
> + */
> +#ifdef CONFIG_TIERED_MEMORY
> + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> +#endif
> for (i = 0; i < dev_dax->nr_range; i++) {
> struct resource *res;
> struct range range;
> @@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>
> dev_set_drvdata(dev, data);
>
> -#ifdef CONFIG_TIERED_MEMORY
> - node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> -#endif
> return 0;
>
> err_request_mem:

Yes, this fixes the issue for me. Thanks.

Regards,
Bharata.

2022-06-06 13:07:00

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

On 6/6/22 5:39 PM, Bharata B Rao wrote:
> On 6/6/2022 5:24 PM, Aneesh Kumar K.V wrote:
>> Aneesh Kumar K V <[email protected]> writes:
>>>
>> Can you try this change?
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 7a11c387fbbc..905609260dda 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>> goto err_reg_mgid;
>> data->mgid = rc;
>>
>> + /*
>> + * This get called before the node is brought online. That
>> + * is because depending on the value of mhp_default_online_type
>> + * the kernel will online the memory along with hotplug
>> + * operation. Add the new memory tier before we try to bring
>> + * memory blocks online. Otherwise new node will get added to
>> + * the default memory tier via hotplug callbacks.
>> + */
>> +#ifdef CONFIG_TIERED_MEMORY
>> + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
>> +#endif
>> for (i = 0; i < dev_dax->nr_range; i++) {
>> struct resource *res;
>> struct range range;
>> @@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>
>> dev_set_drvdata(dev, data);
>>
>> -#ifdef CONFIG_TIERED_MEMORY
>> - node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
>> -#endif
>> return 0;
>>
>> err_request_mem:
>
> Yes, this fixes the issue for me. Thanks.
>

I might put the below change instead of the above. In the end I guess it
is better to add a NUMA node to memory tier after the node is brought
online than before even though with the current code it shouldn't matter
much.

modified drivers/dax/kmem.c
@@ -147,9 +147,15 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
}

dev_set_drvdata(dev, data);
-
+ /*
+ * node_reset_memory_tier is used here to ensure we force
+ * update the NUMA node memory tier. Depending on the value
+ * of mhp_default_online_type the kernel will online the memory
+ * blocks along with hotplug operation above. This can result in dax
+ * kmem memory NUMA node getting added to default memory tier.
+ */
#ifdef CONFIG_TIERED_MEMORY
- node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+ node_reset_memory_tier(numa_node, MEMORY_TIER_PMEM);
#endif
return 0;

2022-06-06 16:17:34

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

On 6/6/22 8:29 PM, Jonathan Cameron wrote:
> On Fri, 3 Jun 2022 14:10:47 +0530
> Aneesh Kumar K V <[email protected]> wrote:
>
>> On 5/27/22 7:45 PM, Jonathan Cameron wrote:
>>> On Fri, 27 May 2022 17:55:23 +0530
>>> "Aneesh Kumar K.V" <[email protected]> wrote:
>>>
>>>> From: Jagdish Gediya <[email protected]>
>>>>
>>>> Add support to read/write the memory tierindex for a NUMA node.
>>>>
>>>> /sys/devices/system/node/nodeN/memtier
>>>>
>>>> where N = node id
>>>>
>>>> When read, It list the memory tier that the node belongs to.
>>>>
>>>> When written, the kernel moves the node into the specified
>>>> memory tier, the tier assignment of all other nodes are not
>>>> affected.
>>>>
>>>> If the memory tier does not exist, writing to the above file
>>>> create the tier and assign the NUMA node to that tier.
>>> creates
>>>
>>> There was some discussion in v2 of Wei Xu's RFC that what matter
>>> for creation is the rank, not the tier number.
>>>
>>> My suggestion is move to an explicit creation file such as
>>> memtier/create_tier_from_rank
>>> to which writing the rank gives results in a new tier
>>> with the next device ID and requested rank.
>>
>> I think the below workflow is much simpler.
>>
>> :/sys/devices/system# cat memtier/memtier1/nodelist
>> 1-3
>> :/sys/devices/system# cat node/node1/memtier
>> 1
>> :/sys/devices/system# ls memtier/memtier*
>> nodelist power rank subsystem uevent
>> /sys/devices/system# ls memtier/
>> default_rank max_tier memtier1 power uevent
>> :/sys/devices/system# echo 2 > node/node1/memtier
>> :/sys/devices/system#
>>
>> :/sys/devices/system# ls memtier/
>> default_rank max_tier memtier1 memtier2 power uevent
>> :/sys/devices/system# cat memtier/memtier1/nodelist
>> 2-3
>> :/sys/devices/system# cat memtier/memtier2/nodelist
>> 1
>> :/sys/devices/system#
>>
>> ie, to create a tier we just write the tier id/tier index to
>> node/nodeN/memtier file. That will create a new memory tier if needed
>> and add the node to that specific memory tier. Since for now we are
>> having 1:1 mapping between tier index to rank value, we can derive the
>> rank value from the memory tier index.
>>
>> For dynamic memory tier support, we can assign a rank value such that
>> new memory tiers are always created such that it comes last in the
>> demotion order.
>
> I'm not keen on having to pass through an intermediate state where
> the rank may well be wrong, but I guess it's not that harmful even
> if it feels wrong ;)
>

Any new memory tier added can be of lowest rank (rank - 0) and hence
will appear as the highest memory tier in demotion order. User can then
assign the right rank value to the memory tier? Also the actual demotion
target paths are built during memory block online which in most case
would happen after we properly verify that the device got assigned to
the right memory tier with correct rank value?

> Races are potentially a bit of a pain though depending on what we
> expect the usage model to be.
>
> There are patterns (CXL regions for example) of guaranteeing the
> 'right' device is created by doing something like
>
> cat create_tier > temp.txt
> #(temp gets 2 for example on first call then
> # next read of this file gets 3 etc)
>
> cat temp.txt > create_tier
> # will fail if there hasn't been a read of the same value
>
> Assuming all software keeps to the model, then there are no
> race conditions over creation. Otherwise we have two new
> devices turn up very close to each other and userspace scripting
> tries to create two new tiers - if it races they may end up in
> the same tier when that wasn't the intent. Then code to set
> the rank also races and we get two potentially very different
> memories in a tier with a randomly selected rank.
>
> Fun and games... And a fine illustration why sysfs based 'device'
> creation is tricky to get right (and lots of cases in the kernel
> don't).
>

I would expect userspace to be careful and verify the memory tier and
rank value before we online the memory blocks backed by the device. Even
if we race, the result would be two device not intended to be part of
the same memory tier appearing at the same tier. But then we won't be
building demotion targets yet. So userspace could verify this, move the
nodes out of the memory tier. Once it is verified, memory blocks can be
onlined.

Having said that can you outline the usage of
memtier/create_tier_from_rank ?

-aneesh

2022-06-06 16:56:07

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

On 6/6/22 9:46 PM, Jonathan Cameron wrote:
> On Mon, 6 Jun 2022 21:31:16 +0530
> Aneesh Kumar K V <[email protected]> wrote:
>
>> On 6/6/22 8:29 PM, Jonathan Cameron wrote:
>>> On Fri, 3 Jun 2022 14:10:47 +0530
>>> Aneesh Kumar K V <[email protected]> wrote:
>>>
>>>> On 5/27/22 7:45 PM, Jonathan Cameron wrote:
>>>>> On Fri, 27 May 2022 17:55:23 +0530
>>>>> "Aneesh Kumar K.V" <[email protected]> wrote:
>>>>>
>>>>>> From: Jagdish Gediya <[email protected]>
>>>>>>
>>>>>> Add support to read/write the memory tierindex for a NUMA node.
>>>>>>
>>>>>> /sys/devices/system/node/nodeN/memtier
>>>>>>
>>>>>> where N = node id
>>>>>>
>>>>>> When read, It list the memory tier that the node belongs to.
>>>>>>
>>>>>> When written, the kernel moves the node into the specified
>>>>>> memory tier, the tier assignment of all other nodes are not
>>>>>> affected.
>>>>>>
>>>>>> If the memory tier does not exist, writing to the above file
>>>>>> create the tier and assign the NUMA node to that tier.
>>>>> creates
>>>>>
>>>>> There was some discussion in v2 of Wei Xu's RFC that what matter
>>>>> for creation is the rank, not the tier number.
>>>>>
>>>>> My suggestion is move to an explicit creation file such as
>>>>> memtier/create_tier_from_rank
>>>>> to which writing the rank gives results in a new tier
>>>>> with the next device ID and requested rank.
>>>>
>>>> I think the below workflow is much simpler.
>>>>
>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>> 1-3
>>>> :/sys/devices/system# cat node/node1/memtier
>>>> 1
>>>> :/sys/devices/system# ls memtier/memtier*
>>>> nodelist power rank subsystem uevent
>>>> /sys/devices/system# ls memtier/
>>>> default_rank max_tier memtier1 power uevent
>>>> :/sys/devices/system# echo 2 > node/node1/memtier
>>>> :/sys/devices/system#
>>>>
>>>> :/sys/devices/system# ls memtier/
>>>> default_rank max_tier memtier1 memtier2 power uevent
>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>> 2-3
>>>> :/sys/devices/system# cat memtier/memtier2/nodelist
>>>> 1
>>>> :/sys/devices/system#
>>>>
>>>> ie, to create a tier we just write the tier id/tier index to
>>>> node/nodeN/memtier file. That will create a new memory tier if needed
>>>> and add the node to that specific memory tier. Since for now we are
>>>> having 1:1 mapping between tier index to rank value, we can derive the
>>>> rank value from the memory tier index.
>>>>
>>>> For dynamic memory tier support, we can assign a rank value such that
>>>> new memory tiers are always created such that it comes last in the
>>>> demotion order.
>>>
>>> I'm not keen on having to pass through an intermediate state where
>>> the rank may well be wrong, but I guess it's not that harmful even
>>> if it feels wrong ;)
>>>
>>
>> Any new memory tier added can be of lowest rank (rank - 0) and hence
>> will appear as the highest memory tier in demotion order.
>
> Depends on driver interaction - if new memory is CXL attached or
> GPU attached, chances are the driver has an input on which tier
> it is put in by default.
>
>> User can then
>> assign the right rank value to the memory tier? Also the actual demotion
>> target paths are built during memory block online which in most case
>> would happen after we properly verify that the device got assigned to
>> the right memory tier with correct rank value?
>
> Agreed, though that may change the model of how memory is brought online
> somewhat.
>
>>
>>> Races are potentially a bit of a pain though depending on what we
>>> expect the usage model to be.
>>>
>>> There are patterns (CXL regions for example) of guaranteeing the
>>> 'right' device is created by doing something like
>>>
>>> cat create_tier > temp.txt
>>> #(temp gets 2 for example on first call then
>>> # next read of this file gets 3 etc)
>>>
>>> cat temp.txt > create_tier
>>> # will fail if there hasn't been a read of the same value
>>>
>>> Assuming all software keeps to the model, then there are no
>>> race conditions over creation. Otherwise we have two new
>>> devices turn up very close to each other and userspace scripting
>>> tries to create two new tiers - if it races they may end up in
>>> the same tier when that wasn't the intent. Then code to set
>>> the rank also races and we get two potentially very different
>>> memories in a tier with a randomly selected rank.
>>>
>>> Fun and games... And a fine illustration why sysfs based 'device'
>>> creation is tricky to get right (and lots of cases in the kernel
>>> don't).
>>>
>>
>> I would expect userspace to be careful and verify the memory tier and
>> rank value before we online the memory blocks backed by the device. Even
>> if we race, the result would be two device not intended to be part of
>> the same memory tier appearing at the same tier. But then we won't be
>> building demotion targets yet. So userspace could verify this, move the
>> nodes out of the memory tier. Once it is verified, memory blocks can be
>> onlined.
>
> The race is there and not avoidable as far as I can see. Two processes A and B.
>
> A checks for a spare tier number
> B checks for a spare tier number
> A tries to assign node 3 to new tier 2 (new tier created)
> B tries to assign node 4 to new tier 2 (accidentally hits existing tier - as this
> is the same method we'd use to put it in the existing tier we can't tell this
> write was meant to create a new tier).
> A writes rank 100 to tier 2
> A checks rank for tier 2 and finds it is 100 as expected.
> B write rank 200 to tier 2 (it could check if still default but even that is racy)
> B checks rank for tier 2 rank and finds it is 200 as expected.
> A onlines memory.
> B onlines memory.
>
> Both think they got what they wanted, but A definitely didn't.
>
> One work around is the read / write approach and create_tier.
>
> A reads create_tier - gets 2.
> B reads create_tier - gets 3.
> A writes 2 to create_tier as that's what it read.
> B writes 3 to create_tier as that's what it read.
>
> continue with created tiers. Obviously can exhaust tiers, but if this is
> root only, could just create lots anyway so no worse off.
>
>>
>> Having said that can you outline the usage of
>> memtier/create_tier_from_rank ?
>
> There are corner cases to deal with...
>
> A writes 100 to create_tier_from_rank.
> A goes looking for matching tier - finds it: tier2
> B writes 200 to create_tier_from_rank
> B goes looking for matching tier - finds it: tier3
>
> rest is fine as operating on different tiers.
>
> Trickier is
> A writes 100 to create_tier_from_rank - succeed.
> B writes 100 to create_tier_from_rank - Could fail, or could just eat it?
>
> Logically this is same as separate create_tier and then a write
> of rank, but in one operation, but then you need to search
> for the right one. As such, perhaps a create_tier
> that does the read/write pair as above is the best solution.
>

This all is good when we allow dynamic rank values. But currently we are
restricting ourselves to three rank value as below:

rank memtier
300 memtier0
200 memtier1
100 memtier2

Now with the above, how do we define a write to create_tier_from_rank.
What should be the behavior if user write value other than above defined
rank values? Also enforcing the above three rank values as supported
implies teaching userspace about them. I am trying to see how to fit
create_tier_from_rank without requiring the above.

Can we look at implementing create_tier_from_rank when we start
supporting dynamic tiers/rank values? ie,

we still allow node/nodeN/memtier. But with dynamic tiers a race free
way to get a new memory tier would be echo rank >
memtier/create_tier_from_rank. We could also say, memtier0/1/2 are
kernel defined memory tiers. Writing to memtier/create_tier_from_rank
will create new memory tiers above memtier2 with the rank value specified?

-aneesh



2022-06-06 18:24:38

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

Aneesh Kumar K V <[email protected]> writes:

> On 6/6/22 9:46 PM, Jonathan Cameron wrote:
>> On Mon, 6 Jun 2022 21:31:16 +0530
>> Aneesh Kumar K V <[email protected]> wrote:
>>
>>> On 6/6/22 8:29 PM, Jonathan Cameron wrote:
>>>> On Fri, 3 Jun 2022 14:10:47 +0530
>>>> Aneesh Kumar K V <[email protected]> wrote:
>>>>
>>>>> On 5/27/22 7:45 PM, Jonathan Cameron wrote:
>>>>>> On Fri, 27 May 2022 17:55:23 +0530
>>>>>> "Aneesh Kumar K.V" <[email protected]> wrote:
>>>>>>
>>>>>>> From: Jagdish Gediya <[email protected]>
>>>>>>>
>>>>>>> Add support to read/write the memory tierindex for a NUMA node.
>>>>>>>
>>>>>>> /sys/devices/system/node/nodeN/memtier
>>>>>>>
>>>>>>> where N = node id
>>>>>>>
>>>>>>> When read, It list the memory tier that the node belongs to.
>>>>>>>
>>>>>>> When written, the kernel moves the node into the specified
>>>>>>> memory tier, the tier assignment of all other nodes are not
>>>>>>> affected.
>>>>>>>
>>>>>>> If the memory tier does not exist, writing to the above file
>>>>>>> create the tier and assign the NUMA node to that tier.
>>>>>> creates
>>>>>>
>>>>>> There was some discussion in v2 of Wei Xu's RFC that what matter
>>>>>> for creation is the rank, not the tier number.
>>>>>>
>>>>>> My suggestion is move to an explicit creation file such as
>>>>>> memtier/create_tier_from_rank
>>>>>> to which writing the rank gives results in a new tier
>>>>>> with the next device ID and requested rank.
>>>>>
>>>>> I think the below workflow is much simpler.
>>>>>
>>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>>> 1-3
>>>>> :/sys/devices/system# cat node/node1/memtier
>>>>> 1
>>>>> :/sys/devices/system# ls memtier/memtier*
>>>>> nodelist power rank subsystem uevent
>>>>> /sys/devices/system# ls memtier/
>>>>> default_rank max_tier memtier1 power uevent
>>>>> :/sys/devices/system# echo 2 > node/node1/memtier
>>>>> :/sys/devices/system#
>>>>>
>>>>> :/sys/devices/system# ls memtier/
>>>>> default_rank max_tier memtier1 memtier2 power uevent
>>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>>> 2-3
>>>>> :/sys/devices/system# cat memtier/memtier2/nodelist
>>>>> 1
>>>>> :/sys/devices/system#
>>>>>
>>>>> ie, to create a tier we just write the tier id/tier index to
>>>>> node/nodeN/memtier file. That will create a new memory tier if needed
>>>>> and add the node to that specific memory tier. Since for now we are
>>>>> having 1:1 mapping between tier index to rank value, we can derive the
>>>>> rank value from the memory tier index.
>>>>>
>>>>> For dynamic memory tier support, we can assign a rank value such that
>>>>> new memory tiers are always created such that it comes last in the
>>>>> demotion order.
>>>>
>>>> I'm not keen on having to pass through an intermediate state where
>>>> the rank may well be wrong, but I guess it's not that harmful even
>>>> if it feels wrong ;)
>>>>
>>>
>>> Any new memory tier added can be of lowest rank (rank - 0) and hence
>>> will appear as the highest memory tier in demotion order.
>>
>> Depends on driver interaction - if new memory is CXL attached or
>> GPU attached, chances are the driver has an input on which tier
>> it is put in by default.
>>
>>> User can then
>>> assign the right rank value to the memory tier? Also the actual demotion
>>> target paths are built during memory block online which in most case
>>> would happen after we properly verify that the device got assigned to
>>> the right memory tier with correct rank value?
>>
>> Agreed, though that may change the model of how memory is brought online
>> somewhat.
>>
>>>
>>>> Races are potentially a bit of a pain though depending on what we
>>>> expect the usage model to be.
>>>>
>>>> There are patterns (CXL regions for example) of guaranteeing the
>>>> 'right' device is created by doing something like
>>>>
>>>> cat create_tier > temp.txt
>>>> #(temp gets 2 for example on first call then
>>>> # next read of this file gets 3 etc)
>>>>
>>>> cat temp.txt > create_tier
>>>> # will fail if there hasn't been a read of the same value
>>>>
>>>> Assuming all software keeps to the model, then there are no
>>>> race conditions over creation. Otherwise we have two new
>>>> devices turn up very close to each other and userspace scripting
>>>> tries to create two new tiers - if it races they may end up in
>>>> the same tier when that wasn't the intent. Then code to set
>>>> the rank also races and we get two potentially very different
>>>> memories in a tier with a randomly selected rank.
>>>>
>>>> Fun and games... And a fine illustration why sysfs based 'device'
>>>> creation is tricky to get right (and lots of cases in the kernel
>>>> don't).
>>>>
>>>
>>> I would expect userspace to be careful and verify the memory tier and
>>> rank value before we online the memory blocks backed by the device. Even
>>> if we race, the result would be two device not intended to be part of
>>> the same memory tier appearing at the same tier. But then we won't be
>>> building demotion targets yet. So userspace could verify this, move the
>>> nodes out of the memory tier. Once it is verified, memory blocks can be
>>> onlined.
>>
>> The race is there and not avoidable as far as I can see. Two processes A and B.
>>
>> A checks for a spare tier number
>> B checks for a spare tier number
>> A tries to assign node 3 to new tier 2 (new tier created)
>> B tries to assign node 4 to new tier 2 (accidentally hits existing tier - as this
>> is the same method we'd use to put it in the existing tier we can't tell this
>> write was meant to create a new tier).
>> A writes rank 100 to tier 2
>> A checks rank for tier 2 and finds it is 100 as expected.
>> B write rank 200 to tier 2 (it could check if still default but even that is racy)
>> B checks rank for tier 2 rank and finds it is 200 as expected.
>> A onlines memory.
>> B onlines memory.
>>
>> Both think they got what they wanted, but A definitely didn't.
>>
>> One work around is the read / write approach and create_tier.
>>
>> A reads create_tier - gets 2.
>> B reads create_tier - gets 3.
>> A writes 2 to create_tier as that's what it read.
>> B writes 3 to create_tier as that's what it read.
>>
>> continue with created tiers. Obviously can exhaust tiers, but if this is
>> root only, could just create lots anyway so no worse off.
>>
>>>
>>> Having said that can you outline the usage of
>>> memtier/create_tier_from_rank ?
>>
>> There are corner cases to deal with...
>>
>> A writes 100 to create_tier_from_rank.
>> A goes looking for matching tier - finds it: tier2
>> B writes 200 to create_tier_from_rank
>> B goes looking for matching tier - finds it: tier3
>>
>> rest is fine as operating on different tiers.
>>
>> Trickier is
>> A writes 100 to create_tier_from_rank - succeed.
>> B writes 100 to create_tier_from_rank - Could fail, or could just eat it?
>>
>> Logically this is same as separate create_tier and then a write
>> of rank, but in one operation, but then you need to search
>> for the right one. As such, perhaps a create_tier
>> that does the read/write pair as above is the best solution.
>>
>
> This all is good when we allow dynamic rank values. But currently we are
> restricting ourselves to three rank value as below:
>
> rank memtier
> 300 memtier0
> 200 memtier1
> 100 memtier2
>
> Now with the above, how do we define a write to create_tier_from_rank.
> What should be the behavior if user write value other than above defined
> rank values? Also enforcing the above three rank values as supported
> implies teaching userspace about them. I am trying to see how to fit
> create_tier_from_rank without requiring the above.
>
> Can we look at implementing create_tier_from_rank when we start
> supporting dynamic tiers/rank values? ie,
>
> we still allow node/nodeN/memtier. But with dynamic tiers a race free
> way to get a new memory tier would be echo rank >
> memtier/create_tier_from_rank. We could also say, memtier0/1/2 are
> kernel defined memory tiers. Writing to memtier/create_tier_from_rank
> will create new memory tiers above memtier2 with the rank value specified?
>

To keep it compatible we could do this. ie, we just allow creation of
one additional memory tier (memtier3) via the above interface.


:/sys/devices/system/memtier# ls -al
total 0
drwxr-xr-x 4 root root 0 Jun 6 17:39 .
drwxr-xr-x 10 root root 0 Jun 6 17:39 ..
--w------- 1 root root 4096 Jun 6 17:40 create_tier_from_rank
-r--r--r-- 1 root root 4096 Jun 6 17:40 default_tier
-r--r--r-- 1 root root 4096 Jun 6 17:40 max_tier
drwxr-xr-x 3 root root 0 Jun 6 17:39 memtier1
drwxr-xr-x 2 root root 0 Jun 6 17:40 power
-rw-r--r-- 1 root root 4096 Jun 6 17:39 uevent
:/sys/devices/system/memtier# echo 20 > create_tier_from_rank
:/sys/devices/system/memtier# ls
create_tier_from_rank default_tier max_tier memtier1 memtier3 power uevent
:/sys/devices/system/memtier# cat memtier3/rank
20
:/sys/devices/system/memtier# echo 20 > create_tier_from_rank
bash: echo: write error: No space left on device
:/sys/devices/system/memtier#

is this good?

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0468af60d427..a4150120ba24 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -13,7 +13,7 @@
#define MEMORY_RANK_PMEM 100

#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM
-#define MAX_MEMORY_TIERS 3
+#define MAX_MEMORY_TIERS 4

extern bool numa_demotion_enabled;
extern nodemask_t promotion_mask;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index c6eb223a219f..7fdee0c4c4ea 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -169,7 +169,8 @@ static void insert_memory_tier(struct memory_tier *memtier)
list_add_tail(&memtier->list, &memory_tiers);
}

-static struct memory_tier *register_memory_tier(unsigned int tier)
+static struct memory_tier *register_memory_tier(unsigned int tier,
+ unsigned int rank)
{
int error;
struct memory_tier *memtier;
@@ -182,7 +183,7 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
return NULL;

memtier->dev.id = tier;
- memtier->rank = get_rank_from_tier(tier);
+ memtier->rank = rank;
memtier->dev.bus = &memory_tier_subsys;
memtier->dev.release = memory_tier_device_release;
memtier->dev.groups = memory_tier_dev_groups;
@@ -218,9 +219,53 @@ default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
}
static DEVICE_ATTR_RO(default_tier);

+
+static struct memory_tier *__get_memory_tier_from_id(int id);
+static ssize_t create_tier_from_rank_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int ret, rank;
+ struct memory_tier *memtier;
+
+ ret = kstrtouint(buf, 10, &rank);
+ if (ret)
+ return ret;
+
+ if (ret == MEMORY_RANK_HBM_GPU ||
+ rank == MEMORY_TIER_DRAM ||
+ rank == MEMORY_RANK_PMEM)
+ return -EINVAL;
+
+ mutex_lock(&memory_tier_lock);
+ /*
+ * For now we only support creation of one additional tier via
+ * this interface.
+ */
+ memtier = __get_memory_tier_from_id(3);
+ if (!memtier) {
+ memtier = register_memory_tier(3, rank);
+ if (!memtier) {
+ ret = -EINVAL;
+ goto out;
+ }
+ } else {
+ ret = -ENOSPC;
+ goto out;
+ }
+
+ ret = count;
+out:
+ mutex_unlock(&memory_tier_lock);
+ return ret;
+}
+static DEVICE_ATTR_WO(create_tier_from_rank);
+
+
static struct attribute *memory_tier_attrs[] = {
&dev_attr_max_tier.attr,
&dev_attr_default_tier.attr,
+ &dev_attr_create_tier_from_rank.attr,
NULL
};

@@ -302,7 +347,7 @@ static int __node_set_memory_tier(int node, int tier)

memtier = __get_memory_tier_from_id(tier);
if (!memtier) {
- memtier = register_memory_tier(tier);
+ memtier = register_memory_tier(tier, get_rank_from_tier(tier));
if (!memtier) {
ret = -EINVAL;
goto out;
@@ -651,7 +696,8 @@ static int __init memory_tier_init(void)
* Register only default memory tier to hide all empty
* memory tier from sysfs.
*/
- memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
+ memtier = register_memory_tier(DEFAULT_MEMORY_TIER,
+ get_rank_from_tier(DEFAULT_MEMORY_TIER));
if (!memtier)
panic("%s() failed to register memory tier: %d\n", __func__, ret);


2022-06-08 05:57:31

by Tim Chen

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces (v3)

On Mon, 2022-05-30 at 13:50 +0100, Jonathan Cameron wrote:
>
> > When discussed offline, Tim Chen pointed out that with the proposed
> > interface, it's unconvenient to know the position of a given memory tier
> > in all memory tiers. We must sort "rank" of all memory tiers to know
> > that. "possible" file can be used for that. Although "possible" file
> > can be generated with a shell script, it's more convenient to show it
> > directly.
> >
> > Another way to address the issue is to add memtierN/pos for each memory
> > tier as suggested by Tim. It's readonly and will show position of
> > "memtierN" in all memory tiers. It's even better to show the relative
> > postion to the default memory tier (DRAM with CPU). That is, the
> > position of DRAM memory tier is 0.
> >
> > Unlike memory tier device ID or rank, the position is relative and
> > dynamic.
>
> Hi,
>
> I'm unconvinced. This is better done with a shell script than
> by adding ABI we'll have to live with for ever..
>
> I'm no good at shell scripting but this does the job
> grep "" tier*/rank | sort -n -k 2 -t :
>
> tier2/rank:50
> tier0/rank:100
> tier1/rank:200
> tier3/rank:240
>
> I'm sure someone more knowledgeable will do it in a simpler fashion still.
>
>

You can argue that

$ cat /sys/devices/system/cpu/cpu1/topology/core_siblings
f
$ cat /sys/devices/system/cpu/cpu1/topology/core_siblings_list
0-3

provide exactly the same information and we should get rid of
core_siblings_list. I think core_siblings_list exists to make
it easier for a human, so he/she doesn't have to parse the mask,
or write a script to find out the ids of CPUs who are siblings.

I think in the same spirit, having an interface to allow a
human to quickly see the hierachical relationship of tiers
relative to each other is helpful.

Tim

2022-06-08 07:23:35

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On Mon, 2022-06-06 at 14:32 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 2:22 PM, Ying Huang wrote:
> ....
> > > > > I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> > > > > MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> > > > > with two memory tiers (DRAM and pmem) the demotion continues to work
> > > > > as expected after patch 3 ("mm/demotion: Build demotion targets based on
> > > > > explicit memory tiers"). With that, there will not be any regression in
> > > > > between the patch series.
> > > > >
> > > >
> > > > Thanks! Please do that. And I think you can add sysfs interface after
> > > > that patch too. That is, in [1/7]
> > > >
> > >
> > > I am not sure why you insist on moving sysfs interfaces later. They are
> > > introduced based on the helper added. It make patch review easier to
> > > look at both the helpers and the user of the helper together in a patch.
> >
> > Yes. We should introduce a function and its user in one patch for
> > review. But this doesn't mean that we should introduce the user space
> > interface as the first step. I think the user space interface should
> > output correct information when we expose it.
> >
>
> If you look at this patchset we are not exposing any wrong information.
>
> patch 1 -> adds ability to register the memory tiers and expose details
> of registered memory tier. At this point the patchset only support DRAM
> tier and hence only one tier is shown

But inside kernel, we actually work with 2 tiers and demote/prmote pages
between them. With the information from your interface, users would
think that there is no any demotion/promotion in kernel because there's
only 1 tier.

> patch 2 -> adds per node memtier attribute. So only DRAM nodes shows the
> details, because the patchset yet has not introduced a slower memory
> tier like PMEM.
>
> patch 4 -> introducing demotion. Will make that patch 5
>
> patch 5 -> add dax kmem numa nodes as slower memory tier. Now this
> becomes patch 4 at which point we will correctly show two memory tiers
> in the system.
>
>
> > > > +struct memory_tier {
> > > > + nodemask_t nodelist;
> > > > +};
> > > >
> > > > And struct device can be added after the kernel has switched the
> > > > implementation based on explicit memory tiers.
> > > >
> > > > +struct memory_tier {
> > > > + struct device dev;
> > > > + nodemask_t nodelist;
> > > > +};
> > > >
> > >
> > >
> > > Can you elaborate on this? or possibly review the v5 series indicating
> > > what change you are suggesting here?
> > >
> > >
> > > > But I don't think it's a good idea to have "struct device" embedded in
> > > > "struct memory_tier". We don't have "struct device" embedded in "struct
> > > > pgdata_list"...
> > > >
> > >
> > > I avoided creating an array for memory_tier (memory_tier[]) so that we
> > > can keep it dynamic. Keeping dev embedded in struct memory_tier simplify
> > > the life cycle management of that dynamic list. We free the struct
> > > memory_tier allocation via device release function (memtier->dev.release
> > > = memory_tier_device_release )
> > >
> > > Why do you think it is not a good idea?
> >
> > I think that we shouldn't bind our kernel internal implementation with
> > user space interface too much. Yes. We can expose kernel internal
> > implementation to user space in a direct way. I suggest you to follow
> > the style of "struct pglist_data" and "struct node". If we decouple
> > "struct memory_tier" and "struct memory_tier_dev" (or some other name),
> > we can refer to "struct memory_tier" without depending on all device
> > core. Memory tier should be accessible inside the kernel even without a
> > user interface. And memory tier isn't a device in concept.
> >
>
> memory_tiers are different from pglist_data and struct node in that we
> also allow the creation of them from userspace.

I don't think that there's much difference. struct pglist_data and
struct node can be created/destroyed dynamically too. Please take a
look at

__try_online_node()
register_one_node()
try_offline_node()
unregister_one_node()

> That is the life time of
> a memory tier is driven from userspace and it is much easier to manage
> them via sysfs file lifetime mechanism rather than inventing an
> independent and more complex way of doing the same.

You needs to manage the lifetime of struct memory_tier in kernel too.
Because there are kernel users. And even if you use device core
lifetime mechanism, you don't need to embed struct device in struct
memory_tier too, you can free "separate" struct memory_tier in "release"
callback of struct device.

> > For life cycle management, I think that we can do that without sysfs
> > too.
> >
>
> unless there are specific details that you think will be broken by
> embedding struct device inside struct memory_tier, IMHO I still consider
> the embedded implementation much simpler and in accordance with other
> kernel design patterns.

In concept, struct memory_tier isn't a device. Although we expose it as
a device in sysfs. That's just an implementation detail. So I think
it's better to make struct memory_tier independent of struct device if
possible.

Via not embeding struct device in struct memory_tier, it's much easier
to dereference struct memory_tier directly in inline function in ".h".
We don't need to introduce one accessor function for each field of
struct memory_tier for that.

Best Regards,
Huang, Ying


2022-06-08 08:42:10

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:

[snip]

>
> +static int __init memory_tier_init(void)
> +{
> + int ret;
> +
> + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> + if (ret)
> + panic("%s() failed to register subsystem: %d\n", __func__, ret);

I don't think we should go panic for failing to register subsys and
device for memory tiers. Just pr_err() should be enough.

Best Regards,
Huang, Ying

> +
> + /*
> + * Register only default memory tier to hide all empty
> + * memory tier from sysfs.
> + */
> + ret = register_memory_tier(DEFAULT_MEMORY_TIER);
> + if (ret)
> + panic("%s() failed to register memory tier: %d\n", __func__, ret);
> +
> + /*
> + * CPU only nodes are not part of memoty tiers.
> + */
> + memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
> +
> + return 0;
> +}
> +subsys_initcall(memory_tier_init);
> +
> +#endif /* CONFIG_TIERED_MEMORY */


2022-06-08 08:46:03

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> From: Jagdish Gediya <[email protected]>
>
> Add support to read/write the memory tierindex for a NUMA node.
>
> /sys/devices/system/node/nodeN/memtier
>
> where N = node id
>
> When read, It list the memory tier that the node belongs to.
>
> When written, the kernel moves the node into the specified
> memory tier, the tier assignment of all other nodes are not
> affected.
>
> If the memory tier does not exist, writing to the above file
> create the tier and assign the NUMA node to that tier.
>
> mutex memory_tier_lock is introduced to protect memory tier
> related chanegs as it can happen from sysfs as well on hot
> plug events.
>
> Signed-off-by: Jagdish Gediya <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
>  drivers/base/node.c | 35 ++++++++++++++
>  include/linux/migrate.h | 4 +-
>  mm/migrate.c | 103 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 141 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ec8bb24a5a22..cf4a58446d8c 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -20,6 +20,7 @@
>  #include <linux/pm_runtime.h>
>  #include <linux/swap.h>
>  #include <linux/slab.h>
> +#include <linux/migrate.h>
>  
>
>
>
>  static struct bus_type node_subsys = {
>   .name = "node",
> @@ -560,11 +561,45 @@ static ssize_t node_read_distance(struct device *dev,
>  }
>  static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>  
>
>
>
> +#ifdef CONFIG_TIERED_MEMORY
> +static ssize_t memtier_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + int node = dev->id;
> +
> + return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
> +}
> +
> +static ssize_t memtier_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + unsigned long tier;
> + int node = dev->id;
> +
> + int ret = kstrtoul(buf, 10, &tier);
> + if (ret)
> + return ret;
> +
> + ret = node_reset_memory_tier(node, tier);
> + if (ret)
> + return ret;
> +
> + return count;
> +}
> +
> +static DEVICE_ATTR_RW(memtier);
> +#endif
> +
>  static struct attribute *node_dev_attrs[] = {
>   &dev_attr_meminfo.attr,
>   &dev_attr_numastat.attr,
>   &dev_attr_distance.attr,
>   &dev_attr_vmstat.attr,
> +#ifdef CONFIG_TIERED_MEMORY
> + &dev_attr_memtier.attr,
> +#endif
>   NULL
>  };
>  
>
>
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 0ec653623565..d37d1d5dee82 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -177,13 +177,15 @@ enum memory_tier_type {
>  };
>  
>
>
>
>  int next_demotion_node(int node);
> -
>  extern void migrate_on_reclaim_init(void);
>  #ifdef CONFIG_HOTPLUG_CPU
>  extern void set_migration_target_nodes(void);
>  #else
>  static inline void set_migration_target_nodes(void) {}
>  #endif
> +int node_get_memory_tier(int node);
> +int node_set_memory_tier(int node, int tier);
> +int node_reset_memory_tier(int node, int tier);
>  #else
>  #define numa_demotion_enabled false
>  static inline int next_demotion_node(int node)
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f28ee93fb017..304559ba3372 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2132,6 +2132,7 @@ static struct bus_type memory_tier_subsys = {
>   .dev_name = "memtier",
>  };
>  
>
>
>
> +DEFINE_MUTEX(memory_tier_lock);
>  static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
>  
>
>
>
>  static ssize_t nodelist_show(struct device *dev,
> @@ -2225,6 +2226,108 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
>   NULL,
>  };
>  
>
>
>
> +static int __node_get_memory_tier(int node)
> +{
> + int tier;
> +
> + for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) {
> + if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist))
> + return tier;
> + }
> +
> + return -1;
> +}
> +
> +int node_get_memory_tier(int node)
> +{
> + int tier;
> +
> + /*
> + * Make sure memory tier is not unregistered
> + * while it is being read.
> + */
> + mutex_lock(&memory_tier_lock);
> +
> + tier = __node_get_memory_tier(node);
> +
> + mutex_unlock(&memory_tier_lock);
> +
> + return tier;
> +}
> +
> +int __node_set_memory_tier(int node, int tier)
> +{
> + int ret = 0;
> + /*
> + * As register_memory_tier() for new tier can fail,
> + * try it before modifying existing tier. register
> + * tier makes tier visible in sysfs.
> + */
> + if (!memory_tiers[tier]) {
> + ret = register_memory_tier(tier);
> + if (ret) {
> + goto out;
> + }
> + }
> +
> + node_set(node, memory_tiers[tier]->nodelist);
> +
> +out:
> + return ret;
> +}
> +
> +int node_reset_memory_tier(int node, int tier)

I think "reset" isn't a good name here. Maybe something like "change"
or "move"?

Best Regards,
Huang, Ying

> +{
> + int current_tier, ret = 0;
> +
> + mutex_lock(&memory_tier_lock);
> +
> + current_tier = __node_get_memory_tier(node);
> + if (current_tier == tier)
> + goto out;
> +
> + if (current_tier != -1 )
> + node_clear(node, memory_tiers[current_tier]->nodelist);
> +
> + ret = __node_set_memory_tier(node, tier);
> +
> + if (!ret) {
> + if (nodes_empty(memory_tiers[current_tier]->nodelist))
> + unregister_memory_tier(current_tier);
> + } else {
> + /* reset it back to older tier */
> + ret = __node_set_memory_tier(node, current_tier);
> + }
> +out:
> + mutex_unlock(&memory_tier_lock);
> +
> + return ret;
> +}
> +
> +int node_set_memory_tier(int node, int tier)
> +{
> + int current_tier, ret = 0;
> +
> + if (tier >= MAX_MEMORY_TIERS)
> + return -EINVAL;
> +
> + mutex_lock(&memory_tier_lock);
> + current_tier = __node_get_memory_tier(node);
> + /*
> + * if node is already part of the tier proceed with the
> + * current tier value, because we might want to establish
> + * new migration paths now. The node might be added to a tier
> + * before it was made part of N_MEMORY, hence estabilish_migration_targets
> + * will have skipped this node.
> + */
> + if (current_tier != -1)
> + tier = current_tier;
> + ret = __node_set_memory_tier(node, tier);
> + mutex_unlock(&memory_tier_lock);
> +
> + return ret;
> +}
> +
>  /*
>   * node_demotion[] example:
>   *


2022-06-08 08:53:17

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces (v3)

On 6/8/22 12:55 AM, Tim Chen wrote:
> On Mon, 2022-05-30 at 13:50 +0100, Jonathan Cameron wrote:
>>
>>> When discussed offline, Tim Chen pointed out that with the proposed
>>> interface, it's unconvenient to know the position of a given memory tier
>>> in all memory tiers. We must sort "rank" of all memory tiers to know
>>> that. "possible" file can be used for that. Although "possible" file
>>> can be generated with a shell script, it's more convenient to show it
>>> directly.
>>>
>>> Another way to address the issue is to add memtierN/pos for each memory
>>> tier as suggested by Tim. It's readonly and will show position of
>>> "memtierN" in all memory tiers. It's even better to show the relative
>>> postion to the default memory tier (DRAM with CPU). That is, the
>>> position of DRAM memory tier is 0.
>>>
>>> Unlike memory tier device ID or rank, the position is relative and
>>> dynamic.
>>
>> Hi,
>>
>> I'm unconvinced. This is better done with a shell script than
>> by adding ABI we'll have to live with for ever..
>>
>> I'm no good at shell scripting but this does the job
>> grep "" tier*/rank | sort -n -k 2 -t :
>>
>> tier2/rank:50
>> tier0/rank:100
>> tier1/rank:200
>> tier3/rank:240
>>
>> I'm sure someone more knowledgeable will do it in a simpler fashion still.
>>
>>
>
> You can argue that
>
> $ cat /sys/devices/system/cpu/cpu1/topology/core_siblings
> f
> $ cat /sys/devices/system/cpu/cpu1/topology/core_siblings_list
> 0-3
>
> provide exactly the same information and we should get rid of
> core_siblings_list. I think core_siblings_list exists to make
> it easier for a human, so he/she doesn't have to parse the mask,
> or write a script to find out the ids of CPUs who are siblings.
>
> I think in the same spirit, having an interface to allow a
> human to quickly see the hierachical relationship of tiers
> relative to each other is helpful.
>

We can add that later if we find applications requiring this. I kind of
have the feeling that we are adding too much based on possible ways
memory tiers could be used in the future. For now we can look at doing
bare minimum to address the current constraints and drive more user
visible changes later based on application requirements.

-aneesh

2022-06-08 09:39:30

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On 6/8/22 12:46 PM, Ying Huang wrote:
> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>
> [snip]
>
>>
>> +static int __init memory_tier_init(void)
>> +{
>> + int ret;
>> +
>> + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
>> + if (ret)
>> + panic("%s() failed to register subsystem: %d\n", __func__, ret);
>
> I don't think we should go panic for failing to register subsys and
> device for memory tiers. Just pr_err() should be enough.
>

So you are suggesting we continue to work with memory tiers with no
userspace interface?

>> +
>> + /*
>> + * Register only default memory tier to hide all empty
>> + * memory tier from sysfs.
>> + */
>> + ret = register_memory_tier(DEFAULT_MEMORY_TIER);
>> + if (ret)
>> + panic("%s() failed to register memory tier: %d\n", __func__, ret);
>> +
>> + /*
>> + * CPU only nodes are not part of memoty tiers.
>> + */
>> + memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
>> +
>> + return 0;
>> +}
>> +subsys_initcall(memory_tier_init);
>> +
>> +#endif /* CONFIG_TIERED_MEMORY */
>
>

2022-06-08 09:40:02

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

On 6/8/22 12:48 PM, Ying Huang wrote:
> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>> From: Jagdish Gediya <[email protected]>
>>
>> Add support to read/write the memory tierindex for a NUMA node.
>>
>> /sys/devices/system/node/nodeN/memtier
>>
>> where N = node id
>>
>> When read, It list the memory tier that the node belongs to.
>>
>> When written, the kernel moves the node into the specified
>> memory tier, the tier assignment of all other nodes are not
>> affected.
>>
>> If the memory tier does not exist, writing to the above file
>> create the tier and assign the NUMA node to that tier.
>>
>> mutex memory_tier_lock is introduced to protect memory tier
>> related chanegs as it can happen from sysfs as well on hot
>> plug events.
>>
>> Signed-off-by: Jagdish Gediya <[email protected]>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> ---
>>  drivers/base/node.c | 35 ++++++++++++++
>>  include/linux/migrate.h | 4 +-
>>  mm/migrate.c | 103 ++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 141 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index ec8bb24a5a22..cf4a58446d8c 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -20,6 +20,7 @@
>>  #include <linux/pm_runtime.h>
>>  #include <linux/swap.h>
>>  #include <linux/slab.h>
>> +#include <linux/migrate.h>
>>
>>
>>
>>
>>  static struct bus_type node_subsys = {
>>   .name = "node",
>> @@ -560,11 +561,45 @@ static ssize_t node_read_distance(struct device *dev,
>>  }
>>  static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>>
>>
>>
>>
>> +#ifdef CONFIG_TIERED_MEMORY
>> +static ssize_t memtier_show(struct device *dev,
>> + struct device_attribute *attr,
>> + char *buf)
>> +{
>> + int node = dev->id;
>> +
>> + return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
>> +}
>> +
>> +static ssize_t memtier_store(struct device *dev,
>> + struct device_attribute *attr,
>> + const char *buf, size_t count)
>> +{
>> + unsigned long tier;
>> + int node = dev->id;
>> +
>> + int ret = kstrtoul(buf, 10, &tier);
>> + if (ret)
>> + return ret;
>> +
>> + ret = node_reset_memory_tier(node, tier);
>> + if (ret)
>> + return ret;
>> +
>> + return count;
>> +}
>> +
>> +static DEVICE_ATTR_RW(memtier);
>> +#endif
>> +
>>  static struct attribute *node_dev_attrs[] = {
>>   &dev_attr_meminfo.attr,
>>   &dev_attr_numastat.attr,
>>   &dev_attr_distance.attr,
>>   &dev_attr_vmstat.attr,
>> +#ifdef CONFIG_TIERED_MEMORY
>> + &dev_attr_memtier.attr,
>> +#endif
>>   NULL
>>  };
>>
>>
>>
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index 0ec653623565..d37d1d5dee82 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -177,13 +177,15 @@ enum memory_tier_type {
>>  };
>>
>>
>>
>>
>>  int next_demotion_node(int node);
>> -
>>  extern void migrate_on_reclaim_init(void);
>>  #ifdef CONFIG_HOTPLUG_CPU
>>  extern void set_migration_target_nodes(void);
>>  #else
>>  static inline void set_migration_target_nodes(void) {}
>>  #endif
>> +int node_get_memory_tier(int node);
>> +int node_set_memory_tier(int node, int tier);
>> +int node_reset_memory_tier(int node, int tier);
>>  #else
>>  #define numa_demotion_enabled false
>>  static inline int next_demotion_node(int node)
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index f28ee93fb017..304559ba3372 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -2132,6 +2132,7 @@ static struct bus_type memory_tier_subsys = {
>>   .dev_name = "memtier",
>>  };
>>
>>
>>
>>
>> +DEFINE_MUTEX(memory_tier_lock);
>>  static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
>>
>>
>>
>>
>>  static ssize_t nodelist_show(struct device *dev,
>> @@ -2225,6 +2226,108 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
>>   NULL,
>>  };
>>
>>
>>
>>
>> +static int __node_get_memory_tier(int node)
>> +{
>> + int tier;
>> +
>> + for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) {
>> + if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist))
>> + return tier;
>> + }
>> +
>> + return -1;
>> +}
>> +
>> +int node_get_memory_tier(int node)
>> +{
>> + int tier;
>> +
>> + /*
>> + * Make sure memory tier is not unregistered
>> + * while it is being read.
>> + */
>> + mutex_lock(&memory_tier_lock);
>> +
>> + tier = __node_get_memory_tier(node);
>> +
>> + mutex_unlock(&memory_tier_lock);
>> +
>> + return tier;
>> +}
>> +
>> +int __node_set_memory_tier(int node, int tier)
>> +{
>> + int ret = 0;
>> + /*
>> + * As register_memory_tier() for new tier can fail,
>> + * try it before modifying existing tier. register
>> + * tier makes tier visible in sysfs.
>> + */
>> + if (!memory_tiers[tier]) {
>> + ret = register_memory_tier(tier);
>> + if (ret) {
>> + goto out;
>> + }
>> + }
>> +
>> + node_set(node, memory_tiers[tier]->nodelist);
>> +
>> +out:
>> + return ret;
>> +}
>> +
>> +int node_reset_memory_tier(int node, int tier)
>
> I think "reset" isn't a good name here. Maybe something like "change"
> or "move"?
>

how about node_update_memory_tier()?

-aneesh

2022-06-08 09:40:23

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

On Wed, 2022-06-08 at 13:54 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:46 PM, Ying Huang wrote:
> > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> >
> > [snip]
> >
> > >
> > > +static int __init memory_tier_init(void)
> > > +{
> > > + int ret;
> > > +
> > > + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> > > + if (ret)
> > > + panic("%s() failed to register subsystem: %d\n", __func__, ret);
> >
> > I don't think we should go panic for failing to register subsys and
> > device for memory tiers. Just pr_err() should be enough.
> >
>
> So you are suggesting we continue to work with memory tiers with no
> userspace interface?

Yes. We don't need to panic system for this.

Best Regards,
Huang, Ying

> > > +
> > > + /*
> > > + * Register only default memory tier to hide all empty
> > > + * memory tier from sysfs.
> > > + */
> > > + ret = register_memory_tier(DEFAULT_MEMORY_TIER);
> > > + if (ret)
> > > + panic("%s() failed to register memory tier: %d\n", __func__, ret);
> > > +
> > > + /*
> > > + * CPU only nodes are not part of memoty tiers.
> > > + */
> > > + memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
> > > +
> > > + return 0;
> > > +}
> > > +subsys_initcall(memory_tier_init);
> > > +
> > > +#endif /* CONFIG_TIERED_MEMORY */
> >
> >
>


2022-06-08 09:40:33

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

On Wed, 2022-06-08 at 13:55 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:48 PM, Ying Huang wrote:
> > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > From: Jagdish Gediya <[email protected]>
> > >
> > > Add support to read/write the memory tierindex for a NUMA node.
> > >
> > > /sys/devices/system/node/nodeN/memtier
> > >
> > > where N = node id
> > >
> > > When read, It list the memory tier that the node belongs to.
> > >
> > > When written, the kernel moves the node into the specified
> > > memory tier, the tier assignment of all other nodes are not
> > > affected.
> > >
> > > If the memory tier does not exist, writing to the above file
> > > create the tier and assign the NUMA node to that tier.
> > >
> > > mutex memory_tier_lock is introduced to protect memory tier
> > > related chanegs as it can happen from sysfs as well on hot
> > > plug events.
> > >
> > > Signed-off-by: Jagdish Gediya <[email protected]>
> > > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > > ---
> > >   drivers/base/node.c | 35 ++++++++++++++
> > >   include/linux/migrate.h | 4 +-
> > >   mm/migrate.c | 103 ++++++++++++++++++++++++++++++++++++++++
> > >   3 files changed, 141 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > index ec8bb24a5a22..cf4a58446d8c 100644
> > > --- a/drivers/base/node.c
> > > +++ b/drivers/base/node.c
> > > @@ -20,6 +20,7 @@
> > >   #include <linux/pm_runtime.h>
> > >   #include <linux/swap.h>
> > >   #include <linux/slab.h>
> > > +#include <linux/migrate.h>
> > >   
> > >
> > >
> > >
> > >
> > >   static struct bus_type node_subsys = {
> > >    .name = "node",
> > > @@ -560,11 +561,45 @@ static ssize_t node_read_distance(struct device *dev,
> > >   }
> > >   static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
> > >   
> > >
> > >
> > >
> > >
> > > +#ifdef CONFIG_TIERED_MEMORY
> > > +static ssize_t memtier_show(struct device *dev,
> > > + struct device_attribute *attr,
> > > + char *buf)
> > > +{
> > > + int node = dev->id;
> > > +
> > > + return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
> > > +}
> > > +
> > > +static ssize_t memtier_store(struct device *dev,
> > > + struct device_attribute *attr,
> > > + const char *buf, size_t count)
> > > +{
> > > + unsigned long tier;
> > > + int node = dev->id;
> > > +
> > > + int ret = kstrtoul(buf, 10, &tier);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + ret = node_reset_memory_tier(node, tier);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + return count;
> > > +}
> > > +
> > > +static DEVICE_ATTR_RW(memtier);
> > > +#endif
> > > +
> > >   static struct attribute *node_dev_attrs[] = {
> > >    &dev_attr_meminfo.attr,
> > >    &dev_attr_numastat.attr,
> > >    &dev_attr_distance.attr,
> > >    &dev_attr_vmstat.attr,
> > > +#ifdef CONFIG_TIERED_MEMORY
> > > + &dev_attr_memtier.attr,
> > > +#endif
> > >    NULL
> > >   };
> > >   
> > >
> > >
> > >
> > >
> > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > index 0ec653623565..d37d1d5dee82 100644
> > > --- a/include/linux/migrate.h
> > > +++ b/include/linux/migrate.h
> > > @@ -177,13 +177,15 @@ enum memory_tier_type {
> > >   };
> > >   
> > >
> > >
> > >
> > >
> > >   int next_demotion_node(int node);
> > > -
> > >   extern void migrate_on_reclaim_init(void);
> > >   #ifdef CONFIG_HOTPLUG_CPU
> > >   extern void set_migration_target_nodes(void);
> > >   #else
> > >   static inline void set_migration_target_nodes(void) {}
> > >   #endif
> > > +int node_get_memory_tier(int node);
> > > +int node_set_memory_tier(int node, int tier);
> > > +int node_reset_memory_tier(int node, int tier);
> > >   #else
> > >   #define numa_demotion_enabled false
> > >   static inline int next_demotion_node(int node)
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index f28ee93fb017..304559ba3372 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -2132,6 +2132,7 @@ static struct bus_type memory_tier_subsys = {
> > >    .dev_name = "memtier",
> > >   };
> > >   
> > >
> > >
> > >
> > >
> > > +DEFINE_MUTEX(memory_tier_lock);
> > >   static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
> > >   
> > >
> > >
> > >
> > >
> > >   static ssize_t nodelist_show(struct device *dev,
> > > @@ -2225,6 +2226,108 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
> > >    NULL,
> > >   };
> > >   
> > >
> > >
> > >
> > >
> > > +static int __node_get_memory_tier(int node)
> > > +{
> > > + int tier;
> > > +
> > > + for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) {
> > > + if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist))
> > > + return tier;
> > > + }
> > > +
> > > + return -1;
> > > +}
> > > +
> > > +int node_get_memory_tier(int node)
> > > +{
> > > + int tier;
> > > +
> > > + /*
> > > + * Make sure memory tier is not unregistered
> > > + * while it is being read.
> > > + */
> > > + mutex_lock(&memory_tier_lock);
> > > +
> > > + tier = __node_get_memory_tier(node);
> > > +
> > > + mutex_unlock(&memory_tier_lock);
> > > +
> > > + return tier;
> > > +}
> > > +
> > > +int __node_set_memory_tier(int node, int tier)
> > > +{
> > > + int ret = 0;
> > > + /*
> > > + * As register_memory_tier() for new tier can fail,
> > > + * try it before modifying existing tier. register
> > > + * tier makes tier visible in sysfs.
> > > + */
> > > + if (!memory_tiers[tier]) {
> > > + ret = register_memory_tier(tier);
> > > + if (ret) {
> > > + goto out;
> > > + }
> > > + }
> > > +
> > > + node_set(node, memory_tiers[tier]->nodelist);
> > > +
> > > +out:
> > > + return ret;
> > > +}
> > > +
> > > +int node_reset_memory_tier(int node, int tier)
> >
> > I think "reset" isn't a good name here. Maybe something like "change"
> > or "move"?
> >
>
> how about node_update_memory_tier()?

That sounds OK for me.

Best Regards,
Huang, Ying