This patchset implements weighted interleave and adds a new sysfs
entry: /sys/devices/system/node/nodeN/accessM/il_weight.
The il_weight of a node is used by mempolicy to implement weighted
interleave when `numactl --interleave=...` is invoked. By default
il_weight for a node is always 1, which preserves the default round
robin interleave behavior.
Interleave weights may be set from 0-100, and denote the number of
pages that should be allocated from the node when interleaving
occurs.
For example, if a node's interleave weight is set to 5, 5 pages
will be allocated from that node before the next node is scheduled
for allocations.
Additionally, "node accessors" (synonmous with cpu nodes) are used
to allow for accessor-relative weighting. The "accessor" for a task
is defined as the node the task is presently running on.
# Set node weight for node0 accessed by tasks on node0 to 5
echo 5 > /sys/devices/system/node/node0/access0/il_weight
# Set node weight for node0 accessed by tasks on node1 to 3
echo 3 > /sys/devices/system/node/node0/access1/il_weight
In this way it becomes possible to set an interleaving strategy
that fits the available bandwidth for the devices available on
the system. An example system:
Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
In this setup, the effective weights for nodes 0-3 for a task
running on Node 0 may be [60, 20, 10, 10].
This spreads memory out across devices which all have different
latency and bandwidth attributes at a way that can maximize the
available resources.
~Gregory
================================================================
Version Notes:
v3: move weights into node rather than memtiers
some additional fixes to node.c to support this
v1/v2: add weighted-interleave support to mempolicy
= v3 notes
This update effectively removes the connection between mempolicy
and memory-tiers by simply placing the interleave weights directly
in the node accessor information structure.
Node was recommended by Huang, Ying <[email protected]>
Accessor was recommended by Ravi Shankar <[email protected]>
== Move weights into node
Originally this work was done by placing weights in the memory tier.
In this patch set we changed the weights to live in the numa node
accessor structure, which allows for a more natural weighting scheme
and also supports source-node relative weighting.
Interleave weight is located in:
/sys/devices/system/node/nodeN/accessM/il_weight
and is set with a value between 1 and 100:
# Set node weight for node0 accessed by node0 to 5
echo 5 > /sys/devices/system/node/node0/access0/il_weight
By default, il_weight is always set to 1, which mimics the default
interleave behavior (simple round-robin).
== Other Node fixes
2 other updates to node.c were required to support this:
1) The access list must be initialized prior to the node struct
pointer being registered in the node array
2) The accessor's in the list must be registered regardless of
whether HMAT/HMEM information is reported. Presently this
results in 0-value information being present in the various
access subgroup
== Weighted interleave
mm/mempolicy: modify interleave mempolicy to use node weights
The node subsystem implements interleave weighting for the purpose
of bandwidth optimization. Each node may have different weights in
relation to each compute node ("access node").
The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement
weighted interleave. By default, since all nodes default to a weight
of 1, the original interleave behavior is retained.
Examples
Weight settings:
echo 4 > node0/access0/il_weight
echo 3 > node1/access0/il_weight
echo 2 > node1/access1/il_weight
echo 1 > node0/access1/il_weight
Results:
Task A:
cpunode: 0
nodemask: [0,1]
weights: [4,3]
allocation result: [0,0,0,0,1,1,1 repeat]
Task B:
cpunode: 1
nodemask: [0,1]
weights: [1,2]
allocation result: [0,1,1 repeat]
=== original RFCs ====
Memory-tier based weights
By: Ravi Shankar <[email protected]>
https://lore.kernel.org/all/[email protected]/
Mempolicy multi-node weighting w/ set_mempolicy2:
By: Gregory Price <[email protected]>
https://lore.kernel.org/all/[email protected]/
Hasan Al Maruf: N:M weighting in mempolicy
https://lore.kernel.org/linux-mm/YqD0%[email protected]/T/
Huang, Ying's presentation in lpc22, 16th slide in
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\
Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
Gregory Price (4):
base/node.c: initialize the accessor list before registering
node: add accessors to sysfs when nodes are created
node: add interleave weights to node accessor
mm/mempolicy: modify interleave mempolicy to use node weights
drivers/base/node.c | 120 ++++++++++++++++++++++++++++++++-
include/linux/mempolicy.h | 4 ++
include/linux/node.h | 17 +++++
mm/mempolicy.c | 138 +++++++++++++++++++++++++++++---------
4 files changed, 246 insertions(+), 33 deletions(-)
--
2.39.1
The current code registers the node as available in the node array
before initializing the accessor list. This makes it so that
anything which might access the accessor list as a result of
allocations will cause an undefined memory access.
In one example, an extension to access hmat data during interleave
caused this undefined access as a result of a bulk allocation
that occurs during node initialization but before the accessor
list is initialized.
Initialize the accessor list before making the node generally
available to the global system.
Signed-off-by: Gregory Price <[email protected]>
---
drivers/base/node.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 493d533f8375..4d588f4658c8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -868,11 +868,15 @@ int __register_one_node(int nid)
{
int error;
int cpu;
+ struct node *node;
- node_devices[nid] = kzalloc(sizeof(struct node), GFP_KERNEL);
- if (!node_devices[nid])
+ node = kzalloc(sizeof(struct node), GFP_KERNEL);
+ if (!node)
return -ENOMEM;
+ INIT_LIST_HEAD(&node->access_list);
+ node_devices[nid] = node;
+
error = register_node(node_devices[nid], nid);
/* link cpu under this node */
@@ -881,7 +885,6 @@ int __register_one_node(int nid)
register_cpu_under_node(cpu, nid);
}
- INIT_LIST_HEAD(&node_devices[nid]->access_list);
node_init_caches(nid);
return error;
--
2.39.1