Weighted interleave is a new interleave policy intended to make
use of heterogeneous memory environments appearing with CXL.
The existing interleave mechanism does an even round-robin
distribution of memory across all nodes in a nodemask, while
weighted interleave distributes memory across nodes according
to a provided weight. (Weight = # of page allocations per round)
Weighted interleave is intended to reduce average latency when
bandwidth is pressured - therefore increasing total throughput.
In other words: It allows greater use of the total available
bandwidth in a heterogeneous hardware environment (different
hardware provides different bandwidth capacity).
As bandwidth is pressured, latency increases - first linearly
and then exponentially. By keeping bandwidth usage distributed
according to available bandwidth, we therefore can reduce the
average latency of a cacheline fetch.
A good explanation of the bandwidth vs latency response curve:
https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/
From the article:
```
Constant region:
The latency response is fairly constant for the first 40%
of the sustained bandwidth.
Linear region:
In between 40% to 80% of the sustained bandwidth, the
latency response increases almost linearly with the bandwidth
demand of the system due to contention overhead by numerous
memory requests.
Exponential region:
Between 80% to 100% of the sustained bandwidth, the memory
latency is dominated by the contention latency which can be
as much as twice the idle latency or more.
Maximum sustained bandwidth :
Is 65% to 75% of the theoretical maximum bandwidth.
```
As a general rule of thumb:
* If bandwidth usage is low, latency does not increase. It is
optimal to place data in the nearest (lowest latency) device.
* If bandwidth usage is high, latency increases. It is optimal
to place data such that bandwidth use is optimized per-device.
This is the top line goal: Provide a user a mechanism to target using
the "maximum sustained bandwidth" of each hardware component in a
heterogenous memory system.
For example, the stream benchmark demonstrates that 1:1 (default)
interleave is actively harmful, while weighted interleave can be
beneficial. Default interleave distributes data such that too much
pressure is placed on devices with lower available bandwidth.
Stream Benchmark (High level results, 1 Socket + 1 CXL Device)
Default interleave : -78% (slower than DRAM)
Global weighting : -6% to +4% (workload dependant)
Targeted weights : +2.5% to +4% (consistently better than DRAM)
Global means the task-policy was set (set_mempolicy), while targeted
means VMA policies were set (mbind2). We see weighted interleave
is not always beneficial when applied globally, but is always
beneficial when applied to bandwidth-driving memory regions.
We implement sysfs entries for "system global" weights which can be
set by a daemon or administrator.
There are 3 patches in this set:
1) Implement system-global interleave weights as sysfs extension
in mm/mempolicy.c. These weights are RCU protected, and a
default weight set is provided (all weights are 1 by default).
In future work, we intend to expose an interface for HMAT/CDAT
information to be used during boot to set reasonable system
default values based on the memory configuration of the system
discovered at boot or during device hotplug.
2) A mild refactor of some interleave-logic for re-use in the
new weighted interleave logic.
3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind
Included below are some performance and LTP test information,
and a sample numactl branch which can be used for testing.
= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench : +19% over DRAM. +47% over default interleave.
= LTP Testing Summary =
existing mempolicy & mbind tests: pass
mempolicy & mbind + weighted interleave (global weights): pass
= version history
- RCU: This version protects the weight array with RCU.
- ktest fix: proper include (types.h) in uapi header
- doc: mempolicy.c comments in MPOL_WEIGHTED_INTERLEAVE patch
- Dropped task-local weights and syscalls from the proposal
until affirmative use cases for task-local weights appear.
Link: https://lore.kernel.org/linux-mm/[email protected]/
=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <[email protected]>
Hardware: Single-socket, multiple CXL memory expanders.
Workload: W2
Data Signature: 2:1 read:write
DRAM only bandwidth (GBps): 298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only: 1.38x
Gain over default interleave: 2.64x
Workload: W5
Data Signature: 1:1 read:write
DRAM only bandwidth (GBps): 273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only: 1.4x
Gain over default interleave: 2.26x
=====================================================================
Performance test - Stream
From - Gregory Price <[email protected]>
Hardware: Single socket, single CXL expander
numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master
Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting : -6% to +4% (workload dependant)
mbind2 weights : +2.5% to +4% (consistently better than DRAM)
dram only:
numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Function Direction BestRateMBs AvgTime MinTime MaxTime
Copy: 0->0 200923.2 0.032662 0.031853 0.033301
Scale: 0->0 202123.0 0.032526 0.031664 0.032970
Add: 0->0 208873.2 0.047322 0.045961 0.047884
Triad: 0->0 208523.8 0.047262 0.046038 0.048414
CXL-only:
numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy: 0->0 22209.7 0.288661 0.288162 0.289342
Scale: 0->0 22288.2 0.287549 0.287147 0.288291
Add: 0->0 24419.1 0.393372 0.393135 0.393735
Triad: 0->0 24484.6 0.392337 0.392083 0.394331
Based on the above, the optimal weights are ~9:1
echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2
default interleave:
numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy: 0->0 44666.2 0.143671 0.143285 0.144174
Scale: 0->0 44781.6 0.143256 0.142916 0.143713
Add: 0->0 48600.7 0.197719 0.197528 0.197858
Triad: 0->0 48727.5 0.197204 0.197014 0.197439
global weighted interleave:
numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy: 0->0 190085.9 0.034289 0.033669 0.034645
Scale: 0->0 207677.4 0.031909 0.030817 0.033061
Add: 0->0 202036.8 0.048737 0.047516 0.053409
Triad: 0->0 217671.5 0.045819 0.044103 0.046755
targted regions w/ global weights (modified stream to mbind2 malloc'd regions))
numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc
Copy: 0->0 205827.0 0.031445 0.031094 0.031984
Scale: 0->0 208171.8 0.031320 0.030744 0.032505
Add: 0->0 217352.0 0.045087 0.044168 0.046515
Triad: 0->0 216884.8 0.045062 0.044263 0.046982
=====================================================================
Performance tests - XSBench
From - Hyeongtak Ji <[email protected]>
Hardware: Single socket, Single CXL memory Expander
NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads: 56
Lookups: 170,000,000
Summary: +19% over DRAM. +47% over default interleave.
Performance tests - XSBench
1. dram only
$ numactl -m 0 ./XSBench -s XL –p 5000000
Runtime: 36.235 seconds
Lookups/s: 4,691,618
2. default interleave
$ numactl –i 0,2 ./XSBench –s XL –p 5000000
Runtime: 55.243 seconds
Lookups/s: 3,077,293
3. weighted interleave
numactl –w –i 0,2 ./XSBench –s XL –p 5000000
Runtime: 29.262 seconds
Lookups/s: 5,809,513
=====================================================================
LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2
= Existing tests
set_mempolicy, get_mempolicy, mbind
MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality
but did not adjust tests for weighting. Basically the weights were
set to 1, which is the default, and it should behavior like standard
MPOL_INTERLEAVE if logic is correct.
== set_mempolicy01 : passed 18, failed 0
== set_mempolicy02 : passed 10, failed 0
== set_mempolicy03 : passed 64, failed 0
== set_mempolicy04 : passed 32, failed 0
== set_mempolicy05 - n/a on non-x86
== set_mempolicy06 : passed 10, failed 0
this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE
== set_mempolicy07 : passed 32, failed 0
set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy01 : passed 12, failed 0
change: added MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy02 : passed 2, failed 0
== mbind01 : passed 15, failed 0
added MPOL_WEIGHTED_INTERLEAVE
== mbind02 : passed 4, failed 0
added MPOL_WEIGHTED_INTERLEAVE
== mbind03 : passed 16, failed 0
added MPOL_WEIGHTED_INTERLEAVE
== mbind04 : passed 48, failed 0
added MPOL_WEIGHTED_INTERLEAVE
=====================================================================
numactl (set_mempolicy) w/ global weighting test
numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master
command: numactl -w --interleave=0,1 ./eatmem
result (weights 1:1):
0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4
7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4
50% distribution is correct
result (weights 5:1):
01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4
7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4
16.666% distribution is correct
result (weights 1:5):
01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4
7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4
16.666% distribution is correct
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void)
{
char* mem = malloc(1024*1024*256);
memset(mem, 1, 1024*1024*256);
for (int i = 0; i < ((1024*1024*256)/4096); i++)
{
mem = malloc(4096);
mem[0] = 1;
}
printf("done\n");
getchar();
return 0;
}
=====================================================================
Suggested-by: Gregory Price <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Suggested-by: Hasan Al Maruf <[email protected]>
Suggested-by: Hao Wang <[email protected]>
Suggested-by: Ying Huang <[email protected]>
Suggested-by: Dan Williams <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Suggested-by: Zhongkun He <[email protected]>
Suggested-by: Frank van der Linden <[email protected]>
Suggested-by: John Groves <[email protected]>
Suggested-by: Vinicius Tavares Petrucci <[email protected]>
Suggested-by: Srinivasulu Thanneeru <[email protected]>
Suggested-by: Ravi Jonnalagadda <[email protected]>
Suggested-by: Jonathan Cameron <[email protected]>
Suggested-by: Hyeongtak Ji <[email protected]>
Suggested-by: Andi Fleen <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Gregory Price (2):
mm/mempolicy: refactor a read-once mechanism into a function for
re-use
mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted
interleaving
Rakie Kim (1):
mm/mempolicy: implement the sysfs-based weighted_interleave interface
.../ABI/testing/sysfs-kernel-mm-mempolicy | 4 +
...fs-kernel-mm-mempolicy-weighted-interleave | 26 +
.../admin-guide/mm/numa_memory_policy.rst | 9 +
include/linux/mempolicy.h | 5 +
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 491 +++++++++++++++++-
6 files changed, 523 insertions(+), 13 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
--
2.39.1
From: Rakie Kim <[email protected]>
This patch provides a way to set interleave weight information under
sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
The sysfs structure is designed as follows.
$ tree /sys/kernel/mm/mempolicy/
/sys/kernel/mm/mempolicy/ [1]
└── weighted_interleave [2]
├── node0 [3]
└── node1
Each file above can be explained as follows.
[1] mm/mempolicy: configuration interface for mempolicy subsystem
[2] weighted_interleave/: config interface for weighted interleave policy
[3] weighted_interleave/nodeN: weight for nodeN
Internally, there is a secondary table `default_iw_table`, which holds
kernel-internal default interleave weights for each possible node.
If the value for a node is set to `0`, the default value will be used.
If sysfs is disabled in the config, interleave weights will default
to use `default_iw_table`.
Suggested-by: Huang Ying <[email protected]>
Signed-off-by: Rakie Kim <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
Co-developed-by: Gregory Price <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Co-developed-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
---
.../ABI/testing/sysfs-kernel-mm-mempolicy | 4 +
...fs-kernel-mm-mempolicy-weighted-interleave | 26 ++
mm/mempolicy.c | 251 ++++++++++++++++++
3 files changed, 281 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
new file mode 100644
index 000000000000..2dcf24f4384a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
@@ -0,0 +1,4 @@
+What: /sys/kernel/mm/mempolicy/
+Date: December 2023
+Contact: Linux memory management mailing list <[email protected]>
+Description: Interface for Mempolicy
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
new file mode 100644
index 000000000000..e6a38139bf0f
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -0,0 +1,26 @@
+What: /sys/kernel/mm/mempolicy/weighted_interleave/
+Date: December 2023
+Contact: Linux memory management mailing list <[email protected]>
+Description: Configuration Interface for the Weighted Interleave policy
+
+What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
+Date: December 2023
+Contact: Linux memory management mailing list <[email protected]>
+Description: Weight configuration interface for nodeN
+
+ The interleave weight for a memory node (N). These weights are
+ utilized by processes which have set their mempolicy to
+ MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by
+ omitting a task-local weight array.
+
+ These weights only affect new allocations, and changes at runtime
+ will not cause migrations on already allocated pages.
+
+ The minimum weight for a node is always 1.
+
+ Minimum weight: 1
+ Maximum weight: 255
+
+ Writing an empty string or `0` will reset the weight to the
+ system default. The system default may be set by the kernel
+ or drivers at boot or during hotplug events.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..5da4fd79fd18 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -131,6 +131,30 @@ static struct mempolicy default_policy = {
static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+struct iw_table {
+ struct rcu_head rcu;
+ u8 weights[MAX_NUMNODES];
+};
+/*
+ * default_iw_table is the kernel-internal default value interleave
+ * weight table. It is to be set by driver code capable of reading
+ * HMAT/CDAT information, and to provide mempolicy a sane set of
+ * default weight values for WEIGHTED_INTERLEAVE mode.
+ *
+ * By default, prior to HMAT/CDAT information being consumed, the
+ * default weight of all nodes is 1. The default weight of any
+ * node can only be in the range 1-255. A 0-weight is not allowed.
+ */
+static struct iw_table default_iw_table;
+/*
+ * iw_table is the sysfs-set interleave weight table, a value of 0
+ * denotes that the default_iw_table value should be used.
+ *
+ * iw_table is RCU protected
+ */
+static struct iw_table __rcu *iw_table;
+static DEFINE_MUTEX(iw_table_mtx);
+
/**
* numa_nearest_node - Find nearest node by state
* @node: Node id to start the search
@@ -3067,3 +3091,230 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
nodemask_pr_args(&nodes));
}
+
+#ifdef CONFIG_SYSFS
+struct iw_node_attr {
+ struct kobj_attribute kobj_attr;
+ int nid;
+};
+
+static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ struct iw_node_attr *node_attr;
+ u8 weight;
+ struct iw_table __rcu *table;
+
+ node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ weight = table->weights[node_attr->nid];
+ rcu_read_unlock();
+
+ return sysfs_emit(buf, "%d\n", weight);
+}
+
+static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct iw_node_attr *node_attr;
+ struct iw_table __rcu *new;
+ struct iw_table __rcu *old;
+ u8 weight = 0;
+
+ node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+ if (count == 0 || sysfs_streq(buf, ""))
+ weight = 0;
+ else if (kstrtou8(buf, 0, &weight))
+ return -EINVAL;
+
+ new = kmalloc(sizeof(*new), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ mutex_lock(&iw_table_mtx);
+ old = rcu_dereference_protected(iw_table,
+ lockdep_is_held(&iw_table_mtx));
+ /* If value is 0, revert to default weight */
+ weight = weight ? weight : default_iw_table.weights[node_attr->nid];
+ memcpy(&new->weights, &old->weights, sizeof(new->weights));
+ new->weights[node_attr->nid] = weight;
+ rcu_assign_pointer(iw_table, new);
+ mutex_unlock(&iw_table_mtx);
+ kfree_rcu(old, rcu);
+ return count;
+}
+
+static struct iw_node_attr *node_attrs[MAX_NUMNODES];
+
+static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
+ struct kobject *parent)
+{
+ if (!node_attr)
+ return;
+ sysfs_remove_file(parent, &node_attr->kobj_attr.attr);
+ kfree(node_attr->kobj_attr.attr.name);
+ kfree(node_attr);
+}
+
+static void sysfs_wi_release(struct kobject *wi_kobj)
+{
+ int i;
+
+ for (i = 0; i < MAX_NUMNODES; i++)
+ sysfs_wi_node_release(node_attrs[i], wi_kobj);
+ kobject_put(wi_kobj);
+}
+
+static const struct kobj_type wi_ktype = {
+ .sysfs_ops = &kobj_sysfs_ops,
+ .release = sysfs_wi_release,
+};
+
+static int add_weight_node(int nid, struct kobject *wi_kobj)
+{
+ struct iw_node_attr *node_attr;
+ char *name;
+
+ node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
+ if (!node_attr)
+ return -ENOMEM;
+
+ name = kasprintf(GFP_KERNEL, "node%d", nid);
+ if (!name) {
+ kfree(node_attr);
+ return -ENOMEM;
+ }
+
+ sysfs_attr_init(&node_attr->kobj_attr.attr);
+ node_attr->kobj_attr.attr.name = name;
+ node_attr->kobj_attr.attr.mode = 0644;
+ node_attr->kobj_attr.show = node_show;
+ node_attr->kobj_attr.store = node_store;
+ node_attr->nid = nid;
+
+ if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) {
+ kfree(node_attr->kobj_attr.attr.name);
+ kfree(node_attr);
+ pr_err("failed to add attribute to weighted_interleave\n");
+ return -ENOMEM;
+ }
+
+ node_attrs[nid] = node_attr;
+ return 0;
+}
+
+static int add_weighted_interleave_group(struct kobject *root_kobj)
+{
+ struct kobject *wi_kobj;
+ int nid, err;
+
+ wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
+ if (!wi_kobj)
+ return -ENOMEM;
+
+ err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj,
+ "weighted_interleave");
+ if (err) {
+ kfree(wi_kobj);
+ return err;
+ }
+
+ memset(node_attrs, 0, sizeof(node_attrs));
+ for_each_node_state(nid, N_POSSIBLE) {
+ err = add_weight_node(nid, wi_kobj);
+ if (err) {
+ pr_err("failed to add sysfs [node%d]\n", nid);
+ break;
+ }
+ }
+ if (err)
+ kobject_put(wi_kobj);
+ return 0;
+}
+
+static void mempolicy_kobj_release(struct kobject *kobj)
+{
+ if (iw_table != &default_iw_table)
+ kfree(iw_table);
+ kfree(kobj);
+}
+
+static const struct kobj_type mempolicy_ktype = {
+ .release = mempolicy_kobj_release
+};
+
+static struct kobject *mempolicy_kobj;
+static int __init mempolicy_sysfs_init(void)
+{
+ int err;
+ struct kobject *mempolicy_kobj;
+ struct iw_table __rcu *table = NULL;
+
+ /*
+ * If sysfs setup fails, utilize the default weight table
+ * This at least allows mempolicy to continue functioning safely.
+ */
+ memset(&default_iw_table.weights, 1, MAX_NUMNODES);
+ iw_table = &default_iw_table;
+
+ table = kzalloc(sizeof(struct iw_table), GFP_KERNEL);
+ if (!table)
+ return -ENOMEM;
+
+ memcpy(&table->weights, default_iw_table.weights,
+ sizeof(table->weights));
+
+ mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL);
+ if (!mempolicy_kobj) {
+ kfree(table);
+ pr_err("failed to add mempolicy kobject to the system\n");
+ return -ENOMEM;
+ }
+ err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj,
+ "mempolicy");
+ if (err) {
+ kfree(table);
+ kfree(mempolicy_kobj);
+ return err;
+ }
+
+ err = add_weighted_interleave_group(mempolicy_kobj);
+
+ if (err) {
+ kobject_put(mempolicy_kobj);
+ return err;
+ }
+
+ iw_table = table;
+ return err;
+}
+
+static void __exit mempolicy_exit(void)
+{
+ if (mempolicy_kobj)
+ kobject_put(mempolicy_kobj);
+}
+
+#else
+static int __init mempolicy_sysfs_init(void)
+{
+ /*
+ * if sysfs is not enabled MPOL_WEIGHTED_INTERLEAVE defaults to
+ * MPOL_INTERLEAVE behavior, but is still defined separately to
+ * allow task-local weighted interleave and system-defaults to
+ * operate as intended.
+ *
+ * In this scenario iw_table cannot (presently) change, so
+ * there's no need to set up RCU / cleanup code.
+ */
+ memset(&default_iw_table.weights, 1, sizeof(default_iw_table));
+ iw_table = default_iw_table;
+ return 0;
+}
+
+static void __exit mempolicy_exit(void) { }
+#endif /* CONFIG_SYSFS */
+late_initcall(mempolicy_sysfs_init);
+module_exit(mempolicy_exit);
--
2.39.1
move the use of barrier() to force policy->nodemask onto the stack into
a function `read_once_policy_nodemask` so that it may be re-used.
Suggested-by: Huang Ying <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
---
mm/mempolicy.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5da4fd79fd18..0abd3a3394ef 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1907,6 +1907,20 @@ unsigned int mempolicy_slab_node(void)
}
}
+static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
+ nodemask_t *mask)
+{
+ /*
+ * barrier stabilizes the nodemask locally so that it can be iterated
+ * over safely without concern for changes. Allocators validate node
+ * selection does not violate mems_allowed, so this is safe.
+ */
+ barrier();
+ __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
+ barrier();
+ return nodes_weight(*mask);
+}
+
/*
* Do static interleaving for interleave index @ilx. Returns the ilx'th
* node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1914,20 +1928,12 @@ unsigned int mempolicy_slab_node(void)
*/
static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
{
- nodemask_t nodemask = pol->nodes;
+ nodemask_t nodemask;
unsigned int target, nnodes;
int i;
int nid;
- /*
- * The barrier will stabilize the nodemask in a register or on
- * the stack so that it will stop changing under the code.
- *
- * Between first_node() and next_node(), pol->nodes could be changed
- * by other threads. So we put pol->nodes in a local stack.
- */
- barrier();
- nnodes = nodes_weight(nodemask);
+ nnodes = read_once_policy_nodemask(pol, &nodemask);
if (!nnodes)
return numa_node_id();
target = ilx % nnodes;
--
2.39.1
When a system has multiple NUMA nodes and it becomes bandwidth hungry,
using the current MPOL_INTERLEAVE could be an wise option.
However, if those NUMA nodes consist of different types of memory such
as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
based interleave policy does not optimally distribute data to make use
of their different bandwidth characteristics.
Instead, interleave is more effective when the allocation policy follows
each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
enabling weighted interleave between NUMA nodes. Weighted interleave
allows for proportional distribution of memory across multiple numa
nodes, preferably apportioned to match the bandwidth of each node.
For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
weight distribution is (2:1).
Weights for each node can be assigned via the new sysfs extension:
/sys/kernel/mm/mempolicy/weighted_interleave/
In addition, the `default_iw_table` is created, which will be extended
in the future to allow defaults to be registered by drivers. For now,
the default value of all nodes will be `1`, which matches the behavior
of standard 1:1 round-robin interleave.
The policy allocates a number of pages equal to the set weights. For
example, if the weights are (2,1), then 2 pages will be allocated on
node0 for every 1 page allocated on node1.
The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).
There are 3 integration points:
weighted_interleave_nodes:
Counts the number of allocations as they occur, and applies the
weight for the current node. When the weight reaches 0, switch
to the next node.
weighted_interleave_nid:
Gets the total weight of the nodemask as well as each individual
node weight, then calculates the node based on the given index.
bulk_array_weighted_interleave:
Gets the total weight of the nodemask as well as each individual
node weight, then calculates the number of "interleave rounds" as
well as any delta ("partial round"). Calculates the number of
pages for each node and allocates them.
If a node was scheduled for interleave via interleave_nodes, the
current weight (pol->cur_weight) will be allocated first, before
the remaining bulk calculation is done.
One piece of complexity is the interaction between a recent refactor
which split the logic to acquire the "ilx" (interleave index) of an
allocation and the actually application of the interleave. The
calculation of the `interleave index` is done by `get_vma_policy()`,
while the actual selection of the node will be later appliex by the
relevant weighted_interleave function.
Suggested-by: Hasan Al Maruf <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Co-developed-by: Rakie Kim <[email protected]>
Signed-off-by: Rakie Kim <[email protected]>
Co-developed-by: Honggyu Kim <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
Co-developed-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
Co-developed-by: Srinivasulu Thanneeru <[email protected]>
Signed-off-by: Srinivasulu Thanneeru <[email protected]>
Co-developed-by: Ravi Jonnalagadda <[email protected]>
Signed-off-by: Ravi Jonnalagadda <[email protected]>
---
.../admin-guide/mm/numa_memory_policy.rst | 9 +
include/linux/mempolicy.h | 5 +
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 214 +++++++++++++++++-
4 files changed, 226 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..a70f20ce1ffb 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
can fall back to all existing numa nodes. This is effectively
MPOL_PREFERRED allowed for a mask rather than a single node.
+MPOL_WEIGHTED_INTERLEAVE
+ This mode operates the same as MPOL_INTERLEAVE, except that
+ interleaving behavior is executed based on weights set in
+ /sys/kernel/mm/mempolicy/weighted_interleave/
+
+ Weighted interleave allocates pages on nodes according to a
+ weight. For example if nodes [0,1] are weighted [5,2], 5 pages
+ will be allocated on node0 for every 2 pages allocated on node1.
+
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..c1a083eb0dd5 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -54,6 +54,11 @@ struct mempolicy {
nodemask_t cpuset_mems_allowed; /* relative to these nodes */
nodemask_t user_nodemask; /* nodemask passed by user */
} w;
+
+ /* Weighted interleave settings */
+ struct {
+ u8 cur_weight;
+ } wil;
};
/*
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..1f9bb10d1a47 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,6 +23,7 @@ enum {
MPOL_INTERLEAVE,
MPOL_LOCAL,
MPOL_PREFERRED_MANY,
+ MPOL_WEIGHTED_INTERLEAVE,
MPOL_MAX, /* always last member of enum */
};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0abd3a3394ef..a2b5d64b28e0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -19,6 +19,13 @@
* for anonymous memory. For process policy an process counter
* is used.
*
+ * weighted interleave
+ * Allocate memory interleaved over a set of nodes based on
+ * a set of weights (per-node), with normal fallback if it
+ * fails. Otherwise operates the same as interleave.
+ * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
+ * on node 0 for every 1 page allocated on node 1.
+ *
* bind Only allocate memory on a specific set of nodes,
* no fallback.
* FIXME: memory is allocated starting with the first node
@@ -327,6 +334,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
policy->mode = mode;
policy->flags = flags;
policy->home_node = NUMA_NO_NODE;
+ policy->wil.cur_weight = 0;
return policy;
}
@@ -439,6 +447,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
.create = mpol_new_nodemask,
.rebind = mpol_rebind_preferred,
},
+ [MPOL_WEIGHTED_INTERLEAVE] = {
+ .create = mpol_new_nodemask,
+ .rebind = mpol_rebind_nodemask,
+ },
};
static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
@@ -860,7 +872,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
old = current->mempolicy;
current->mempolicy = new;
- if (new && new->mode == MPOL_INTERLEAVE)
+ if (new && (new->mode == MPOL_INTERLEAVE ||
+ new->mode == MPOL_WEIGHTED_INTERLEAVE))
current->il_prev = MAX_NUMNODES-1;
task_unlock(current);
mpol_put(old);
@@ -886,6 +899,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
case MPOL_INTERLEAVE:
case MPOL_PREFERRED:
case MPOL_PREFERRED_MANY:
+ case MPOL_WEIGHTED_INTERLEAVE:
*nodes = pol->nodes;
break;
case MPOL_LOCAL:
@@ -970,6 +984,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
} else if (pol == current->mempolicy &&
pol->mode == MPOL_INTERLEAVE) {
*policy = next_node_in(current->il_prev, pol->nodes);
+ } else if (pol == current->mempolicy &&
+ (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
+ if (pol->wil.cur_weight)
+ *policy = current->il_prev;
+ else
+ *policy = next_node_in(current->il_prev,
+ pol->nodes);
} else {
err = -EINVAL;
goto out;
@@ -1799,7 +1820,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
pol = __get_vma_policy(vma, addr, ilx);
if (!pol)
pol = get_task_policy(current);
- if (pol->mode == MPOL_INTERLEAVE) {
+ if (pol->mode == MPOL_INTERLEAVE ||
+ pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
*ilx += vma->vm_pgoff >> order;
*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
}
@@ -1849,6 +1871,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
return zone >= dynamic_policy_zone;
}
+static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
+{
+ unsigned int next;
+ struct task_struct *me = current;
+ struct iw_table __rcu *table;
+
+ next = next_node_in(me->il_prev, policy->nodes);
+ if (next == MAX_NUMNODES)
+ return next;
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ if (!policy->wil.cur_weight)
+ policy->wil.cur_weight = table->weights[next];
+ rcu_read_unlock();
+
+ policy->wil.cur_weight--;
+ if (!policy->wil.cur_weight)
+ me->il_prev = next;
+ return next;
+}
+
/* Do dynamic interleaving for a process */
static unsigned int interleave_nodes(struct mempolicy *policy)
{
@@ -1883,6 +1927,9 @@ unsigned int mempolicy_slab_node(void)
case MPOL_INTERLEAVE:
return interleave_nodes(policy);
+ case MPOL_WEIGHTED_INTERLEAVE:
+ return weighted_interleave_nodes(policy);
+
case MPOL_BIND:
case MPOL_PREFERRED_MANY:
{
@@ -1921,6 +1968,39 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
return nodes_weight(*mask);
}
+static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
+{
+ nodemask_t nodemask;
+ unsigned int target, nr_nodes;
+ struct iw_table __rcu *table;
+ unsigned int weight_total = 0;
+ u8 weight;
+ int nid;
+
+ nr_nodes = read_once_policy_nodemask(pol, &nodemask);
+ if (!nr_nodes)
+ return numa_node_id();
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ /* calculate the total weight */
+ for_each_node_mask(nid, nodemask)
+ weight_total += table->weights[nid];
+
+ /* Calculate the node offset based on totals */
+ target = ilx % weight_total;
+ nid = first_node(nodemask);
+ while (target) {
+ weight = table->weights[nid];
+ if (target < weight)
+ break;
+ target -= weight;
+ nid = next_node_in(nid, nodemask);
+ }
+ rcu_read_unlock();
+ return nid;
+}
+
/*
* Do static interleaving for interleave index @ilx. Returns the ilx'th
* node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1981,6 +2061,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
*nid = (ilx == NO_INTERLEAVE_INDEX) ?
interleave_nodes(pol) : interleave_nid(pol, ilx);
break;
+ case MPOL_WEIGHTED_INTERLEAVE:
+ *nid = (ilx == NO_INTERLEAVE_INDEX) ?
+ weighted_interleave_nodes(pol) :
+ weighted_interleave_nid(pol, ilx);
+ break;
}
return nodemask;
@@ -2042,6 +2127,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
case MPOL_PREFERRED_MANY:
case MPOL_BIND:
case MPOL_INTERLEAVE:
+ case MPOL_WEIGHTED_INTERLEAVE:
*mask = mempolicy->nodes;
break;
@@ -2141,7 +2227,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
* If the policy is interleave or does not allow the current
* node in its nodemask, we allocate the standard way.
*/
- if (pol->mode != MPOL_INTERLEAVE &&
+ if ((pol->mode != MPOL_INTERLEAVE &&
+ pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
(!nodemask || node_isset(nid, *nodemask))) {
/*
* First, try to allocate THP only on local node, but
@@ -2277,6 +2364,114 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
return total_allocated;
}
+static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
+ struct mempolicy *pol, unsigned long nr_pages,
+ struct page **page_array)
+{
+ struct task_struct *me = current;
+ unsigned long total_allocated = 0;
+ unsigned long nr_allocated;
+ unsigned long rounds;
+ unsigned long node_pages, delta;
+ u8 weight;
+ struct iw_table __rcu *table;
+ u8 *weights;
+ unsigned int weight_total = 0;
+ unsigned long rem_pages = nr_pages;
+ nodemask_t nodes;
+ int nnodes, node, weight_nodes;
+ int prev_node = NUMA_NO_NODE;
+ int i;
+
+ nnodes = read_once_policy_nodemask(pol, &nodes);
+ if (!nnodes)
+ return 0;
+
+ /* Continue allocating from most recent node and adjust the nr_pages */
+ if (pol->wil.cur_weight) {
+ node = next_node_in(me->il_prev, nodes);
+ node_pages = pol->wil.cur_weight;
+ if (node_pages > rem_pages)
+ node_pages = rem_pages;
+ nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+ NULL, page_array);
+ page_array += nr_allocated;
+ total_allocated += nr_allocated;
+ /* if that's all the pages, no need to interleave */
+ if (rem_pages <= pol->wil.cur_weight) {
+ pol->wil.cur_weight -= rem_pages;
+ return total_allocated;
+ }
+ /* Otherwise we adjust nr_pages down, and continue from there */
+ rem_pages -= pol->wil.cur_weight;
+ pol->wil.cur_weight = 0;
+ prev_node = node;
+ }
+
+ /* fetch the weights for this operation and calculate total weight */
+ weights = kmalloc(nnodes, GFP_KERNEL);
+ if (!weights)
+ return total_allocated;
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ weight_nodes = 0;
+ for_each_node_mask(node, nodes) {
+ weights[weight_nodes++] = table->weights[node];
+ weight_total += table->weights[node];
+ }
+ rcu_read_unlock();
+
+ if (!weight_total) {
+ kfree(weights);
+ return total_allocated;
+ }
+
+ /* Now we can continue allocating as if from 0 instead of an offset */
+ rounds = rem_pages / weight_total;
+ delta = rem_pages % weight_total;
+ for (i = 0; i < nnodes; i++) {
+ node = next_node_in(prev_node, nodes);
+ weight = weights[i];
+ node_pages = weight * rounds;
+ if (delta) {
+ if (delta > weight) {
+ node_pages += weight;
+ delta -= weight;
+ } else {
+ node_pages += delta;
+ delta = 0;
+ }
+ }
+ nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+ NULL, page_array);
+ page_array += nr_allocated;
+ total_allocated += nr_allocated;
+ if (total_allocated == nr_pages)
+ break;
+ prev_node = node;
+ }
+
+ /*
+ * Finally, we need to update me->il_prev and pol->wil.cur_weight
+ * if there were overflow pages, but not equivalent to the node
+ * weight, set the cur_weight to node_weight - delta and the
+ * me->il_prev to the previous node. Otherwise if it was perfect
+ * we can simply set il_prev to node and cur_weight to 0
+ */
+ if (node_pages) {
+ me->il_prev = prev_node;
+ node_pages %= weight;
+ pol->wil.cur_weight = weight - node_pages;
+ } else {
+ me->il_prev = node;
+ pol->wil.cur_weight = 0;
+ }
+
+ kfree(weights);
+ return total_allocated;
+}
+
static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
struct mempolicy *pol, unsigned long nr_pages,
struct page **page_array)
@@ -2317,6 +2512,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
return alloc_pages_bulk_array_interleave(gfp, pol,
nr_pages, page_array);
+ if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
+ return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
+ nr_pages,
+ page_array);
+
if (pol->mode == MPOL_PREFERRED_MANY)
return alloc_pages_bulk_array_preferred_many(gfp,
numa_node_id(), pol, nr_pages, page_array);
@@ -2392,6 +2592,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
case MPOL_INTERLEAVE:
case MPOL_PREFERRED:
case MPOL_PREFERRED_MANY:
+ case MPOL_WEIGHTED_INTERLEAVE:
return !!nodes_equal(a->nodes, b->nodes);
case MPOL_LOCAL:
return true;
@@ -2528,6 +2729,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
polnid = interleave_nid(pol, ilx);
break;
+ case MPOL_WEIGHTED_INTERLEAVE:
+ polnid = weighted_interleave_nid(pol, ilx);
+ break;
+
case MPOL_PREFERRED:
if (node_isset(curnid, pol->nodes))
goto out;
@@ -2902,6 +3107,7 @@ static const char * const policy_modes[] =
[MPOL_PREFERRED] = "prefer",
[MPOL_BIND] = "bind",
[MPOL_INTERLEAVE] = "interleave",
+ [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
[MPOL_LOCAL] = "local",
[MPOL_PREFERRED_MANY] = "prefer (many)",
};
@@ -2961,6 +3167,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
}
break;
case MPOL_INTERLEAVE:
+ case MPOL_WEIGHTED_INTERLEAVE:
/*
* Default to online nodes with memory if no nodelist
*/
@@ -3071,6 +3278,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
case MPOL_PREFERRED_MANY:
case MPOL_BIND:
case MPOL_INTERLEAVE:
+ case MPOL_WEIGHTED_INTERLEAVE:
nodes = pol->nodes;
break;
default:
--
2.39.1
Gregory Price <[email protected]> writes:
> From: Rakie Kim <[email protected]>
>
> This patch provides a way to set interleave weight information under
> sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
>
> The sysfs structure is designed as follows.
>
> $ tree /sys/kernel/mm/mempolicy/
> /sys/kernel/mm/mempolicy/ [1]
> └── weighted_interleave [2]
> ├── node0 [3]
> └── node1
>
> Each file above can be explained as follows.
>
> [1] mm/mempolicy: configuration interface for mempolicy subsystem
>
> [2] weighted_interleave/: config interface for weighted interleave policy
>
> [3] weighted_interleave/nodeN: weight for nodeN
>
> Internally, there is a secondary table `default_iw_table`, which holds
> kernel-internal default interleave weights for each possible node.
>
> If the value for a node is set to `0`, the default value will be used.
>
> If sysfs is disabled in the config, interleave weights will default
> to use `default_iw_table`.
>
> Suggested-by: Huang Ying <[email protected]>
> Signed-off-by: Rakie Kim <[email protected]>
> Signed-off-by: Honggyu Kim <[email protected]>
> Co-developed-by: Gregory Price <[email protected]>
> Signed-off-by: Gregory Price <[email protected]>
> Co-developed-by: Hyeongtak Ji <[email protected]>
> Signed-off-by: Hyeongtak Ji <[email protected]>
> ---
> .../ABI/testing/sysfs-kernel-mm-mempolicy | 4 +
> ...fs-kernel-mm-mempolicy-weighted-interleave | 26 ++
> mm/mempolicy.c | 251 ++++++++++++++++++
> 3 files changed, 281 insertions(+)
> create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
> create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
> new file mode 100644
> index 000000000000..2dcf24f4384a
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
> @@ -0,0 +1,4 @@
> +What: /sys/kernel/mm/mempolicy/
> +Date: December 2023
> +Contact: Linux memory management mailing list <[email protected]>
> +Description: Interface for Mempolicy
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> new file mode 100644
> index 000000000000..e6a38139bf0f
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -0,0 +1,26 @@
> +What: /sys/kernel/mm/mempolicy/weighted_interleave/
> +Date: December 2023
> +Contact: Linux memory management mailing list <[email protected]>
> +Description: Configuration Interface for the Weighted Interleave policy
> +
> +What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
> +Date: December 2023
> +Contact: Linux memory management mailing list <[email protected]>
> +Description: Weight configuration interface for nodeN
> +
> + The interleave weight for a memory node (N). These weights are
> + utilized by processes which have set their mempolicy to
> + MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by
> + omitting a task-local weight array.
> +
> + These weights only affect new allocations, and changes at runtime
> + will not cause migrations on already allocated pages.
> +
> + The minimum weight for a node is always 1.
> +
> + Minimum weight: 1
> + Maximum weight: 255
> +
> + Writing an empty string or `0` will reset the weight to the
> + system default. The system default may be set by the kernel
> + or drivers at boot or during hotplug events.
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..5da4fd79fd18 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -131,6 +131,30 @@ static struct mempolicy default_policy = {
>
> static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>
> +struct iw_table {
> + struct rcu_head rcu;
> + u8 weights[MAX_NUMNODES];
> +};
> +/*
> + * default_iw_table is the kernel-internal default value interleave
> + * weight table. It is to be set by driver code capable of reading
> + * HMAT/CDAT information, and to provide mempolicy a sane set of
> + * default weight values for WEIGHTED_INTERLEAVE mode.
> + *
> + * By default, prior to HMAT/CDAT information being consumed, the
> + * default weight of all nodes is 1. The default weight of any
> + * node can only be in the range 1-255. A 0-weight is not allowed.
> + */
> +static struct iw_table default_iw_table;
> +/*
> + * iw_table is the sysfs-set interleave weight table, a value of 0
> + * denotes that the default_iw_table value should be used.
> + *
> + * iw_table is RCU protected
> + */
> +static struct iw_table __rcu *iw_table;
> +static DEFINE_MUTEX(iw_table_mtx);
I greped "mtx" in kernel/*.c and mm/*.c and found nothing. To following
the existing coding convention, better to name this as iw_table_mutex or
iw_table_lock?
And, I think this is used to protect both iw_table and default_iw_table?
If so, it deserves some comments.
> +
> /**
> * numa_nearest_node - Find nearest node by state
> * @node: Node id to start the search
> @@ -3067,3 +3091,230 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
> p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
> nodemask_pr_args(&nodes));
> }
> +
> +#ifdef CONFIG_SYSFS
> +struct iw_node_attr {
> + struct kobj_attribute kobj_attr;
> + int nid;
> +};
> +
> +static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
> + char *buf)
> +{
> + struct iw_node_attr *node_attr;
> + u8 weight;
> + struct iw_table __rcu *table;
> +
> + node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + weight = table->weights[node_attr->nid];
> + rcu_read_unlock();
> +
> + return sysfs_emit(buf, "%d\n", weight);
> +}
> +
> +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct iw_node_attr *node_attr;
> + struct iw_table __rcu *new;
> + struct iw_table __rcu *old;
> + u8 weight = 0;
> +
> + node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> + if (count == 0 || sysfs_streq(buf, ""))
> + weight = 0;
> + else if (kstrtou8(buf, 0, &weight))
> + return -EINVAL;
> +
> + new = kmalloc(sizeof(*new), GFP_KERNEL);
> + if (!new)
> + return -ENOMEM;
> +
> + mutex_lock(&iw_table_mtx);
> + old = rcu_dereference_protected(iw_table,
> + lockdep_is_held(&iw_table_mtx));
> + /* If value is 0, revert to default weight */
> + weight = weight ? weight : default_iw_table.weights[node_attr->nid];
If we change the default weight in default_iw_table.weights[], how do we
identify whether the weight has been customized by users via sysfs? So,
I suggest to use 0 in iw_table for not-customized weight.
And if so, we need to use RCU to access default_iw_table too.
> + memcpy(&new->weights, &old->weights, sizeof(new->weights));
> + new->weights[node_attr->nid] = weight;
> + rcu_assign_pointer(iw_table, new);
> + mutex_unlock(&iw_table_mtx);
> + kfree_rcu(old, rcu);
synchronize_rcu() should be OK here. It's fast enough in this cold
path. This make it good to define iw_table as
u8 __rcu *iw_table;
Then, we only need to allocate nr_node_ids elements now.
> + return count;
> +}
> +
> +static struct iw_node_attr *node_attrs[MAX_NUMNODES];
> +
> +static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
> + struct kobject *parent)
> +{
> + if (!node_attr)
> + return;
> + sysfs_remove_file(parent, &node_attr->kobj_attr.attr);
> + kfree(node_attr->kobj_attr.attr.name);
> + kfree(node_attr);
> +}
> +
> +static void sysfs_wi_release(struct kobject *wi_kobj)
> +{
> + int i;
> +
> + for (i = 0; i < MAX_NUMNODES; i++)
> + sysfs_wi_node_release(node_attrs[i], wi_kobj);
> + kobject_put(wi_kobj);
> +}
> +
> +static const struct kobj_type wi_ktype = {
> + .sysfs_ops = &kobj_sysfs_ops,
> + .release = sysfs_wi_release,
> +};
> +
> +static int add_weight_node(int nid, struct kobject *wi_kobj)
> +{
> + struct iw_node_attr *node_attr;
> + char *name;
> +
> + node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
> + if (!node_attr)
> + return -ENOMEM;
> +
> + name = kasprintf(GFP_KERNEL, "node%d", nid);
> + if (!name) {
> + kfree(node_attr);
> + return -ENOMEM;
> + }
> +
> + sysfs_attr_init(&node_attr->kobj_attr.attr);
> + node_attr->kobj_attr.attr.name = name;
> + node_attr->kobj_attr.attr.mode = 0644;
> + node_attr->kobj_attr.show = node_show;
> + node_attr->kobj_attr.store = node_store;
> + node_attr->nid = nid;
> +
> + if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) {
> + kfree(node_attr->kobj_attr.attr.name);
> + kfree(node_attr);
> + pr_err("failed to add attribute to weighted_interleave\n");
> + return -ENOMEM;
> + }
> +
> + node_attrs[nid] = node_attr;
> + return 0;
> +}
> +
> +static int add_weighted_interleave_group(struct kobject *root_kobj)
> +{
> + struct kobject *wi_kobj;
> + int nid, err;
> +
> + wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
> + if (!wi_kobj)
> + return -ENOMEM;
> +
> + err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj,
> + "weighted_interleave");
> + if (err) {
> + kfree(wi_kobj);
> + return err;
> + }
> +
> + memset(node_attrs, 0, sizeof(node_attrs));
> + for_each_node_state(nid, N_POSSIBLE) {
> + err = add_weight_node(nid, wi_kobj);
> + if (err) {
> + pr_err("failed to add sysfs [node%d]\n", nid);
> + break;
> + }
> + }
> + if (err)
> + kobject_put(wi_kobj);
> + return 0;
> +}
> +
> +static void mempolicy_kobj_release(struct kobject *kobj)
> +{
> + if (iw_table != &default_iw_table)
> + kfree(iw_table);
> + kfree(kobj);
> +}
> +
> +static const struct kobj_type mempolicy_ktype = {
> + .release = mempolicy_kobj_release
> +};
> +
> +static struct kobject *mempolicy_kobj;
> +static int __init mempolicy_sysfs_init(void)
> +{
> + int err;
> + struct kobject *mempolicy_kobj;
> + struct iw_table __rcu *table = NULL;
> +
> + /*
> + * If sysfs setup fails, utilize the default weight table
> + * This at least allows mempolicy to continue functioning safely.
> + */
> + memset(&default_iw_table.weights, 1, MAX_NUMNODES);
> + iw_table = &default_iw_table;
> +
> + table = kzalloc(sizeof(struct iw_table), GFP_KERNEL);
> + if (!table)
> + return -ENOMEM;
> +
> + memcpy(&table->weights, default_iw_table.weights,
> + sizeof(table->weights));
> +
> + mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL);
> + if (!mempolicy_kobj) {
> + kfree(table);
> + pr_err("failed to add mempolicy kobject to the system\n");
> + return -ENOMEM;
> + }
> + err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj,
> + "mempolicy");
> + if (err) {
> + kfree(table);
> + kfree(mempolicy_kobj);
> + return err;
> + }
> +
> + err = add_weighted_interleave_group(mempolicy_kobj);
> +
> + if (err) {
> + kobject_put(mempolicy_kobj);
> + return err;
> + }
> +
> + iw_table = table;
> + return err;
> +}
> +
> +static void __exit mempolicy_exit(void)
> +{
> + if (mempolicy_kobj)
> + kobject_put(mempolicy_kobj);
> +}
> +
> +#else
> +static int __init mempolicy_sysfs_init(void)
> +{
> + /*
> + * if sysfs is not enabled MPOL_WEIGHTED_INTERLEAVE defaults to
> + * MPOL_INTERLEAVE behavior, but is still defined separately to
> + * allow task-local weighted interleave and system-defaults to
> + * operate as intended.
> + *
> + * In this scenario iw_table cannot (presently) change, so
> + * there's no need to set up RCU / cleanup code.
> + */
> + memset(&default_iw_table.weights, 1, sizeof(default_iw_table));
This depends on sizeof(default_iw_table.weights[0]) == 1, I think it's
better to use explicit loop here to make the code more robust a little.
> + iw_table = default_iw_table;
iw_table = &default_iw_table;
?
> + return 0;
> +}
> +
> +static void __exit mempolicy_exit(void) { }
> +#endif /* CONFIG_SYSFS */
> +late_initcall(mempolicy_sysfs_init);
> +module_exit(mempolicy_exit);
--
Best Regards,
Huang, Ying
Gregory Price <[email protected]> writes:
> move the use of barrier() to force policy->nodemask onto the stack into
> a function `read_once_policy_nodemask` so that it may be re-used.
>
> Suggested-by: Huang Ying <[email protected]>
> Signed-off-by: Gregory Price <[email protected]>
> ---
> mm/mempolicy.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 5da4fd79fd18..0abd3a3394ef 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1907,6 +1907,20 @@ unsigned int mempolicy_slab_node(void)
> }
> }
>
> +static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
> + nodemask_t *mask)
It may be more useful if we define this as memcpy_once(). That can be
used not only for nodemask, but also other data structure.
> +{
> + /*
> + * barrier stabilizes the nodemask locally so that it can be iterated
> + * over safely without concern for changes. Allocators validate node
> + * selection does not violate mems_allowed, so this is safe.
> + */
> + barrier();
> + __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
We don't use __builtin_memcpy() in kernel itself directly. Although it
is used in kernel tools. So, I think it's better to use memcpy() here.
> + barrier();
> + return nodes_weight(*mask);
> +}
> +
> /*
> * Do static interleaving for interleave index @ilx. Returns the ilx'th
> * node in pol->nodes (starting from ilx=0), wrapping around if ilx
> @@ -1914,20 +1928,12 @@ unsigned int mempolicy_slab_node(void)
> */
> static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> {
> - nodemask_t nodemask = pol->nodes;
> + nodemask_t nodemask;
> unsigned int target, nnodes;
> int i;
> int nid;
> - /*
> - * The barrier will stabilize the nodemask in a register or on
> - * the stack so that it will stop changing under the code.
> - *
> - * Between first_node() and next_node(), pol->nodes could be changed
> - * by other threads. So we put pol->nodes in a local stack.
> - */
> - barrier();
>
> - nnodes = nodes_weight(nodemask);
> + nnodes = read_once_policy_nodemask(pol, &nodemask);
> if (!nnodes)
> return numa_node_id();
> target = ilx % nnodes;
--
Best Regards,
Huang, Ying
Gregory Price <[email protected]> writes:
> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> using the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
> based interleave policy does not optimally distribute data to make use
> of their different bandwidth characteristics.
>
> Instead, interleave is more effective when the allocation policy follows
> each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
> enabling weighted interleave between NUMA nodes. Weighted interleave
> allows for proportional distribution of memory across multiple numa
> nodes, preferably apportioned to match the bandwidth of each node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
> weight distribution is (2:1).
>
> Weights for each node can be assigned via the new sysfs extension:
> /sys/kernel/mm/mempolicy/weighted_interleave/
>
> In addition, the `default_iw_table` is created, which will be extended
> in the future to allow defaults to be registered by drivers. For now,
> the default value of all nodes will be `1`, which matches the behavior
> of standard 1:1 round-robin interleave.
>
> The policy allocates a number of pages equal to the set weights. For
> example, if the weights are (2,1), then 2 pages will be allocated on
> node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
Not necessary in this series, just a reminder. Don't forget to update
the man pages of set_mempolicy(2) and mbind(2) after the patchset is
merged.
> There are 3 integration points:
>
> weighted_interleave_nodes:
> Counts the number of allocations as they occur, and applies the
> weight for the current node. When the weight reaches 0, switch
> to the next node.
>
> weighted_interleave_nid:
> Gets the total weight of the nodemask as well as each individual
> node weight, then calculates the node based on the given index.
>
> bulk_array_weighted_interleave:
> Gets the total weight of the nodemask as well as each individual
> node weight, then calculates the number of "interleave rounds" as
> well as any delta ("partial round"). Calculates the number of
> pages for each node and allocates them.
>
> If a node was scheduled for interleave via interleave_nodes, the
> current weight (pol->cur_weight) will be allocated first, before
> the remaining bulk calculation is done.
>
> One piece of complexity is the interaction between a recent refactor
> which split the logic to acquire the "ilx" (interleave index) of an
> allocation and the actually application of the interleave. The
> calculation of the `interleave index` is done by `get_vma_policy()`,
> while the actual selection of the node will be later appliex by the
> relevant weighted_interleave function.
>
> Suggested-by: Hasan Al Maruf <[email protected]>
> Signed-off-by: Gregory Price <[email protected]>
> Co-developed-by: Rakie Kim <[email protected]>
> Signed-off-by: Rakie Kim <[email protected]>
> Co-developed-by: Honggyu Kim <[email protected]>
> Signed-off-by: Honggyu Kim <[email protected]>
> Co-developed-by: Hyeongtak Ji <[email protected]>
> Signed-off-by: Hyeongtak Ji <[email protected]>
> Co-developed-by: Srinivasulu Thanneeru <[email protected]>
> Signed-off-by: Srinivasulu Thanneeru <[email protected]>
> Co-developed-by: Ravi Jonnalagadda <[email protected]>
> Signed-off-by: Ravi Jonnalagadda <[email protected]>
> ---
> .../admin-guide/mm/numa_memory_policy.rst | 9 +
> include/linux/mempolicy.h | 5 +
> include/uapi/linux/mempolicy.h | 1 +
> mm/mempolicy.c | 214 +++++++++++++++++-
> 4 files changed, 226 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index eca38fa81e0f..a70f20ce1ffb 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
> can fall back to all existing numa nodes. This is effectively
> MPOL_PREFERRED allowed for a mask rather than a single node.
>
> +MPOL_WEIGHTED_INTERLEAVE
> + This mode operates the same as MPOL_INTERLEAVE, except that
> + interleaving behavior is executed based on weights set in
> + /sys/kernel/mm/mempolicy/weighted_interleave/
> +
> + Weighted interleave allocates pages on nodes according to a
> + weight. For example if nodes [0,1] are weighted [5,2], 5 pages
> + will be allocated on node0 for every 2 pages allocated on node1.
> +
> NUMA memory policy supports the following optional mode flags:
>
> MPOL_F_STATIC_NODES
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..c1a083eb0dd5 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -54,6 +54,11 @@ struct mempolicy {
> nodemask_t cpuset_mems_allowed; /* relative to these nodes */
> nodemask_t user_nodemask; /* nodemask passed by user */
> } w;
> +
> + /* Weighted interleave settings */
> + struct {
> + u8 cur_weight;
> + } wil;
> };
>
> /*
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index a8963f7ef4c2..1f9bb10d1a47 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -23,6 +23,7 @@ enum {
> MPOL_INTERLEAVE,
> MPOL_LOCAL,
> MPOL_PREFERRED_MANY,
> + MPOL_WEIGHTED_INTERLEAVE,
> MPOL_MAX, /* always last member of enum */
> };
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0abd3a3394ef..a2b5d64b28e0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -19,6 +19,13 @@
> * for anonymous memory. For process policy an process counter
> * is used.
> *
> + * weighted interleave
> + * Allocate memory interleaved over a set of nodes based on
> + * a set of weights (per-node), with normal fallback if it
> + * fails. Otherwise operates the same as interleave.
> + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
> + * on node 0 for every 1 page allocated on node 1.
> + *
> * bind Only allocate memory on a specific set of nodes,
> * no fallback.
> * FIXME: memory is allocated starting with the first node
> @@ -327,6 +334,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
> policy->mode = mode;
> policy->flags = flags;
> policy->home_node = NUMA_NO_NODE;
> + policy->wil.cur_weight = 0;
>
> return policy;
> }
> @@ -439,6 +447,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> .create = mpol_new_nodemask,
> .rebind = mpol_rebind_preferred,
> },
> + [MPOL_WEIGHTED_INTERLEAVE] = {
> + .create = mpol_new_nodemask,
> + .rebind = mpol_rebind_nodemask,
> + },
> };
>
> static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> @@ -860,7 +872,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>
> old = current->mempolicy;
> current->mempolicy = new;
> - if (new && new->mode == MPOL_INTERLEAVE)
> + if (new && (new->mode == MPOL_INTERLEAVE ||
> + new->mode == MPOL_WEIGHTED_INTERLEAVE))
> current->il_prev = MAX_NUMNODES-1;
> task_unlock(current);
> mpol_put(old);
> @@ -886,6 +899,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
> case MPOL_INTERLEAVE:
> case MPOL_PREFERRED:
> case MPOL_PREFERRED_MANY:
> + case MPOL_WEIGHTED_INTERLEAVE:
> *nodes = pol->nodes;
> break;
> case MPOL_LOCAL:
> @@ -970,6 +984,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> } else if (pol == current->mempolicy &&
> pol->mode == MPOL_INTERLEAVE) {
> *policy = next_node_in(current->il_prev, pol->nodes);
> + } else if (pol == current->mempolicy &&
> + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
> + if (pol->wil.cur_weight)
> + *policy = current->il_prev;
> + else
> + *policy = next_node_in(current->il_prev,
> + pol->nodes);
> } else {
> err = -EINVAL;
> goto out;
> @@ -1799,7 +1820,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> pol = __get_vma_policy(vma, addr, ilx);
> if (!pol)
> pol = get_task_policy(current);
> - if (pol->mode == MPOL_INTERLEAVE) {
> + if (pol->mode == MPOL_INTERLEAVE ||
> + pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> *ilx += vma->vm_pgoff >> order;
> *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> }
> @@ -1849,6 +1871,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> return zone >= dynamic_policy_zone;
> }
>
> +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> +{
> + unsigned int next;
> + struct task_struct *me = current;
> + struct iw_table __rcu *table;
> +
> + next = next_node_in(me->il_prev, policy->nodes);
> + if (next == MAX_NUMNODES)
> + return next;
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + if (!policy->wil.cur_weight)
> + policy->wil.cur_weight = table->weights[next];
> + rcu_read_unlock();
> +
> + policy->wil.cur_weight--;
> + if (!policy->wil.cur_weight)
> + me->il_prev = next;
> + return next;
> +}
> +
> /* Do dynamic interleaving for a process */
> static unsigned int interleave_nodes(struct mempolicy *policy)
> {
> @@ -1883,6 +1927,9 @@ unsigned int mempolicy_slab_node(void)
> case MPOL_INTERLEAVE:
> return interleave_nodes(policy);
>
> + case MPOL_WEIGHTED_INTERLEAVE:
> + return weighted_interleave_nodes(policy);
> +
> case MPOL_BIND:
> case MPOL_PREFERRED_MANY:
> {
> @@ -1921,6 +1968,39 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
> return nodes_weight(*mask);
> }
>
> +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> +{
> + nodemask_t nodemask;
> + unsigned int target, nr_nodes;
> + struct iw_table __rcu *table;
> + unsigned int weight_total = 0;
> + u8 weight;
> + int nid;
> +
> + nr_nodes = read_once_policy_nodemask(pol, &nodemask);
> + if (!nr_nodes)
> + return numa_node_id();
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + /* calculate the total weight */
> + for_each_node_mask(nid, nodemask)
> + weight_total += table->weights[nid];
> +
> + /* Calculate the node offset based on totals */
> + target = ilx % weight_total;
> + nid = first_node(nodemask);
> + while (target) {
> + weight = table->weights[nid];
> + if (target < weight)
> + break;
> + target -= weight;
> + nid = next_node_in(nid, nodemask);
> + }
> + rcu_read_unlock();
> + return nid;
> +}
> +
> /*
> * Do static interleaving for interleave index @ilx. Returns the ilx'th
> * node in pol->nodes (starting from ilx=0), wrapping around if ilx
> @@ -1981,6 +2061,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> *nid = (ilx == NO_INTERLEAVE_INDEX) ?
> interleave_nodes(pol) : interleave_nid(pol, ilx);
> break;
> + case MPOL_WEIGHTED_INTERLEAVE:
> + *nid = (ilx == NO_INTERLEAVE_INDEX) ?
> + weighted_interleave_nodes(pol) :
> + weighted_interleave_nid(pol, ilx);
> + break;
> }
>
> return nodemask;
> @@ -2042,6 +2127,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> case MPOL_PREFERRED_MANY:
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> *mask = mempolicy->nodes;
> break;
>
> @@ -2141,7 +2227,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> * If the policy is interleave or does not allow the current
> * node in its nodemask, we allocate the standard way.
> */
> - if (pol->mode != MPOL_INTERLEAVE &&
> + if ((pol->mode != MPOL_INTERLEAVE &&
> + pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
> (!nodemask || node_isset(nid, *nodemask))) {
> /*
> * First, try to allocate THP only on local node, but
> @@ -2277,6 +2364,114 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
> return total_allocated;
> }
>
> +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> + struct mempolicy *pol, unsigned long nr_pages,
> + struct page **page_array)
> +{
> + struct task_struct *me = current;
> + unsigned long total_allocated = 0;
> + unsigned long nr_allocated;
> + unsigned long rounds;
> + unsigned long node_pages, delta;
> + u8 weight;
> + struct iw_table __rcu *table;
> + u8 *weights;
> + unsigned int weight_total = 0;
> + unsigned long rem_pages = nr_pages;
> + nodemask_t nodes;
> + int nnodes, node, weight_nodes;
> + int prev_node = NUMA_NO_NODE;
> + int i;
> +
> + nnodes = read_once_policy_nodemask(pol, &nodes);
> + if (!nnodes)
> + return 0;
> +
> + /* Continue allocating from most recent node and adjust the nr_pages */
> + if (pol->wil.cur_weight) {
> + node = next_node_in(me->il_prev, nodes);
> + node_pages = pol->wil.cur_weight;
> + if (node_pages > rem_pages)
> + node_pages = rem_pages;
> + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> + NULL, page_array);
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + /* if that's all the pages, no need to interleave */
> + if (rem_pages <= pol->wil.cur_weight) {
> + pol->wil.cur_weight -= rem_pages;
> + return total_allocated;
> + }
> + /* Otherwise we adjust nr_pages down, and continue from there */
> + rem_pages -= pol->wil.cur_weight;
> + pol->wil.cur_weight = 0;
> + prev_node = node;
> + }
> +
> + /* fetch the weights for this operation and calculate total weight */
> + weights = kmalloc(nnodes, GFP_KERNEL);
> + if (!weights)
> + return total_allocated;
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + weight_nodes = 0;
> + for_each_node_mask(node, nodes) {
> + weights[weight_nodes++] = table->weights[node];
> + weight_total += table->weights[node];
> + }
> + rcu_read_unlock();
> +
> + if (!weight_total) {
> + kfree(weights);
> + return total_allocated;
> + }
> +
> + /* Now we can continue allocating as if from 0 instead of an offset */
> + rounds = rem_pages / weight_total;
> + delta = rem_pages % weight_total;
> + for (i = 0; i < nnodes; i++) {
> + node = next_node_in(prev_node, nodes);
> + weight = weights[i];
> + node_pages = weight * rounds;
> + if (delta) {
> + if (delta > weight) {
> + node_pages += weight;
> + delta -= weight;
> + } else {
> + node_pages += delta;
> + delta = 0;
> + }
> + }
> + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> + NULL, page_array);
Should we check nr_allocated here? Allocation may fail anyway.
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + if (total_allocated == nr_pages)
> + break;
> + prev_node = node;
> + }
> +
> + /*
> + * Finally, we need to update me->il_prev and pol->wil.cur_weight
> + * if there were overflow pages, but not equivalent to the node
> + * weight, set the cur_weight to node_weight - delta and the
> + * me->il_prev to the previous node. Otherwise if it was perfect
> + * we can simply set il_prev to node and cur_weight to 0
> + */
> + if (node_pages) {
> + me->il_prev = prev_node;
> + node_pages %= weight;
> + pol->wil.cur_weight = weight - node_pages;
> + } else {
> + me->il_prev = node;
> + pol->wil.cur_weight = 0;
> + }
> +
> + kfree(weights);
> + return total_allocated;
> +}
> +
> static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
> struct mempolicy *pol, unsigned long nr_pages,
> struct page **page_array)
> @@ -2317,6 +2512,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
> return alloc_pages_bulk_array_interleave(gfp, pol,
> nr_pages, page_array);
>
> + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
> + return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
> + nr_pages,
> + page_array);
> +
Just nit-pick, may be better to be
return alloc_pages_bulk_array_weighted_interleave(
gfp, pol, nr_pages, page_array);
> if (pol->mode == MPOL_PREFERRED_MANY)
> return alloc_pages_bulk_array_preferred_many(gfp,
> numa_node_id(), pol, nr_pages, page_array);
> @@ -2392,6 +2592,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
> case MPOL_INTERLEAVE:
> case MPOL_PREFERRED:
> case MPOL_PREFERRED_MANY:
> + case MPOL_WEIGHTED_INTERLEAVE:
> return !!nodes_equal(a->nodes, b->nodes);
> case MPOL_LOCAL:
> return true;
> @@ -2528,6 +2729,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
> polnid = interleave_nid(pol, ilx);
> break;
>
> + case MPOL_WEIGHTED_INTERLEAVE:
> + polnid = weighted_interleave_nid(pol, ilx);
> + break;
> +
> case MPOL_PREFERRED:
> if (node_isset(curnid, pol->nodes))
> goto out;
> @@ -2902,6 +3107,7 @@ static const char * const policy_modes[] =
> [MPOL_PREFERRED] = "prefer",
> [MPOL_BIND] = "bind",
> [MPOL_INTERLEAVE] = "interleave",
> + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
> [MPOL_LOCAL] = "local",
> [MPOL_PREFERRED_MANY] = "prefer (many)",
> };
> @@ -2961,6 +3167,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
> }
> break;
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> /*
> * Default to online nodes with memory if no nodelist
> */
> @@ -3071,6 +3278,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
> case MPOL_PREFERRED_MANY:
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> nodes = pol->nodes;
> break;
> default:
--
Best Regards,
Huang, Ying
On Mon, Jan 15, 2024 at 11:18:00AM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > +static struct iw_table default_iw_table;
> > +/*
> > + * iw_table is the sysfs-set interleave weight table, a value of 0
> > + * denotes that the default_iw_table value should be used.
> > + *
> > + * iw_table is RCU protected
> > + */
> > +static struct iw_table __rcu *iw_table;
> > +static DEFINE_MUTEX(iw_table_mtx);
>
> I greped "mtx" in kernel/*.c and mm/*.c and found nothing. To following
> the existing coding convention, better to name this as iw_table_mutex or
> iw_table_lock?
>
ack.
> And, I think this is used to protect both iw_table and default_iw_table?
> If so, it deserves some comments.
>
Right now default_iw_table cannot be updated, and so it is neither
protected nor requires protection.
I planned to add the protection comment in the next patch series, which
would implement the kernel-side interface for updating the default
weights during boot/hotplug.
We haven't had the discussion on how/when this should happen yet,
though, and there's some research to be done. (i.e. when should DRAM
weights be set? should the entire table be reweighted on hotplug? etc)
> > +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
> > + const char *buf, size_t count)
> > +{
> > + struct iw_node_attr *node_attr;
> > + struct iw_table __rcu *new;
> > + struct iw_table __rcu *old;
> > + u8 weight = 0;
> > +
> > + node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> > + if (count == 0 || sysfs_streq(buf, ""))
> > + weight = 0;
> > + else if (kstrtou8(buf, 0, &weight))
> > + return -EINVAL;
> > +
> > + new = kmalloc(sizeof(*new), GFP_KERNEL);
> > + if (!new)
> > + return -ENOMEM;
> > +
> > + mutex_lock(&iw_table_mtx);
> > + old = rcu_dereference_protected(iw_table,
> > + lockdep_is_held(&iw_table_mtx));
> > + /* If value is 0, revert to default weight */
> > + weight = weight ? weight : default_iw_table.weights[node_attr->nid];
>
> If we change the default weight in default_iw_table.weights[], how do we
> identify whether the weight has been customized by users via sysfs? So,
> I suggest to use 0 in iw_table for not-customized weight.
>
> And if so, we need to use RCU to access default_iw_table too.
>
Dumb simplification on my part, I'll walk this back and add the
if (!weight) weight = default_iw_table[node]
logic back into the allocator paths accordinly.
> > + memcpy(&new->weights, &old->weights, sizeof(new->weights));
> > + new->weights[node_attr->nid] = weight;
> > + rcu_assign_pointer(iw_table, new);
> > + mutex_unlock(&iw_table_mtx);
> > + kfree_rcu(old, rcu);
>
> synchronize_rcu() should be OK here. It's fast enough in this cold
> path. This make it good to define iw_table as
>
I'll take a look.
> u8 __rcu *iw_table;
>
> Then, we only need to allocate nr_node_ids elements now.
>
We need nr_possible_nodes to handle hotplug correctly.
I decided to simplify this down to MAX_NUMNODES *juuuuuust in case*
"true node hotplug" ever becomes a reality. If that happens, then
only allocating space for possible nodes creates a much bigger
headache on hotplug.
For the sake of that simplification, it seemed better to just eat the
1KB. If you really want me to do that, I will, but the MAX_NUMNODES
choice was an explicitly defensive choice.
> > +static int __init mempolicy_sysfs_init(void)
> > +{
> > + /*
> > + * if sysfs is not enabled MPOL_WEIGHTED_INTERLEAVE defaults to
> > + * MPOL_INTERLEAVE behavior, but is still defined separately to
> > + * allow task-local weighted interleave and system-defaults to
> > + * operate as intended.
> > + *
> > + * In this scenario iw_table cannot (presently) change, so
> > + * there's no need to set up RCU / cleanup code.
> > + */
> > + memset(&default_iw_table.weights, 1, sizeof(default_iw_table));
>
> This depends on sizeof(default_iw_table.weights[0]) == 1, I think it's
> better to use explicit loop here to make the code more robust a little.
>
oh hm, you're right. rookie mistake on my part.
Thanks,
Gregory
On Mon, Jan 15, 2024 at 12:13:06PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> >
> > +static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
> > + nodemask_t *mask)
>
> It may be more useful if we define this as memcpy_once(). That can be
> used not only for nodemask, but also other data structure.
>
Seemed better to do this is an entirely separate patch line to avoid
scope creep on reviews and such.
> > + barrier();
> > + __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
>
> We don't use __builtin_memcpy() in kernel itself directly. Although it
> is used in kernel tools. So, I think it's better to use memcpy() here.
>
ack.
~Gregory
On Mon, Jan 15, 2024 at 01:47:31PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > + /* Continue allocating from most recent node and adjust the nr_pages */
> > + if (pol->wil.cur_weight) {
> > + node = next_node_in(me->il_prev, nodes);
> > + node_pages = pol->wil.cur_weight;
> > + if (node_pages > rem_pages)
> > + node_pages = rem_pages;
> > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> > + NULL, page_array);
.. snip ...
> > + if (delta > weight) {
> > + node_pages += weight;
> > + delta -= weight;
> > + } else {
> > + node_pages += delta;
> > + delta = 0;
> > + }
> > + }
> > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> > + NULL, page_array);
>
> Should we check nr_allocated here? Allocation may fail anyway.
>
I thought about this briefly in both situations.
If you look at alloc_pages_bulk_array_interleave(), it does not fail if
__alloc_pages_bulk() fails, instead it continues and attempts to
allocate from the remaining nodes.
Presumably, this is because the caller of the bulk allocator can accept
a partial-failure and will go ahead and allocate the remaining pages on
an extra slow path.
Since alloc_pages_bulk_array_interleave() appears to be capable of
failing in the exact same way, I considered this safe.
> > + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
> > + return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
> > + nr_pages,
> > + page_array);
> > +
>
> Just nit-pick, may be better to be
>
> return alloc_pages_bulk_array_weighted_interleave(
> gfp, pol, nr_pages, page_array);
>
Wasn't sure on style when names get this long lol, will make the change
:]
Probably v2 thursday or friday
Regards
~Gregory
Gregory Price <[email protected]> writes:
> On Mon, Jan 15, 2024 at 11:18:00AM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>> > +static struct iw_table default_iw_table;
>> > +/*
>> > + * iw_table is the sysfs-set interleave weight table, a value of 0
>> > + * denotes that the default_iw_table value should be used.
>> > + *
>> > + * iw_table is RCU protected
>> > + */
>> > +static struct iw_table __rcu *iw_table;
>> > +static DEFINE_MUTEX(iw_table_mtx);
>>
>> I greped "mtx" in kernel/*.c and mm/*.c and found nothing. To following
>> the existing coding convention, better to name this as iw_table_mutex or
>> iw_table_lock?
>>
>
> ack.
>
>> And, I think this is used to protect both iw_table and default_iw_table?
>> If so, it deserves some comments.
>>
>
> Right now default_iw_table cannot be updated, and so it is neither
> protected nor requires protection.
>
> I planned to add the protection comment in the next patch series, which
> would implement the kernel-side interface for updating the default
> weights during boot/hotplug.
>
> We haven't had the discussion on how/when this should happen yet,
> though, and there's some research to be done. (i.e. when should DRAM
> weights be set? should the entire table be reweighted on hotplug? etc)
Before that, I'm OK to remove default_iw_table and use hard coded "1" as
default weight for now.
>> > +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>> > + const char *buf, size_t count)
>> > +{
>> > + struct iw_node_attr *node_attr;
>> > + struct iw_table __rcu *new;
>> > + struct iw_table __rcu *old;
>> > + u8 weight = 0;
>> > +
>> > + node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
>> > + if (count == 0 || sysfs_streq(buf, ""))
>> > + weight = 0;
>> > + else if (kstrtou8(buf, 0, &weight))
>> > + return -EINVAL;
>> > +
>> > + new = kmalloc(sizeof(*new), GFP_KERNEL);
>> > + if (!new)
>> > + return -ENOMEM;
>> > +
>> > + mutex_lock(&iw_table_mtx);
>> > + old = rcu_dereference_protected(iw_table,
>> > + lockdep_is_held(&iw_table_mtx));
>> > + /* If value is 0, revert to default weight */
>> > + weight = weight ? weight : default_iw_table.weights[node_attr->nid];
>>
>> If we change the default weight in default_iw_table.weights[], how do we
>> identify whether the weight has been customized by users via sysfs? So,
>> I suggest to use 0 in iw_table for not-customized weight.
>>
>> And if so, we need to use RCU to access default_iw_table too.
>>
>
> Dumb simplification on my part, I'll walk this back and add the
>
> if (!weight) weight = default_iw_table[node]
>
> logic back into the allocator paths accordinly.
>
>> > + memcpy(&new->weights, &old->weights, sizeof(new->weights));
>> > + new->weights[node_attr->nid] = weight;
>> > + rcu_assign_pointer(iw_table, new);
>> > + mutex_unlock(&iw_table_mtx);
>> > + kfree_rcu(old, rcu);
>>
>> synchronize_rcu() should be OK here. It's fast enough in this cold
>> path. This make it good to define iw_table as
>>
> I'll take a look.
>
>> u8 __rcu *iw_table;
>>
>> Then, we only need to allocate nr_node_ids elements now.
>>
>
> We need nr_possible_nodes to handle hotplug correctly.
nr_node_ids >= num_possible_nodes(). It's larger than any possible node
ID.
> I decided to simplify this down to MAX_NUMNODES *juuuuuust in case*
> "true node hotplug" ever becomes a reality. If that happens, then
> only allocating space for possible nodes creates a much bigger
> headache on hotplug.
>
> For the sake of that simplification, it seemed better to just eat the
> 1KB. If you really want me to do that, I will, but the MAX_NUMNODES
> choice was an explicitly defensive choice.
When "true node hotplug" becomes reality, we can make nr_node_ids ==
MAX_NUMNODES. So, it's safe to use it. Please take a look at
setup_nr_node_ids().
>> > +static int __init mempolicy_sysfs_init(void)
>> > +{
>> > + /*
>> > + * if sysfs is not enabled MPOL_WEIGHTED_INTERLEAVE defaults to
>> > + * MPOL_INTERLEAVE behavior, but is still defined separately to
>> > + * allow task-local weighted interleave and system-defaults to
>> > + * operate as intended.
>> > + *
>> > + * In this scenario iw_table cannot (presently) change, so
>> > + * there's no need to set up RCU / cleanup code.
>> > + */
>> > + memset(&default_iw_table.weights, 1, sizeof(default_iw_table));
>>
>> This depends on sizeof(default_iw_table.weights[0]) == 1, I think it's
>> better to use explicit loop here to make the code more robust a little.
>>
>
> oh hm, you're right. rookie mistake on my part.
>
--
Best Regards,
Huang, Ying
On Wed, Jan 17, 2024 at 02:58:08PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > We haven't had the discussion on how/when this should happen yet,
> > though, and there's some research to be done. (i.e. when should DRAM
> > weights be set? should the entire table be reweighted on hotplug? etc)
>
> Before that, I'm OK to remove default_iw_table and use hard coded "1" as
> default weight for now.
>
Can't quite do that. default_iw_table is a static structure because we
need a reliable default structure not subject to module initialization
failure. Otherwise we can end up in a situation where iw_table is NULL
during some allocation path if the sysfs structure fails to setup fully.
There's no good reason to fail allocations just because sysfs failed to
initialization for some reason. I'll leave default_iw_table with a size
of MAX_NUMNODES for now (nr_node_ids is set up at runtime per your
reference to `setup_nr_node_ids` below, so we can't use it for this).
> >
> >> u8 __rcu *iw_table;
> >>
> >> Then, we only need to allocate nr_node_ids elements now.
> >>
> >
> > We need nr_possible_nodes to handle hotplug correctly.
>
> nr_node_ids >= num_possible_nodes(). It's larger than any possible node
> ID.
>
nr_node_ids gets setup at runtime, while the default_iw_table needs
to be a static structure (see above). I can make default_iw_table
MAX_NUMNODES and subsequent allocations of iw_table be nr_node_ids,
but that makes iw_table a different size at any given time.
This *will* break if "true hotplug" ever shows up and possible_nodes !=
MAX_NUMNODES. But I can write it up if it's a sticking point for you.
Ultimately we're squabbling over, at most, about ~3kb of memory, just
keep that in mind. (I guess if you spawn 3000 threads and each tries a
concurrent write to sysfs/node1, you'd eat 3MB view briefly, but that
is a truly degenerate case and I can think of more denegerate things).
>
> When "true node hotplug" becomes reality, we can make nr_node_ids ==
> MAX_NUMNODES. So, it's safe to use it. Please take a look at
> setup_nr_node_ids().
>
~Gregory
Gregory Price <[email protected]> writes:
> On Mon, Jan 15, 2024 at 01:47:31PM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>> > + /* Continue allocating from most recent node and adjust the nr_pages */
>> > + if (pol->wil.cur_weight) {
>> > + node = next_node_in(me->il_prev, nodes);
>> > + node_pages = pol->wil.cur_weight;
>> > + if (node_pages > rem_pages)
>> > + node_pages = rem_pages;
>> > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
>> > + NULL, page_array);
> ... snip ...
>> > + if (delta > weight) {
>> > + node_pages += weight;
>> > + delta -= weight;
>> > + } else {
>> > + node_pages += delta;
>> > + delta = 0;
>> > + }
>> > + }
>> > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
>> > + NULL, page_array);
>>
>> Should we check nr_allocated here? Allocation may fail anyway.
>>
>
> I thought about this briefly in both situations.
>
> If you look at alloc_pages_bulk_array_interleave(), it does not fail if
> __alloc_pages_bulk() fails, instead it continues and attempts to
> allocate from the remaining nodes.
>
> Presumably, this is because the caller of the bulk allocator can accept
> a partial-failure and will go ahead and allocate the remaining pages on
> an extra slow path.
>
> Since alloc_pages_bulk_array_interleave() appears to be capable of
> failing in the exact same way, I considered this safe.
You are right. We should proceed with next node here.
--
Best Regards,
Huang, Ying
>> > + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
>> > + return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
>> > + nr_pages,
>> > + page_array);
>> > +
>>
>> Just nit-pick, may be better to be
>>
>> return alloc_pages_bulk_array_weighted_interleave(
>> gfp, pol, nr_pages, page_array);
>>
>
> Wasn't sure on style when names get this long lol, will make the change
> :]
>
>
>
> Probably v2 thursday or friday
>
> Regards
> ~Gregory
Gregory Price <[email protected]> writes:
> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> using the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
> based interleave policy does not optimally distribute data to make use
> of their different bandwidth characteristics.
>
> Instead, interleave is more effective when the allocation policy follows
> each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
> enabling weighted interleave between NUMA nodes. Weighted interleave
> allows for proportional distribution of memory across multiple numa
> nodes, preferably apportioned to match the bandwidth of each node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
> weight distribution is (2:1).
>
> Weights for each node can be assigned via the new sysfs extension:
> /sys/kernel/mm/mempolicy/weighted_interleave/
>
> In addition, the `default_iw_table` is created, which will be extended
> in the future to allow defaults to be registered by drivers. For now,
> the default value of all nodes will be `1`, which matches the behavior
> of standard 1:1 round-robin interleave.
>
> The policy allocates a number of pages equal to the set weights. For
> example, if the weights are (2,1), then 2 pages will be allocated on
> node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
>
> There are 3 integration points:
>
> weighted_interleave_nodes:
> Counts the number of allocations as they occur, and applies the
> weight for the current node. When the weight reaches 0, switch
> to the next node.
>
> weighted_interleave_nid:
> Gets the total weight of the nodemask as well as each individual
> node weight, then calculates the node based on the given index.
>
> bulk_array_weighted_interleave:
> Gets the total weight of the nodemask as well as each individual
> node weight, then calculates the number of "interleave rounds" as
> well as any delta ("partial round"). Calculates the number of
> pages for each node and allocates them.
>
> If a node was scheduled for interleave via interleave_nodes, the
> current weight (pol->cur_weight) will be allocated first, before
> the remaining bulk calculation is done.
>
> One piece of complexity is the interaction between a recent refactor
> which split the logic to acquire the "ilx" (interleave index) of an
> allocation and the actually application of the interleave. The
> calculation of the `interleave index` is done by `get_vma_policy()`,
> while the actual selection of the node will be later appliex by the
> relevant weighted_interleave function.
>
> Suggested-by: Hasan Al Maruf <[email protected]>
> Signed-off-by: Gregory Price <[email protected]>
> Co-developed-by: Rakie Kim <[email protected]>
> Signed-off-by: Rakie Kim <[email protected]>
> Co-developed-by: Honggyu Kim <[email protected]>
> Signed-off-by: Honggyu Kim <[email protected]>
> Co-developed-by: Hyeongtak Ji <[email protected]>
> Signed-off-by: Hyeongtak Ji <[email protected]>
> Co-developed-by: Srinivasulu Thanneeru <[email protected]>
> Signed-off-by: Srinivasulu Thanneeru <[email protected]>
> Co-developed-by: Ravi Jonnalagadda <[email protected]>
> Signed-off-by: Ravi Jonnalagadda <[email protected]>
> ---
> .../admin-guide/mm/numa_memory_policy.rst | 9 +
> include/linux/mempolicy.h | 5 +
> include/uapi/linux/mempolicy.h | 1 +
> mm/mempolicy.c | 214 +++++++++++++++++-
> 4 files changed, 226 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index eca38fa81e0f..a70f20ce1ffb 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
> can fall back to all existing numa nodes. This is effectively
> MPOL_PREFERRED allowed for a mask rather than a single node.
>
> +MPOL_WEIGHTED_INTERLEAVE
> + This mode operates the same as MPOL_INTERLEAVE, except that
> + interleaving behavior is executed based on weights set in
> + /sys/kernel/mm/mempolicy/weighted_interleave/
> +
> + Weighted interleave allocates pages on nodes according to a
> + weight. For example if nodes [0,1] are weighted [5,2], 5 pages
> + will be allocated on node0 for every 2 pages allocated on node1.
> +
> NUMA memory policy supports the following optional mode flags:
>
> MPOL_F_STATIC_NODES
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..c1a083eb0dd5 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -54,6 +54,11 @@ struct mempolicy {
> nodemask_t cpuset_mems_allowed; /* relative to these nodes */
> nodemask_t user_nodemask; /* nodemask passed by user */
> } w;
> +
> + /* Weighted interleave settings */
> + struct {
> + u8 cur_weight;
> + } wil;
> };
>
> /*
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index a8963f7ef4c2..1f9bb10d1a47 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -23,6 +23,7 @@ enum {
> MPOL_INTERLEAVE,
> MPOL_LOCAL,
> MPOL_PREFERRED_MANY,
> + MPOL_WEIGHTED_INTERLEAVE,
> MPOL_MAX, /* always last member of enum */
> };
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0abd3a3394ef..a2b5d64b28e0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -19,6 +19,13 @@
> * for anonymous memory. For process policy an process counter
> * is used.
> *
> + * weighted interleave
> + * Allocate memory interleaved over a set of nodes based on
> + * a set of weights (per-node), with normal fallback if it
> + * fails. Otherwise operates the same as interleave.
> + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
> + * on node 0 for every 1 page allocated on node 1.
> + *
> * bind Only allocate memory on a specific set of nodes,
> * no fallback.
> * FIXME: memory is allocated starting with the first node
> @@ -327,6 +334,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
> policy->mode = mode;
> policy->flags = flags;
> policy->home_node = NUMA_NO_NODE;
> + policy->wil.cur_weight = 0;
>
> return policy;
> }
> @@ -439,6 +447,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> .create = mpol_new_nodemask,
> .rebind = mpol_rebind_preferred,
> },
> + [MPOL_WEIGHTED_INTERLEAVE] = {
> + .create = mpol_new_nodemask,
> + .rebind = mpol_rebind_nodemask,
> + },
> };
>
> static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> @@ -860,7 +872,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>
> old = current->mempolicy;
> current->mempolicy = new;
> - if (new && new->mode == MPOL_INTERLEAVE)
> + if (new && (new->mode == MPOL_INTERLEAVE ||
> + new->mode == MPOL_WEIGHTED_INTERLEAVE))
> current->il_prev = MAX_NUMNODES-1;
> task_unlock(current);
> mpol_put(old);
> @@ -886,6 +899,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
> case MPOL_INTERLEAVE:
> case MPOL_PREFERRED:
> case MPOL_PREFERRED_MANY:
> + case MPOL_WEIGHTED_INTERLEAVE:
> *nodes = pol->nodes;
> break;
> case MPOL_LOCAL:
> @@ -970,6 +984,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> } else if (pol == current->mempolicy &&
> pol->mode == MPOL_INTERLEAVE) {
> *policy = next_node_in(current->il_prev, pol->nodes);
> + } else if (pol == current->mempolicy &&
> + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
> + if (pol->wil.cur_weight)
> + *policy = current->il_prev;
> + else
> + *policy = next_node_in(current->il_prev,
> + pol->nodes);
> } else {
> err = -EINVAL;
> goto out;
> @@ -1799,7 +1820,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> pol = __get_vma_policy(vma, addr, ilx);
> if (!pol)
> pol = get_task_policy(current);
> - if (pol->mode == MPOL_INTERLEAVE) {
> + if (pol->mode == MPOL_INTERLEAVE ||
> + pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> *ilx += vma->vm_pgoff >> order;
> *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> }
> @@ -1849,6 +1871,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> return zone >= dynamic_policy_zone;
> }
>
> +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> +{
> + unsigned int next;
> + struct task_struct *me = current;
> + struct iw_table __rcu *table;
> +
> + next = next_node_in(me->il_prev, policy->nodes);
> + if (next == MAX_NUMNODES)
> + return next;
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + if (!policy->wil.cur_weight)
> + policy->wil.cur_weight = table->weights[next];
> + rcu_read_unlock();
> +
> + policy->wil.cur_weight--;
> + if (!policy->wil.cur_weight)
> + me->il_prev = next;
> + return next;
> +}
> +
> /* Do dynamic interleaving for a process */
> static unsigned int interleave_nodes(struct mempolicy *policy)
> {
> @@ -1883,6 +1927,9 @@ unsigned int mempolicy_slab_node(void)
> case MPOL_INTERLEAVE:
> return interleave_nodes(policy);
>
> + case MPOL_WEIGHTED_INTERLEAVE:
> + return weighted_interleave_nodes(policy);
> +
> case MPOL_BIND:
> case MPOL_PREFERRED_MANY:
> {
> @@ -1921,6 +1968,39 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
> return nodes_weight(*mask);
> }
>
> +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> +{
> + nodemask_t nodemask;
> + unsigned int target, nr_nodes;
> + struct iw_table __rcu *table;
> + unsigned int weight_total = 0;
> + u8 weight;
> + int nid;
> +
> + nr_nodes = read_once_policy_nodemask(pol, &nodemask);
> + if (!nr_nodes)
> + return numa_node_id();
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + /* calculate the total weight */
> + for_each_node_mask(nid, nodemask)
> + weight_total += table->weights[nid];
> +
> + /* Calculate the node offset based on totals */
> + target = ilx % weight_total;
> + nid = first_node(nodemask);
> + while (target) {
> + weight = table->weights[nid];
> + if (target < weight)
> + break;
> + target -= weight;
> + nid = next_node_in(nid, nodemask);
> + }
> + rcu_read_unlock();
> + return nid;
> +}
> +
> /*
> * Do static interleaving for interleave index @ilx. Returns the ilx'th
> * node in pol->nodes (starting from ilx=0), wrapping around if ilx
> @@ -1981,6 +2061,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> *nid = (ilx == NO_INTERLEAVE_INDEX) ?
> interleave_nodes(pol) : interleave_nid(pol, ilx);
> break;
> + case MPOL_WEIGHTED_INTERLEAVE:
> + *nid = (ilx == NO_INTERLEAVE_INDEX) ?
> + weighted_interleave_nodes(pol) :
> + weighted_interleave_nid(pol, ilx);
> + break;
> }
>
> return nodemask;
> @@ -2042,6 +2127,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> case MPOL_PREFERRED_MANY:
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> *mask = mempolicy->nodes;
> break;
>
> @@ -2141,7 +2227,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> * If the policy is interleave or does not allow the current
> * node in its nodemask, we allocate the standard way.
> */
> - if (pol->mode != MPOL_INTERLEAVE &&
> + if ((pol->mode != MPOL_INTERLEAVE &&
> + pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
> (!nodemask || node_isset(nid, *nodemask))) {
> /*
> * First, try to allocate THP only on local node, but
> @@ -2277,6 +2364,114 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
> return total_allocated;
> }
>
> +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> + struct mempolicy *pol, unsigned long nr_pages,
> + struct page **page_array)
> +{
> + struct task_struct *me = current;
> + unsigned long total_allocated = 0;
> + unsigned long nr_allocated;
> + unsigned long rounds;
> + unsigned long node_pages, delta;
> + u8 weight;
> + struct iw_table __rcu *table;
> + u8 *weights;
> + unsigned int weight_total = 0;
> + unsigned long rem_pages = nr_pages;
> + nodemask_t nodes;
> + int nnodes, node, weight_nodes;
> + int prev_node = NUMA_NO_NODE;
> + int i;
> +
> + nnodes = read_once_policy_nodemask(pol, &nodes);
> + if (!nnodes)
> + return 0;
> +
> + /* Continue allocating from most recent node and adjust the nr_pages */
> + if (pol->wil.cur_weight) {
> + node = next_node_in(me->il_prev, nodes);
> + node_pages = pol->wil.cur_weight;
> + if (node_pages > rem_pages)
> + node_pages = rem_pages;
> + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> + NULL, page_array);
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + /* if that's all the pages, no need to interleave */
> + if (rem_pages <= pol->wil.cur_weight) {
> + pol->wil.cur_weight -= rem_pages;
> + return total_allocated;
> + }
> + /* Otherwise we adjust nr_pages down, and continue from there */
> + rem_pages -= pol->wil.cur_weight;
> + pol->wil.cur_weight = 0;
> + prev_node = node;
> + }
> +
> + /* fetch the weights for this operation and calculate total weight */
> + weights = kmalloc(nnodes, GFP_KERNEL);
> + if (!weights)
> + return total_allocated;
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + weight_nodes = 0;
> + for_each_node_mask(node, nodes) {
> + weights[weight_nodes++] = table->weights[node];
> + weight_total += table->weights[node];
> + }
> + rcu_read_unlock();
> +
> + if (!weight_total) {
> + kfree(weights);
> + return total_allocated;
> + }
> +
> + /* Now we can continue allocating as if from 0 instead of an offset */
> + rounds = rem_pages / weight_total;
> + delta = rem_pages % weight_total;
> + for (i = 0; i < nnodes; i++) {
> + node = next_node_in(prev_node, nodes);
> + weight = weights[i];
> + node_pages = weight * rounds;
> + if (delta) {
> + if (delta > weight) {
> + node_pages += weight;
> + delta -= weight;
> + } else {
> + node_pages += delta;
> + delta = 0;
> + }
> + }
> + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> + NULL, page_array);
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + if (total_allocated == nr_pages)
> + break;
> + prev_node = node;
> + }
> +
> + /*
> + * Finally, we need to update me->il_prev and pol->wil.cur_weight
> + * if there were overflow pages, but not equivalent to the node
> + * weight, set the cur_weight to node_weight - delta and the
> + * me->il_prev to the previous node. Otherwise if it was perfect
> + * we can simply set il_prev to node and cur_weight to 0
> + */
> + if (node_pages) {
> + me->il_prev = prev_node;
> + node_pages %= weight;
> + pol->wil.cur_weight = weight - node_pages;
> + } else {
> + me->il_prev = node;
> + pol->wil.cur_weight = 0;
> + }
It appears that we should set me->il_prev and pol->wil.cur_weight when
delta becomes 0? That is, following allocation should start from there?
--
Best Regards,
Huang, Ying
> +
> + kfree(weights);
> + return total_allocated;
> +}
> +
> static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
> struct mempolicy *pol, unsigned long nr_pages,
> struct page **page_array)
> @@ -2317,6 +2512,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
> return alloc_pages_bulk_array_interleave(gfp, pol,
> nr_pages, page_array);
>
> + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
> + return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
> + nr_pages,
> + page_array);
> +
> if (pol->mode == MPOL_PREFERRED_MANY)
> return alloc_pages_bulk_array_preferred_many(gfp,
> numa_node_id(), pol, nr_pages, page_array);
> @@ -2392,6 +2592,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
> case MPOL_INTERLEAVE:
> case MPOL_PREFERRED:
> case MPOL_PREFERRED_MANY:
> + case MPOL_WEIGHTED_INTERLEAVE:
> return !!nodes_equal(a->nodes, b->nodes);
> case MPOL_LOCAL:
> return true;
> @@ -2528,6 +2729,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
> polnid = interleave_nid(pol, ilx);
> break;
>
> + case MPOL_WEIGHTED_INTERLEAVE:
> + polnid = weighted_interleave_nid(pol, ilx);
> + break;
> +
> case MPOL_PREFERRED:
> if (node_isset(curnid, pol->nodes))
> goto out;
> @@ -2902,6 +3107,7 @@ static const char * const policy_modes[] =
> [MPOL_PREFERRED] = "prefer",
> [MPOL_BIND] = "bind",
> [MPOL_INTERLEAVE] = "interleave",
> + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
> [MPOL_LOCAL] = "local",
> [MPOL_PREFERRED_MANY] = "prefer (many)",
> };
> @@ -2961,6 +3167,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
> }
> break;
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> /*
> * Default to online nodes with memory if no nodelist
> */
> @@ -3071,6 +3278,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
> case MPOL_PREFERRED_MANY:
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> nodes = pol->nodes;
> break;
> default:
On Thu, Jan 18, 2024 at 11:05:52AM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
> > +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> > + struct mempolicy *pol, unsigned long nr_pages,
> > + struct page **page_array)
> > +{
> > + struct task_struct *me = current;
> > + unsigned long total_allocated = 0;
> > + unsigned long nr_allocated;
> > + unsigned long rounds;
> > + unsigned long node_pages, delta;
> > + u8 weight;
> > + struct iw_table __rcu *table;
> > + u8 *weights;
> > + unsigned int weight_total = 0;
> > + unsigned long rem_pages = nr_pages;
> > + nodemask_t nodes;
> > + int nnodes, node, weight_nodes;
> > + int prev_node = NUMA_NO_NODE;
> > + int i;
> > +
> > + nnodes = read_once_policy_nodemask(pol, &nodes);
> > + if (!nnodes)
> > + return 0;
> > +
> > + /* Continue allocating from most recent node and adjust the nr_pages */
> > + if (pol->wil.cur_weight) {
> > + node = next_node_in(me->il_prev, nodes);
> > + node_pages = pol->wil.cur_weight;
> > + if (node_pages > rem_pages)
> > + node_pages = rem_pages;
> > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> > + NULL, page_array);
> > + page_array += nr_allocated;
> > + total_allocated += nr_allocated;
> > + /* if that's all the pages, no need to interleave */
> > + if (rem_pages <= pol->wil.cur_weight) {
> > + pol->wil.cur_weight -= rem_pages;
> > + return total_allocated;
> > + }
> > + /* Otherwise we adjust nr_pages down, and continue from there */
> > + rem_pages -= pol->wil.cur_weight;
> > + pol->wil.cur_weight = 0;
> > + prev_node = node;
> > + }
> > +
> > + /* fetch the weights for this operation and calculate total weight */
> > + weights = kmalloc(nnodes, GFP_KERNEL);
> > + if (!weights)
> > + return total_allocated;
> > +
> > + rcu_read_lock();
> > + table = rcu_dereference(iw_table);
> > + weight_nodes = 0;
> > + for_each_node_mask(node, nodes) {
> > + weights[weight_nodes++] = table->weights[node];
> > + weight_total += table->weights[node];
> > + }
> > + rcu_read_unlock();
> > +
> > + if (!weight_total) {
> > + kfree(weights);
> > + return total_allocated;
> > + }
> > +
> > + /* Now we can continue allocating as if from 0 instead of an offset */
> > + rounds = rem_pages / weight_total;
> > + delta = rem_pages % weight_total;
> > + for (i = 0; i < nnodes; i++) {
> > + node = next_node_in(prev_node, nodes);
> > + weight = weights[i];
> > + node_pages = weight * rounds;
> > + if (delta) {
> > + if (delta > weight) {
> > + node_pages += weight;
> > + delta -= weight;
> > + } else {
> > + node_pages += delta;
> > + delta = 0;
> > + }
> > + }
> > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> > + NULL, page_array);
> > + page_array += nr_allocated;
> > + total_allocated += nr_allocated;
> > + if (total_allocated == nr_pages)
> > + break;
> > + prev_node = node;
> > + }
> > +
> > + /*
> > + * Finally, we need to update me->il_prev and pol->wil.cur_weight
> > + * if there were overflow pages, but not equivalent to the node
> > + * weight, set the cur_weight to node_weight - delta and the
> > + * me->il_prev to the previous node. Otherwise if it was perfect
> > + * we can simply set il_prev to node and cur_weight to 0
> > + */
> > + if (node_pages) {
> > + me->il_prev = prev_node;
> > + node_pages %= weight;
> > + pol->wil.cur_weight = weight - node_pages;
> > + } else {
> > + me->il_prev = node;
> > + pol->wil.cur_weight = 0;
> > + }
>
>
> It appears that we should set me->il_prev and pol->wil.cur_weight when
> delta becomes 0? That is, following allocation should start from there?
>
So the observation is that when delta reaches 0, we know what the prior
node should be. The only corner case being that delta is 0 when we
enter the loop (in which case current prev_node is the correct
prev_node).
Eyeballing it, this seems correct, but I'll do some additional
validation tomorrow. That should clean up the last block a bit.
Thanks!
~Gregory
Gregory Price <[email protected]> writes:
> On Wed, Jan 17, 2024 at 02:58:08PM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>> > We haven't had the discussion on how/when this should happen yet,
>> > though, and there's some research to be done. (i.e. when should DRAM
>> > weights be set? should the entire table be reweighted on hotplug? etc)
>>
>> Before that, I'm OK to remove default_iw_table and use hard coded "1" as
>> default weight for now.
>>
>
> Can't quite do that. default_iw_table is a static structure because we
> need a reliable default structure not subject to module initialization
> failure. Otherwise we can end up in a situation where iw_table is NULL
> during some allocation path if the sysfs structure fails to setup fully.
As the first simplest implementation, we can avoid default_iw_table[].
Becuse it's constant.
> There's no good reason to fail allocations just because sysfs failed to
> initialization for some reason. I'll leave default_iw_table with a size
> of MAX_NUMNODES for now (nr_node_ids is set up at runtime per your
> reference to `setup_nr_node_ids` below, so we can't use it for this).
We allocate memory during module initialization all over the places in
kernel. I don't think it will cause any issue in practice. Just some
additional checking for "default_iw_table == NULL".
And, we cannot make it just static, because we need to use RCU to keep
it consistent. Otherwise, it may be changed during reading.
>> >
>> >> u8 __rcu *iw_table;
>> >>
>> >> Then, we only need to allocate nr_node_ids elements now.
>> >>
>> >
>> > We need nr_possible_nodes to handle hotplug correctly.
>>
>> nr_node_ids >= num_possible_nodes(). It's larger than any possible node
>> ID.
>>
>
> nr_node_ids gets setup at runtime, while the default_iw_table needs
> to be a static structure (see above). I can make default_iw_table
> MAX_NUMNODES and subsequent allocations of iw_table be nr_node_ids,
> but that makes iw_table a different size at any given time.
>
> This *will* break if "true hotplug" ever shows up and possible_nodes !=
> MAX_NUMNODES. But I can write it up if it's a sticking point for you.
I don't think it is an issue for "true hotplug". Because we can set
nr_node_ids = MAX_NUMNODES even if there is something called "true
hotplug".
> Ultimately we're squabbling over, at most, about ~3kb of memory, just
> keep that in mind. (I guess if you spawn 3000 threads and each tries a
> concurrent write to sysfs/node1, you'd eat 3MB view briefly, but that
> is a truly degenerate case and I can think of more denegerate things).
Not just for memory wastage, it's about proper API too.
>>
>> When "true node hotplug" becomes reality, we can make nr_node_ids ==
>> MAX_NUMNODES. So, it's safe to use it. Please take a look at
>> setup_nr_node_ids().
>>
--
Best Regards,
Huang, Ying