Hi Andrew - this version added a fix for a stale weight
issue that can occur on cgroup migration, and some fixups
recommended by Ying Huang. There is an additional patch
about the stale weight due to the introduction of atomics
that may warrant some explicit scrutiny before pulling.
v3: stale value fix, bulk allocator fixups, sysfs simplification
---
Weighted interleave is a new interleave policy intended to make
use of heterogeneous memory environments appearing with CXL.
The existing interleave mechanism does an even round-robin
distribution of memory across all nodes in a nodemask, while
weighted interleave distributes memory across nodes according
to a provided weight. (Weight = # of page allocations per round)
Weighted interleave is intended to reduce average latency when
bandwidth is pressured - therefore increasing total throughput.
In other words: It allows greater use of the total available
bandwidth in a heterogeneous hardware environment (different
hardware provides different bandwidth capacity).
As bandwidth is pressured, latency increases - first linearly
and then exponentially. By keeping bandwidth usage distributed
according to available bandwidth, we therefore can reduce the
average latency of a cacheline fetch.
A good explanation of the bandwidth vs latency response curve:
https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/
From the article:
```
Constant region:
The latency response is fairly constant for the first 40%
of the sustained bandwidth.
Linear region:
In between 40% to 80% of the sustained bandwidth, the
latency response increases almost linearly with the bandwidth
demand of the system due to contention overhead by numerous
memory requests.
Exponential region:
Between 80% to 100% of the sustained bandwidth, the memory
latency is dominated by the contention latency which can be
as much as twice the idle latency or more.
Maximum sustained bandwidth :
Is 65% to 75% of the theoretical maximum bandwidth.
```
As a general rule of thumb:
* If bandwidth usage is low, latency does not increase. It is
optimal to place data in the nearest (lowest latency) device.
* If bandwidth usage is high, latency increases. It is optimal
to place data such that bandwidth use is optimized per-device.
This is the top line goal: Provide a user a mechanism to target using
the "maximum sustained bandwidth" of each hardware component in a
heterogenous memory system.
For example, the stream benchmark demonstrates that 1:1 (default)
interleave is actively harmful, while weighted interleave can be
beneficial. Default interleave distributes data such that too much
pressure is placed on devices with lower available bandwidth.
Stream Benchmark (High level results, 1 Socket + 1 CXL Device)
Default interleave : -78% (slower than DRAM)
Global weighting : -6% to +4% (workload dependant)
Targeted weights : +2.5% to +4% (consistently better than DRAM)
Global means the task-policy was set (set_mempolicy), while targeted
means VMA policies were set (mbind2). We see weighted interleave
is not always beneficial when applied globally, but is always
beneficial when applied to bandwidth-driving memory regions.
We implement sysfs entries for "system global" weights which can be
set by a daemon or administrator.
There are 3 patches in this set:
1) Implement system-global interleave weights as sysfs extension
in mm/mempolicy.c. These weights are RCU protected, and a
default weight set is provided (all weights are 1 by default).
In future work, we intend to expose an interface for HMAT/CDAT
information to be used during boot to set reasonable system
default values based on the memory configuration of the system
discovered at boot or during device hotplug.
2) A mild refactor of some interleave-logic for re-use in the
new weighted interleave logic.
3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind
Included below are some performance and LTP test information,
and a sample numactl branch which can be used for testing.
= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench : +19% over DRAM. +47% over default interleave.
= LTP Testing Summary =
existing mempolicy & mbind tests: pass
mempolicy & mbind + weighted interleave (global weights): pass
= version history
v3:
- MAJOR: Changes cur_weight to be an atomic and to carry the
current interleave node+weight, rather than just weight
this fixes a stale-weight issue when a rebind occurs.
- minor doc updates
- sysfs: remove module_exit path, not needed
- sysfs: allocate node_attrs rather than static MAX_NUMNODES array
- interleave_nodes: handle cur_weight=0 conditions explicitly
- bulk allocator: prev_node should be initialized to me->il_prev
- bulk alloactor: weight collection logic style fixes
- bulk allocator: bulk allocator for loop style fixes
- bulk allocator: corner case fixes for resume_node/weight
v2:
- MAJOR: Torture tested bulk allocator, fixed edge conditions
tracking the next me->il_node. Added documentation.
Prior version was stable, but the resulting me->il_node
could be wrong under certain circumstances.
- naming: iw_table_mtx -> iw_table_lock
- RCU: use synchronize+kfree and simplify the weight structure
- default: remove default table, since it's static for now
- sysfs setup: simplify setup, if table==NULL presume 1's
- node_store: only allocate (sizeof(u8) * nr_node_ids)
- allocators: update to deal with NULL table pointer
- read_once: __builtin_memcpy -> memcpy
- formatting
v1:
- RCU: This version protects the weight array with RCU.
- ktest fix: proper include (types.h) in uapi header
- doc: make mpol_params in docs reflect definition
- doc: mempolicy.c comments in MPOL_WEIGHTED_INTERLEAVE patch
- Dropped task-local weights and syscalls from the proposal
until affirmative use cases for task-local weights appear.
Link: https://lore.kernel.org/linux-mm/[email protected]/
=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <[email protected]>
Hardware: Single-socket, multiple CXL memory expanders.
Workload: W2
Data Signature: 2:1 read:write
DRAM only bandwidth (GBps): 298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only: 1.38x
Gain over default interleave: 2.64x
Workload: W5
Data Signature: 1:1 read:write
DRAM only bandwidth (GBps): 273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only: 1.4x
Gain over default interleave: 2.26x
=====================================================================
Performance test - Stream
From - Gregory Price <[email protected]>
Hardware: Single socket, single CXL expander
numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master
Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting : -6% to +4% (workload dependant)
mbind2 weights : +2.5% to +4% (consistently better than DRAM)
dram only:
numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Function Direction BestRateMBs AvgTime MinTime MaxTime
Copy: 0->0 200923.2 0.032662 0.031853 0.033301
Scale: 0->0 202123.0 0.032526 0.031664 0.032970
Add: 0->0 208873.2 0.047322 0.045961 0.047884
Triad: 0->0 208523.8 0.047262 0.046038 0.048414
CXL-only:
numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy: 0->0 22209.7 0.288661 0.288162 0.289342
Scale: 0->0 22288.2 0.287549 0.287147 0.288291
Add: 0->0 24419.1 0.393372 0.393135 0.393735
Triad: 0->0 24484.6 0.392337 0.392083 0.394331
Based on the above, the optimal weights are ~9:1
echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2
default interleave:
numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy: 0->0 44666.2 0.143671 0.143285 0.144174
Scale: 0->0 44781.6 0.143256 0.142916 0.143713
Add: 0->0 48600.7 0.197719 0.197528 0.197858
Triad: 0->0 48727.5 0.197204 0.197014 0.197439
global weighted interleave:
numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy: 0->0 190085.9 0.034289 0.033669 0.034645
Scale: 0->0 207677.4 0.031909 0.030817 0.033061
Add: 0->0 202036.8 0.048737 0.047516 0.053409
Triad: 0->0 217671.5 0.045819 0.044103 0.046755
targted regions w/ global weights (modified stream to mbind2 malloc'd regions))
numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc
Copy: 0->0 205827.0 0.031445 0.031094 0.031984
Scale: 0->0 208171.8 0.031320 0.030744 0.032505
Add: 0->0 217352.0 0.045087 0.044168 0.046515
Triad: 0->0 216884.8 0.045062 0.044263 0.046982
=====================================================================
Performance tests - XSBench
From - Hyeongtak Ji <[email protected]>
Hardware: Single socket, Single CXL memory Expander
NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads: 56
Lookups: 170,000,000
Summary: +19% over DRAM. +47% over default interleave.
Performance tests - XSBench
1. dram only
$ numactl -m 0 ./XSBench -s XL –p 5000000
Runtime: 36.235 seconds
Lookups/s: 4,691,618
2. default interleave
$ numactl –i 0,2 ./XSBench –s XL –p 5000000
Runtime: 55.243 seconds
Lookups/s: 3,077,293
3. weighted interleave
numactl –w –i 0,2 ./XSBench –s XL –p 5000000
Runtime: 29.262 seconds
Lookups/s: 5,809,513
=====================================================================
LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2
= Existing tests
set_mempolicy, get_mempolicy, mbind
MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality
but did not adjust tests for weighting. Basically the weights were
set to 1, which is the default, and it should behavior like standard
MPOL_INTERLEAVE if logic is correct.
== set_mempolicy01 : passed 18, failed 0
== set_mempolicy02 : passed 10, failed 0
== set_mempolicy03 : passed 64, failed 0
== set_mempolicy04 : passed 32, failed 0
== set_mempolicy05 - n/a on non-x86
== set_mempolicy06 : passed 10, failed 0
this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE
== set_mempolicy07 : passed 32, failed 0
set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy01 : passed 12, failed 0
change: added MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy02 : passed 2, failed 0
== mbind01 : passed 15, failed 0
added MPOL_WEIGHTED_INTERLEAVE
== mbind02 : passed 4, failed 0
added MPOL_WEIGHTED_INTERLEAVE
== mbind03 : passed 16, failed 0
added MPOL_WEIGHTED_INTERLEAVE
== mbind04 : passed 48, failed 0
added MPOL_WEIGHTED_INTERLEAVE
=====================================================================
numactl (set_mempolicy) w/ global weighting test
numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master
command: numactl -w --interleave=0,1 ./eatmem
result (weights 1:1):
0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4
7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4
50% distribution is correct
result (weights 5:1):
01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4
7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4
16.666% distribution is correct
result (weights 1:5):
01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4
7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4
16.666% distribution is correct
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void)
{
char* mem = malloc(1024*1024*256);
memset(mem, 1, 1024*1024*256);
for (int i = 0; i < ((1024*1024*256)/4096); i++)
{
mem = malloc(4096);
mem[0] = 1;
}
printf("done\n");
getchar();
return 0;
}
=====================================================================
Suggested-by: Gregory Price <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Suggested-by: Hasan Al Maruf <[email protected]>
Suggested-by: Hao Wang <[email protected]>
Suggested-by: Ying Huang <[email protected]>
Suggested-by: Dan Williams <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Suggested-by: Zhongkun He <[email protected]>
Suggested-by: Frank van der Linden <[email protected]>
Suggested-by: John Groves <[email protected]>
Suggested-by: Vinicius Tavares Petrucci <[email protected]>
Suggested-by: Srinivasulu Thanneeru <[email protected]>
Suggested-by: Ravi Jonnalagadda <[email protected]>
Suggested-by: Jonathan Cameron <[email protected]>
Suggested-by: Hyeongtak Ji <[email protected]>
Suggested-by: Andi Kleen <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Gregory Price (3):
mm/mempolicy: refactor a read-once mechanism into a function for
re-use
mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted
interleaving
mm/mempolicy: change cur_il_weight to atomic and carry the node with
it
Rakie Kim (1):
mm/mempolicy: implement the sysfs-based weighted_interleave interface
.../ABI/testing/sysfs-kernel-mm-mempolicy | 4 +
...fs-kernel-mm-mempolicy-weighted-interleave | 25 +
.../admin-guide/mm/numa_memory_policy.rst | 9 +
include/linux/mempolicy.h | 3 +
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 551 +++++++++++++++++-
6 files changed, 579 insertions(+), 14 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
--
2.39.1
move the use of barrier() to force policy->nodemask onto the stack into
a function `read_once_policy_nodemask` so that it may be re-used.
Suggested-by: Huang Ying <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
---
mm/mempolicy.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f1627d45b0c8..b13c45a0bfcb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1894,6 +1894,20 @@ unsigned int mempolicy_slab_node(void)
}
}
+static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
+ nodemask_t *mask)
+{
+ /*
+ * barrier stabilizes the nodemask locally so that it can be iterated
+ * over safely without concern for changes. Allocators validate node
+ * selection does not violate mems_allowed, so this is safe.
+ */
+ barrier();
+ memcpy(mask, &pol->nodes, sizeof(nodemask_t));
+ barrier();
+ return nodes_weight(*mask);
+}
+
/*
* Do static interleaving for interleave index @ilx. Returns the ilx'th
* node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1901,20 +1915,12 @@ unsigned int mempolicy_slab_node(void)
*/
static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
{
- nodemask_t nodemask = pol->nodes;
+ nodemask_t nodemask;
unsigned int target, nnodes;
int i;
int nid;
- /*
- * The barrier will stabilize the nodemask in a register or on
- * the stack so that it will stop changing under the code.
- *
- * Between first_node() and next_node(), pol->nodes could be changed
- * by other threads. So we put pol->nodes in a local stack.
- */
- barrier();
- nnodes = nodes_weight(nodemask);
+ nnodes = read_once_policy_nodemask(pol, &nodemask);
if (!nnodes)
return numa_node_id();
target = ilx % nnodes;
--
2.39.1
In the prior patch, we carry only the current weight for a weighted
interleave round with us across calls through the allocator path.
node = next_node_in(current->il_prev, pol->nodemask)
pol->cur_il_weight <--- this weight applies to the above node
This separation of data can cause a race condition.
If a cgroup-initiated task migration or mems_allowed change occurs
from outside the context of the task, this can cause the weight to
become stale, meaning we may end using that weight to allocate
memory on the wrong node.
Example:
1) task A sets (cur_il_weight = 8) and (current->il_prev) to
node0. node1 is the next set bit in pol->nodemask
2) rebind event occurs, removing node1 from the nodemask.
node2 is now the next set bit in pol->nodemask
cur_il_weight is now stale.
3) allocation occurs, next_node_in(il_prev, nodes) returns
node2. cur_il_weight is now applied to the wrong node.
The upper level allocator logic must still enforce mems_allowed,
so this isn't dangerous, but it is innaccurate.
Just clearing the weight is insufficient, as it creates two more
race conditions. The root of the issue is the separation of weight
and node data between nodemask and cur_il_weight.
To solve this, update cur_il_weight to be an atomic_t, and place the
node that the weight applies to in the upper bits of the field:
atomic_t cur_il_weight
node bits 32:8
weight bits 7:0
Now retrieving or clearing the active interleave node and weight
is a single atomic operation, and we are not dependent on the
potentially changing state of (pol->nodemask) to determine what
node the weight applies to.
Two special observations:
- if the weight is non-zero, cur_il_weight must *always* have a
valid node number, e.g. it cannot be NUMA_NO_NODE (-1).
This is because we steal the top byte for the weight.
- MAX_NUMNODES is presently limited to 1024 or less on every
architecture. This would permanently limit MAX_NUMNODES to
an absolute maximum of (1 << 24) to avoid overflows.
Per some reading and discussion, it appears that max nodes is
limited to 1024 so that zone type still fits in page flags, so
this method seemed preferable compared to the alternatives of
trying to make all or part of mempolicy RCU protected (which
may not be possible, since it is often referenced during code
chunks which call operations that may sleep).
Signed-off-by: Gregory Price <[email protected]>
---
include/linux/mempolicy.h | 2 +-
mm/mempolicy.c | 93 +++++++++++++++++++++++++--------------
2 files changed, 61 insertions(+), 34 deletions(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index c644d7bbd396..8108fc6e96ca 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -56,7 +56,7 @@ struct mempolicy {
} w;
/* Weighted interleave settings */
- u8 cur_il_weight;
+ atomic_t cur_il_weight;
};
/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5a517511658e..41b5fef0a6f5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -321,7 +321,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
policy->mode = mode;
policy->flags = flags;
policy->home_node = NUMA_NO_NODE;
- policy->cur_il_weight = 0;
+ atomic_set(&policy->cur_il_weight, 0);
return policy;
}
@@ -356,6 +356,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
tmp = *nodes;
pol->nodes = tmp;
+ atomic_set(&pol->cur_il_weight, 0);
}
static void mpol_rebind_preferred(struct mempolicy *pol,
@@ -973,8 +974,10 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
*policy = next_node_in(current->il_prev, pol->nodes);
} else if (pol == current->mempolicy &&
(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
- if (pol->cur_il_weight)
- *policy = current->il_prev;
+ int cweight = atomic_read(&pol->cur_il_weight);
+
+ if (cweight & 0xFF)
+ *policy = cweight >> 8;
else
*policy = next_node_in(current->il_prev,
pol->nodes);
@@ -1864,36 +1867,48 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
unsigned int node, next;
struct task_struct *me = current;
u8 __rcu *table;
+ int cur_weight;
u8 weight;
- node = next_node_in(me->il_prev, policy->nodes);
- if (node == MAX_NUMNODES)
- return node;
+ cur_weight = atomic_read(&policy->cur_il_weight);
+ node = cur_weight >> 8;
+ weight = cur_weight & 0xff;
- /* on first alloc after setting mempolicy, acquire first weight */
- if (unlikely(!policy->cur_il_weight)) {
+ /* If nodemask was rebound, just fetch the next node */
+ if (!weight || !node_isset(node, policy->nodes)) {
+ node = next_node_in(me->il_prev, policy->nodes);
+ /* can only happen if nodemask has become invalid */
+ if (node == MAX_NUMNODES)
+ return node;
rcu_read_lock();
table = rcu_dereference(iw_table);
/* detect system-default values */
weight = table ? table[node] : 1;
- policy->cur_il_weight = weight ? weight : 1;
+ weight = weight ? weight : 1;
rcu_read_unlock();
}
/* account for this allocation call */
- policy->cur_il_weight--;
+ weight--;
/* if now at 0, move to next node and set up that node's weight */
- if (unlikely(!policy->cur_il_weight)) {
+ if (unlikely(!weight)) {
me->il_prev = node;
next = next_node_in(node, policy->nodes);
- rcu_read_lock();
- table = rcu_dereference(iw_table);
- /* detect system-default values */
- weight = table ? table[next] : 1;
- policy->cur_il_weight = weight ? weight : 1;
- rcu_read_unlock();
- }
+ if (next != MAX_NUMNODES) {
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ /* detect system-default values */
+ weight = table ? table[next] : 1;
+ weight = weight ? weight : 1;
+ rcu_read_unlock();
+ cur_weight = (next << 8) | weight;
+ } else /* policy->nodes became invalid */
+ cur_weight = 0;
+ } else
+ cur_weight = (node << 8) | weight;
+
+ atomic_set(&policy->cur_il_weight, cur_weight);
return node;
}
@@ -2385,6 +2400,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
nodemask_t nodes;
int nnodes, node, resume_node, next_node;
int prev_node = me->il_prev;
+ int cur_node_and_weight = atomic_read(&pol->cur_il_weight);
int i;
if (!nr_pages)
@@ -2394,10 +2410,11 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
if (!nnodes)
return 0;
+ node = cur_node_and_weight >> 8;
+ weight = cur_node_and_weight & 0xff;
/* Continue allocating from most recent node and adjust the nr_pages */
- if (pol->cur_il_weight) {
- node = next_node_in(prev_node, nodes);
- node_pages = pol->cur_il_weight;
+ if (weight && node_isset(node, nodes)) {
+ node_pages = weight;
if (node_pages > rem_pages)
node_pages = rem_pages;
nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
@@ -2408,27 +2425,36 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
* if that's all the pages, no need to interleave, otherwise
* we need to set up the next interleave node/weight correctly.
*/
- if (rem_pages < pol->cur_il_weight) {
+ if (rem_pages < weight) {
/* stay on current node, adjust cur_il_weight */
- pol->cur_il_weight -= rem_pages;
+ weight -= rem_pages;
+ atomic_set(&pol->cur_il_weight, ((node << 8) | weight));
return total_allocated;
- } else if (rem_pages == pol->cur_il_weight) {
+ } else if (rem_pages == weight) {
/* move to next node / weight */
me->il_prev = node;
next_node = next_node_in(node, nodes);
- rcu_read_lock();
- table = rcu_dereference(iw_table);
- weight = table ? table[next_node] : 1;
- /* detect system-default usage */
- pol->cur_il_weight = weight ? weight : 1;
- rcu_read_unlock();
+ if (next_node == MAX_NUMNODES) {
+ next_node = 0;
+ weight = 0;
+ } else {
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ weight = table ? table[next_node] : 1;
+ /* detect system-default usage */
+ weight = weight ? weight : 1;
+ rcu_read_unlock();
+ }
+ atomic_set(&pol->cur_il_weight,
+ ((next_node << 8) | weight));
return total_allocated;
}
/* Otherwise we adjust nr_pages down, and continue from there */
- rem_pages -= pol->cur_il_weight;
- pol->cur_il_weight = 0;
+ rem_pages -= weight;
prev_node = node;
}
+ /* clear cur_il_weight in case of an allocation failure */
+ atomic_set(&pol->cur_il_weight, 0);
/* create a local copy of node weights to operate on outside rcu */
weights = kmalloc(nr_node_ids, GFP_KERNEL);
@@ -2513,7 +2539,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
}
/* resume allocating from the calculated node and weight */
me->il_prev = resume_node;
- pol->cur_il_weight = resume_weight;
+ resume_node = next_node_in(resume_node, nodes);
+ atomic_set(&pol->cur_il_weight, ((resume_node << 8) | resume_weight));
kfree(weights);
return total_allocated;
}
--
2.39.1
From: Rakie Kim <[email protected]>
This patch provides a way to set interleave weight information under
sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
The sysfs structure is designed as follows.
$ tree /sys/kernel/mm/mempolicy/
/sys/kernel/mm/mempolicy/ [1]
└── weighted_interleave [2]
├── node0 [3]
└── node1
Each file above can be explained as follows.
[1] mm/mempolicy: configuration interface for mempolicy subsystem
[2] weighted_interleave/: config interface for weighted interleave policy
[3] weighted_interleave/nodeN: weight for nodeN
If a node value is set to `0`, the system-default value will be used.
As of this patch, the system-default for all nodes is always 1.
Suggested-by: Huang Ying <[email protected]>
Signed-off-by: Rakie Kim <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
Co-developed-by: Gregory Price <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Co-developed-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
---
.../ABI/testing/sysfs-kernel-mm-mempolicy | 4 +
...fs-kernel-mm-mempolicy-weighted-interleave | 25 ++
mm/mempolicy.c | 224 ++++++++++++++++++
3 files changed, 253 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
new file mode 100644
index 000000000000..2dcf24f4384a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
@@ -0,0 +1,4 @@
+What: /sys/kernel/mm/mempolicy/
+Date: December 2023
+Contact: Linux memory management mailing list <[email protected]>
+Description: Interface for Mempolicy
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
new file mode 100644
index 000000000000..0062b02703ff
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -0,0 +1,25 @@
+What: /sys/kernel/mm/mempolicy/weighted_interleave/
+Date: January 2024
+Contact: Linux memory management mailing list <[email protected]>
+Description: Configuration Interface for the Weighted Interleave policy
+
+What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
+Date: January 2024
+Contact: Linux memory management mailing list <[email protected]>
+Description: Weight configuration interface for nodeN
+
+ The interleave weight for a memory node (N). These weights are
+ utilized by taskss which have set their mempolicy to
+ MPOL_WEIGHTED_INTERLEAVE.
+
+ These weights only affect new allocations, and changes at runtime
+ will not cause migrations on already allocated pages.
+
+ The minimum weight for a node is always 1.
+
+ Minimum weight: 1
+ Maximum weight: 255
+
+ Writing an empty string or `0` will reset the weight to the
+ system default. The system default may be set by the kernel
+ or drivers at boot or during hotplug events.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..f1627d45b0c8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -131,6 +131,17 @@ static struct mempolicy default_policy = {
static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+/*
+ * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
+ * system-default value should be used. A NULL iw_table also denotes that
+ * system-default values should be used. Until the system-default table
+ * is implemented, the system-default is always 1.
+ *
+ * iw_table is RCU protected
+ */
+static u8 __rcu *iw_table;
+static DEFINE_MUTEX(iw_table_lock);
+
/**
* numa_nearest_node - Find nearest node by state
* @node: Node id to start the search
@@ -3067,3 +3078,216 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
nodemask_pr_args(&nodes));
}
+
+#ifdef CONFIG_SYSFS
+struct iw_node_attr {
+ struct kobj_attribute kobj_attr;
+ int nid;
+};
+
+static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ struct iw_node_attr *node_attr;
+ u8 weight;
+ u8 __rcu *table;
+
+ node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ weight = table ? table[node_attr->nid] : 1;
+ rcu_read_unlock();
+
+ return sysfs_emit(buf, "%d\n", weight);
+}
+
+static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct iw_node_attr *node_attr;
+ u8 __rcu *new;
+ u8 __rcu *old;
+ u8 weight = 0;
+
+ node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+ if (count == 0 || sysfs_streq(buf, ""))
+ weight = 0;
+ else if (kstrtou8(buf, 0, &weight))
+ return -EINVAL;
+
+ /*
+ * The default weight is 1, for now. When the kernel-internal
+ * default weight array is implemented, 0 will be a directive to
+ * the allocators to use the system-default weight instead.
+ */
+ if (!weight)
+ weight = 1;
+
+ new = kmalloc(nr_node_ids, GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ mutex_lock(&iw_table_lock);
+ old = rcu_dereference_protected(iw_table,
+ lockdep_is_held(&iw_table_lock));
+ if (old)
+ memcpy(new, old, nr_node_ids);
+ else
+ memset(new, 1, nr_node_ids);
+ new[node_attr->nid] = weight;
+ rcu_assign_pointer(iw_table, new);
+ mutex_unlock(&iw_table_lock);
+ synchronize_rcu();
+ kfree(old);
+ return count;
+}
+
+static struct iw_node_attr **node_attrs;
+
+static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
+ struct kobject *parent)
+{
+ if (!node_attr)
+ return;
+ sysfs_remove_file(parent, &node_attr->kobj_attr.attr);
+ kfree(node_attr->kobj_attr.attr.name);
+ kfree(node_attr);
+}
+
+static void sysfs_wi_release(struct kobject *wi_kobj)
+{
+ int i;
+
+ for (i = 0; i < nr_node_ids; i++)
+ sysfs_wi_node_release(node_attrs[i], wi_kobj);
+ kobject_put(wi_kobj);
+}
+
+static const struct kobj_type wi_ktype = {
+ .sysfs_ops = &kobj_sysfs_ops,
+ .release = sysfs_wi_release,
+};
+
+static int add_weight_node(int nid, struct kobject *wi_kobj)
+{
+ struct iw_node_attr *node_attr;
+ char *name;
+
+ node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
+ if (!node_attr)
+ return -ENOMEM;
+
+ name = kasprintf(GFP_KERNEL, "node%d", nid);
+ if (!name) {
+ kfree(node_attr);
+ return -ENOMEM;
+ }
+
+ sysfs_attr_init(&node_attr->kobj_attr.attr);
+ node_attr->kobj_attr.attr.name = name;
+ node_attr->kobj_attr.attr.mode = 0644;
+ node_attr->kobj_attr.show = node_show;
+ node_attr->kobj_attr.store = node_store;
+ node_attr->nid = nid;
+
+ if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) {
+ kfree(node_attr->kobj_attr.attr.name);
+ kfree(node_attr);
+ pr_err("failed to add attribute to weighted_interleave\n");
+ return -ENOMEM;
+ }
+
+ node_attrs[nid] = node_attr;
+ return 0;
+}
+
+static int add_weighted_interleave_group(struct kobject *root_kobj)
+{
+ struct kobject *wi_kobj;
+ int nid, err;
+
+ wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
+ if (!wi_kobj)
+ return -ENOMEM;
+
+ err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj,
+ "weighted_interleave");
+ if (err) {
+ kfree(wi_kobj);
+ return err;
+ }
+
+ for_each_node_state(nid, N_POSSIBLE) {
+ err = add_weight_node(nid, wi_kobj);
+ if (err) {
+ pr_err("failed to add sysfs [node%d]\n", nid);
+ break;
+ }
+ }
+ if (err)
+ kobject_put(wi_kobj);
+ return 0;
+}
+
+static void mempolicy_kobj_release(struct kobject *kobj)
+{
+ u8 __rcu *old;
+
+ mutex_lock(&iw_table_lock);
+ old = rcu_dereference_protected(iw_table,
+ lockdep_is_held(&iw_table_lock));
+ rcu_assign_pointer(iw_table, NULL);
+ mutex_unlock(&iw_table_lock);
+ synchronize_rcu();
+ kfree(old);
+ kfree(node_attrs);
+ kfree(kobj);
+}
+
+static const struct kobj_type mempolicy_ktype = {
+ .release = mempolicy_kobj_release
+};
+
+static int __init mempolicy_sysfs_init(void)
+{
+ int err;
+ static struct kobject *mempolicy_kobj;
+
+ mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL);
+ if (!mempolicy_kobj) {
+ err = -ENOMEM;
+ goto err_out;
+ }
+
+ node_attrs = kcalloc(nr_node_ids, sizeof(struct iw_node_attr *),
+ GFP_KERNEL);
+ if (!node_attrs) {
+ err = -ENOMEM;
+ goto mempol_out;
+ }
+
+ err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj,
+ "mempolicy");
+ if (err)
+ goto node_out;
+
+ err = add_weighted_interleave_group(mempolicy_kobj);
+ if (err) {
+ pr_err("mempolicy sysfs structure failed to initialize\n");
+ kobject_put(mempolicy_kobj);
+ return err;
+ }
+
+ return err;
+node_out:
+ kfree(node_attrs);
+mempol_out:
+ kfree(mempolicy_kobj);
+err_out:
+ pr_err("failed to add mempolicy kobject to the system\n");
+ return err;
+}
+
+late_initcall(mempolicy_sysfs_init);
+#endif /* CONFIG_SYSFS */
--
2.39.1
When a system has multiple NUMA nodes and it becomes bandwidth hungry,
using the current MPOL_INTERLEAVE could be an wise option.
However, if those NUMA nodes consist of different types of memory such
as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
based interleave policy does not optimally distribute data to make use
of their different bandwidth characteristics.
Instead, interleave is more effective when the allocation policy follows
each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
enabling weighted interleave between NUMA nodes. Weighted interleave
allows for proportional distribution of memory across multiple numa
nodes, preferably apportioned to match the bandwidth of each node.
For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
weight distribution is (2:1).
Weights for each node can be assigned via the new sysfs extension:
/sys/kernel/mm/mempolicy/weighted_interleave/
For now, the default value of all nodes will be `1`, which matches
the behavior of standard 1:1 round-robin interleave. An extension
will be added in the future to allow default values to be registered
at kernel and device bringup time.
The policy allocates a number of pages equal to the set weights. For
example, if the weights are (2,1), then 2 pages will be allocated on
node0 for every 1 page allocated on node1.
The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).
There are 3 integration points:
weighted_interleave_nodes:
Counts the number of allocations as they occur, and applies the
weight for the current node. When the weight reaches 0, switch
to the next node.
weighted_interleave_nid:
Gets the total weight of the nodemask as well as each individual
node weight, then calculates the node based on the given index.
bulk_array_weighted_interleave:
Gets the total weight of the nodemask as well as each individual
node weight, then calculates the number of "interleave rounds" as
well as any delta ("partial round"). Calculates the number of
pages for each node and allocates them.
If a node was scheduled for interleave via interleave_nodes, the
current weight (pol->cur_il_weight) will be allocated first, before
the remaining bulk calculation is done.
One piece of complexity is the interaction between a recent refactor
which split the logic to acquire the "ilx" (interleave index) of an
allocation and the actually application of the interleave. The
calculation of the `interleave index` is done by `get_vma_policy()`,
while the actual selection of the node will be later appliex by the
relevant weighted_interleave function.
Suggested-by: Hasan Al Maruf <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Co-developed-by: Rakie Kim <[email protected]>
Signed-off-by: Rakie Kim <[email protected]>
Co-developed-by: Honggyu Kim <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
Co-developed-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
Co-developed-by: Srinivasulu Thanneeru <[email protected]>
Signed-off-by: Srinivasulu Thanneeru <[email protected]>
Co-developed-by: Ravi Jonnalagadda <[email protected]>
Signed-off-by: Ravi Jonnalagadda <[email protected]>
---
.../admin-guide/mm/numa_memory_policy.rst | 9 +
include/linux/mempolicy.h | 3 +
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 274 +++++++++++++++++-
4 files changed, 283 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..a70f20ce1ffb 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
can fall back to all existing numa nodes. This is effectively
MPOL_PREFERRED allowed for a mask rather than a single node.
+MPOL_WEIGHTED_INTERLEAVE
+ This mode operates the same as MPOL_INTERLEAVE, except that
+ interleaving behavior is executed based on weights set in
+ /sys/kernel/mm/mempolicy/weighted_interleave/
+
+ Weighted interleave allocates pages on nodes according to a
+ weight. For example if nodes [0,1] are weighted [5,2], 5 pages
+ will be allocated on node0 for every 2 pages allocated on node1.
+
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..c644d7bbd396 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -54,6 +54,9 @@ struct mempolicy {
nodemask_t cpuset_mems_allowed; /* relative to these nodes */
nodemask_t user_nodemask; /* nodemask passed by user */
} w;
+
+ /* Weighted interleave settings */
+ u8 cur_il_weight;
};
/*
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..1f9bb10d1a47 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,6 +23,7 @@ enum {
MPOL_INTERLEAVE,
MPOL_LOCAL,
MPOL_PREFERRED_MANY,
+ MPOL_WEIGHTED_INTERLEAVE,
MPOL_MAX, /* always last member of enum */
};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b13c45a0bfcb..5a517511658e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -19,6 +19,13 @@
* for anonymous memory. For process policy an process counter
* is used.
*
+ * weighted interleave
+ * Allocate memory interleaved over a set of nodes based on
+ * a set of weights (per-node), with normal fallback if it
+ * fails. Otherwise operates the same as interleave.
+ * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
+ * on node 0 for every 1 page allocated on node 1.
+ *
* bind Only allocate memory on a specific set of nodes,
* no fallback.
* FIXME: memory is allocated starting with the first node
@@ -314,6 +321,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
policy->mode = mode;
policy->flags = flags;
policy->home_node = NUMA_NO_NODE;
+ policy->cur_il_weight = 0;
return policy;
}
@@ -426,6 +434,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
.create = mpol_new_nodemask,
.rebind = mpol_rebind_preferred,
},
+ [MPOL_WEIGHTED_INTERLEAVE] = {
+ .create = mpol_new_nodemask,
+ .rebind = mpol_rebind_nodemask,
+ },
};
static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
@@ -847,7 +859,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
old = current->mempolicy;
current->mempolicy = new;
- if (new && new->mode == MPOL_INTERLEAVE)
+ if (new && (new->mode == MPOL_INTERLEAVE ||
+ new->mode == MPOL_WEIGHTED_INTERLEAVE))
current->il_prev = MAX_NUMNODES-1;
task_unlock(current);
mpol_put(old);
@@ -873,6 +886,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
case MPOL_INTERLEAVE:
case MPOL_PREFERRED:
case MPOL_PREFERRED_MANY:
+ case MPOL_WEIGHTED_INTERLEAVE:
*nodes = pol->nodes;
break;
case MPOL_LOCAL:
@@ -957,6 +971,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
} else if (pol == current->mempolicy &&
pol->mode == MPOL_INTERLEAVE) {
*policy = next_node_in(current->il_prev, pol->nodes);
+ } else if (pol == current->mempolicy &&
+ (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
+ if (pol->cur_il_weight)
+ *policy = current->il_prev;
+ else
+ *policy = next_node_in(current->il_prev,
+ pol->nodes);
} else {
err = -EINVAL;
goto out;
@@ -1769,7 +1790,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
* @vma: virtual memory area whose policy is sought
* @addr: address in @vma for shared policy lookup
* @order: 0, or appropriate huge_page_order for interleaving
- * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE
+ * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or
+ * MPOL_WEIGHTED_INTERLEAVE
*
* Returns effective policy for a VMA at specified address.
* Falls back to current->mempolicy or system default policy, as necessary.
@@ -1786,7 +1808,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
pol = __get_vma_policy(vma, addr, ilx);
if (!pol)
pol = get_task_policy(current);
- if (pol->mode == MPOL_INTERLEAVE) {
+ if (pol->mode == MPOL_INTERLEAVE ||
+ pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
*ilx += vma->vm_pgoff >> order;
*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
}
@@ -1836,6 +1859,44 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
return zone >= dynamic_policy_zone;
}
+static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
+{
+ unsigned int node, next;
+ struct task_struct *me = current;
+ u8 __rcu *table;
+ u8 weight;
+
+ node = next_node_in(me->il_prev, policy->nodes);
+ if (node == MAX_NUMNODES)
+ return node;
+
+ /* on first alloc after setting mempolicy, acquire first weight */
+ if (unlikely(!policy->cur_il_weight)) {
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ /* detect system-default values */
+ weight = table ? table[node] : 1;
+ policy->cur_il_weight = weight ? weight : 1;
+ rcu_read_unlock();
+ }
+
+ /* account for this allocation call */
+ policy->cur_il_weight--;
+
+ /* if now at 0, move to next node and set up that node's weight */
+ if (unlikely(!policy->cur_il_weight)) {
+ me->il_prev = node;
+ next = next_node_in(node, policy->nodes);
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ /* detect system-default values */
+ weight = table ? table[next] : 1;
+ policy->cur_il_weight = weight ? weight : 1;
+ rcu_read_unlock();
+ }
+ return node;
+}
+
/* Do dynamic interleaving for a process */
static unsigned int interleave_nodes(struct mempolicy *policy)
{
@@ -1870,6 +1931,9 @@ unsigned int mempolicy_slab_node(void)
case MPOL_INTERLEAVE:
return interleave_nodes(policy);
+ case MPOL_WEIGHTED_INTERLEAVE:
+ return weighted_interleave_nodes(policy);
+
case MPOL_BIND:
case MPOL_PREFERRED_MANY:
{
@@ -1908,6 +1972,39 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
return nodes_weight(*mask);
}
+static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
+{
+ nodemask_t nodemask;
+ unsigned int target, nr_nodes;
+ u8 __rcu *table;
+ unsigned int weight_total = 0;
+ u8 weight;
+ int nid;
+
+ nr_nodes = read_once_policy_nodemask(pol, &nodemask);
+ if (!nr_nodes)
+ return numa_node_id();
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ /* calculate the total weight */
+ for_each_node_mask(nid, nodemask)
+ weight_total += table ? table[nid] : 1;
+
+ /* Calculate the node offset based on totals */
+ target = ilx % weight_total;
+ nid = first_node(nodemask);
+ while (target) {
+ weight = table ? table[nid] : 1;
+ if (target < weight)
+ break;
+ target -= weight;
+ nid = next_node_in(nid, nodemask);
+ }
+ rcu_read_unlock();
+ return nid;
+}
+
/*
* Do static interleaving for interleave index @ilx. Returns the ilx'th
* node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1968,6 +2065,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
*nid = (ilx == NO_INTERLEAVE_INDEX) ?
interleave_nodes(pol) : interleave_nid(pol, ilx);
break;
+ case MPOL_WEIGHTED_INTERLEAVE:
+ *nid = (ilx == NO_INTERLEAVE_INDEX) ?
+ weighted_interleave_nodes(pol) :
+ weighted_interleave_nid(pol, ilx);
+ break;
}
return nodemask;
@@ -2029,6 +2131,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
case MPOL_PREFERRED_MANY:
case MPOL_BIND:
case MPOL_INTERLEAVE:
+ case MPOL_WEIGHTED_INTERLEAVE:
*mask = mempolicy->nodes;
break;
@@ -2128,7 +2231,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
* If the policy is interleave or does not allow the current
* node in its nodemask, we allocate the standard way.
*/
- if (pol->mode != MPOL_INTERLEAVE &&
+ if ((pol->mode != MPOL_INTERLEAVE &&
+ pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
(!nodemask || node_isset(nid, *nodemask))) {
/*
* First, try to allocate THP only on local node, but
@@ -2264,6 +2368,156 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
return total_allocated;
}
+static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
+ struct mempolicy *pol, unsigned long nr_pages,
+ struct page **page_array)
+{
+ struct task_struct *me = current;
+ unsigned long total_allocated = 0;
+ unsigned long nr_allocated;
+ unsigned long rounds;
+ unsigned long node_pages, delta;
+ u8 weight, resume_weight;
+ u8 __rcu *table;
+ u8 *weights;
+ unsigned int weight_total = 0;
+ unsigned long rem_pages = nr_pages;
+ nodemask_t nodes;
+ int nnodes, node, resume_node, next_node;
+ int prev_node = me->il_prev;
+ int i;
+
+ if (!nr_pages)
+ return 0;
+
+ nnodes = read_once_policy_nodemask(pol, &nodes);
+ if (!nnodes)
+ return 0;
+
+ /* Continue allocating from most recent node and adjust the nr_pages */
+ if (pol->cur_il_weight) {
+ node = next_node_in(prev_node, nodes);
+ node_pages = pol->cur_il_weight;
+ if (node_pages > rem_pages)
+ node_pages = rem_pages;
+ nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+ NULL, page_array);
+ page_array += nr_allocated;
+ total_allocated += nr_allocated;
+ /*
+ * if that's all the pages, no need to interleave, otherwise
+ * we need to set up the next interleave node/weight correctly.
+ */
+ if (rem_pages < pol->cur_il_weight) {
+ /* stay on current node, adjust cur_il_weight */
+ pol->cur_il_weight -= rem_pages;
+ return total_allocated;
+ } else if (rem_pages == pol->cur_il_weight) {
+ /* move to next node / weight */
+ me->il_prev = node;
+ next_node = next_node_in(node, nodes);
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ weight = table ? table[next_node] : 1;
+ /* detect system-default usage */
+ pol->cur_il_weight = weight ? weight : 1;
+ rcu_read_unlock();
+ return total_allocated;
+ }
+ /* Otherwise we adjust nr_pages down, and continue from there */
+ rem_pages -= pol->cur_il_weight;
+ pol->cur_il_weight = 0;
+ prev_node = node;
+ }
+
+ /* create a local copy of node weights to operate on outside rcu */
+ weights = kmalloc(nr_node_ids, GFP_KERNEL);
+ if (!weights)
+ return total_allocated;
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ /* If table is not registered, use system defaults */
+ if (table)
+ memcpy(weights, iw_table, nr_node_ids);
+ else
+ memset(weights, 1, nr_node_ids);
+ rcu_read_unlock();
+
+ /* calculate total, detect system default usage */
+ for_each_node_mask(node, nodes) {
+ /* detect system-default usage */
+ if (!weights[node])
+ weights[node] = 1;
+ weight_total += weights[node];
+ }
+
+ /*
+ * Now we can continue allocating from 0 instead of an offset
+ * We calculate the number of rounds and any partial rounds so
+ * that we minimize the number of calls to __alloc_pages_bulk
+ * This requires us to track which node we should resume from.
+ *
+ * if (rounds > 0) and (delta == 0), resume_node will always be
+ * the current value of prev_node, which may be NUMA_NO_NODE if
+ * this is the first allocation after a policy is replaced. The
+ * resume weight will be the weight of the next node.
+ *
+ * if (delta > 0) and delta is depleted exactly on a node-weight
+ * boundary, resume node will be the node last allocated from when
+ * delta reached 0.
+ *
+ * if (delta > 0) and delta is not depleted on a node-weight boundary,
+ * resume node will be the node prior to the node last allocated from.
+ *
+ * (rounds == 0) and (delta == 0) is not possible (earlier exit)
+ */
+ rounds = rem_pages / weight_total;
+ delta = rem_pages % weight_total;
+ resume_node = prev_node;
+ resume_weight = weights[next_node_in(prev_node, nodes)];
+ /* If no delta, we'll resume from current prev_node and first weight */
+ for (i = 0; i < nnodes; i++) {
+ node = next_node_in(prev_node, nodes);
+ weight = weights[node];
+ node_pages = weight * rounds;
+ /* If a delta exists, add this node's portion of the delta */
+ if (delta > weight) {
+ node_pages += weight;
+ delta -= weight;
+ resume_node = node;
+ } else if (delta) {
+ node_pages += delta;
+ if (delta == weight) {
+ /* resume from next node with its weight */
+ resume_node = node;
+ next_node = next_node_in(node, nodes);
+ resume_weight = weights[next_node];
+ } else {
+ /* resume from this node w/ remaining weight */
+ resume_node = prev_node;
+ resume_weight = weight - (node_pages % weight);
+ }
+ delta = 0;
+ }
+ /* node_pages can be 0 if an allocation fails and rounds == 0 */
+ if (!node_pages)
+ break;
+ nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+ NULL, page_array);
+ page_array += nr_allocated;
+ total_allocated += nr_allocated;
+ if (total_allocated == nr_pages)
+ break;
+ prev_node = node;
+ }
+ /* resume allocating from the calculated node and weight */
+ me->il_prev = resume_node;
+ pol->cur_il_weight = resume_weight;
+ kfree(weights);
+ return total_allocated;
+}
+
static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
struct mempolicy *pol, unsigned long nr_pages,
struct page **page_array)
@@ -2304,6 +2558,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
return alloc_pages_bulk_array_interleave(gfp, pol,
nr_pages, page_array);
+ if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
+ return alloc_pages_bulk_array_weighted_interleave(
+ gfp, pol, nr_pages, page_array);
+
if (pol->mode == MPOL_PREFERRED_MANY)
return alloc_pages_bulk_array_preferred_many(gfp,
numa_node_id(), pol, nr_pages, page_array);
@@ -2379,6 +2637,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
case MPOL_INTERLEAVE:
case MPOL_PREFERRED:
case MPOL_PREFERRED_MANY:
+ case MPOL_WEIGHTED_INTERLEAVE:
return !!nodes_equal(a->nodes, b->nodes);
case MPOL_LOCAL:
return true;
@@ -2515,6 +2774,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
polnid = interleave_nid(pol, ilx);
break;
+ case MPOL_WEIGHTED_INTERLEAVE:
+ polnid = weighted_interleave_nid(pol, ilx);
+ break;
+
case MPOL_PREFERRED:
if (node_isset(curnid, pol->nodes))
goto out;
@@ -2889,6 +3152,7 @@ static const char * const policy_modes[] =
[MPOL_PREFERRED] = "prefer",
[MPOL_BIND] = "bind",
[MPOL_INTERLEAVE] = "interleave",
+ [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
[MPOL_LOCAL] = "local",
[MPOL_PREFERRED_MANY] = "prefer (many)",
};
@@ -2948,6 +3212,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
}
break;
case MPOL_INTERLEAVE:
+ case MPOL_WEIGHTED_INTERLEAVE:
/*
* Default to online nodes with memory if no nodelist
*/
@@ -3058,6 +3323,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
case MPOL_PREFERRED_MANY:
case MPOL_BIND:
case MPOL_INTERLEAVE:
+ case MPOL_WEIGHTED_INTERLEAVE:
nodes = pol->nodes;
break;
default:
--
2.39.1
Gregory Price <[email protected]> writes:
> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> using the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
> based interleave policy does not optimally distribute data to make use
> of their different bandwidth characteristics.
>
> Instead, interleave is more effective when the allocation policy follows
> each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
> enabling weighted interleave between NUMA nodes. Weighted interleave
> allows for proportional distribution of memory across multiple numa
> nodes, preferably apportioned to match the bandwidth of each node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
> weight distribution is (2:1).
>
> Weights for each node can be assigned via the new sysfs extension:
> /sys/kernel/mm/mempolicy/weighted_interleave/
>
> For now, the default value of all nodes will be `1`, which matches
> the behavior of standard 1:1 round-robin interleave. An extension
> will be added in the future to allow default values to be registered
> at kernel and device bringup time.
>
> The policy allocates a number of pages equal to the set weights. For
> example, if the weights are (2,1), then 2 pages will be allocated on
> node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
>
> There are 3 integration points:
>
> weighted_interleave_nodes:
> Counts the number of allocations as they occur, and applies the
> weight for the current node. When the weight reaches 0, switch
> to the next node.
>
> weighted_interleave_nid:
> Gets the total weight of the nodemask as well as each individual
> node weight, then calculates the node based on the given index.
>
> bulk_array_weighted_interleave:
> Gets the total weight of the nodemask as well as each individual
> node weight, then calculates the number of "interleave rounds" as
> well as any delta ("partial round"). Calculates the number of
> pages for each node and allocates them.
>
> If a node was scheduled for interleave via interleave_nodes, the
> current weight (pol->cur_il_weight) will be allocated first, before
> the remaining bulk calculation is done.
>
> One piece of complexity is the interaction between a recent refactor
> which split the logic to acquire the "ilx" (interleave index) of an
> allocation and the actually application of the interleave. The
> calculation of the `interleave index` is done by `get_vma_policy()`,
> while the actual selection of the node will be later appliex by the
> relevant weighted_interleave function.
>
> Suggested-by: Hasan Al Maruf <[email protected]>
> Signed-off-by: Gregory Price <[email protected]>
> Co-developed-by: Rakie Kim <[email protected]>
> Signed-off-by: Rakie Kim <[email protected]>
> Co-developed-by: Honggyu Kim <[email protected]>
> Signed-off-by: Honggyu Kim <[email protected]>
> Co-developed-by: Hyeongtak Ji <[email protected]>
> Signed-off-by: Hyeongtak Ji <[email protected]>
> Co-developed-by: Srinivasulu Thanneeru <[email protected]>
> Signed-off-by: Srinivasulu Thanneeru <[email protected]>
> Co-developed-by: Ravi Jonnalagadda <[email protected]>
> Signed-off-by: Ravi Jonnalagadda <[email protected]>
> ---
> .../admin-guide/mm/numa_memory_policy.rst | 9 +
> include/linux/mempolicy.h | 3 +
> include/uapi/linux/mempolicy.h | 1 +
> mm/mempolicy.c | 274 +++++++++++++++++-
> 4 files changed, 283 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index eca38fa81e0f..a70f20ce1ffb 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
> can fall back to all existing numa nodes. This is effectively
> MPOL_PREFERRED allowed for a mask rather than a single node.
>
> +MPOL_WEIGHTED_INTERLEAVE
> + This mode operates the same as MPOL_INTERLEAVE, except that
> + interleaving behavior is executed based on weights set in
> + /sys/kernel/mm/mempolicy/weighted_interleave/
> +
> + Weighted interleave allocates pages on nodes according to a
> + weight. For example if nodes [0,1] are weighted [5,2], 5 pages
> + will be allocated on node0 for every 2 pages allocated on node1.
> +
> NUMA memory policy supports the following optional mode flags:
>
> MPOL_F_STATIC_NODES
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..c644d7bbd396 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -54,6 +54,9 @@ struct mempolicy {
> nodemask_t cpuset_mems_allowed; /* relative to these nodes */
> nodemask_t user_nodemask; /* nodemask passed by user */
> } w;
> +
> + /* Weighted interleave settings */
> + u8 cur_il_weight;
> };
>
> /*
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index a8963f7ef4c2..1f9bb10d1a47 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -23,6 +23,7 @@ enum {
> MPOL_INTERLEAVE,
> MPOL_LOCAL,
> MPOL_PREFERRED_MANY,
> + MPOL_WEIGHTED_INTERLEAVE,
> MPOL_MAX, /* always last member of enum */
> };
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index b13c45a0bfcb..5a517511658e 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -19,6 +19,13 @@
> * for anonymous memory. For process policy an process counter
> * is used.
> *
> + * weighted interleave
> + * Allocate memory interleaved over a set of nodes based on
> + * a set of weights (per-node), with normal fallback if it
> + * fails. Otherwise operates the same as interleave.
> + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
> + * on node 0 for every 1 page allocated on node 1.
> + *
> * bind Only allocate memory on a specific set of nodes,
> * no fallback.
> * FIXME: memory is allocated starting with the first node
> @@ -314,6 +321,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
> policy->mode = mode;
> policy->flags = flags;
> policy->home_node = NUMA_NO_NODE;
> + policy->cur_il_weight = 0;
>
> return policy;
> }
> @@ -426,6 +434,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> .create = mpol_new_nodemask,
> .rebind = mpol_rebind_preferred,
> },
> + [MPOL_WEIGHTED_INTERLEAVE] = {
> + .create = mpol_new_nodemask,
> + .rebind = mpol_rebind_nodemask,
> + },
> };
>
> static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> @@ -847,7 +859,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>
> old = current->mempolicy;
> current->mempolicy = new;
> - if (new && new->mode == MPOL_INTERLEAVE)
> + if (new && (new->mode == MPOL_INTERLEAVE ||
> + new->mode == MPOL_WEIGHTED_INTERLEAVE))
> current->il_prev = MAX_NUMNODES-1;
> task_unlock(current);
> mpol_put(old);
> @@ -873,6 +886,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
> case MPOL_INTERLEAVE:
> case MPOL_PREFERRED:
> case MPOL_PREFERRED_MANY:
> + case MPOL_WEIGHTED_INTERLEAVE:
> *nodes = pol->nodes;
> break;
> case MPOL_LOCAL:
> @@ -957,6 +971,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> } else if (pol == current->mempolicy &&
> pol->mode == MPOL_INTERLEAVE) {
> *policy = next_node_in(current->il_prev, pol->nodes);
> + } else if (pol == current->mempolicy &&
> + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
> + if (pol->cur_il_weight)
> + *policy = current->il_prev;
> + else
> + *policy = next_node_in(current->il_prev,
> + pol->nodes);
It appears that my previous comments about this is ignored.
https://lore.kernel.org/linux-mm/[email protected]/
Please correct me if I am wrong.
> } else {
> err = -EINVAL;
> goto out;
> @@ -1769,7 +1790,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> * @vma: virtual memory area whose policy is sought
> * @addr: address in @vma for shared policy lookup
> * @order: 0, or appropriate huge_page_order for interleaving
> - * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE
> + * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or
> + * MPOL_WEIGHTED_INTERLEAVE
> *
> * Returns effective policy for a VMA at specified address.
> * Falls back to current->mempolicy or system default policy, as necessary.
> @@ -1786,7 +1808,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> pol = __get_vma_policy(vma, addr, ilx);
> if (!pol)
> pol = get_task_policy(current);
> - if (pol->mode == MPOL_INTERLEAVE) {
> + if (pol->mode == MPOL_INTERLEAVE ||
> + pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> *ilx += vma->vm_pgoff >> order;
> *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> }
> @@ -1836,6 +1859,44 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> return zone >= dynamic_policy_zone;
> }
>
> +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> +{
> + unsigned int node, next;
> + struct task_struct *me = current;
> + u8 __rcu *table;
> + u8 weight;
> +
> + node = next_node_in(me->il_prev, policy->nodes);
> + if (node == MAX_NUMNODES)
> + return node;
> +
> + /* on first alloc after setting mempolicy, acquire first weight */
> + if (unlikely(!policy->cur_il_weight)) {
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + /* detect system-default values */
> + weight = table ? table[node] : 1;
> + policy->cur_il_weight = weight ? weight : 1;
> + rcu_read_unlock();
> + }
> +
> + /* account for this allocation call */
> + policy->cur_il_weight--;
> +
> + /* if now at 0, move to next node and set up that node's weight */
> + if (unlikely(!policy->cur_il_weight)) {
> + me->il_prev = node;
> + next = next_node_in(node, policy->nodes);
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + /* detect system-default values */
> + weight = table ? table[next] : 1;
> + policy->cur_il_weight = weight ? weight : 1;
> + rcu_read_unlock();
> + }
It appears that the code could be more concise if we allow
policy->cur_il_weight == 0. Duplicated code are in
alloc_pages_bulk_array_weighted_interleave() too. Anyway, can we define
some function to reduce duplicated code.
> + return node;
> +}
> +
> /* Do dynamic interleaving for a process */
> static unsigned int interleave_nodes(struct mempolicy *policy)
> {
> @@ -1870,6 +1931,9 @@ unsigned int mempolicy_slab_node(void)
> case MPOL_INTERLEAVE:
> return interleave_nodes(policy);
>
> + case MPOL_WEIGHTED_INTERLEAVE:
> + return weighted_interleave_nodes(policy);
> +
> case MPOL_BIND:
> case MPOL_PREFERRED_MANY:
> {
> @@ -1908,6 +1972,39 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
> return nodes_weight(*mask);
> }
>
> +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> +{
> + nodemask_t nodemask;
> + unsigned int target, nr_nodes;
> + u8 __rcu *table;
> + unsigned int weight_total = 0;
> + u8 weight;
> + int nid;
> +
> + nr_nodes = read_once_policy_nodemask(pol, &nodemask);
> + if (!nr_nodes)
> + return numa_node_id();
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + /* calculate the total weight */
> + for_each_node_mask(nid, nodemask)
> + weight_total += table ? table[nid] : 1;
> +
> + /* Calculate the node offset based on totals */
> + target = ilx % weight_total;
> + nid = first_node(nodemask);
> + while (target) {
> + weight = table ? table[nid] : 1;
> + if (target < weight)
> + break;
> + target -= weight;
> + nid = next_node_in(nid, nodemask);
> + }
> + rcu_read_unlock();
> + return nid;
> +}
> +
> /*
> * Do static interleaving for interleave index @ilx. Returns the ilx'th
> * node in pol->nodes (starting from ilx=0), wrapping around if ilx
> @@ -1968,6 +2065,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> *nid = (ilx == NO_INTERLEAVE_INDEX) ?
> interleave_nodes(pol) : interleave_nid(pol, ilx);
> break;
> + case MPOL_WEIGHTED_INTERLEAVE:
> + *nid = (ilx == NO_INTERLEAVE_INDEX) ?
> + weighted_interleave_nodes(pol) :
> + weighted_interleave_nid(pol, ilx);
> + break;
> }
>
> return nodemask;
> @@ -2029,6 +2131,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> case MPOL_PREFERRED_MANY:
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> *mask = mempolicy->nodes;
> break;
>
> @@ -2128,7 +2231,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> * If the policy is interleave or does not allow the current
> * node in its nodemask, we allocate the standard way.
> */
> - if (pol->mode != MPOL_INTERLEAVE &&
> + if ((pol->mode != MPOL_INTERLEAVE &&
> + pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
> (!nodemask || node_isset(nid, *nodemask))) {
> /*
> * First, try to allocate THP only on local node, but
> @@ -2264,6 +2368,156 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
> return total_allocated;
> }
>
> +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> + struct mempolicy *pol, unsigned long nr_pages,
> + struct page **page_array)
> +{
> + struct task_struct *me = current;
> + unsigned long total_allocated = 0;
> + unsigned long nr_allocated;
> + unsigned long rounds;
> + unsigned long node_pages, delta;
> + u8 weight, resume_weight;
> + u8 __rcu *table;
> + u8 *weights;
> + unsigned int weight_total = 0;
> + unsigned long rem_pages = nr_pages;
> + nodemask_t nodes;
> + int nnodes, node, resume_node, next_node;
> + int prev_node = me->il_prev;
> + int i;
> +
> + if (!nr_pages)
> + return 0;
> +
> + nnodes = read_once_policy_nodemask(pol, &nodes);
> + if (!nnodes)
> + return 0;
> +
> + /* Continue allocating from most recent node and adjust the nr_pages */
> + if (pol->cur_il_weight) {
> + node = next_node_in(prev_node, nodes);
> + node_pages = pol->cur_il_weight;
> + if (node_pages > rem_pages)
> + node_pages = rem_pages;
> + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> + NULL, page_array);
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + /*
> + * if that's all the pages, no need to interleave, otherwise
> + * we need to set up the next interleave node/weight correctly.
> + */
> + if (rem_pages < pol->cur_il_weight) {
> + /* stay on current node, adjust cur_il_weight */
> + pol->cur_il_weight -= rem_pages;
> + return total_allocated;
> + } else if (rem_pages == pol->cur_il_weight) {
> + /* move to next node / weight */
> + me->il_prev = node;
> + next_node = next_node_in(node, nodes);
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + weight = table ? table[next_node] : 1;
> + /* detect system-default usage */
> + pol->cur_il_weight = weight ? weight : 1;
> + rcu_read_unlock();
> + return total_allocated;
> + }
> + /* Otherwise we adjust nr_pages down, and continue from there */
> + rem_pages -= pol->cur_il_weight;
> + pol->cur_il_weight = 0;
This break the rule to keep pol->cur_il_weight != 0 except after initial
setup. Is it OK?
> + prev_node = node;
> + }
> +
> + /* create a local copy of node weights to operate on outside rcu */
> + weights = kmalloc(nr_node_ids, GFP_KERNEL);
> + if (!weights)
> + return total_allocated;
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + /* If table is not registered, use system defaults */
> + if (table)
> + memcpy(weights, iw_table, nr_node_ids);
> + else
> + memset(weights, 1, nr_node_ids);
> + rcu_read_unlock();
> +
> + /* calculate total, detect system default usage */
> + for_each_node_mask(node, nodes) {
> + /* detect system-default usage */
> + if (!weights[node])
> + weights[node] = 1;
> + weight_total += weights[node];
> + }
> +
> + /*
> + * Now we can continue allocating from 0 instead of an offset
> + * We calculate the number of rounds and any partial rounds so
> + * that we minimize the number of calls to __alloc_pages_bulk
> + * This requires us to track which node we should resume from.
> + *
> + * if (rounds > 0) and (delta == 0), resume_node will always be
> + * the current value of prev_node, which may be NUMA_NO_NODE if
> + * this is the first allocation after a policy is replaced. The
> + * resume weight will be the weight of the next node.
> + *
> + * if (delta > 0) and delta is depleted exactly on a node-weight
> + * boundary, resume node will be the node last allocated from when
> + * delta reached 0.
> + *
> + * if (delta > 0) and delta is not depleted on a node-weight boundary,
> + * resume node will be the node prior to the node last allocated from.
> + *
> + * (rounds == 0) and (delta == 0) is not possible (earlier exit)
> + */
> + rounds = rem_pages / weight_total;
> + delta = rem_pages % weight_total;
> + resume_node = prev_node;
> + resume_weight = weights[next_node_in(prev_node, nodes)];
> + /* If no delta, we'll resume from current prev_node and first weight */
> + for (i = 0; i < nnodes; i++) {
> + node = next_node_in(prev_node, nodes);
> + weight = weights[node];
> + node_pages = weight * rounds;
> + /* If a delta exists, add this node's portion of the delta */
> + if (delta > weight) {
> + node_pages += weight;
> + delta -= weight;
> + resume_node = node;
> + } else if (delta) {
> + node_pages += delta;
> + if (delta == weight) {
> + /* resume from next node with its weight */
> + resume_node = node;
> + next_node = next_node_in(node, nodes);
> + resume_weight = weights[next_node];
> + } else {
> + /* resume from this node w/ remaining weight */
> + resume_node = prev_node;
> + resume_weight = weight - (node_pages % weight);
resume_weight = weight - delta; ?
> + }
> + delta = 0;
> + }
> + /* node_pages can be 0 if an allocation fails and rounds == 0 */
> + if (!node_pages)
> + break;
> + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> + NULL, page_array);
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + if (total_allocated == nr_pages)
> + break;
> + prev_node = node;
> + }
> + /* resume allocating from the calculated node and weight */
> + me->il_prev = resume_node;
> + pol->cur_il_weight = resume_weight;
> + kfree(weights);
> + return total_allocated;
> +}
> +
> static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
> struct mempolicy *pol, unsigned long nr_pages,
> struct page **page_array)
> @@ -2304,6 +2558,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
> return alloc_pages_bulk_array_interleave(gfp, pol,
> nr_pages, page_array);
>
> + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
> + return alloc_pages_bulk_array_weighted_interleave(
> + gfp, pol, nr_pages, page_array);
> +
> if (pol->mode == MPOL_PREFERRED_MANY)
> return alloc_pages_bulk_array_preferred_many(gfp,
> numa_node_id(), pol, nr_pages, page_array);
> @@ -2379,6 +2637,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
> case MPOL_INTERLEAVE:
> case MPOL_PREFERRED:
> case MPOL_PREFERRED_MANY:
> + case MPOL_WEIGHTED_INTERLEAVE:
> return !!nodes_equal(a->nodes, b->nodes);
> case MPOL_LOCAL:
> return true;
> @@ -2515,6 +2774,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
> polnid = interleave_nid(pol, ilx);
> break;
>
> + case MPOL_WEIGHTED_INTERLEAVE:
> + polnid = weighted_interleave_nid(pol, ilx);
> + break;
> +
> case MPOL_PREFERRED:
> if (node_isset(curnid, pol->nodes))
> goto out;
> @@ -2889,6 +3152,7 @@ static const char * const policy_modes[] =
> [MPOL_PREFERRED] = "prefer",
> [MPOL_BIND] = "bind",
> [MPOL_INTERLEAVE] = "interleave",
> + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
> [MPOL_LOCAL] = "local",
> [MPOL_PREFERRED_MANY] = "prefer (many)",
> };
> @@ -2948,6 +3212,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
> }
> break;
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> /*
> * Default to online nodes with memory if no nodelist
> */
> @@ -3058,6 +3323,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
> case MPOL_PREFERRED_MANY:
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> nodes = pol->nodes;
> break;
> default:
--
Best Regards,
Huang, Ying
Gregory Price <[email protected]> writes:
> In the prior patch, we carry only the current weight for a weighted
> interleave round with us across calls through the allocator path.
>
> node = next_node_in(current->il_prev, pol->nodemask)
> pol->cur_il_weight <--- this weight applies to the above node
>
> This separation of data can cause a race condition.
>
> If a cgroup-initiated task migration or mems_allowed change occurs
> from outside the context of the task, this can cause the weight to
> become stale, meaning we may end using that weight to allocate
> memory on the wrong node.
>
> Example:
> 1) task A sets (cur_il_weight = 8) and (current->il_prev) to
> node0. node1 is the next set bit in pol->nodemask
> 2) rebind event occurs, removing node1 from the nodemask.
> node2 is now the next set bit in pol->nodemask
> cur_il_weight is now stale.
> 3) allocation occurs, next_node_in(il_prev, nodes) returns
> node2. cur_il_weight is now applied to the wrong node.
>
> The upper level allocator logic must still enforce mems_allowed,
> so this isn't dangerous, but it is innaccurate.
>
> Just clearing the weight is insufficient, as it creates two more
> race conditions. The root of the issue is the separation of weight
> and node data between nodemask and cur_il_weight.
>
> To solve this, update cur_il_weight to be an atomic_t, and place the
> node that the weight applies to in the upper bits of the field:
>
> atomic_t cur_il_weight
> node bits 32:8
> weight bits 7:0
>
> Now retrieving or clearing the active interleave node and weight
> is a single atomic operation, and we are not dependent on the
> potentially changing state of (pol->nodemask) to determine what
> node the weight applies to.
>
> Two special observations:
> - if the weight is non-zero, cur_il_weight must *always* have a
> valid node number, e.g. it cannot be NUMA_NO_NODE (-1).
IIUC, we don't need that, "MAX_NUMNODES-1" is used instead.
> This is because we steal the top byte for the weight.
>
> - MAX_NUMNODES is presently limited to 1024 or less on every
> architecture. This would permanently limit MAX_NUMNODES to
> an absolute maximum of (1 << 24) to avoid overflows.
>
> Per some reading and discussion, it appears that max nodes is
> limited to 1024 so that zone type still fits in page flags, so
> this method seemed preferable compared to the alternatives of
> trying to make all or part of mempolicy RCU protected (which
> may not be possible, since it is often referenced during code
> chunks which call operations that may sleep).
>
> Signed-off-by: Gregory Price <[email protected]>
> ---
> include/linux/mempolicy.h | 2 +-
> mm/mempolicy.c | 93 +++++++++++++++++++++++++--------------
> 2 files changed, 61 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index c644d7bbd396..8108fc6e96ca 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -56,7 +56,7 @@ struct mempolicy {
> } w;
>
> /* Weighted interleave settings */
> - u8 cur_il_weight;
> + atomic_t cur_il_weight;
If we use this field for node and weight, why not change the field name?
For example, cur_wil_node_weight.
> };
>
> /*
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 5a517511658e..41b5fef0a6f5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -321,7 +321,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
> policy->mode = mode;
> policy->flags = flags;
> policy->home_node = NUMA_NO_NODE;
> - policy->cur_il_weight = 0;
> + atomic_set(&policy->cur_il_weight, 0);
>
> return policy;
> }
> @@ -356,6 +356,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> tmp = *nodes;
>
> pol->nodes = tmp;
> + atomic_set(&pol->cur_il_weight, 0);
> }
>
> static void mpol_rebind_preferred(struct mempolicy *pol,
> @@ -973,8 +974,10 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> *policy = next_node_in(current->il_prev, pol->nodes);
> } else if (pol == current->mempolicy &&
> (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
> - if (pol->cur_il_weight)
> - *policy = current->il_prev;
> + int cweight = atomic_read(&pol->cur_il_weight);
> +
> + if (cweight & 0xFF)
> + *policy = cweight >> 8;
Please define some helper functions or macros instead of operate on bits
directly.
> else
> *policy = next_node_in(current->il_prev,
> pol->nodes);
If we record current node in pol->cur_il_weight, why do we still need
curren->il_prev. Can we only use pol->cur_il_weight? And if so, we can
even make current->il_prev a union.
--
Best Regards,
Huang, Ying
[snip]
On Fri, Jan 26, 2024 at 03:10:49PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > + } else if (pol == current->mempolicy &&
> > + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
> > + if (pol->cur_il_weight)
> > + *policy = current->il_prev;
> > + else
> > + *policy = next_node_in(current->il_prev,
> > + pol->nodes);
>
> It appears that my previous comments about this is ignored.
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Please correct me if I am wrong.
>
The fix is in the following patch. I'd originally planned to squash the
atomic patch into this one, but decided against it because it probably
warranted isolated scrutiny.
@@ -973,8 +974,10 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
*policy = next_node_in(current->il_prev, pol->nodes);
} else if (pol == current->mempolicy &&
(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
- if (pol->cur_il_weight)
- *policy = current->il_prev;
+ int cweight = atomic_read(&pol->cur_il_weight);
+
+ if (cweight & 0xFF)
+ *policy = cweight >> 8;
in this we return the node the weight applies to, otherwise we return
whatever is after il_prev.
I can pull this fix ahead.
> > + /* if now at 0, move to next node and set up that node's weight */
> > + if (unlikely(!policy->cur_il_weight)) {
> > + me->il_prev = node;
> > + next = next_node_in(node, policy->nodes);
> > + rcu_read_lock();
> > + table = rcu_dereference(iw_table);
> > + /* detect system-default values */
> > + weight = table ? table[next] : 1;
> > + policy->cur_il_weight = weight ? weight : 1;
> > + rcu_read_unlock();
> > + }
>
> It appears that the code could be more concise if we allow
> policy->cur_il_weight == 0. Duplicated code are in
> alloc_pages_bulk_array_weighted_interleave() too. Anyway, can we define
> some function to reduce duplicated code.
>
This is kind of complicated by the next patch, which places the node and
the weight into the same field to resolve the stale weight issue.
In that patch (cur_il_weight = 0) means "cur_il_weight invalid",
because the weight part can only be 0 when:
a) an error occuring during bulk allocation
b) a rebind event
I'll take some time to think about whether we can do away with
task->il_prev (as your next patch notes mentioned).
> > + /* Otherwise we adjust nr_pages down, and continue from there */
> > + rem_pages -= pol->cur_il_weight;
> > + pol->cur_il_weight = 0;
>
> This break the rule to keep pol->cur_il_weight != 0 except after initial
> setup. Is it OK?
>
The only way cur_il_weight can leave this function 0 at this point is if
an error occurs (specifically the failure to kmalloc immediately next).
If we don't clear cur_il_weight here, then we have a stale weight, and
the next allocation pass will over-allocate on the current node.
This semantic also changes a bit in the next patch, but is basically the
same. If il_weight is 0, then either an error occurred or a rebind
event occured.
> > + /* resume from this node w/ remaining weight */
> > + resume_node = prev_node;
> > + resume_weight = weight - (node_pages % weight);
>
> resume_weight = weight - delta; ?
>
ack
~Gregory
On Fri, Jan 26, 2024 at 03:40:27PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > Two special observations:
> > - if the weight is non-zero, cur_il_weight must *always* have a
> > valid node number, e.g. it cannot be NUMA_NO_NODE (-1).
>
> IIUC, we don't need that, "MAX_NUMNODES-1" is used instead.
>
Correct, I just thought it pertinent to call this out explicitly since
I'm stealing the top byte, but the node value has traditionally been a
full integer.
This may be relevant should anyone try to carry, a random node value
into this field. For example, if someone tried to copy policy->home_node
into cur_il_weight for whatever reason.
It's worth breaking out a function to defend against this - plus to hide
the bit operations directly as you recommend below.
> > /* Weighted interleave settings */
> > - u8 cur_il_weight;
> > + atomic_t cur_il_weight;
>
> If we use this field for node and weight, why not change the field name?
> For example, cur_wil_node_weight.
>
ack.
> > + if (cweight & 0xFF)
> > + *policy = cweight >> 8;
>
> Please define some helper functions or macros instead of operate on bits
> directly.
>
ack.
> > else
> > *policy = next_node_in(current->il_prev,
> > pol->nodes);
>
> If we record current node in pol->cur_il_weight, why do we still need
> curren->il_prev. Can we only use pol->cur_il_weight? And if so, we can
> even make current->il_prev a union.
>
I just realized that there's a problem here for shared memory policies.
from weighted_interleave_nodes, I do this:
cur_weight = atomic_read(&policy->cur_il_weight);
..
weight--;
..
atomic_set(&policy->cur_il_weight, cur_weight);
On a shared memory policy, this is a race condition.
I don't think we can combine il_prev and cur_wil_node_weight because
the task policy may be different than the current policy.
i.e. it's totally valid to do the following:
1) set_mempolicy(MPOL_INTERLEAVE)
2) mbind(..., MPOL_WEIGHTED_INTERLEAVE)
Using current->il_prev between these two policies, is just plain incorrect,
so I will need to rethink this, and the existing code will need to be
updated such that weighted_interleave does not use current->il_prev.
~Gregory
Gregory Price <[email protected]> writes:
> On Fri, Jan 26, 2024 at 03:40:27PM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>> > Two special observations:
>> > - if the weight is non-zero, cur_il_weight must *always* have a
>> > valid node number, e.g. it cannot be NUMA_NO_NODE (-1).
>>
>> IIUC, we don't need that, "MAX_NUMNODES-1" is used instead.
>>
>
> Correct, I just thought it pertinent to call this out explicitly since
> I'm stealing the top byte, but the node value has traditionally been a
> full integer.
>
> This may be relevant should anyone try to carry, a random node value
> into this field. For example, if someone tried to copy policy->home_node
> into cur_il_weight for whatever reason.
>
> It's worth breaking out a function to defend against this - plus to hide
> the bit operations directly as you recommend below.
>
>> > /* Weighted interleave settings */
>> > - u8 cur_il_weight;
>> > + atomic_t cur_il_weight;
>>
>> If we use this field for node and weight, why not change the field name?
>> For example, cur_wil_node_weight.
>>
>
> ack.
>
>> > + if (cweight & 0xFF)
>> > + *policy = cweight >> 8;
>>
>> Please define some helper functions or macros instead of operate on bits
>> directly.
>>
>
> ack.
>
>> > else
>> > *policy = next_node_in(current->il_prev,
>> > pol->nodes);
>>
>> If we record current node in pol->cur_il_weight, why do we still need
>> curren->il_prev. Can we only use pol->cur_il_weight? And if so, we can
>> even make current->il_prev a union.
>>
>
> I just realized that there's a problem here for shared memory policies.
>
> from weighted_interleave_nodes, I do this:
>
> cur_weight = atomic_read(&policy->cur_il_weight);
> ...
> weight--;
> ...
> atomic_set(&policy->cur_il_weight, cur_weight);
>
> On a shared memory policy, this is a race condition.
>
>
> I don't think we can combine il_prev and cur_wil_node_weight because
> the task policy may be different than the current policy.
>
> i.e. it's totally valid to do the following:
>
> 1) set_mempolicy(MPOL_INTERLEAVE)
> 2) mbind(..., MPOL_WEIGHTED_INTERLEAVE)
>
> Using current->il_prev between these two policies, is just plain incorrect,
> so I will need to rethink this, and the existing code will need to be
> updated such that weighted_interleave does not use current->il_prev.
IIUC, weighted_interleave_nodes() is only used for mempolicy of tasks
(set_mempolicy()), as in the following code.
+ *nid = (ilx == NO_INTERLEAVE_INDEX) ?
+ weighted_interleave_nodes(pol) :
+ weighted_interleave_nid(pol, ilx);
But, in contrast, it's bad to put task-local "current weight" in
mempolicy. So, I think that it's better to move cur_il_weight to
task_struct. And maybe combine it with current->il_prev.
--
Best Regards,
Huang, Ying
On Mon, Jan 29, 2024 at 04:17:46PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > Using current->il_prev between these two policies, is just plain incorrect,
> > so I will need to rethink this, and the existing code will need to be
> > updated such that weighted_interleave does not use current->il_prev.
>
> IIUC, weighted_interleave_nodes() is only used for mempolicy of tasks
> (set_mempolicy()), as in the following code.
>
> + *nid = (ilx == NO_INTERLEAVE_INDEX) ?
> + weighted_interleave_nodes(pol) :
> + weighted_interleave_nid(pol, ilx);
>
Was digging through this the past couple of days. It does look like
this is true - because if (pol) comes from a vma, ilx will not be
NO_INTERLEAVE_INDEX. If this changes in the future, however,
weighted_interleave_nodes may begin to miscount under heavy contention.
It may be worth documenting this explicitly, because this is incredibly
non-obvious. I will add a comment to this chunk here.
> But, in contrast, it's bad to put task-local "current weight" in
> mempolicy. So, I think that it's better to move cur_il_weight to
> task_struct. And maybe combine it with current->il_prev.
>
Given all of this, I think is reasonably. That is effectively what is
happening anyway for anyone that just uses `numactl -w --interleave=...`
Style question: is it preferable add an anonymous union into task_struct:
union {
short il_prev;
atomic_t wil_node_weight;
};
Or should I break out that union explicitly in mempolicy.h?
The latter involves additional code updates in mempolicy.c for the union
name (current->___.il_prev) but it lets us add documentation to mempolicy.h
~Gregory
On Mon, Jan 29, 2024 at 10:48:47AM -0500, Gregory Price wrote:
> On Mon, Jan 29, 2024 at 04:17:46PM +0800, Huang, Ying wrote:
> > Gregory Price <[email protected]> writes:
> >
> > But, in contrast, it's bad to put task-local "current weight" in
> > mempolicy. So, I think that it's better to move cur_il_weight to
> > task_struct. And maybe combine it with current->il_prev.
> >
> Style question: is it preferable add an anonymous union into task_struct:
>
> union {
> short il_prev;
> atomic_t wil_node_weight;
> };
>
> Or should I break out that union explicitly in mempolicy.h?
>
Having attempted this, it looks like including mempolicy.h into sched.h
is a non-starter. There are build issues likely associated from the
nested include of uapi/linux/mempolicy.h
So I went ahead and did the following. Style-wise If it's better to just
integrate this as an anonymous union in task_struct, let me know, but it
seemed better to add some documentation here.
I also added static get/set functions to mempolicy.c to touch these
values accordingly.
As suggested, I changed things to allow 0-weight in il_prev.node_weight
adjusted the logic accordingly. Will be testing this for a day or so
before sending out new patches.
~Gregory
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffe8f618ab86..f0d2af3bbc69 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -745,6 +745,29 @@ struct kmap_ctrl {
#endif
};
+
+/*
+ * Describes task_struct interleave settings
+ *
+ * Interleave uses mpol_interleave.node
+ * last node allocated from
+ * intended for use in next_node_in() on the next allocation
+ *
+ * Weighted interleave uses mpol_interleave.node_weight
+ * node is the value of the current node to allocate from
+ * weight is the number of allocations left on that node
+ * when weight is 0, next_node_in(node) will be invoked
+ */
+union mpol_interleave {
+ struct {
+ short node;
+ short resv;
+ };
+ /* structure: short node; u8 resv; u8 weight; */
+ atomic_t node_weight;
+};
+
+
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -1258,7 +1281,7 @@ struct task_struct {
#ifdef CONFIG_NUMA
/* Protected by alloc_lock: */
struct mempolicy *mempolicy;
- short il_prev;
+ union mpol_interleave il_prev;
short pref_node_fork;
#endif
#ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 92740b8f0eb5..48e365b507b2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -149,6 +149,66 @@ static struct mempolicy preferred_node_policy[MAX_NUMNODES];
static u8 __rcu *iw_table;
static DEFINE_MUTEX(iw_table_lock);
+static u8 get_il_weight(int node)
+{
+ u8 __rcu *table;
+ u8 weight;
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ /* if no iw_table, use system default */
+ weight = table ? table[node] : 1;
+ /* if value in iw_table is 0, use system default */
+ weight = weight ? weight : 1;
+ rcu_read_unlock();
+ return weight;
+}
+
+/* Clear any interleave values from task->il_prev */
+static void clear_il_prev(void)
+{
+ int node_weight;
+
+ node_weight = MAKE_WIL_PREV(MAX_NUMNODES - 1, 0);
+ atomic_set(¤t->il_prev.node_weight, node_weight);
+}
+
+/* get the next value for weighted interleave */
+static void get_wil_prev(int *node, u8 *weight)
+{
+ int node_weight;
+
+ node_weight = atomic_read(¤t->il_prev.node_weight);
+ *node = WIL_NODE(node_weight);
+ *weight = WIL_WEIGHT(node_weight);
+}
+
+/* set the next value for weighted interleave */
+static void set_wil_prev(int node, u8 weight)
+{
+ int node_weight;
+
+ if (node == MAX_NUMNODES)
+ node -= 1;
+ node_weight = MAKE_WIL_PREV(node, weight);
+ atomic_set(¤t->il_prev.node_weight, node_weight);
+}
+
+/* get the previous interleave node */
+static short get_il_prev(void)
+{
+ return current->il_prev.node;
+}
+
+/* set the previous interleave node */
+static void set_il_prev(int node)
+{
+ if (unlikely(node >= MAX_NUMNODES))
+ node = MAX_NUMNODES - 1;
+
+ current->il_prev.node = node;
+}
+
Gregory Price <[email protected]> writes:
> On Mon, Jan 29, 2024 at 10:48:47AM -0500, Gregory Price wrote:
>> On Mon, Jan 29, 2024 at 04:17:46PM +0800, Huang, Ying wrote:
>> > Gregory Price <[email protected]> writes:
>> >
>> > But, in contrast, it's bad to put task-local "current weight" in
>> > mempolicy. So, I think that it's better to move cur_il_weight to
>> > task_struct. And maybe combine it with current->il_prev.
>> >
>> Style question: is it preferable add an anonymous union into task_struct:
>>
>> union {
>> short il_prev;
>> atomic_t wil_node_weight;
>> };
>>
>> Or should I break out that union explicitly in mempolicy.h?
>>
>
> Having attempted this, it looks like including mempolicy.h into sched.h
> is a non-starter. There are build issues likely associated from the
> nested include of uapi/linux/mempolicy.h
>
> So I went ahead and did the following. Style-wise If it's better to just
> integrate this as an anonymous union in task_struct, let me know, but it
> seemed better to add some documentation here.
>
> I also added static get/set functions to mempolicy.c to touch these
> values accordingly.
>
> As suggested, I changed things to allow 0-weight in il_prev.node_weight
> adjusted the logic accordingly. Will be testing this for a day or so
> before sending out new patches.
>
Thanks about this again. It seems that we don't need to touch
task->il_prev and task->il_weight during rebinding for weighted
interleave too.
For weighted interleaving, il_prev is the node used for previous
allocation, il_weight is the weight after previous allocation. So
weighted_interleave_nodes() could be as follows,
unsigned int weighted_interleave_nodes(struct mempolicy *policy)
{
unsigned int nid;
struct task_struct *me = current;
nid = me->il_prev;
if (!me->il_weight || !node_isset(nid, policy->nodes)) {
nid = next_node_in(...);
me->il_prev = nid;
me->il_weight = weights[nid];
}
me->il_weight--;
return nid;
}
If this works, we can just add il_weight into task_struct.
--
Best Regards,
Huang, Ying
On Tue, Jan 30, 2024 at 11:15:35AM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > On Mon, Jan 29, 2024 at 10:48:47AM -0500, Gregory Price wrote:
> >> On Mon, Jan 29, 2024 at 04:17:46PM +0800, Huang, Ying wrote:
> >> > Gregory Price <[email protected]> writes:
> >> >
> >> > But, in contrast, it's bad to put task-local "current weight" in
> >> > mempolicy. So, I think that it's better to move cur_il_weight to
> >> > task_struct. And maybe combine it with current->il_prev.
> >> >
> >> Style question: is it preferable add an anonymous union into task_struct:
> >>
> >> union {
> >> short il_prev;
> >> atomic_t wil_node_weight;
> >> };
> >>
> >> Or should I break out that union explicitly in mempolicy.h?
> >>
> >
> > Having attempted this, it looks like including mempolicy.h into sched.h
> > is a non-starter. There are build issues likely associated from the
> > nested include of uapi/linux/mempolicy.h
> >
> > So I went ahead and did the following. Style-wise If it's better to just
> > integrate this as an anonymous union in task_struct, let me know, but it
> > seemed better to add some documentation here.
> >
> > I also added static get/set functions to mempolicy.c to touch these
> > values accordingly.
> >
> > As suggested, I changed things to allow 0-weight in il_prev.node_weight
> > adjusted the logic accordingly. Will be testing this for a day or so
> > before sending out new patches.
> >
>
> Thanks about this again. It seems that we don't need to touch
> task->il_prev and task->il_weight during rebinding for weighted
> interleave too.
>
It's not clear to me this is the case. cpusets takes the task_lock to
change mems_allowed and rebind task->mempolicy, but I do not see the
task lock access blocking allocations.
Comments from cpusets suggest allocations can happen in parallel.
/*
* cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
* @tsk: the task to change
* @newmems: new nodes that the task will be set
*
* We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed
* and rebind an eventual tasks' mempolicy. If the task is allocating in
* parallel, it might temporarily see an empty intersection, which results in
* a seqlock check and retry before OOM or allocation failure.
*/
For normal interleave, this isn't an issue because it always proceeds to
the next node. The same is not true of weighted interleave, which may
have a hanging weight in task->il_weight.
That is why I looked to combine the two, so at least node/weight were
carried together.
> unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> {
> unsigned int nid;
> struct task_struct *me = current;
>
> nid = me->il_prev;
> if (!me->il_weight || !node_isset(nid, policy->nodes)) {
> nid = next_node_in(...);
> me->il_prev = nid;
> me->il_weight = weights[nid];
> }
> me->il_weight--;
>
> return nid;
> }
I ended up with this:
static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
{
unsigned int node;
u8 weight;
get_wil_prev(&node, &weight);
/* If nodemask was rebound, just fetch the next node */
if (!weight) {
node = next_node_in(node, policy->nodes);
/* can only happen if nodemask has become invalid */
if (node == MAX_NUMNODES)
return node;
weight = get_il_weight(node);
}
weight--;
set_wil_prev(node, weight);
return node;
}
~Gregory
Gregory Price <[email protected]> writes:
> On Tue, Jan 30, 2024 at 11:15:35AM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>> > On Mon, Jan 29, 2024 at 10:48:47AM -0500, Gregory Price wrote:
>> >> On Mon, Jan 29, 2024 at 04:17:46PM +0800, Huang, Ying wrote:
>> >> > Gregory Price <[email protected]> writes:
>> >> >
>> >> > But, in contrast, it's bad to put task-local "current weight" in
>> >> > mempolicy. So, I think that it's better to move cur_il_weight to
>> >> > task_struct. And maybe combine it with current->il_prev.
>> >> >
>> >> Style question: is it preferable add an anonymous union into task_struct:
>> >>
>> >> union {
>> >> short il_prev;
>> >> atomic_t wil_node_weight;
>> >> };
>> >>
>> >> Or should I break out that union explicitly in mempolicy.h?
>> >>
>> >
>> > Having attempted this, it looks like including mempolicy.h into sched.h
>> > is a non-starter. There are build issues likely associated from the
>> > nested include of uapi/linux/mempolicy.h
>> >
>> > So I went ahead and did the following. Style-wise If it's better to just
>> > integrate this as an anonymous union in task_struct, let me know, but it
>> > seemed better to add some documentation here.
>> >
>> > I also added static get/set functions to mempolicy.c to touch these
>> > values accordingly.
>> >
>> > As suggested, I changed things to allow 0-weight in il_prev.node_weight
>> > adjusted the logic accordingly. Will be testing this for a day or so
>> > before sending out new patches.
>> >
>>
>> Thanks about this again. It seems that we don't need to touch
>> task->il_prev and task->il_weight during rebinding for weighted
>> interleave too.
>>
>
> It's not clear to me this is the case. cpusets takes the task_lock to
> change mems_allowed and rebind task->mempolicy, but I do not see the
> task lock access blocking allocations.
>
> Comments from cpusets suggest allocations can happen in parallel.
>
> /*
> * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
> * @tsk: the task to change
> * @newmems: new nodes that the task will be set
> *
> * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed
> * and rebind an eventual tasks' mempolicy. If the task is allocating in
> * parallel, it might temporarily see an empty intersection, which results in
> * a seqlock check and retry before OOM or allocation failure.
> */
>
>
> For normal interleave, this isn't an issue because it always proceeds to
> the next node. The same is not true of weighted interleave, which may
> have a hanging weight in task->il_weight.
So, I added a check as follows,
node_isset(current->il_prev, policy->nodes)
If prev node is removed from nodemask, allocation will proceed to the
next node. Otherwise, it's safe to use current->il_weight.
--
Best Regards,
Huang, Ying
> That is why I looked to combine the two, so at least node/weight were
> carried together.
>
>> unsigned int weighted_interleave_nodes(struct mempolicy *policy)
>> {
>> unsigned int nid;
>> struct task_struct *me = current;
>>
>> nid = me->il_prev;
>> if (!me->il_weight || !node_isset(nid, policy->nodes)) {
>> nid = next_node_in(...);
>> me->il_prev = nid;
>> me->il_weight = weights[nid];
>> }
>> me->il_weight--;
>>
>> return nid;
>> }
>
> I ended up with this:
>
> static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> {
> unsigned int node;
> u8 weight;
>
> get_wil_prev(&node, &weight);
> /* If nodemask was rebound, just fetch the next node */
> if (!weight) {
> node = next_node_in(node, policy->nodes);
> /* can only happen if nodemask has become invalid */
> if (node == MAX_NUMNODES)
> return node;
> weight = get_il_weight(node);
> }
> weight--;
> set_wil_prev(node, weight);
> return node;
> }
>
> ~Gregory
On Tue, Jan 30, 2024 at 01:18:30PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > For normal interleave, this isn't an issue because it always proceeds to
> > the next node. The same is not true of weighted interleave, which may
> > have a hanging weight in task->il_weight.
>
> So, I added a check as follows,
>
> node_isset(current->il_prev, policy->nodes)
>
> If prev node is removed from nodemask, allocation will proceed to the
> next node. Otherwise, it's safe to use current->il_weight.
>
Funny enough I have this on one of my branches and dropped it, but after
digging through everything - this should be sufficient.
I'll just add il_weight next to il_prev and have a new set of patches
out today. Code is already there, just needs one last cleanup pass.
~Gregory