Hi Andrew,
This is *hopefully* the final major update to this line.
Full version nodes at the end of initial cover letter chunk.
(v4: style, task->il_weight, uninitialized values, docs)
---
Weighted interleave is a new interleave policy intended to make
use of heterogeneous memory environments appearing with CXL.
The existing interleave mechanism does an even round-robin
distribution of memory across all nodes in a nodemask, while
weighted interleave distributes memory across nodes according
to a provided weight. (Weight = # of page allocations per round)
Weighted interleave is intended to reduce average latency when
bandwidth is pressured - therefore increasing total throughput.
In other words: It allows greater use of the total available
bandwidth in a heterogeneous hardware environment (different
hardware provides different bandwidth capacity).
As bandwidth is pressured, latency increases - first linearly
and then exponentially. By keeping bandwidth usage distributed
according to available bandwidth, we therefore can reduce the
average latency of a cacheline fetch.
A good explanation of the bandwidth vs latency response curve:
https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/
From the article:
```
Constant region:
The latency response is fairly constant for the first 40%
of the sustained bandwidth.
Linear region:
In between 40% to 80% of the sustained bandwidth, the
latency response increases almost linearly with the bandwidth
demand of the system due to contention overhead by numerous
memory requests.
Exponential region:
Between 80% to 100% of the sustained bandwidth, the memory
latency is dominated by the contention latency which can be
as much as twice the idle latency or more.
Maximum sustained bandwidth :
Is 65% to 75% of the theoretical maximum bandwidth.
```
As a general rule of thumb:
* If bandwidth usage is low, latency does not increase. It is
optimal to place data in the nearest (lowest latency) device.
* If bandwidth usage is high, latency increases. It is optimal
to place data such that bandwidth use is optimized per-device.
This is the top line goal: Provide a user a mechanism to target using
the "maximum sustained bandwidth" of each hardware component in a
heterogenous memory system.
For example, the stream benchmark demonstrates that 1:1 (default)
interleave is actively harmful, while weighted interleave can be
beneficial. Default interleave distributes data such that too much
pressure is placed on devices with lower available bandwidth.
Stream Benchmark (vs DRAM, 1 Socket + 1 CXL Device)
Default interleave : -78% (slower than DRAM)
Global weighting : -6% to +4% (workload dependant)
Targeted weights : +2.5% to +4% (consistently better than DRAM)
Global means the task-policy was set (set_mempolicy), while targeted
means VMA policies were set (mbind2). We see weighted interleave
is not always beneficial when applied globally, but is always
beneficial when applied to bandwidth-driving memory regions.
There are 3 patches in this set:
1) Implement system-global interleave weights as sysfs extension
in mm/mempolicy.c. These weights are RCU protected, and a
default weight set is provided (all weights are 1 by default).
In future work, we intend to expose an interface for HMAT/CDAT
code to set reasonable default values based on the memory
configuration of the system discovered at boot/hotplug.
2) A mild refactor of some interleave-logic for re-use in the
new weighted interleave logic.
3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind
Included below are some performance and LTP test information,
and a sample numactl branch which can be used for testing.
= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench : +19% over DRAM. +47% over default interleave.
= LTP Testing Summary =
existing mempolicy & mbind tests: pass
mempolicy & mbind + weighted interleave (global weights): pass
= version history
v4:
- style fixes, code deduplication, simplifications, comments
- moved mempolicy->il_weight to task_struct->il_weight
- changed logic to simply move forward on il_weight=0 rather than
treat il_weight=0 as a special value
- detect when il_prev is no longer in the nodemask and move forward
- missed weighted interleave check in alloc_pages_mpol()
- uninitialized value of nr_allocated in bulk allocator
=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <[email protected]>
Hardware: Single-socket, multiple CXL memory expanders.
Workload: W2
Data Signature: 2:1 read:write
DRAM only bandwidth (GBps): 298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only: 1.38x
Gain over default interleave: 2.64x
Workload: W5
Data Signature: 1:1 read:write
DRAM only bandwidth (GBps): 273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only: 1.4x
Gain over default interleave: 2.26x
=====================================================================
Performance test - Stream
From - Gregory Price <[email protected]>
Hardware: Single socket, single CXL expander
numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master
Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting : -6% to +4% (workload dependant)
mbind2 weights : +2.5% to +4% (consistently better than DRAM)
dram only:
numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Function Direction BestRateMBs AvgTime MinTime MaxTime
Copy: 0->0 200923.2 0.032662 0.031853 0.033301
Scale: 0->0 202123.0 0.032526 0.031664 0.032970
Add: 0->0 208873.2 0.047322 0.045961 0.047884
Triad: 0->0 208523.8 0.047262 0.046038 0.048414
CXL-only:
numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy: 0->0 22209.7 0.288661 0.288162 0.289342
Scale: 0->0 22288.2 0.287549 0.287147 0.288291
Add: 0->0 24419.1 0.393372 0.393135 0.393735
Triad: 0->0 24484.6 0.392337 0.392083 0.394331
Based on the above, the optimal weights are ~9:1
echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2
default interleave:
numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy: 0->0 44666.2 0.143671 0.143285 0.144174
Scale: 0->0 44781.6 0.143256 0.142916 0.143713
Add: 0->0 48600.7 0.197719 0.197528 0.197858
Triad: 0->0 48727.5 0.197204 0.197014 0.197439
global weighted interleave:
numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy: 0->0 190085.9 0.034289 0.033669 0.034645
Scale: 0->0 207677.4 0.031909 0.030817 0.033061
Add: 0->0 202036.8 0.048737 0.047516 0.053409
Triad: 0->0 217671.5 0.045819 0.044103 0.046755
targted regions w/ global weights (modified stream to mbind2 malloc'd regions))
numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc
Copy: 0->0 205827.0 0.031445 0.031094 0.031984
Scale: 0->0 208171.8 0.031320 0.030744 0.032505
Add: 0->0 217352.0 0.045087 0.044168 0.046515
Triad: 0->0 216884.8 0.045062 0.044263 0.046982
=====================================================================
Performance tests - XSBench
From - Hyeongtak Ji <[email protected]>
Hardware: Single socket, Single CXL memory Expander
NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads: 56
Lookups: 170,000,000
Summary: +19% over DRAM. +47% over default interleave.
Performance tests - XSBench
1. dram only
$ numactl -m 0 ./XSBench -s XL –p 5000000
Runtime: 36.235 seconds
Lookups/s: 4,691,618
2. default interleave
$ numactl –i 0,2 ./XSBench –s XL –p 5000000
Runtime: 55.243 seconds
Lookups/s: 3,077,293
3. weighted interleave
numactl –w –i 0,2 ./XSBench –s XL –p 5000000
Runtime: 29.262 seconds
Lookups/s: 5,809,513
=====================================================================
LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2
= Existing tests
set_mempolicy, get_mempolicy, mbind
MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality
but did not adjust tests for weighting. Basically the weights were
set to 1, which is the default, and it should behave the same as
MPOL_INTERLEAVE if logic is correct.
== set_mempolicy01 : passed 18, failed 0
== set_mempolicy02 : passed 10, failed 0
== set_mempolicy03 : passed 64, failed 0
== set_mempolicy04 : passed 32, failed 0
== set_mempolicy05 - n/a on non-x86
== set_mempolicy06 : passed 10, failed 0
this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE
== set_mempolicy07 : passed 32, failed 0
set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy01 : passed 12, failed 0
change: added MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy02 : passed 2, failed 0
== mbind01 : passed 15, failed 0
added MPOL_WEIGHTED_INTERLEAVE
== mbind02 : passed 4, failed 0
added MPOL_WEIGHTED_INTERLEAVE
== mbind03 : passed 16, failed 0
added MPOL_WEIGHTED_INTERLEAVE
== mbind04 : passed 48, failed 0
added MPOL_WEIGHTED_INTERLEAVE
=====================================================================
numactl (set_mempolicy) w/ global weighting test
numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master
command: numactl -w --interleave=0,1 ./eatmem
result (weights 1:1):
0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4
7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4
50% distribution is correct
result (weights 5:1):
01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4
7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4
16.666% distribution is correct
result (weights 1:5):
01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4
7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4
16.666% distribution is correct
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void)
{
char* mem = malloc(1024*1024*256);
memset(mem, 1, 1024*1024*256);
for (int i = 0; i < ((1024*1024*256)/4096); i++)
{
mem = malloc(4096);
mem[0] = 1;
}
printf("done\n");
getchar();
return 0;
}
=====================================================================
Suggested-by: Gregory Price <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Suggested-by: Hasan Al Maruf <[email protected]>
Suggested-by: Hao Wang <[email protected]>
Suggested-by: Ying Huang <[email protected]>
Suggested-by: Dan Williams <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Suggested-by: Zhongkun He <[email protected]>
Suggested-by: Frank van der Linden <[email protected]>
Suggested-by: John Groves <[email protected]>
Suggested-by: Vinicius Tavares Petrucci <[email protected]>
Suggested-by: Srinivasulu Thanneeru <[email protected]>
Suggested-by: Ravi Jonnalagadda <[email protected]>
Suggested-by: Jonathan Cameron <[email protected]>
Suggested-by: Hyeongtak Ji <[email protected]>
Suggested-by: Andi Kleen <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Gregory Price (2):
mm/mempolicy: refactor a read-once mechanism into a function for
re-use
mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted
interleaving
Rakie Kim (1):
mm/mempolicy: implement the sysfs-based weighted_interleave interface
.../ABI/testing/sysfs-kernel-mm-mempolicy | 4 +
...fs-kernel-mm-mempolicy-weighted-interleave | 25 +
.../admin-guide/mm/numa_memory_policy.rst | 9 +
include/linux/sched.h | 1 +
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 480 +++++++++++++++++-
6 files changed, 506 insertions(+), 14 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
--
2.39.1
move the use of barrier() to force policy->nodemask onto the stack into
a function `read_once_policy_nodemask` so that it may be re-used.
Suggested-by: Huang Ying <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
---
mm/mempolicy.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 440128a398ef..3bdfaf03b660 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1909,6 +1909,20 @@ unsigned int mempolicy_slab_node(void)
}
}
+static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
+ nodemask_t *mask)
+{
+ /*
+ * barrier stabilizes the nodemask locally so that it can be iterated
+ * over safely without concern for changes. Allocators validate node
+ * selection does not violate mems_allowed, so this is safe.
+ */
+ barrier();
+ memcpy(mask, &pol->nodes, sizeof(nodemask_t));
+ barrier();
+ return nodes_weight(*mask);
+}
+
/*
* Do static interleaving for interleave index @ilx. Returns the ilx'th
* node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1916,20 +1930,12 @@ unsigned int mempolicy_slab_node(void)
*/
static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
{
- nodemask_t nodemask = pol->nodes;
+ nodemask_t nodemask;
unsigned int target, nnodes;
int i;
int nid;
- /*
- * The barrier will stabilize the nodemask in a register or on
- * the stack so that it will stop changing under the code.
- *
- * Between first_node() and next_node(), pol->nodes could be changed
- * by other threads. So we put pol->nodes in a local stack.
- */
- barrier();
- nnodes = nodes_weight(nodemask);
+ nnodes = read_once_policy_nodemask(pol, &nodemask);
if (!nnodes)
return numa_node_id();
target = ilx % nnodes;
--
2.39.1
When a system has multiple NUMA nodes and it becomes bandwidth hungry,
using the current MPOL_INTERLEAVE could be an wise option.
However, if those NUMA nodes consist of different types of memory such
as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
based interleave policy does not optimally distribute data to make use
of their different bandwidth characteristics.
Instead, interleave is more effective when the allocation policy follows
each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
enabling weighted interleave between NUMA nodes. Weighted interleave
allows for proportional distribution of memory across multiple numa
nodes, preferably apportioned to match the bandwidth of each node.
For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
weight distribution is (2:1).
Weights for each node can be assigned via the new sysfs extension:
/sys/kernel/mm/mempolicy/weighted_interleave/
For now, the default value of all nodes will be `1`, which matches
the behavior of standard 1:1 round-robin interleave. An extension
will be added in the future to allow default values to be registered
at kernel and device bringup time.
The policy allocates a number of pages equal to the set weights. For
example, if the weights are (2,1), then 2 pages will be allocated on
node0 for every 1 page allocated on node1.
The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).
Some high level notes about the pieces of weighted interleave:
current->il_prev:
Default interleave uses this to track the last used node.
Weighted interleave uses this to track the *current* node, and
when weight reaches 0 it will be used to acquire the next node.
current->il_weight:
The active weight of the current node (current->il_prev)
When this reaches 0, current->il_prev is set to the next node
and current->il_weight is set to the next weight.
weighted_interleave_nodes:
Counts the number of allocations as they occur, and applies the
weight for the current node. When the weight reaches 0, switch
to the next node. Operates only on task->mempolicy.
weighted_interleave_nid:
Gets the total weight of the nodemask as well as each individual
node weight, then calculates the node based on the given index.
Operates on VMA policies.
bulk_array_weighted_interleave:
Gets the total weight of the nodemask as well as each individual
node weight, then calculates the number of "interleave rounds" as
well as any delta ("partial round"). Calculates the number of
pages for each node and allocates them.
If a node was scheduled for interleave via interleave_nodes, the
current weight will be allocated first.
Operates only on the task->mempolicy.
One piece of complexity is the interaction between a recent refactor
which split the logic to acquire the "ilx" (interleave index) of an
allocation and the actual application of the interleave. If a call
to alloc_pages_mpol() were made with a weighted-interleave policy and
ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would
operate on a VMA policy - violating the description above.
An inspection of all callers of alloc_pages_mpol() shows that all
external callers set ilx to `0`, an index value, or will call
get_vma_policy() to acquire the ilx.
For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks
all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces
the `weighted_interleave_nodes()` and `weighted_interleave_nid()`
policy requirements (task/vma respectively).
Suggested-by: Hasan Al Maruf <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Co-developed-by: Rakie Kim <[email protected]>
Signed-off-by: Rakie Kim <[email protected]>
Co-developed-by: Honggyu Kim <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
Co-developed-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
Co-developed-by: Srinivasulu Thanneeru <[email protected]>
Signed-off-by: Srinivasulu Thanneeru <[email protected]>
Co-developed-by: Ravi Jonnalagadda <[email protected]>
Signed-off-by: Ravi Jonnalagadda <[email protected]>
---
.../admin-guide/mm/numa_memory_policy.rst | 9 +
include/linux/sched.h | 1 +
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 231 +++++++++++++++++-
4 files changed, 238 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..a70f20ce1ffb 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
can fall back to all existing numa nodes. This is effectively
MPOL_PREFERRED allowed for a mask rather than a single node.
+MPOL_WEIGHTED_INTERLEAVE
+ This mode operates the same as MPOL_INTERLEAVE, except that
+ interleaving behavior is executed based on weights set in
+ /sys/kernel/mm/mempolicy/weighted_interleave/
+
+ Weighted interleave allocates pages on nodes according to a
+ weight. For example if nodes [0,1] are weighted [5,2], 5 pages
+ will be allocated on node0 for every 2 pages allocated on node1.
+
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffe8f618ab86..b9ce285d8c9c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1259,6 +1259,7 @@ struct task_struct {
/* Protected by alloc_lock: */
struct mempolicy *mempolicy;
short il_prev;
+ u8 il_weight;
short pref_node_fork;
#endif
#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..1f9bb10d1a47 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,6 +23,7 @@ enum {
MPOL_INTERLEAVE,
MPOL_LOCAL,
MPOL_PREFERRED_MANY,
+ MPOL_WEIGHTED_INTERLEAVE,
MPOL_MAX, /* always last member of enum */
};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3bdfaf03b660..7cd92f4ec0d7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -19,6 +19,13 @@
* for anonymous memory. For process policy an process counter
* is used.
*
+ * weighted interleave
+ * Allocate memory interleaved over a set of nodes based on
+ * a set of weights (per-node), with normal fallback if it
+ * fails. Otherwise operates the same as interleave.
+ * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
+ * on node 0 for every 1 page allocated on node 1.
+ *
* bind Only allocate memory on a specific set of nodes,
* no fallback.
* FIXME: memory is allocated starting with the first node
@@ -441,6 +448,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
.create = mpol_new_nodemask,
.rebind = mpol_rebind_preferred,
},
+ [MPOL_WEIGHTED_INTERLEAVE] = {
+ .create = mpol_new_nodemask,
+ .rebind = mpol_rebind_nodemask,
+ },
};
static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
@@ -862,8 +873,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
old = current->mempolicy;
current->mempolicy = new;
- if (new && new->mode == MPOL_INTERLEAVE)
+ if (new && (new->mode == MPOL_INTERLEAVE ||
+ new->mode == MPOL_WEIGHTED_INTERLEAVE)) {
current->il_prev = MAX_NUMNODES-1;
+ current->il_weight = 0;
+ }
task_unlock(current);
mpol_put(old);
ret = 0;
@@ -888,6 +902,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
case MPOL_INTERLEAVE:
case MPOL_PREFERRED:
case MPOL_PREFERRED_MANY:
+ case MPOL_WEIGHTED_INTERLEAVE:
*nodes = pol->nodes;
break;
case MPOL_LOCAL:
@@ -972,6 +987,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
} else if (pol == current->mempolicy &&
pol->mode == MPOL_INTERLEAVE) {
*policy = next_node_in(current->il_prev, pol->nodes);
+ } else if (pol == current->mempolicy &&
+ pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
+ if (current->il_weight)
+ *policy = current->il_prev;
+ else
+ *policy = next_node_in(current->il_prev,
+ pol->nodes);
} else {
err = -EINVAL;
goto out;
@@ -1336,7 +1358,8 @@ static long do_mbind(unsigned long start, unsigned long len,
* VMAs, the nodes will still be interleaved from the targeted
* nodemask, but one by one may be selected differently.
*/
- if (new->mode == MPOL_INTERLEAVE) {
+ if (new->mode == MPOL_INTERLEAVE ||
+ new->mode == MPOL_WEIGHTED_INTERLEAVE) {
struct page *page;
unsigned int order;
unsigned long addr = -EFAULT;
@@ -1784,7 +1807,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
* @vma: virtual memory area whose policy is sought
* @addr: address in @vma for shared policy lookup
* @order: 0, or appropriate huge_page_order for interleaving
- * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE
+ * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or
+ * MPOL_WEIGHTED_INTERLEAVE
*
* Returns effective policy for a VMA at specified address.
* Falls back to current->mempolicy or system default policy, as necessary.
@@ -1801,7 +1825,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
pol = __get_vma_policy(vma, addr, ilx);
if (!pol)
pol = get_task_policy(current);
- if (pol->mode == MPOL_INTERLEAVE) {
+ if (pol->mode == MPOL_INTERLEAVE ||
+ pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
*ilx += vma->vm_pgoff >> order;
*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
}
@@ -1851,6 +1876,22 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
return zone >= dynamic_policy_zone;
}
+static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
+{
+ unsigned int node = current->il_prev;
+
+ if (!current->il_weight || !node_isset(node, policy->nodes)) {
+ node = next_node_in(node, policy->nodes);
+ /* can only happen if nodemask is being rebound */
+ if (node == MAX_NUMNODES)
+ return node;
+ current->il_prev = node;
+ current->il_weight = get_il_weight(node);
+ }
+ current->il_weight--;
+ return node;
+}
+
/* Do dynamic interleaving for a process */
static unsigned int interleave_nodes(struct mempolicy *policy)
{
@@ -1885,6 +1926,9 @@ unsigned int mempolicy_slab_node(void)
case MPOL_INTERLEAVE:
return interleave_nodes(policy);
+ case MPOL_WEIGHTED_INTERLEAVE:
+ return weighted_interleave_nodes(policy);
+
case MPOL_BIND:
case MPOL_PREFERRED_MANY:
{
@@ -1923,6 +1967,45 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
return nodes_weight(*mask);
}
+static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
+{
+ nodemask_t nodemask;
+ unsigned int target, nr_nodes;
+ u8 __rcu *table;
+ unsigned int weight_total = 0;
+ u8 weight;
+ int nid;
+
+ nr_nodes = read_once_policy_nodemask(pol, &nodemask);
+ if (!nr_nodes)
+ return numa_node_id();
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ /* calculate the total weight */
+ for_each_node_mask(nid, nodemask) {
+ /* detect system default usage */
+ weight = table ? table[nid] : 1;
+ weight = weight ? weight : 1;
+ weight_total += weight;
+ }
+
+ /* Calculate the node offset based on totals */
+ target = ilx % weight_total;
+ nid = first_node(nodemask);
+ while (target) {
+ /* detect system default usage */
+ weight = table ? table[nid] : 1;
+ weight = weight ? weight : 1;
+ if (target < weight)
+ break;
+ target -= weight;
+ nid = next_node_in(nid, nodemask);
+ }
+ rcu_read_unlock();
+ return nid;
+}
+
/*
* Do static interleaving for interleave index @ilx. Returns the ilx'th
* node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1983,6 +2066,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
*nid = (ilx == NO_INTERLEAVE_INDEX) ?
interleave_nodes(pol) : interleave_nid(pol, ilx);
break;
+ case MPOL_WEIGHTED_INTERLEAVE:
+ *nid = (ilx == NO_INTERLEAVE_INDEX) ?
+ weighted_interleave_nodes(pol) :
+ weighted_interleave_nid(pol, ilx);
+ break;
}
return nodemask;
@@ -2044,6 +2132,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
case MPOL_PREFERRED_MANY:
case MPOL_BIND:
case MPOL_INTERLEAVE:
+ case MPOL_WEIGHTED_INTERLEAVE:
*mask = mempolicy->nodes;
break;
@@ -2144,6 +2233,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
* node in its nodemask, we allocate the standard way.
*/
if (pol->mode != MPOL_INTERLEAVE &&
+ pol->mode != MPOL_WEIGHTED_INTERLEAVE &&
(!nodemask || node_isset(nid, *nodemask))) {
/*
* First, try to allocate THP only on local node, but
@@ -2279,6 +2369,127 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
return total_allocated;
}
+static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
+ struct mempolicy *pol, unsigned long nr_pages,
+ struct page **page_array)
+{
+ struct task_struct *me = current;
+ unsigned long total_allocated = 0;
+ unsigned long nr_allocated = 0;
+ unsigned long rounds;
+ unsigned long node_pages, delta;
+ u8 __rcu *table, *weights, weight;
+ unsigned int weight_total = 0;
+ unsigned long rem_pages = nr_pages;
+ nodemask_t nodes;
+ int nnodes, node, next_node;
+ int resume_node = MAX_NUMNODES - 1;
+ u8 resume_weight = 0;
+ int prev_node;
+ int i;
+
+ if (!nr_pages)
+ return 0;
+
+ nnodes = read_once_policy_nodemask(pol, &nodes);
+ if (!nnodes)
+ return 0;
+
+ /* Continue allocating from most recent node and adjust the nr_pages */
+ node = me->il_prev;
+ weight = me->il_weight;
+ if (weight && node_isset(node, nodes)) {
+ node_pages = min(rem_pages, weight);
+ nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+ NULL, page_array);
+ page_array += nr_allocated;
+ total_allocated += nr_allocated;
+ /* if that's all the pages, no need to interleave */
+ if (rem_pages < weight) {
+ /* stay on current node, adjust il_weight */
+ me->il_weight -= rem_pages;
+ return total_allocated;
+ } else if (rem_pages == weight) {
+ /* move to next node / weight */
+ me->il_prev = next_node_in(node, nodes);
+ me->il_weight = get_il_weight(next_node);
+ return total_allocated;
+ }
+ /* Otherwise we adjust remaining pages, continue from there */
+ rem_pages -= weight;
+ }
+ /* clear active weight in case of an allocation failure */
+ me->il_weight = 0;
+ prev_node = node;
+
+ /* create a local copy of node weights to operate on outside rcu */
+ weights = kzalloc(nr_node_ids, GFP_KERNEL);
+ if (!weights)
+ return total_allocated;
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ if (table)
+ memcpy(weights, table, nr_node_ids);
+ rcu_read_unlock();
+
+ /* calculate total, detect system default usage */
+ for_each_node_mask(node, nodes) {
+ if (!weights[node])
+ weights[node] = 1;
+ weight_total += weights[node];
+ }
+
+ /*
+ * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
+ * Track which node weighted interleave should resume from.
+ *
+ * if (rounds > 0) and (delta == 0), resume_node will always be
+ * the node following prev_node and its weight.
+ */
+ rounds = rem_pages / weight_total;
+ delta = rem_pages % weight_total;
+ resume_node = next_node_in(prev_node, nodes);
+ resume_weight = weights[resume_node];
+ for (i = 0; i < nnodes; i++) {
+ node = next_node_in(prev_node, nodes);
+ weight = weights[node];
+ node_pages = weight * rounds;
+ /* If a delta exists, add this node's portion of the delta */
+ if (delta > weight) {
+ node_pages += weight;
+ delta -= weight;
+ } else if (delta) {
+ node_pages += delta;
+ /* delta may deplete on a boundary or w/ a remainder */
+ if (delta == weight) {
+ /* boundary: resume from next node/weight */
+ resume_node = next_node_in(node, nodes);
+ resume_weight = weights[resume_node];
+ } else {
+ /* remainder: resume this node w/ remainder */
+ resume_node = node;
+ resume_weight = weight - delta;
+ }
+ delta = 0;
+ }
+ /* node_pages can be 0 if an allocation fails and rounds == 0 */
+ if (!node_pages)
+ break;
+ nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+ NULL, page_array);
+ page_array += nr_allocated;
+ total_allocated += nr_allocated;
+ if (total_allocated == nr_pages)
+ break;
+ prev_node = node;
+ }
+ me->il_prev = resume_node;
+ me->il_weight = resume_weight;
+ kfree(weights);
+ return total_allocated;
+}
+
static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
struct mempolicy *pol, unsigned long nr_pages,
struct page **page_array)
@@ -2319,6 +2530,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
return alloc_pages_bulk_array_interleave(gfp, pol,
nr_pages, page_array);
+ if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
+ return alloc_pages_bulk_array_weighted_interleave(
+ gfp, pol, nr_pages, page_array);
+
if (pol->mode == MPOL_PREFERRED_MANY)
return alloc_pages_bulk_array_preferred_many(gfp,
numa_node_id(), pol, nr_pages, page_array);
@@ -2394,6 +2609,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
case MPOL_INTERLEAVE:
case MPOL_PREFERRED:
case MPOL_PREFERRED_MANY:
+ case MPOL_WEIGHTED_INTERLEAVE:
return !!nodes_equal(a->nodes, b->nodes);
case MPOL_LOCAL:
return true;
@@ -2530,6 +2746,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
polnid = interleave_nid(pol, ilx);
break;
+ case MPOL_WEIGHTED_INTERLEAVE:
+ polnid = weighted_interleave_nid(pol, ilx);
+ break;
+
case MPOL_PREFERRED:
if (node_isset(curnid, pol->nodes))
goto out;
@@ -2904,6 +3124,7 @@ static const char * const policy_modes[] =
[MPOL_PREFERRED] = "prefer",
[MPOL_BIND] = "bind",
[MPOL_INTERLEAVE] = "interleave",
+ [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
[MPOL_LOCAL] = "local",
[MPOL_PREFERRED_MANY] = "prefer (many)",
};
@@ -2963,6 +3184,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
}
break;
case MPOL_INTERLEAVE:
+ case MPOL_WEIGHTED_INTERLEAVE:
/*
* Default to online nodes with memory if no nodelist
*/
@@ -3073,6 +3295,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
case MPOL_PREFERRED_MANY:
case MPOL_BIND:
case MPOL_INTERLEAVE:
+ case MPOL_WEIGHTED_INTERLEAVE:
nodes = pol->nodes;
break;
default:
--
2.39.1
From: Rakie Kim <[email protected]>
This patch provides a way to set interleave weight information under
sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
The sysfs structure is designed as follows.
$ tree /sys/kernel/mm/mempolicy/
/sys/kernel/mm/mempolicy/ [1]
└── weighted_interleave [2]
├── node0 [3]
└── node1
Each file above can be explained as follows.
[1] mm/mempolicy: configuration interface for mempolicy subsystem
[2] weighted_interleave/: config interface for weighted interleave policy
[3] weighted_interleave/nodeN: weight for nodeN
If a node value is set to `0`, the system-default value will be used.
As of this patch, the system-default for all nodes is always 1.
Suggested-by: Huang Ying <[email protected]>
Signed-off-by: Rakie Kim <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
Co-developed-by: Gregory Price <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
Co-developed-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
---
.../ABI/testing/sysfs-kernel-mm-mempolicy | 4 +
...fs-kernel-mm-mempolicy-weighted-interleave | 25 ++
mm/mempolicy.c | 223 ++++++++++++++++++
3 files changed, 252 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
new file mode 100644
index 000000000000..8ac327fd7fb6
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
@@ -0,0 +1,4 @@
+What: /sys/kernel/mm/mempolicy/
+Date: January 2024
+Contact: Linux memory management mailing list <[email protected]>
+Description: Interface for Mempolicy
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
new file mode 100644
index 000000000000..0b7972de04e9
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -0,0 +1,25 @@
+What: /sys/kernel/mm/mempolicy/weighted_interleave/
+Date: January 2024
+Contact: Linux memory management mailing list <[email protected]>
+Description: Configuration Interface for the Weighted Interleave policy
+
+What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
+Date: January 2024
+Contact: Linux memory management mailing list <[email protected]>
+Description: Weight configuration interface for nodeN
+
+ The interleave weight for a memory node (N). These weights are
+ utilized by tasks which have set their mempolicy to
+ MPOL_WEIGHTED_INTERLEAVE.
+
+ These weights only affect new allocations, and changes at runtime
+ will not cause migrations on already allocated pages.
+
+ The minimum weight for a node is always 1.
+
+ Minimum weight: 1
+ Maximum weight: 255
+
+ Writing an empty string or `0` will reset the weight to the
+ system default. The system default may be set by the kernel
+ or drivers at boot or during hotplug events.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..440128a398ef 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -131,6 +131,32 @@ static struct mempolicy default_policy = {
static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+/*
+ * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
+ * system-default value should be used. A NULL iw_table also denotes that
+ * system-default values should be used. Until the system-default table
+ * is implemented, the system-default is always 1.
+ *
+ * iw_table is RCU protected
+ */
+static u8 __rcu *iw_table;
+static DEFINE_MUTEX(iw_table_lock);
+
+static u8 get_il_weight(int node)
+{
+ u8 __rcu *table;
+ u8 weight;
+
+ rcu_read_lock();
+ table = rcu_dereference(iw_table);
+ /* if no iw_table, use system default */
+ weight = table ? table[node] : 1;
+ /* if value in iw_table is 0, use system default */
+ weight = weight ? weight : 1;
+ rcu_read_unlock();
+ return weight;
+}
+
/**
* numa_nearest_node - Find nearest node by state
* @node: Node id to start the search
@@ -3067,3 +3093,200 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
nodemask_pr_args(&nodes));
}
+
+#ifdef CONFIG_SYSFS
+struct iw_node_attr {
+ struct kobj_attribute kobj_attr;
+ int nid;
+};
+
+static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ struct iw_node_attr *node_attr;
+ u8 weight;
+
+ node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+ weight = get_il_weight(node_attr->nid);
+ return sysfs_emit(buf, "%d\n", weight);
+}
+
+static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct iw_node_attr *node_attr;
+ u8 __rcu *new;
+ u8 __rcu *old;
+ u8 weight = 0;
+
+ node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+ if (count == 0 || sysfs_streq(buf, ""))
+ weight = 0;
+ else if (kstrtou8(buf, 0, &weight))
+ return -EINVAL;
+
+ new = kzalloc(nr_node_ids, GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ mutex_lock(&iw_table_lock);
+ old = rcu_dereference_protected(iw_table,
+ lockdep_is_held(&iw_table_lock));
+ if (old)
+ memcpy(new, old, nr_node_ids);
+ new[node_attr->nid] = weight;
+ rcu_assign_pointer(iw_table, new);
+ mutex_unlock(&iw_table_lock);
+ synchronize_rcu();
+ kfree(old);
+ return count;
+}
+
+static struct iw_node_attr **node_attrs;
+
+static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
+ struct kobject *parent)
+{
+ if (!node_attr)
+ return;
+ sysfs_remove_file(parent, &node_attr->kobj_attr.attr);
+ kfree(node_attr->kobj_attr.attr.name);
+ kfree(node_attr);
+}
+
+static void sysfs_wi_release(struct kobject *wi_kobj)
+{
+ int i;
+
+ for (i = 0; i < nr_node_ids; i++)
+ sysfs_wi_node_release(node_attrs[i], wi_kobj);
+ kobject_put(wi_kobj);
+}
+
+static const struct kobj_type wi_ktype = {
+ .sysfs_ops = &kobj_sysfs_ops,
+ .release = sysfs_wi_release,
+};
+
+static int add_weight_node(int nid, struct kobject *wi_kobj)
+{
+ struct iw_node_attr *node_attr;
+ char *name;
+
+ node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
+ if (!node_attr)
+ return -ENOMEM;
+
+ name = kasprintf(GFP_KERNEL, "node%d", nid);
+ if (!name) {
+ kfree(node_attr);
+ return -ENOMEM;
+ }
+
+ sysfs_attr_init(&node_attr->kobj_attr.attr);
+ node_attr->kobj_attr.attr.name = name;
+ node_attr->kobj_attr.attr.mode = 0644;
+ node_attr->kobj_attr.show = node_show;
+ node_attr->kobj_attr.store = node_store;
+ node_attr->nid = nid;
+
+ if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) {
+ kfree(node_attr->kobj_attr.attr.name);
+ kfree(node_attr);
+ pr_err("failed to add attribute to weighted_interleave\n");
+ return -ENOMEM;
+ }
+
+ node_attrs[nid] = node_attr;
+ return 0;
+}
+
+static int add_weighted_interleave_group(struct kobject *root_kobj)
+{
+ struct kobject *wi_kobj;
+ int nid, err;
+
+ wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
+ if (!wi_kobj)
+ return -ENOMEM;
+
+ err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj,
+ "weighted_interleave");
+ if (err) {
+ kfree(wi_kobj);
+ return err;
+ }
+
+ for_each_node_state(nid, N_POSSIBLE) {
+ err = add_weight_node(nid, wi_kobj);
+ if (err) {
+ pr_err("failed to add sysfs [node%d]\n", nid);
+ break;
+ }
+ }
+ if (err)
+ kobject_put(wi_kobj);
+ return 0;
+}
+
+static void mempolicy_kobj_release(struct kobject *kobj)
+{
+ u8 __rcu *old;
+
+ mutex_lock(&iw_table_lock);
+ old = rcu_dereference_protected(iw_table,
+ lockdep_is_held(&iw_table_lock));
+ rcu_assign_pointer(iw_table, NULL);
+ mutex_unlock(&iw_table_lock);
+ synchronize_rcu();
+ kfree(old);
+ kfree(node_attrs);
+ kfree(kobj);
+}
+
+static const struct kobj_type mempolicy_ktype = {
+ .release = mempolicy_kobj_release
+};
+
+static int __init mempolicy_sysfs_init(void)
+{
+ int err;
+ static struct kobject *mempolicy_kobj;
+
+ mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL);
+ if (!mempolicy_kobj) {
+ err = -ENOMEM;
+ goto err_out;
+ }
+
+ node_attrs = kcalloc(nr_node_ids, sizeof(struct iw_node_attr *),
+ GFP_KERNEL);
+ if (!node_attrs) {
+ err = -ENOMEM;
+ goto mempol_out;
+ }
+
+ err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj,
+ "mempolicy");
+ if (err)
+ goto node_out;
+
+ err = add_weighted_interleave_group(mempolicy_kobj);
+ if (err) {
+ pr_err("mempolicy sysfs structure failed to initialize\n");
+ kobject_put(mempolicy_kobj);
+ return err;
+ }
+
+ return err;
+node_out:
+ kfree(node_attrs);
+mempol_out:
+ kfree(mempolicy_kobj);
+err_out:
+ pr_err("failed to add mempolicy kobject to the system\n");
+ return err;
+}
+
+late_initcall(mempolicy_sysfs_init);
+#endif /* CONFIG_SYSFS */
--
2.39.1
On Tue, Jan 30, 2024 at 01:20:46PM -0500, Gregory Price wrote:
> + /* Continue allocating from most recent node and adjust the nr_pages */
> + node = me->il_prev;
> + weight = me->il_weight;
> + if (weight && node_isset(node, nodes)) {
> + node_pages = min(rem_pages, weight);
> + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> + NULL, page_array);
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + /* if that's all the pages, no need to interleave */
> + if (rem_pages < weight) {
> + /* stay on current node, adjust il_weight */
> + me->il_weight -= rem_pages;
> + return total_allocated;
> + } else if (rem_pages == weight) {
> + /* move to next node / weight */
> + me->il_prev = next_node_in(node, nodes);
> + me->il_weight = get_il_weight(next_node);
Sigh, I managed to miss a small update that killed next_node in favor of
operating directly on il_prev. Can you squash this fix into the patch?
Otherwise I can submit a separate patch.
~Gregory
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7cd92f4ec0d7..2c1aef8eab70 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2382,7 +2382,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
unsigned int weight_total = 0;
unsigned long rem_pages = nr_pages;
nodemask_t nodes;
- int nnodes, node, next_node;
+ int nnodes, node;
int resume_node = MAX_NUMNODES - 1;
u8 resume_weight = 0;
int prev_node;
@@ -2412,7 +2412,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
} else if (rem_pages == weight) {
/* move to next node / weight */
me->il_prev = next_node_in(node, nodes);
- me->il_weight = get_il_weight(next_node);
+ me->il_weight = get_il_weight(me->il_prev);
return total_allocated;
}
/* Otherwise we adjust remaining pages, continue from there */
Gregory Price <[email protected]> writes:
> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> using the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
> based interleave policy does not optimally distribute data to make use
> of their different bandwidth characteristics.
>
> Instead, interleave is more effective when the allocation policy follows
> each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
> enabling weighted interleave between NUMA nodes. Weighted interleave
> allows for proportional distribution of memory across multiple numa
> nodes, preferably apportioned to match the bandwidth of each node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
> weight distribution is (2:1).
>
> Weights for each node can be assigned via the new sysfs extension:
> /sys/kernel/mm/mempolicy/weighted_interleave/
>
> For now, the default value of all nodes will be `1`, which matches
> the behavior of standard 1:1 round-robin interleave. An extension
> will be added in the future to allow default values to be registered
> at kernel and device bringup time.
>
> The policy allocates a number of pages equal to the set weights. For
> example, if the weights are (2,1), then 2 pages will be allocated on
> node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
>
> Some high level notes about the pieces of weighted interleave:
>
> current->il_prev:
> Default interleave uses this to track the last used node.
> Weighted interleave uses this to track the *current* node, and
> when weight reaches 0 it will be used to acquire the next node.
>
> current->il_weight:
> The active weight of the current node (current->il_prev)
> When this reaches 0, current->il_prev is set to the next node
> and current->il_weight is set to the next weight.
I still think that my description of the 2 fields above is easier to be
understood. For weighted interleave,
current->il_prev is the node from which we allocated page in previous
allocation.
current->il_weight is the remaining weight for current->il_prev after
previous allocation.
But I will not force you to use this. Use it only if you think that
they are better.
> weighted_interleave_nodes:
> Counts the number of allocations as they occur, and applies the
> weight for the current node. When the weight reaches 0, switch
> to the next node. Operates only on task->mempolicy.
>
> weighted_interleave_nid:
> Gets the total weight of the nodemask as well as each individual
> node weight, then calculates the node based on the given index.
> Operates on VMA policies.
>
> bulk_array_weighted_interleave:
> Gets the total weight of the nodemask as well as each individual
> node weight, then calculates the number of "interleave rounds" as
> well as any delta ("partial round"). Calculates the number of
> pages for each node and allocates them.
>
> If a node was scheduled for interleave via interleave_nodes, the
> current weight will be allocated first.
>
> Operates only on the task->mempolicy.
>
> One piece of complexity is the interaction between a recent refactor
> which split the logic to acquire the "ilx" (interleave index) of an
> allocation and the actual application of the interleave. If a call
> to alloc_pages_mpol() were made with a weighted-interleave policy and
> ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would
> operate on a VMA policy - violating the description above.
>
> An inspection of all callers of alloc_pages_mpol() shows that all
> external callers set ilx to `0`, an index value, or will call
> get_vma_policy() to acquire the ilx.
>
> For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks
> all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces
> the `weighted_interleave_nodes()` and `weighted_interleave_nid()`
> policy requirements (task/vma respectively).
>
> Suggested-by: Hasan Al Maruf <[email protected]>
> Signed-off-by: Gregory Price <[email protected]>
> Co-developed-by: Rakie Kim <[email protected]>
> Signed-off-by: Rakie Kim <[email protected]>
> Co-developed-by: Honggyu Kim <[email protected]>
> Signed-off-by: Honggyu Kim <[email protected]>
> Co-developed-by: Hyeongtak Ji <[email protected]>
> Signed-off-by: Hyeongtak Ji <[email protected]>
> Co-developed-by: Srinivasulu Thanneeru <[email protected]>
> Signed-off-by: Srinivasulu Thanneeru <[email protected]>
> Co-developed-by: Ravi Jonnalagadda <[email protected]>
> Signed-off-by: Ravi Jonnalagadda <[email protected]>
> ---
> .../admin-guide/mm/numa_memory_policy.rst | 9 +
> include/linux/sched.h | 1 +
> include/uapi/linux/mempolicy.h | 1 +
> mm/mempolicy.c | 231 +++++++++++++++++-
> 4 files changed, 238 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index eca38fa81e0f..a70f20ce1ffb 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
> can fall back to all existing numa nodes. This is effectively
> MPOL_PREFERRED allowed for a mask rather than a single node.
>
> +MPOL_WEIGHTED_INTERLEAVE
> + This mode operates the same as MPOL_INTERLEAVE, except that
> + interleaving behavior is executed based on weights set in
> + /sys/kernel/mm/mempolicy/weighted_interleave/
> +
> + Weighted interleave allocates pages on nodes according to a
> + weight. For example if nodes [0,1] are weighted [5,2], 5 pages
> + will be allocated on node0 for every 2 pages allocated on node1.
> +
> NUMA memory policy supports the following optional mode flags:
>
> MPOL_F_STATIC_NODES
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ffe8f618ab86..b9ce285d8c9c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1259,6 +1259,7 @@ struct task_struct {
> /* Protected by alloc_lock: */
> struct mempolicy *mempolicy;
> short il_prev;
> + u8 il_weight;
> short pref_node_fork;
> #endif
> #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index a8963f7ef4c2..1f9bb10d1a47 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -23,6 +23,7 @@ enum {
> MPOL_INTERLEAVE,
> MPOL_LOCAL,
> MPOL_PREFERRED_MANY,
> + MPOL_WEIGHTED_INTERLEAVE,
> MPOL_MAX, /* always last member of enum */
> };
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 3bdfaf03b660..7cd92f4ec0d7 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -19,6 +19,13 @@
> * for anonymous memory. For process policy an process counter
> * is used.
> *
> + * weighted interleave
> + * Allocate memory interleaved over a set of nodes based on
> + * a set of weights (per-node), with normal fallback if it
> + * fails. Otherwise operates the same as interleave.
> + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
> + * on node 0 for every 1 page allocated on node 1.
> + *
> * bind Only allocate memory on a specific set of nodes,
> * no fallback.
> * FIXME: memory is allocated starting with the first node
> @@ -441,6 +448,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> .create = mpol_new_nodemask,
> .rebind = mpol_rebind_preferred,
> },
> + [MPOL_WEIGHTED_INTERLEAVE] = {
> + .create = mpol_new_nodemask,
> + .rebind = mpol_rebind_nodemask,
> + },
> };
>
> static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> @@ -862,8 +873,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>
> old = current->mempolicy;
> current->mempolicy = new;
> - if (new && new->mode == MPOL_INTERLEAVE)
> + if (new && (new->mode == MPOL_INTERLEAVE ||
> + new->mode == MPOL_WEIGHTED_INTERLEAVE)) {
> current->il_prev = MAX_NUMNODES-1;
> + current->il_weight = 0;
> + }
> task_unlock(current);
> mpol_put(old);
> ret = 0;
> @@ -888,6 +902,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
> case MPOL_INTERLEAVE:
> case MPOL_PREFERRED:
> case MPOL_PREFERRED_MANY:
> + case MPOL_WEIGHTED_INTERLEAVE:
> *nodes = pol->nodes;
> break;
> case MPOL_LOCAL:
> @@ -972,6 +987,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> } else if (pol == current->mempolicy &&
> pol->mode == MPOL_INTERLEAVE) {
> *policy = next_node_in(current->il_prev, pol->nodes);
> + } else if (pol == current->mempolicy &&
> + pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> + if (current->il_weight)
> + *policy = current->il_prev;
> + else
> + *policy = next_node_in(current->il_prev,
> + pol->nodes);
> } else {
> err = -EINVAL;
> goto out;
> @@ -1336,7 +1358,8 @@ static long do_mbind(unsigned long start, unsigned long len,
> * VMAs, the nodes will still be interleaved from the targeted
> * nodemask, but one by one may be selected differently.
> */
> - if (new->mode == MPOL_INTERLEAVE) {
> + if (new->mode == MPOL_INTERLEAVE ||
> + new->mode == MPOL_WEIGHTED_INTERLEAVE) {
> struct page *page;
> unsigned int order;
> unsigned long addr = -EFAULT;
> @@ -1784,7 +1807,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> * @vma: virtual memory area whose policy is sought
> * @addr: address in @vma for shared policy lookup
> * @order: 0, or appropriate huge_page_order for interleaving
> - * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE
> + * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or
> + * MPOL_WEIGHTED_INTERLEAVE
> *
> * Returns effective policy for a VMA at specified address.
> * Falls back to current->mempolicy or system default policy, as necessary.
> @@ -1801,7 +1825,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> pol = __get_vma_policy(vma, addr, ilx);
> if (!pol)
> pol = get_task_policy(current);
> - if (pol->mode == MPOL_INTERLEAVE) {
> + if (pol->mode == MPOL_INTERLEAVE ||
> + pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> *ilx += vma->vm_pgoff >> order;
> *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> }
> @@ -1851,6 +1876,22 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> return zone >= dynamic_policy_zone;
> }
>
> +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> +{
> + unsigned int node = current->il_prev;
> +
> + if (!current->il_weight || !node_isset(node, policy->nodes)) {
> + node = next_node_in(node, policy->nodes);
> + /* can only happen if nodemask is being rebound */
> + if (node == MAX_NUMNODES)
> + return node;
I feel a little unsafe to read policy->nodes at same time of writing in
rebound. Is it better to use a seqlock to guarantee its consistency?
It's unnecessary to be a part of this series though.
> + current->il_prev = node;
> + current->il_weight = get_il_weight(node);
> + }
> + current->il_weight--;
> + return node;
> +}
> +
> /* Do dynamic interleaving for a process */
> static unsigned int interleave_nodes(struct mempolicy *policy)
> {
> @@ -1885,6 +1926,9 @@ unsigned int mempolicy_slab_node(void)
> case MPOL_INTERLEAVE:
> return interleave_nodes(policy);
>
> + case MPOL_WEIGHTED_INTERLEAVE:
> + return weighted_interleave_nodes(policy);
> +
> case MPOL_BIND:
> case MPOL_PREFERRED_MANY:
> {
> @@ -1923,6 +1967,45 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
> return nodes_weight(*mask);
> }
>
> +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> +{
> + nodemask_t nodemask;
> + unsigned int target, nr_nodes;
> + u8 __rcu *table;
> + unsigned int weight_total = 0;
> + u8 weight;
> + int nid;
> +
> + nr_nodes = read_once_policy_nodemask(pol, &nodemask);
> + if (!nr_nodes)
> + return numa_node_id();
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + /* calculate the total weight */
> + for_each_node_mask(nid, nodemask) {
> + /* detect system default usage */
> + weight = table ? table[nid] : 1;
> + weight = weight ? weight : 1;
> + weight_total += weight;
> + }
> +
> + /* Calculate the node offset based on totals */
> + target = ilx % weight_total;
> + nid = first_node(nodemask);
> + while (target) {
> + /* detect system default usage */
> + weight = table ? table[nid] : 1;
> + weight = weight ? weight : 1;
I found duplicated pattern as above in this patch. Can we define a
function like below to remove the duplication?
u8 __get_il_weight(u8 *table, int nid)
{
u8 weight;
weight = table ? table[nid] : 1;
return weight ? : 1;
}
This can be used in alloc_pages_bulk_array_weighted_interleave() to copy
from global to local weights array too.
But this isn't a big deal. I will leave it to you to decide.
> + if (target < weight)
> + break;
> + target -= weight;
> + nid = next_node_in(nid, nodemask);
> + }
> + rcu_read_unlock();
> + return nid;
> +}
> +
> /*
> * Do static interleaving for interleave index @ilx. Returns the ilx'th
> * node in pol->nodes (starting from ilx=0), wrapping around if ilx
> @@ -1983,6 +2066,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> *nid = (ilx == NO_INTERLEAVE_INDEX) ?
> interleave_nodes(pol) : interleave_nid(pol, ilx);
> break;
> + case MPOL_WEIGHTED_INTERLEAVE:
> + *nid = (ilx == NO_INTERLEAVE_INDEX) ?
> + weighted_interleave_nodes(pol) :
> + weighted_interleave_nid(pol, ilx);
> + break;
> }
>
> return nodemask;
> @@ -2044,6 +2132,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> case MPOL_PREFERRED_MANY:
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> *mask = mempolicy->nodes;
> break;
>
> @@ -2144,6 +2233,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> * node in its nodemask, we allocate the standard way.
> */
> if (pol->mode != MPOL_INTERLEAVE &&
> + pol->mode != MPOL_WEIGHTED_INTERLEAVE &&
> (!nodemask || node_isset(nid, *nodemask))) {
> /*
> * First, try to allocate THP only on local node, but
> @@ -2279,6 +2369,127 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
> return total_allocated;
> }
>
> +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> + struct mempolicy *pol, unsigned long nr_pages,
> + struct page **page_array)
> +{
> + struct task_struct *me = current;
> + unsigned long total_allocated = 0;
> + unsigned long nr_allocated = 0;
> + unsigned long rounds;
> + unsigned long node_pages, delta;
> + u8 __rcu *table, *weights, weight;
> + unsigned int weight_total = 0;
> + unsigned long rem_pages = nr_pages;
> + nodemask_t nodes;
> + int nnodes, node, next_node;
> + int resume_node = MAX_NUMNODES - 1;
> + u8 resume_weight = 0;
> + int prev_node;
> + int i;
> +
> + if (!nr_pages)
> + return 0;
> +
> + nnodes = read_once_policy_nodemask(pol, &nodes);
> + if (!nnodes)
> + return 0;
> +
> + /* Continue allocating from most recent node and adjust the nr_pages */
> + node = me->il_prev;
> + weight = me->il_weight;
> + if (weight && node_isset(node, nodes)) {
> + node_pages = min(rem_pages, weight);
> + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> + NULL, page_array);
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + /* if that's all the pages, no need to interleave */
> + if (rem_pages < weight) {
> + /* stay on current node, adjust il_weight */
> + me->il_weight -= rem_pages;
> + return total_allocated;
> + } else if (rem_pages == weight) {
> + /* move to next node / weight */
> + me->il_prev = next_node_in(node, nodes);
> + me->il_weight = get_il_weight(next_node);
> + return total_allocated;
> + }
> + /* Otherwise we adjust remaining pages, continue from there */
> + rem_pages -= weight;
> + }
> + /* clear active weight in case of an allocation failure */
> + me->il_weight = 0;
> + prev_node = node;
> +
> + /* create a local copy of node weights to operate on outside rcu */
> + weights = kzalloc(nr_node_ids, GFP_KERNEL);
> + if (!weights)
> + return total_allocated;
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + if (table)
> + memcpy(weights, table, nr_node_ids);
> + rcu_read_unlock();
> +
> + /* calculate total, detect system default usage */
> + for_each_node_mask(node, nodes) {
> + if (!weights[node])
> + weights[node] = 1;
> + weight_total += weights[node];
> + }
> +
> + /*
> + * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
> + * Track which node weighted interleave should resume from.
> + *
> + * if (rounds > 0) and (delta == 0), resume_node will always be
> + * the node following prev_node and its weight.
> + */
> + rounds = rem_pages / weight_total;
> + delta = rem_pages % weight_total;
> + resume_node = next_node_in(prev_node, nodes);
> + resume_weight = weights[resume_node];
> + for (i = 0; i < nnodes; i++) {
> + node = next_node_in(prev_node, nodes);
> + weight = weights[node];
> + node_pages = weight * rounds;
> + /* If a delta exists, add this node's portion of the delta */
> + if (delta > weight) {
> + node_pages += weight;
> + delta -= weight;
> + } else if (delta) {
> + node_pages += delta;
> + /* delta may deplete on a boundary or w/ a remainder */
> + if (delta == weight) {
> + /* boundary: resume from next node/weight */
> + resume_node = next_node_in(node, nodes);
> + resume_weight = weights[resume_node];
> + } else {
> + /* remainder: resume this node w/ remainder */
> + resume_node = node;
> + resume_weight = weight - delta;
> + }
If we are comfortable to leave resume_weight == 0, then the above
branch can be simplified to.
resume_node = node;
resume_weight = weight - delta;
But, this is a style issue again. I will leave it to you to decide.
So, except the issue you pointed out already. All series looks good to
me! Thanks! Feel free to add
Reviewed-by: "Huang, Ying" <[email protected]>
to the whole series.
> + delta = 0;
> + }
> + /* node_pages can be 0 if an allocation fails and rounds == 0 */
> + if (!node_pages)
> + break;
> + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> + NULL, page_array);
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + if (total_allocated == nr_pages)
> + break;
> + prev_node = node;
> + }
> + me->il_prev = resume_node;
> + me->il_weight = resume_weight;
> + kfree(weights);
> + return total_allocated;
> +}
> +
> static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
> struct mempolicy *pol, unsigned long nr_pages,
> struct page **page_array)
> @@ -2319,6 +2530,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
> return alloc_pages_bulk_array_interleave(gfp, pol,
> nr_pages, page_array);
>
> + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
> + return alloc_pages_bulk_array_weighted_interleave(
> + gfp, pol, nr_pages, page_array);
> +
> if (pol->mode == MPOL_PREFERRED_MANY)
> return alloc_pages_bulk_array_preferred_many(gfp,
> numa_node_id(), pol, nr_pages, page_array);
> @@ -2394,6 +2609,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
> case MPOL_INTERLEAVE:
> case MPOL_PREFERRED:
> case MPOL_PREFERRED_MANY:
> + case MPOL_WEIGHTED_INTERLEAVE:
> return !!nodes_equal(a->nodes, b->nodes);
> case MPOL_LOCAL:
> return true;
> @@ -2530,6 +2746,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
> polnid = interleave_nid(pol, ilx);
> break;
>
> + case MPOL_WEIGHTED_INTERLEAVE:
> + polnid = weighted_interleave_nid(pol, ilx);
> + break;
> +
> case MPOL_PREFERRED:
> if (node_isset(curnid, pol->nodes))
> goto out;
> @@ -2904,6 +3124,7 @@ static const char * const policy_modes[] =
> [MPOL_PREFERRED] = "prefer",
> [MPOL_BIND] = "bind",
> [MPOL_INTERLEAVE] = "interleave",
> + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
> [MPOL_LOCAL] = "local",
> [MPOL_PREFERRED_MANY] = "prefer (many)",
> };
> @@ -2963,6 +3184,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
> }
> break;
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> /*
> * Default to online nodes with memory if no nodelist
> */
> @@ -3073,6 +3295,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
> case MPOL_PREFERRED_MANY:
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> + case MPOL_WEIGHTED_INTERLEAVE:
> nodes = pol->nodes;
> break;
> default:
--
Best Regards,
Huang, Ying
On Wed, Jan 31, 2024 at 02:43:12PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
> >
> > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> > +{
> > + unsigned int node = current->il_prev;
> > +
> > + if (!current->il_weight || !node_isset(node, policy->nodes)) {
> > + node = next_node_in(node, policy->nodes);
> > + /* can only happen if nodemask is being rebound */
> > + if (node == MAX_NUMNODES)
> > + return node;
>
> I feel a little unsafe to read policy->nodes at same time of writing in
> rebound. Is it better to use a seqlock to guarantee its consistency?
> It's unnecessary to be a part of this series though.
>
I think this is handled already? It is definitely an explicit race
condition that is documented elsewhere:
/*
* mpol_rebind_policy - Migrate a policy to a different set of nodes
*
* Per-vma policies are protected by mmap_lock. Allocations using per-task
* policies are protected by task->mems_allowed_seq to prevent a premature
* OOM/allocation failure due to parallel nodemask modification.
*/
example from slub:
do {
cpuset_mems_cookie = read_mems_allowed_begin();
zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
...
} while (read_mems_allowed_retry(cpuset_mems_cookie));
quick perusal through other allocators, show similar checks.
page_alloc.c - check_retry_cpusetset()
filemap.c - filemap_alloc_folio()
If we ever want mempolicy to be swappable from outside the current task
context, this will have to change most likely - but that's another
feature for another day.
> > + while (target) {
> > + /* detect system default usage */
> > + weight = table ? table[nid] : 1;
> > + weight = weight ? weight : 1;
>
> I found duplicated pattern as above in this patch. Can we define a
> function like below to remove the duplication?
>
> u8 __get_il_weight(u8 *table, int nid)
> {
> u8 weight;
>
> weight = table ? table[nid] : 1;
> return weight ? : 1;
> }
>
When we implement the system-default array, this will change to:
weight = sysfs_table ? sysfs_table[nid] : default_table[nid];
This cleanup will get picked up in that patch set since this code is
going to change anyway.
> > + if (delta == weight) {
> > + /* boundary: resume from next node/weight */
> > + resume_node = next_node_in(node, nodes);
> > + resume_weight = weights[resume_node];
> > + } else {
> > + /* remainder: resume this node w/ remainder */
> > + resume_node = node;
> > + resume_weight = weight - delta;
> > + }
>
> If we are comfortable to leave resume_weight == 0, then the above
> branch can be simplified to.
>
> resume_node = node;
> resume_weight = weight - delta;
>
> But, this is a style issue again. I will leave it to you to decide.
Good point, and in fact there's a similar branch in the first half of
the function that can be simplified. Will follow up with a style patch.
mm/mempolicy.c | 21 ++++-----------------
1 file changed, 4 insertions(+), 17 deletions(-)
My favorite style of patch :D
Andrew if you happen to be monitoring, this is the patch (not tested
yet, but it's pretty obvious, otherwise i'll submit individually
tomorrow).
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2c1aef8eab70..b0ca9bcdd64c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
page_array += nr_allocated;
total_allocated += nr_allocated;
/* if that's all the pages, no need to interleave */
- if (rem_pages < weight) {
- /* stay on current node, adjust il_weight */
+ if (rem_pages <= weight) {
me->il_weight -= rem_pages;
return total_allocated;
- } else if (rem_pages == weight) {
- /* move to next node / weight */
- me->il_prev = next_node_in(node, nodes);
- me->il_weight = get_il_weight(me->il_prev);
- return total_allocated;
}
/* Otherwise we adjust remaining pages, continue from there */
rem_pages -= weight;
@@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
node_pages += weight;
delta -= weight;
} else if (delta) {
+ /* when delta is deleted, resume from that node */
node_pages += delta;
- /* delta may deplete on a boundary or w/ a remainder */
- if (delta == weight) {
- /* boundary: resume from next node/weight */
- resume_node = next_node_in(node, nodes);
- resume_weight = weights[resume_node];
- } else {
- /* remainder: resume this node w/ remainder */
- resume_node = node;
- resume_weight = weight - delta;
- }
+ resume_node = node;
+ resume_weight = weight - delta;
delta = 0;
}
/* node_pages can be 0 if an allocation fails and rounds == 0 */
>
> So, except the issue you pointed out already. All series looks good to
> me! Thanks! Feel free to add
>
> Reviewed-by: "Huang, Ying" <[email protected]>
>
> to the whole series.
>
Thank you so much for your patience with me! I appreciate all the help.
I am looking forward to this feature very much!
~Gregory
Gregory Price <[email protected]> writes:
> On Wed, Jan 31, 2024 at 02:43:12PM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>> >
>> > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
>> > +{
>> > + unsigned int node = current->il_prev;
>> > +
>> > + if (!current->il_weight || !node_isset(node, policy->nodes)) {
>> > + node = next_node_in(node, policy->nodes);
>> > + /* can only happen if nodemask is being rebound */
>> > + if (node == MAX_NUMNODES)
>> > + return node;
>>
>> I feel a little unsafe to read policy->nodes at same time of writing in
>> rebound. Is it better to use a seqlock to guarantee its consistency?
>> It's unnecessary to be a part of this series though.
>>
>
> I think this is handled already? It is definitely an explicit race
> condition that is documented elsewhere:
>
> /*
> * mpol_rebind_policy - Migrate a policy to a different set of nodes
> *
> * Per-vma policies are protected by mmap_lock. Allocations using per-task
> * policies are protected by task->mems_allowed_seq to prevent a premature
> * OOM/allocation failure due to parallel nodemask modification.
> */
Thanks for pointing this out!
If we use task->mems_allowed_seq reader side in
weighted_interleave_nodes() we can guarantee the consistency of
policy->nodes. That may be not deserved, because it's not a big deal to
allocate 1 page in a wrong node.
It makes more sense to do that in
alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may
be allocated there.
> example from slub:
>
> do {
> cpuset_mems_cookie = read_mems_allowed_begin();
> zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
> ...
> } while (read_mems_allowed_retry(cpuset_mems_cookie));
>
> quick perusal through other allocators, show similar checks.
>
> page_alloc.c - check_retry_cpusetset()
> filemap.c - filemap_alloc_folio()
>
> If we ever want mempolicy to be swappable from outside the current task
> context, this will have to change most likely - but that's another
> feature for another day.
>
--
Best Regards,
Huang, Ying
On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > On Wed, Jan 31, 2024 at 02:43:12PM +0800, Huang, Ying wrote:
> >> Gregory Price <[email protected]> writes:
> >> >
> >> > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> >> > +{
> >> > + unsigned int node = current->il_prev;
> >> > +
> >> > + if (!current->il_weight || !node_isset(node, policy->nodes)) {
> >> > + node = next_node_in(node, policy->nodes);
> >> > + /* can only happen if nodemask is being rebound */
> >> > + if (node == MAX_NUMNODES)
> >> > + return node;
> >>
> >> I feel a little unsafe to read policy->nodes at same time of writing in
> >> rebound. Is it better to use a seqlock to guarantee its consistency?
> >> It's unnecessary to be a part of this series though.
> >>
> >
> > I think this is handled already? It is definitely an explicit race
> > condition that is documented elsewhere:
> >
> > /*
> > * mpol_rebind_policy - Migrate a policy to a different set of nodes
> > *
> > * Per-vma policies are protected by mmap_lock. Allocations using per-task
> > * policies are protected by task->mems_allowed_seq to prevent a premature
> > * OOM/allocation failure due to parallel nodemask modification.
> > */
>
> Thanks for pointing this out!
>
> If we use task->mems_allowed_seq reader side in
> weighted_interleave_nodes() we can guarantee the consistency of
> policy->nodes. That may be not deserved, because it's not a big deal to
> allocate 1 page in a wrong node.
>
> It makes more sense to do that in
> alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may
> be allocated there.
>
That's probably worth just adding now, I'll do it and squash the style
updates into the branch. Sorry Andrew, I guess 1 last version is
inbound :]
I'll pick up the reviewed tags along the way.
~Gregory
On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> >
> > I think this is handled already? It is definitely an explicit race
> > condition that is documented elsewhere:
> >
> > /*
> > * mpol_rebind_policy - Migrate a policy to a different set of nodes
> > *
> > * Per-vma policies are protected by mmap_lock. Allocations using per-task
> > * policies are protected by task->mems_allowed_seq to prevent a premature
> > * OOM/allocation failure due to parallel nodemask modification.
> > */
>
> Thanks for pointing this out!
>
> If we use task->mems_allowed_seq reader side in
> weighted_interleave_nodes() we can guarantee the consistency of
> policy->nodes. That may be not deserved, because it's not a big deal to
> allocate 1 page in a wrong node.
>
> It makes more sense to do that in
> alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may
> be allocated there.
>
To save the versioning if there are issues, here are the 3 diffs that
I have left. If you are good with these changes, I'll squash the first
2 into the third commit, keep the last one as a separate commit (it
changes the interleave_nodes() logic too), and submit v5 w/ your
reviewed tag on all of them.
Fix one (pedantic?) warning from syzbot:
----------------------------------------
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b1437396c357..dfd097009606 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
unsigned long nr_allocated = 0;
unsigned long rounds;
unsigned long node_pages, delta;
- u8 __rcu *table, *weights, weight;
+ u8 __rcu *table, __rcu *weights, weight;
unsigned int weight_total = 0;
unsigned long rem_pages = nr_pages;
nodemask_t nodes;
Simplifying resume_node/weight logic:
-------------------------------------
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2c1aef8eab70..b0ca9bcdd64c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
page_array += nr_allocated;
total_allocated += nr_allocated;
/* if that's all the pages, no need to interleave */
- if (rem_pages < weight) {
- /* stay on current node, adjust il_weight */
+ if (rem_pages <= weight) {
me->il_weight -= rem_pages;
return total_allocated;
- } else if (rem_pages == weight) {
- /* move to next node / weight */
- me->il_prev = next_node_in(node, nodes);
- me->il_weight = get_il_weight(me->il_prev);
- return total_allocated;
}
/* Otherwise we adjust remaining pages, continue from there */
rem_pages -= weight;
@@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
node_pages += weight;
delta -= weight;
} else if (delta) {
+ /* when delta is deleted, resume from that node */
node_pages += delta;
- /* delta may deplete on a boundary or w/ a remainder */
- if (delta == weight) {
- /* boundary: resume from next node/weight */
- resume_node = next_node_in(node, nodes);
- resume_weight = weights[resume_node];
- } else {
- /* remainder: resume this node w/ remainder */
- resume_node = node;
- resume_weight = weight - delta;
- }
+ resume_node = node;
+ resume_weight = weight - delta;
delta = 0;
}
/* node_pages can be 0 if an allocation fails and rounds == 0 */
task->mems_allowed_seq protection (added as 4th patch)
------------------------------------------------------
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b0ca9bcdd64c..b1437396c357 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1879,10 +1879,15 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
{
unsigned int node = current->il_prev;
+ unsigned int cpuset_mems_cookie;
+retry:
+ /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
+ cpuset_mems_cookie = read_mems_allowed_begin();
if (!current->il_weight || !node_isset(node, policy->nodes)) {
node = next_node_in(node, policy->nodes);
- /* can only happen if nodemask is being rebound */
+ if (read_mems_allowed_retry(cpuset_mems_cookie))
+ goto retry;
if (node == MAX_NUMNODES)
return node;
current->il_prev = node;
@@ -1896,10 +1901,17 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
static unsigned int interleave_nodes(struct mempolicy *policy)
{
unsigned int nid;
+ unsigned int cpuset_mems_cookie;
+
+ /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
+ do {
+ cpuset_mems_cookie = read_mems_allowed_begin();
+ nid = next_node_in(current->il_prev, policy->nodes);
+ } while (read_mems_allowed_retry(cpuset_mems_cookie));
- nid = next_node_in(current->il_prev, policy->nodes);
if (nid < MAX_NUMNODES)
current->il_prev = nid;
+
return nid;
}
@@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
struct page **page_array)
{
struct task_struct *me = current;
+ unsigned int cpuset_mems_cookie;
unsigned long total_allocated = 0;
unsigned long nr_allocated = 0;
unsigned long rounds;
@@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
int prev_node;
int i;
+
if (!nr_pages)
return 0;
- nnodes = read_once_policy_nodemask(pol, &nodes);
+ /* read the nodes onto the stack, retry if done during rebind */
+ do {
+ cpuset_mems_cookie = read_mems_allowed_begin();
+ nnodes = read_once_policy_nodemask(pol, &nodes);
+ } while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+ /* if the nodemask has become invalid, we cannot do anything */
if (!nnodes)
return 0;
Gregory Price <[email protected]> writes:
> On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>> >
>> > I think this is handled already? It is definitely an explicit race
>> > condition that is documented elsewhere:
>> >
>> > /*
>> > * mpol_rebind_policy - Migrate a policy to a different set of nodes
>> > *
>> > * Per-vma policies are protected by mmap_lock. Allocations using per-task
>> > * policies are protected by task->mems_allowed_seq to prevent a premature
>> > * OOM/allocation failure due to parallel nodemask modification.
>> > */
>>
>> Thanks for pointing this out!
>>
>> If we use task->mems_allowed_seq reader side in
>> weighted_interleave_nodes() we can guarantee the consistency of
>> policy->nodes. That may be not deserved, because it's not a big deal to
>> allocate 1 page in a wrong node.
>>
>> It makes more sense to do that in
>> alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may
>> be allocated there.
>>
>
> To save the versioning if there are issues, here are the 3 diffs that
> I have left. If you are good with these changes, I'll squash the first
> 2 into the third commit, keep the last one as a separate commit (it
> changes the interleave_nodes() logic too), and submit v5 w/ your
> reviewed tag on all of them.
>
>
> Fix one (pedantic?) warning from syzbot:
> ----------------------------------------
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index b1437396c357..dfd097009606 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> unsigned long nr_allocated = 0;
> unsigned long rounds;
> unsigned long node_pages, delta;
> - u8 __rcu *table, *weights, weight;
> + u8 __rcu *table, __rcu *weights, weight;
The __rcu usage can be checked with `sparse` directly. For example,
make C=1 mm/mempolicy.o
More details can be found in
https://www.kernel.org/doc/html/latest/dev-tools/sparse.html
Per my understanding, we shouldn't use "__rcu" here. Please search
"__rcu" in the following document.
https://www.kernel.org/doc/html/latest/RCU/checklist.html
> unsigned int weight_total = 0;
> unsigned long rem_pages = nr_pages;
> nodemask_t nodes;
>
>
>
> Simplifying resume_node/weight logic:
> -------------------------------------
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 2c1aef8eab70..b0ca9bcdd64c 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> page_array += nr_allocated;
> total_allocated += nr_allocated;
> /* if that's all the pages, no need to interleave */
> - if (rem_pages < weight) {
> - /* stay on current node, adjust il_weight */
> + if (rem_pages <= weight) {
> me->il_weight -= rem_pages;
> return total_allocated;
> - } else if (rem_pages == weight) {
> - /* move to next node / weight */
> - me->il_prev = next_node_in(node, nodes);
> - me->il_weight = get_il_weight(me->il_prev);
> - return total_allocated;
> }
> /* Otherwise we adjust remaining pages, continue from there */
> rem_pages -= weight;
> @@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> node_pages += weight;
> delta -= weight;
> } else if (delta) {
> + /* when delta is deleted, resume from that node */
~~~~~~~
depleted?
> node_pages += delta;
> - /* delta may deplete on a boundary or w/ a remainder */
> - if (delta == weight) {
> - /* boundary: resume from next node/weight */
> - resume_node = next_node_in(node, nodes);
> - resume_weight = weights[resume_node];
> - } else {
> - /* remainder: resume this node w/ remainder */
> - resume_node = node;
> - resume_weight = weight - delta;
> - }
> + resume_node = node;
> + resume_weight = weight - delta;
> delta = 0;
> }
> /* node_pages can be 0 if an allocation fails and rounds == 0 */
>
>
>
>
>
> task->mems_allowed_seq protection (added as 4th patch)
> ------------------------------------------------------
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index b0ca9bcdd64c..b1437396c357 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1879,10 +1879,15 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> {
> unsigned int node = current->il_prev;
> + unsigned int cpuset_mems_cookie;
>
> +retry:
> + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
> + cpuset_mems_cookie = read_mems_allowed_begin();
> if (!current->il_weight || !node_isset(node, policy->nodes)) {
> node = next_node_in(node, policy->nodes);
node will be changed in the loop. So we need to change the logic here.
> - /* can only happen if nodemask is being rebound */
> + if (read_mems_allowed_retry(cpuset_mems_cookie))
> + goto retry;
> if (node == MAX_NUMNODES)
> return node;
> current->il_prev = node;
> @@ -1896,10 +1901,17 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> static unsigned int interleave_nodes(struct mempolicy *policy)
> {
> unsigned int nid;
> + unsigned int cpuset_mems_cookie;
> +
> + /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
> + do {
> + cpuset_mems_cookie = read_mems_allowed_begin();
> + nid = next_node_in(current->il_prev, policy->nodes);
> + } while (read_mems_allowed_retry(cpuset_mems_cookie));
>
> - nid = next_node_in(current->il_prev, policy->nodes);
> if (nid < MAX_NUMNODES)
> current->il_prev = nid;
> +
> return nid;
> }
>
> @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> struct page **page_array)
> {
> struct task_struct *me = current;
> + unsigned int cpuset_mems_cookie;
> unsigned long total_allocated = 0;
> unsigned long nr_allocated = 0;
> unsigned long rounds;
> @@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> int prev_node;
> int i;
>
> +
Change by accident?
> if (!nr_pages)
> return 0;
>
> - nnodes = read_once_policy_nodemask(pol, &nodes);
> + /* read the nodes onto the stack, retry if done during rebind */
> + do {
> + cpuset_mems_cookie = read_mems_allowed_begin();
> + nnodes = read_once_policy_nodemask(pol, &nodes);
> + } while (read_mems_allowed_retry(cpuset_mems_cookie));
> +
> + /* if the nodemask has become invalid, we cannot do anything */
> if (!nnodes)
> return 0;
--
Best Regards,
Huang, Ying
On Thu, Feb 01, 2024 at 09:55:07AM +0800, Huang, Ying wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index b1437396c357..dfd097009606 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> > unsigned long nr_allocated = 0;
> > unsigned long rounds;
> > unsigned long node_pages, delta;
> > - u8 __rcu *table, *weights, weight;
> > + u8 __rcu *table, __rcu *weights, weight;
>
> The __rcu usage can be checked with `sparse` directly. For example,
>
> make C=1 mm/mempolicy.o
>
> More details can be found in
>
> https://www.kernel.org/doc/html/latest/dev-tools/sparse.html
>
> Per my understanding, we shouldn't use "__rcu" here. Please search
> "__rcu" in the following document.
>
> https://www.kernel.org/doc/html/latest/RCU/checklist.html
>
Thanks for this, I will sort this out and respond here with changes
before v5.
> > @@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> > node_pages += weight;
> > delta -= weight;
> > } else if (delta) {
> > + /* when delta is deleted, resume from that node */
> ~~~~~~~
> depleted?
ack.
> > +retry:
> > + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
> > + cpuset_mems_cookie = read_mems_allowed_begin();
> > if (!current->il_weight || !node_isset(node, policy->nodes)) {
> > node = next_node_in(node, policy->nodes);
>
> node will be changed in the loop. So we need to change the logic here.
>
Good catch, stupid mistake. ack.
> > @@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> > int prev_node;
> > int i;
> >
> > +
>
> Change by accident?
>
ack.
~Gregory
On Thu, Feb 01, 2024 at 09:55:07AM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
> > - u8 __rcu *table, *weights, weight;
> > + u8 __rcu *table, __rcu *weights, weight;
>
> The __rcu usage can be checked with `sparse` directly. For example,
>
> make C=1 mm/mempolicy.o
>
fixed and squashed, all the __rcu usage i had except the global pointer
have been used. Thanks for the reference material, was struggling to
understand that.
> > task->mems_allowed_seq protection (added as 4th patch)
> > ------------------------------------------------------
> >
> > + cpuset_mems_cookie = read_mems_allowed_begin();
> > if (!current->il_weight || !node_isset(node, policy->nodes)) {
> > node = next_node_in(node, policy->nodes);
>
> node will be changed in the loop. So we need to change the logic here.
>
new patch, if it all looks good i'll ship it in v5
~Gregory
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d8cc3a577986..4e5a640d10b8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
{
- unsigned int node = current->il_prev;
-
- if (!current->il_weight || !node_isset(node, policy->nodes)) {
- node = next_node_in(node, policy->nodes);
- /* can only happen if nodemask is being rebound */
+ unsigned int node;
+ unsigned int cpuset_mems_cookie;
+
+retry:
+ /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
+ cpuset_mems_cookie = read_mems_allowed_begin();
+ if (!current->il_weight ||
+ !node_isset(current->il_prev, policy->nodes)) {
+ node = next_node_in(current->il_prev, policy->nodes);
+ if (read_mems_allowed_retry(cpuset_mems_cookie))
+ goto retry;
if (node == MAX_NUMNODES)
return node;
current->il_prev = node;
@@ -1896,8 +1902,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
static unsigned int interleave_nodes(struct mempolicy *policy)
{
unsigned int nid;
+ unsigned int cpuset_mems_cookie;
+
+ /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
+ do {
+ cpuset_mems_cookie = read_mems_allowed_begin();
+ nid = next_node_in(current->il_prev, policy->nodes);
+ } while (read_mems_allowed_retry(cpuset_mems_cookie));
- nid = next_node_in(current->il_prev, policy->nodes);
if (nid < MAX_NUMNODES)
current->il_prev = nid;
return nid;
@@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
struct page **page_array)
{
struct task_struct *me = current;
+ unsigned int cpuset_mems_cookie;
unsigned long total_allocated = 0;
unsigned long nr_allocated = 0;
unsigned long rounds;
@@ -2391,7 +2404,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
if (!nr_pages)
return 0;
- nnodes = read_once_policy_nodemask(pol, &nodes);
+ /* read the nodes onto the stack, retry if done during rebind */
+ do {
+ cpuset_mems_cookie = read_mems_allowed_begin();
+ nnodes = read_once_policy_nodemask(pol, &nodes);
+ } while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+ /* if the nodemask has become invalid, we cannot do anything */
if (!nnodes)
return 0;
Gregory Price <[email protected]> writes:
> On Thu, Feb 01, 2024 at 09:55:07AM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>> > - u8 __rcu *table, *weights, weight;
>> > + u8 __rcu *table, __rcu *weights, weight;
>>
>> The __rcu usage can be checked with `sparse` directly. For example,
>>
>> make C=1 mm/mempolicy.o
>>
>
> fixed and squashed, all the __rcu usage i had except the global pointer
> have been used. Thanks for the reference material, was struggling to
> understand that.
>
>> > task->mems_allowed_seq protection (added as 4th patch)
>> > ------------------------------------------------------
>> >
>> > + cpuset_mems_cookie = read_mems_allowed_begin();
>> > if (!current->il_weight || !node_isset(node, policy->nodes)) {
>> > node = next_node_in(node, policy->nodes);
>>
>> node will be changed in the loop. So we need to change the logic here.
>>
>
> new patch, if it all looks good i'll ship it in v5
>
> ~Gregory
>
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d8cc3a577986..4e5a640d10b8 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>
> static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> {
> - unsigned int node = current->il_prev;
> -
> - if (!current->il_weight || !node_isset(node, policy->nodes)) {
> - node = next_node_in(node, policy->nodes);
> - /* can only happen if nodemask is being rebound */
> + unsigned int node;
IIUC, "node" may be used without initialization.
--
Best Regards,
Huang, Ying
> + unsigned int cpuset_mems_cookie;
> +
> +retry:
> + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
> + cpuset_mems_cookie = read_mems_allowed_begin();
> + if (!current->il_weight ||
> + !node_isset(current->il_prev, policy->nodes)) {
> + node = next_node_in(current->il_prev, policy->nodes);
> + if (read_mems_allowed_retry(cpuset_mems_cookie))
> + goto retry;
> if (node == MAX_NUMNODES)
> return node;
> current->il_prev = node;
> @@ -1896,8 +1902,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> static unsigned int interleave_nodes(struct mempolicy *policy)
> {
> unsigned int nid;
> + unsigned int cpuset_mems_cookie;
> +
> + /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
> + do {
> + cpuset_mems_cookie = read_mems_allowed_begin();
> + nid = next_node_in(current->il_prev, policy->nodes);
> + } while (read_mems_allowed_retry(cpuset_mems_cookie));
>
> - nid = next_node_in(current->il_prev, policy->nodes);
> if (nid < MAX_NUMNODES)
> current->il_prev = nid;
> return nid;
> @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> struct page **page_array)
> {
> struct task_struct *me = current;
> + unsigned int cpuset_mems_cookie;
> unsigned long total_allocated = 0;
> unsigned long nr_allocated = 0;
> unsigned long rounds;
> @@ -2391,7 +2404,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> if (!nr_pages)
> return 0;
>
> - nnodes = read_once_policy_nodemask(pol, &nodes);
> + /* read the nodes onto the stack, retry if done during rebind */
> + do {
> + cpuset_mems_cookie = read_mems_allowed_begin();
> + nnodes = read_once_policy_nodemask(pol, &nodes);
> + } while (read_mems_allowed_retry(cpuset_mems_cookie));
> +
> + /* if the nodemask has become invalid, we cannot do anything */
> if (!nnodes)
> return 0;
On Thu, Feb 01, 2024 at 11:02:47AM +0800, Huang, Ying wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index d8cc3a577986..4e5a640d10b8 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> >
> > static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> > {
> > - unsigned int node = current->il_prev;
> > -
> > - if (!current->il_weight || !node_isset(node, policy->nodes)) {
> > - node = next_node_in(node, policy->nodes);
> > - /* can only happen if nodemask is being rebound */
> > + unsigned int node;
>
> IIUC, "node" may be used without initialization.
>
ok i should slow down lol. This should take care of it.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d8cc3a577986..ed0d5d2d456a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
{
- unsigned int node = current->il_prev;
-
- if (!current->il_weight || !node_isset(node, policy->nodes)) {
+ unsigned int node;
+ unsigned int cpuset_mems_cookie;
+
+retry:
+ /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
+ cpuset_mems_cookie = read_mems_allowed_begin();
+ node = current->il_prev;
+ if (!node || !node_isset(node, policy->nodes)) {
node = next_node_in(node, policy->nodes);
- /* can only happen if nodemask is being rebound */
+ if (read_mems_allowed_retry(cpuset_mems_cookie))
+ goto retry;
if (node == MAX_NUMNODES)
return node;
current->il_prev = node;
@@ -1896,8 +1902,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
static unsigned int interleave_nodes(struct mempolicy *policy)
{
unsigned int nid;
+ unsigned int cpuset_mems_cookie;
+
+ /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
+ do {
+ cpuset_mems_cookie = read_mems_allowed_begin();
+ nid = next_node_in(current->il_prev, policy->nodes);
+ } while (read_mems_allowed_retry(cpuset_mems_cookie));
- nid = next_node_in(current->il_prev, policy->nodes);
if (nid < MAX_NUMNODES)
current->il_prev = nid;
return nid;
@@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
struct page **page_array)
{
struct task_struct *me = current;
+ unsigned int cpuset_mems_cookie;
unsigned long total_allocated = 0;
unsigned long nr_allocated = 0;
unsigned long rounds;
@@ -2391,7 +2404,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
if (!nr_pages)
return 0;
- nnodes = read_once_policy_nodemask(pol, &nodes);
+ /* read the nodes onto the stack, retry if done during rebind */
+ do {
+ cpuset_mems_cookie = read_mems_allowed_begin();
+ nnodes = read_once_policy_nodemask(pol, &nodes);
+ } while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+ /* if the nodemask has become invalid, we cannot do anything */
if (!nnodes)
return 0;