LinuxLists.cc - [PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local

[permalink] [raw]

Subject: [PATCH 1/9] lib/find: introduce find_nth_and_andnot_bit

In the following patches the function is used to implement in-place bitmaps
traversing without storing intermediate result in temporary bitmaps.

Signed-off-by: Yury Norov <[email protected]>
Acked-by: Tariq Toukan <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Reviewed-by: Peter Lafreniere <[email protected]>
---
include/linux/find.h | 33 +++++++++++++++++++++++++++++++++
lib/find_bit.c | 9 +++++++++
2 files changed, 42 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index ccaf61a0f5fd..4647864a5ffd 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -22,6 +22,9 @@ unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long
unsigned long size, unsigned long n);
unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
unsigned long size, unsigned long n);
+unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+ const unsigned long *addr3, unsigned long size,
+ unsigned long n);
extern unsigned long _find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size);
extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
@@ -255,6 +258,36 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon
return __find_nth_andnot_bit(addr1, addr2, size, n);
}

+/**
+ * find_nth_and_andnot_bit - find N'th set bit in 2 memory regions,
+ * excluding those set in 3rd region
+ * @addr1: The 1st address to start the search at
+ * @addr2: The 2nd address to start the search at
+ * @addr3: The 3rd address to start the search at
+ * @size: The maximum number of bits to search
+ * @n: The number of set bit, which position is needed, counting from 0
+ *
+ * Returns the bit number of the N'th set bit.
+ * If no such, returns @size.
+ */
+static __always_inline
+unsigned long find_nth_and_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2,
+ const unsigned long *addr3,
+ unsigned long size, unsigned long n)
+{
+ if (n >= size)
+ return size;
+
+ if (small_const_nbits(size)) {
+ unsigned long val = *addr1 & *addr2 & (~*addr3) & GENMASK(size - 1, 0);
+
+ return val ? fns(val, n) : size;
+ }
+
+ return __find_nth_and_andnot_bit(addr1, addr2, addr3, size, n);
+}
+
#ifndef find_first_and_bit
/**
* find_first_and_bit - find the first set bit in both memory regions
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18bc0a7ac8ee..c10920e66788 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -155,6 +155,15 @@ unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned l
}
EXPORT_SYMBOL(__find_nth_andnot_bit);

+unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2,
+ const unsigned long *addr3,
+ unsigned long size, unsigned long n)
+{
+ return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n);
+}
+EXPORT_SYMBOL(__find_nth_and_andnot_bit);
+
#ifndef find_next_and_bit
unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
unsigned long nbits, unsigned long start)
--
2.34.1

2023-01-21 04:26:28

[permalink] [raw]

Subject: [PATCH 2/9] cpumask: introduce cpumask_nth_and_andnot

Introduce cpumask_nth_and_andnot() based on find_nth_and_andnot_bit().
It's used in the following patch to traverse cpumasks without storing
intermediate result in temporary cpumask.

Signed-off-by: Yury Norov <[email protected]>
Acked-by: Tariq Toukan <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Reviewed-by: Peter Lafreniere <[email protected]>
---
include/linux/cpumask.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index c2aa0aa26b45..7b16aede7ac5 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -391,6 +391,26 @@ unsigned int cpumask_nth_andnot(unsigned int cpu, const struct cpumask *srcp1,
nr_cpumask_bits, cpumask_check(cpu));
}

+/**
+ * cpumask_nth_and_andnot - get the Nth cpu set in 1st and 2nd cpumask, and clear in 3rd.
+ * @srcp1: the cpumask pointer
+ * @srcp2: the cpumask pointer
+ * @srcp3: the cpumask pointer
+ * @cpu: the N'th cpu to find, starting from 0
+ *
+ * Returns >= nr_cpu_ids if such cpu doesn't exist.
+ */
+static __always_inline
+unsigned int cpumask_nth_and_andnot(unsigned int cpu, const struct cpumask *srcp1,
+ const struct cpumask *srcp2,
+ const struct cpumask *srcp3)
+{
+ return find_nth_and_andnot_bit(cpumask_bits(srcp1),
+ cpumask_bits(srcp2),
+ cpumask_bits(srcp3),
+ nr_cpumask_bits, cpumask_check(cpu));
+}
+
#define CPU_BITS_NONE \
{ \
[0 ... BITS_TO_LONGS(NR_CPUS)-1] = 0UL \
--
2.34.1

2023-01-21 04:26:38

[permalink] [raw]

Subject: [PATCH 4/9] cpumask: improve on cpumask_local_spread() locality

Switch cpumask_local_spread() to use newly added sched_numa_find_nth_cpu(),
which takes into account distances to each node in the system.

For the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node 0 1 2 3
0: 10 50 30 70
1: 50 10 70 30
2: 30 70 10 50
3: 70 30 50 10

The new cpumask_local_spread() traverses cpus for each node like this:

node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15
node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7
node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5
node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3

Signed-off-by: Yury Norov <[email protected]>
Acked-by: Tariq Toukan <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Reviewed-by: Peter Lafreniere <[email protected]>
---
lib/cpumask.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index c7c392514fd3..255974cd6734 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -110,7 +110,7 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
#endif

/**
- * cpumask_local_spread - select the i'th cpu with local numa cpu's first
+ * cpumask_local_spread - select the i'th cpu based on NUMA distances
* @i: index number
* @node: local numa_node
*
@@ -132,15 +132,7 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
if (cpu < nr_cpu_ids)
return cpu;
} else {
- /* NUMA first. */
- cpu = cpumask_nth_and(i, cpu_online_mask, cpumask_of_node(node));
- if (cpu < nr_cpu_ids)
- return cpu;
-
- i -= cpumask_weight_and(cpu_online_mask, cpumask_of_node(node));
-
- /* Skip NUMA nodes, done above. */
- cpu = cpumask_nth_andnot(i, cpu_online_mask, cpumask_of_node(node));
+ cpu = sched_numa_find_nth_cpu(cpu_online_mask, i, node);
if (cpu < nr_cpu_ids)
return cpu;
}
--
2.34.1

2023-01-21 04:27:03

[permalink] [raw]

Subject: [PATCH 3/9] sched: add sched_numa_find_nth_cpu()

The function finds Nth set CPU in a given cpumask starting from a given
node.

Leveraging the fact that each hop in sched_domains_numa_masks includes the
same or greater number of CPUs than the previous one, we can use binary
search on hops instead of linear walk, which makes the overall complexity
of O(log n) in terms of number of cpumask_weight() calls.

Signed-off-by: Yury Norov <[email protected]>
Acked-by: Tariq Toukan <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Reviewed-by: Peter Lafreniere <[email protected]>
---
include/linux/topology.h | 8 ++++++
kernel/sched/topology.c | 57 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 65 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..72f264575698 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
return cpumask_of_node(cpu_to_node(cpu));
}

+#ifdef CONFIG_NUMA
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
+#else
+static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
+{
+ return cpumask_nth(cpu, cpus);
+}
+#endif /* CONFIG_NUMA */

#endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..2bf89186a10f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -3,6 +3,8 @@
* Scheduler topology setup/handling methods
*/

+#include <linux/bsearch.h>
+
DEFINE_MUTEX(sched_domains_mutex);

/* Protected by sched_domains_mutex: */
@@ -2067,6 +2069,61 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
return found;
}

+struct __cmp_key {
+ const struct cpumask *cpus;
+ struct cpumask ***masks;
+ int node;
+ int cpu;
+ int w;
+};
+
+static int hop_cmp(const void *a, const void *b)
+{
+ struct cpumask **prev_hop = *((struct cpumask ***)b - 1);
+ struct cpumask **cur_hop = *(struct cpumask ***)b;
+ struct __cmp_key *k = (struct __cmp_key *)a;
+
+ if (cpumask_weight_and(k->cpus, cur_hop[k->node]) <= k->cpu)
+ return 1;
+
+ k->w = (b == k->masks) ? 0 : cpumask_weight_and(k->cpus, prev_hop[k->node]);
+ if (k->w <= k->cpu)
+ return 0;
+
+ return -1;
+}
+
+/*
+ * sched_numa_find_nth_cpu() - given the NUMA topology, find the Nth next cpu
+ * closest to @cpu from @cpumask.
+ * cpumask: cpumask to find a cpu from
+ * cpu: Nth cpu to find
+ *
+ * returns: cpu, or nr_cpu_ids when nothing found.
+ */
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
+{
+ struct __cmp_key k = { .cpus = cpus, .node = node, .cpu = cpu };
+ struct cpumask ***hop_masks;
+ int hop, ret = nr_cpu_ids;
+
+ rcu_read_lock();
+
+ k.masks = rcu_dereference(sched_domains_numa_masks);
+ if (!k.masks)
+ goto unlock;
+
+ hop_masks = bsearch(&k, k.masks, sched_domains_numa_levels, sizeof(k.masks[0]), hop_cmp);
+ hop = hop_masks - k.masks;
+
+ ret = hop ?
+ cpumask_nth_and_andnot(cpu - k.w, cpus, k.masks[hop][node], k.masks[hop-1][node]) :
+ cpumask_nth_and(cpu, cpus, k.masks[0][node]);
+unlock:
+ rcu_read_unlock();
+ return ret;
+}
+EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
#endif /* CONFIG_NUMA */

static int __sdt_alloc(const struct cpumask *cpu_map)
--
2.34.1

2023-01-21 04:27:04

[permalink] [raw]

Subject: [PATCH 8/9] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints

From: Tariq Toukan <[email protected]>

In the IRQ affinity hints, replace the binary NUMA preference (local /
remote) with the improved for_each_numa_hop_cpu() API that minds the
actual distances, so that remote NUMAs with short distance are preferred
over farther ones.

This has significant performance implications when using NUMA-aware
allocated memory (follow [1] and derivatives for example).

[1]
drivers/net/ethernet/mellanox/mlx5/core/en_main.c :: mlx5e_open_channel()
int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));

Performance tests:

TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121

+-------------------------+-----------+------------------+------------------+
| | BW (Gbps) | TX side CPU util | RX side CPU util |
+-------------------------+-----------+------------------+------------------+
| Baseline | 52.3 | 6.4 % | 17.9 % |
+-------------------------+-----------+------------------+------------------+
| Applied on TX side only | 52.6 | 5.2 % | 18.5 % |
+-------------------------+-----------+------------------+------------------+
| Applied on RX side only | 94.9 | 11.9 % | 27.2 % |
+-------------------------+-----------+------------------+------------------+
| Applied on both sides | 95.1 | 8.4 % | 27.3 % |
+-------------------------+-----------+------------------+------------------+

Bottleneck in RX side is released, reached linerate (~1.8x speedup).
~30% less cpu util on TX.

* CPU util on active cores only.

Setups details (similar for both sides):

NIC: ConnectX6-DX dual port, 100 Gbps each.
Single port used in the tests.

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 16
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7763 64-Core Processor
Stepping: 1
CPU MHz: 2594.804
BogoMIPS: 4890.73
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-7,128-135
NUMA node1 CPU(s): 8-15,136-143
NUMA node2 CPU(s): 16-23,144-151
NUMA node3 CPU(s): 24-31,152-159
NUMA node4 CPU(s): 32-39,160-167
NUMA node5 CPU(s): 40-47,168-175
NUMA node6 CPU(s): 48-55,176-183
NUMA node7 CPU(s): 56-63,184-191
NUMA node8 CPU(s): 64-71,192-199
NUMA node9 CPU(s): 72-79,200-207
NUMA node10 CPU(s): 80-87,208-215
NUMA node11 CPU(s): 88-95,216-223
NUMA node12 CPU(s): 96-103,224-231
NUMA node13 CPU(s): 104-111,232-239
NUMA node14 CPU(s): 112-119,240-247
NUMA node15 CPU(s): 120-127,248-255
..

$ numactl -H
..
node distances:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 11 11 11 12 12 12 12 32 32 32 32 32 32 32 32
1: 11 10 11 11 12 12 12 12 32 32 32 32 32 32 32 32
2: 11 11 10 11 12 12 12 12 32 32 32 32 32 32 32 32
3: 11 11 11 10 12 12 12 12 32 32 32 32 32 32 32 32
4: 12 12 12 12 10 11 11 11 32 32 32 32 32 32 32 32
5: 12 12 12 12 11 10 11 11 32 32 32 32 32 32 32 32
6: 12 12 12 12 11 11 10 11 32 32 32 32 32 32 32 32
7: 12 12 12 12 11 11 11 10 32 32 32 32 32 32 32 32
8: 32 32 32 32 32 32 32 32 10 11 11 11 12 12 12 12
9: 32 32 32 32 32 32 32 32 11 10 11 11 12 12 12 12
10: 32 32 32 32 32 32 32 32 11 11 10 11 12 12 12 12
11: 32 32 32 32 32 32 32 32 11 11 11 10 12 12 12 12
12: 32 32 32 32 32 32 32 32 12 12 12 12 10 11 11 11
13: 32 32 32 32 32 32 32 32 12 12 12 12 11 10 11 11
14: 32 32 32 32 32 32 32 32 12 12 12 12 11 11 10 11
15: 32 32 32 32 32 32 32 32 12 12 12 12 11 11 11 10

$ cat /sys/class/net/ens5f0/device/numa_node
14

Affinity hints (127 IRQs):
Before:
331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000
332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000
333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000
334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000
335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000
336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000
337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000
338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000
339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
347: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
348: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
349: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000004
350: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000008
351: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000010
352: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000020
353: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000040
354: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080
355: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000100
356: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000200
357: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000400
358: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000800
359: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00001000
360: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00002000
361: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00004000
362: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00008000
363: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00010000
364: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00020000
365: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00040000
366: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00080000
367: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00100000
368: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00200000
369: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00400000
370: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00800000
371: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,01000000
372: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,02000000
373: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,04000000
374: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,08000000
375: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,10000000
376: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,20000000
377: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,40000000
378: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,80000000
379: 00000000,00000000,00000000,00000000,00000000,00000000,00000001,00000000
380: 00000000,00000000,00000000,00000000,00000000,00000000,00000002,00000000
381: 00000000,00000000,00000000,00000000,00000000,00000000,00000004,00000000
382: 00000000,00000000,00000000,00000000,00000000,00000000,00000008,00000000
383: 00000000,00000000,00000000,00000000,00000000,00000000,00000010,00000000
384: 00000000,00000000,00000000,00000000,00000000,00000000,00000020,00000000
385: 00000000,00000000,00000000,00000000,00000000,00000000,00000040,00000000
386: 00000000,00000000,00000000,00000000,00000000,00000000,00000080,00000000
387: 00000000,00000000,00000000,00000000,00000000,00000000,00000100,00000000
388: 00000000,00000000,00000000,00000000,00000000,00000000,00000200,00000000
389: 00000000,00000000,00000000,00000000,00000000,00000000,00000400,00000000
390: 00000000,00000000,00000000,00000000,00000000,00000000,00000800,00000000
391: 00000000,00000000,00000000,00000000,00000000,00000000,00001000,00000000
392: 00000000,00000000,00000000,00000000,00000000,00000000,00002000,00000000
393: 00000000,00000000,00000000,00000000,00000000,00000000,00004000,00000000
394: 00000000,00000000,00000000,00000000,00000000,00000000,00008000,00000000
395: 00000000,00000000,00000000,00000000,00000000,00000000,00010000,00000000
396: 00000000,00000000,00000000,00000000,00000000,00000000,00020000,00000000
397: 00000000,00000000,00000000,00000000,00000000,00000000,00040000,00000000
398: 00000000,00000000,00000000,00000000,00000000,00000000,00080000,00000000
399: 00000000,00000000,00000000,00000000,00000000,00000000,00100000,00000000
400: 00000000,00000000,00000000,00000000,00000000,00000000,00200000,00000000
401: 00000000,00000000,00000000,00000000,00000000,00000000,00400000,00000000
402: 00000000,00000000,00000000,00000000,00000000,00000000,00800000,00000000
403: 00000000,00000000,00000000,00000000,00000000,00000000,01000000,00000000
404: 00000000,00000000,00000000,00000000,00000000,00000000,02000000,00000000
405: 00000000,00000000,00000000,00000000,00000000,00000000,04000000,00000000
406: 00000000,00000000,00000000,00000000,00000000,00000000,08000000,00000000
407: 00000000,00000000,00000000,00000000,00000000,00000000,10000000,00000000
408: 00000000,00000000,00000000,00000000,00000000,00000000,20000000,00000000
409: 00000000,00000000,00000000,00000000,00000000,00000000,40000000,00000000
410: 00000000,00000000,00000000,00000000,00000000,00000000,80000000,00000000
411: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000
412: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000
413: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000
414: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000
415: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000
416: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000
417: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000
418: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000
419: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000
420: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000
421: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000
422: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000
423: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000
424: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000
425: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000
426: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000
427: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000
428: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000
429: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000
430: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000
431: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000
432: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000
433: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000
434: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000
435: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000
436: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000
437: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000
438: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000
439: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000
440: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000
441: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000
442: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000
443: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000
444: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000
445: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000
446: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000
447: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000
448: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000
449: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000
450: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000
451: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000
452: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000
453: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000
454: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000
455: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000
456: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000
457: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000

After:
331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000
332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000
333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000
334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000
335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000
336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000
337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000
338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000
339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
347: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000
348: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000
349: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000
350: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000
351: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000
352: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000
353: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000
354: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000
355: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000
356: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000
357: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000
358: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000
359: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000
360: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000
361: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000
362: 00000000,00000000,00000000,00000000,00008000,00000000,00000000,00000000
363: 00000000,00000000,00000000,00000000,01000000,00000000,00000000,00000000
364: 00000000,00000000,00000000,00000000,02000000,00000000,00000000,00000000
365: 00000000,00000000,00000000,00000000,04000000,00000000,00000000,00000000
366: 00000000,00000000,00000000,00000000,08000000,00000000,00000000,00000000
367: 00000000,00000000,00000000,00000000,10000000,00000000,00000000,00000000
368: 00000000,00000000,00000000,00000000,20000000,00000000,00000000,00000000
369: 00000000,00000000,00000000,00000000,40000000,00000000,00000000,00000000
370: 00000000,00000000,00000000,00000000,80000000,00000000,00000000,00000000
371: 00000001,00000000,00000000,00000000,00000000,00000000,00000000,00000000
372: 00000002,00000000,00000000,00000000,00000000,00000000,00000000,00000000
373: 00000004,00000000,00000000,00000000,00000000,00000000,00000000,00000000
374: 00000008,00000000,00000000,00000000,00000000,00000000,00000000,00000000
375: 00000010,00000000,00000000,00000000,00000000,00000000,00000000,00000000
376: 00000020,00000000,00000000,00000000,00000000,00000000,00000000,00000000
377: 00000040,00000000,00000000,00000000,00000000,00000000,00000000,00000000
378: 00000080,00000000,00000000,00000000,00000000,00000000,00000000,00000000
379: 00000100,00000000,00000000,00000000,00000000,00000000,00000000,00000000
380: 00000200,00000000,00000000,00000000,00000000,00000000,00000000,00000000
381: 00000400,00000000,00000000,00000000,00000000,00000000,00000000,00000000
382: 00000800,00000000,00000000,00000000,00000000,00000000,00000000,00000000
383: 00001000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
384: 00002000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
385: 00004000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
386: 00008000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
387: 01000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
388: 02000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
389: 04000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
390: 08000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
391: 10000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
392: 20000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
393: 40000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
394: 80000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
395: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000
396: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000
397: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000
398: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000
399: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000
400: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000
401: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000
402: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000
403: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000
404: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000
405: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000
406: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000
407: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000
408: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000
409: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000
410: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000
411: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000
412: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000
413: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000
414: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000
415: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000
416: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000
417: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000
418: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000
419: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000
420: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000
421: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000
422: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000
423: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000
424: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000
425: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000
426: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000
427: 00000000,00000001,00000000,00000000,00000000,00000000,00000000,00000000
428: 00000000,00000002,00000000,00000000,00000000,00000000,00000000,00000000
429: 00000000,00000004,00000000,00000000,00000000,00000000,00000000,00000000
430: 00000000,00000008,00000000,00000000,00000000,00000000,00000000,00000000
431: 00000000,00000010,00000000,00000000,00000000,00000000,00000000,00000000
432: 00000000,00000020,00000000,00000000,00000000,00000000,00000000,00000000
433: 00000000,00000040,00000000,00000000,00000000,00000000,00000000,00000000
434: 00000000,00000080,00000000,00000000,00000000,00000000,00000000,00000000
435: 00000000,00000100,00000000,00000000,00000000,00000000,00000000,00000000
436: 00000000,00000200,00000000,00000000,00000000,00000000,00000000,00000000
437: 00000000,00000400,00000000,00000000,00000000,00000000,00000000,00000000
438: 00000000,00000800,00000000,00000000,00000000,00000000,00000000,00000000
439: 00000000,00001000,00000000,00000000,00000000,00000000,00000000,00000000
440: 00000000,00002000,00000000,00000000,00000000,00000000,00000000,00000000
441: 00000000,00004000,00000000,00000000,00000000,00000000,00000000,00000000
442: 00000000,00008000,00000000,00000000,00000000,00000000,00000000,00000000
443: 00000000,00010000,00000000,00000000,00000000,00000000,00000000,00000000
444: 00000000,00020000,00000000,00000000,00000000,00000000,00000000,00000000
445: 00000000,00040000,00000000,00000000,00000000,00000000,00000000,00000000
446: 00000000,00080000,00000000,00000000,00000000,00000000,00000000,00000000
447: 00000000,00100000,00000000,00000000,00000000,00000000,00000000,00000000
448: 00000000,00200000,00000000,00000000,00000000,00000000,00000000,00000000
449: 00000000,00400000,00000000,00000000,00000000,00000000,00000000,00000000
450: 00000000,00800000,00000000,00000000,00000000,00000000,00000000,00000000
451: 00000000,01000000,00000000,00000000,00000000,00000000,00000000,00000000
452: 00000000,02000000,00000000,00000000,00000000,00000000,00000000,00000000
453: 00000000,04000000,00000000,00000000,00000000,00000000,00000000,00000000
454: 00000000,08000000,00000000,00000000,00000000,00000000,00000000,00000000
455: 00000000,10000000,00000000,00000000,00000000,00000000,00000000,00000000
456: 00000000,20000000,00000000,00000000,00000000,00000000,00000000,00000000
457: 00000000,40000000,00000000,00000000,00000000,00000000,00000000,00000000

Signed-off-by: Tariq Toukan <[email protected]>
[Tweaked API use]
Suggested-by: Yury Norov <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Reviewed-by: Yury Norov <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 8f7580fec193..c111e8657d54 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -817,9 +817,12 @@ static void comp_irqs_release(struct mlx5_core_dev *dev)
static int comp_irqs_request(struct mlx5_core_dev *dev)
{
struct mlx5_eq_table *table = dev->priv.eq_table;
+ const struct cpumask *prev = cpu_none_mask;
+ const struct cpumask *mask;
int ncomp_eqs = table->num_comp_eqs;
u16 *cpus;
int ret;
+ int cpu;
int i;

ncomp_eqs = table->num_comp_eqs;
@@ -838,8 +841,19 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
ret = -ENOMEM;
goto free_irqs;
}
- for (i = 0; i < ncomp_eqs; i++)
- cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
+
+ i = 0;
+ rcu_read_lock();
+ for_each_numa_hop_mask(mask, dev->priv.numa_node) {
+ for_each_cpu_andnot(cpu, mask, prev) {
+ cpus[i] = cpu;
+ if (++i == ncomp_eqs)
+ goto spread_done;
+ }
+ prev = mask;
+ }
+spread_done:
+ rcu_read_unlock();
ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
kfree(cpus);
if (ret < 0)
--
2.34.1

2023-01-21 04:27:13

[permalink] [raw]

Subject: [PATCH 5/9] lib/cpumask: reorganize cpumask_local_spread() logic

Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
else-branch of cpumask_local_spread() is just a function call, and
we can simplify logic by using ternary operator.

While here, replace BUG() with WARN_ON().

Signed-off-by: Yury Norov <[email protected]>
Acked-by: Tariq Toukan <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Reviewed-by: Peter Lafreniere <[email protected]>
---
lib/cpumask.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index 255974cd6734..10aa15715c0d 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -127,16 +127,12 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
/* Wrap: we always want a cpu. */
i %= num_online_cpus();

- if (node == NUMA_NO_NODE) {
- cpu = cpumask_nth(i, cpu_online_mask);
- if (cpu < nr_cpu_ids)
- return cpu;
- } else {
- cpu = sched_numa_find_nth_cpu(cpu_online_mask, i, node);
- if (cpu < nr_cpu_ids)
- return cpu;
- }
- BUG();
+ cpu = (node == NUMA_NO_NODE) ?
+ cpumask_nth(i, cpu_online_mask) :
+ sched_numa_find_nth_cpu(cpu_online_mask, i, node);
+
+ WARN_ON(cpu >= nr_cpu_ids);
+ return cpu;
}
EXPORT_SYMBOL(cpumask_local_spread);

--
2.34.1

2023-01-21 04:27:17

[permalink] [raw]

Subject: [PATCH 7/9] sched/topology: Introduce for_each_numa_hop_mask()

From: Valentin Schneider <[email protected]>

The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs
reachable within a given distance budget, wrap the logic for iterating over
all (distance, mask) values inside an iterator macro.

Signed-off-by: Valentin Schneider <[email protected]>
Reviewed-by: Yury Norov <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 344c2362755a..fea32377f7c7 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -261,4 +261,22 @@ sched_numa_hop_mask(unsigned int node, unsigned int hops)
}
#endif /* CONFIG_NUMA */

+/**
+ * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance
+ * from a given node.
+ * @mask: the iteration variable.
+ * @node: the NUMA node to start the search from.
+ *
+ * Requires rcu_lock to be held.
+ *
+ * Yields cpu_online_mask for @node == NUMA_NO_NODE.
+ */
+#define for_each_numa_hop_mask(mask, node) \
+ for (unsigned int __hops = 0; \
+ mask = (node != NUMA_NO_NODE || __hops) ? \
+ sched_numa_hop_mask(node, __hops) : \
+ cpu_online_mask, \
+ !IS_ERR_OR_NULL(mask); \
+ __hops++)
+
#endif /* _LINUX_TOPOLOGY_H */
--
2.34.1

2023-01-21 04:27:19

[permalink] [raw]

Subject: [PATCH 6/9] sched/topology: Introduce sched_numa_hop_mask()

From: Valentin Schneider <[email protected]>

Tariq has pointed out that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness - cpumask_local_spread() only knows
about the local node and everything outside is in the same bucket.

sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
of CPUs reachable within a given distance budget), introduce
sched_numa_hop_mask() to export those cpumasks.

Link: http://lore.kernel.org/r/[email protected]
Signed-off-by: Valentin Schneider <[email protected]>
Reviewed-by: Yury Norov <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 7 +++++++
kernel/sched/topology.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 40 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 72f264575698..344c2362755a 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -247,11 +247,18 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)

#ifdef CONFIG_NUMA
int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
+extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops);
#else
static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
{
return cpumask_nth(cpu, cpus);
}
+
+static inline const struct cpumask *
+sched_numa_hop_mask(unsigned int node, unsigned int hops)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
#endif /* CONFIG_NUMA */

#endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 2bf89186a10f..1233affc106c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2124,6 +2124,39 @@ int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
return ret;
}
EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
+
+/**
+ * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away from
+ * @node
+ * @node: The node to count hops from.
+ * @hops: Include CPUs up to that many hops away. 0 means local node.
+ *
+ * Return: On success, a pointer to a cpumask of CPUs at most @hops away from
+ * @node, an error value otherwise.
+ *
+ * Requires rcu_lock to be held. Returned cpumask is only valid within that
+ * read-side section, copy it if required beyond that.
+ *
+ * Note that not all hops are equal in distance; see sched_init_numa() for how
+ * distances and masks are handled.
+ * Also note that this is a reflection of sched_domains_numa_masks, which may change
+ * during the lifetime of the system (offline nodes are taken out of the masks).
+ */
+const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops)
+{
+ struct cpumask ***masks;
+
+ if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
+ return ERR_PTR(-EINVAL);
+
+ masks = rcu_dereference(sched_domains_numa_masks);
+ if (!masks)
+ return ERR_PTR(-EBUSY);
+
+ return masks[hops][node];
+}
+EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
+
#endif /* CONFIG_NUMA */

static int __sdt_alloc(const struct cpumask *cpu_map)
--
2.34.1

2023-01-21 04:27:54

[permalink] [raw]

Subject: [PATCH 9/9] lib/cpumask: update comment for cpumask_local_spread()

Now that we have an iterator-based alternative for a very common case
of using cpumask_local_spread for all cpus in a row, it's worth to
mention that in comment to cpumask_local_spread().

Signed-off-by: Yury Norov <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
---
lib/cpumask.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index 10aa15715c0d..98291b07c756 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -114,11 +114,29 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
* @i: index number
* @node: local numa_node
*
- * This function selects an online CPU according to a numa aware policy;
- * local cpus are returned first, followed by non-local ones, then it
- * wraps around.
+ * Returns online CPU according to a numa aware policy; local cpus are returned
+ * first, followed by non-local ones, then it wraps around.
*
- * It's not very efficient, but useful for setup.
+ * For those who wants to enumerate all CPUs based on their NUMA distances,
+ * i.e. call this function in a loop, like:
+ *
+ * for (i = 0; i < num_online_cpus(); i++) {
+ * cpu = cpumask_local_spread(i, node);
+ * do_something(cpu);
+ * }
+ *
+ * There's a better alternative based on for_each()-like iterators:
+ *
+ * for_each_numa_hop_mask(mask, node) {
+ * for_each_cpu_andnot(cpu, mask, prev)
+ * do_something(cpu);
+ * prev = mask;
+ * }
+ *
+ * It's simpler and more verbose than above. Complexity of iterator-based
+ * enumeration is O(sched_domains_numa_levels * nr_cpu_ids), while
+ * cpumask_local_spread() when called for each cpu is
+ * O(sched_domains_numa_levels * nr_cpu_ids * log(nr_cpu_ids)).
*/
unsigned int cpumask_local_spread(unsigned int i, int node)
{
--
2.34.1

2023-01-22 12:57:11

by Tariq Toukan

[permalink] [raw]

Subject: Re: [PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality

On 21/01/2023 6:24, Yury Norov wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.
>
> This has significant performance implications on NUMA machines, for example
> when using NUMA-aware allocated memory together with NUMA-aware IRQ
> affinity hints.
>
> Performance tests from patch 8 of this series for mellanox network
> driver show:
>
> TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
> Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
>
> +-------------------------+-----------+------------------+------------------+
> | | BW (Gbps) | TX side CPU util | RX side CPU util |
> +-------------------------+-----------+------------------+------------------+
> | Baseline | 52.3 | 6.4 % | 17.9 % |
> +-------------------------+-----------+------------------+------------------+
> | Applied on TX side only | 52.6 | 5.2 % | 18.5 % |
> +-------------------------+-----------+------------------+------------------+
> | Applied on RX side only | 94.9 | 11.9 % | 27.2 % |
> +-------------------------+-----------+------------------+------------------+
> | Applied on both sides | 95.1 | 8.4 % | 27.3 % |
> +-------------------------+-----------+------------------+------------------+
>
> Bottleneck in RX side is released, reached linerate (~1.8x speedup).
> ~30% less cpu util on TX.
>
> This series was supposed to be included in v6.2, but that didn't happen. It
> spent enough in -next without any issues, so I hope we'll finally see it
> in v6.3.
>
> I believe, the best way would be moving it with scheduler patches, but I'm
> OK to try again with bitmap branch as well.

Now that Yury dropped several controversial bitmap patches form the PR,
the rest are mostly in sched, or new API that's used by sched.

Valentin, what do you think? Can you take it to your sched branch?

>
> Tariq Toukan (1):
> net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
> hints
>
> Valentin Schneider (2):
> sched/topology: Introduce sched_numa_hop_mask()
> sched/topology: Introduce for_each_numa_hop_mask()
>
> Yury Norov (6):
> lib/find: introduce find_nth_and_andnot_bit
> cpumask: introduce cpumask_nth_and_andnot
> sched: add sched_numa_find_nth_cpu()
> cpumask: improve on cpumask_local_spread() locality
> lib/cpumask: reorganize cpumask_local_spread() logic
> lib/cpumask: update comment for cpumask_local_spread()
>
> drivers/net/ethernet/mellanox/mlx5/core/eq.c | 18 +++-
> include/linux/cpumask.h | 20 +++++
> include/linux/find.h | 33 +++++++
> include/linux/topology.h | 33 +++++++
> kernel/sched/topology.c | 90 ++++++++++++++++++++
> lib/cpumask.c | 52 ++++++-----
> lib/find_bit.c | 9 ++
> 7 files changed, 230 insertions(+), 25 deletions(-)
>

2023-01-23 09:58:53

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality

On 22/01/23 14:57, Tariq Toukan wrote:
> On 21/01/2023 6:24, Yury Norov wrote:
>>
>> This series was supposed to be included in v6.2, but that didn't happen. It
>> spent enough in -next without any issues, so I hope we'll finally see it
>> in v6.3.
>>
>> I believe, the best way would be moving it with scheduler patches, but I'm
>> OK to try again with bitmap branch as well.
>
> Now that Yury dropped several controversial bitmap patches form the PR,
> the rest are mostly in sched, or new API that's used by sched.
>
> Valentin, what do you think? Can you take it to your sched branch?
>

I would if I had one :-)

Peter/Ingo, any objections to stashing this in tip/sched/core?

2023-01-29 08:08:11

by Tariq Toukan

[permalink] [raw]

Subject: Re: [PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality

On 23/01/2023 11:57, Valentin Schneider wrote:
> On 22/01/23 14:57, Tariq Toukan wrote:
>> On 21/01/2023 6:24, Yury Norov wrote:
>>>
>>> This series was supposed to be included in v6.2, but that didn't happen. It
>>> spent enough in -next without any issues, so I hope we'll finally see it
>>> in v6.3.
>>>
>>> I believe, the best way would be moving it with scheduler patches, but I'm
>>> OK to try again with bitmap branch as well.
>>
>> Now that Yury dropped several controversial bitmap patches form the PR,
>> the rest are mostly in sched, or new API that's used by sched.
>>
>> Valentin, what do you think? Can you take it to your sched branch?
>>
>
> I would if I had one :-)
>

Oh I see :)

> Peter/Ingo, any objections to stashing this in tip/sched/core?
>

Hi Peter and Ingo,

Can you please look into it? So we'll have enough time to act (in
case...) during this kernel.

We already missed one kernel...

Thanks,
Tariq

2023-01-30 20:22:15

[permalink] [raw]

Subject: Re: [PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality

On Sun, 29 Jan 2023 10:07:58 +0200 Tariq Toukan wrote:
> > Peter/Ingo, any objections to stashing this in tip/sched/core?
>
> Can you please look into it? So we'll have enough time to act (in
> case...) during this kernel.
>
> We already missed one kernel...

We really need this in linux-next by the end of the week. PTAL.

2023-02-02 17:33:46

[permalink] [raw]

Subject: Re: [PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality

On Mon, 30 Jan 2023 12:22:06 -0800 Jakub Kicinski wrote:
> On Sun, 29 Jan 2023 10:07:58 +0200 Tariq Toukan wrote:
> > > Peter/Ingo, any objections to stashing this in tip/sched/core?
> >
> > Can you please look into it? So we'll have enough time to act (in
> > case...) during this kernel.
> >
> > We already missed one kernel...
>
> We really need this in linux-next by the end of the week. PTAL.

Peter, could you please take a look? Linux doesn't have an API for
basic, common sense IRQ distribution on AMD systems. It's important :(

2023-02-02 17:38:12

[permalink] [raw]

Subject: Re: [PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality

On Thu, Feb 2, 2023 at 9:33 AM Jakub Kicinski <[email protected]> wrote:
>
> On Mon, 30 Jan 2023 12:22:06 -0800 Jakub Kicinski wrote:
> > On Sun, 29 Jan 2023 10:07:58 +0200 Tariq Toukan wrote:
> > > > Peter/Ingo, any objections to stashing this in tip/sched/core?
> > >
> > > Can you please look into it? So we'll have enough time to act (in
> > > case...) during this kernel.
> > >
> > > We already missed one kernel...
> >
> > We really need this in linux-next by the end of the week. PTAL.
>
> Peter, could you please take a look? Linux doesn't have an API for
> basic, common sense IRQ distribution on AMD systems. It's important :(

FWIW, it's already been in linux-next since mid-December through the
bitmap branch, and no issues were reported so far.

Thanks,
Yury

2023-02-03 00:59:13

by Chen Yu

[permalink] [raw]

Subject: Re: [PATCH 3/9] sched: add sched_numa_find_nth_cpu()

On 2023-01-20 at 20:24:30 -0800, Yury Norov wrote:
> The function finds Nth set CPU in a given cpumask starting from a given
> node.
>
> Leveraging the fact that each hop in sched_domains_numa_masks includes the
> same or greater number of CPUs than the previous one, we can use binary
> search on hops instead of linear walk, which makes the overall complexity
> of O(log n) in terms of number of cpumask_weight() calls.
>
> Signed-off-by: Yury Norov <[email protected]>
> Acked-by: Tariq Toukan <[email protected]>
> Reviewed-by: Jacob Keller <[email protected]>
> Reviewed-by: Peter Lafreniere <[email protected]>
> ---
> include/linux/topology.h | 8 ++++++
> kernel/sched/topology.c | 57 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 65 insertions(+)
>
[snip]
> + * sched_numa_find_nth_cpu() - given the NUMA topology, find the Nth next cpu
> + * closest to @cpu from @cpumask.
Just minor question, the @cpu below is used as the index, right? What does "close to @cpu"
mean above?
> + * cpumask: cpumask to find a cpu from
> + * cpu: Nth cpu to find
Maybe also add description for @node?

thanks,
Chenyu

2023-02-07 05:10:50

[permalink] [raw]

Subject: Re: [PATCH 3/9] sched: add sched_numa_find_nth_cpu()

On Fri, 20 Jan 2023 20:24:30 -0800 Yury Norov wrote:
> The function finds Nth set CPU in a given cpumask starting from a given
> node.
>
> Leveraging the fact that each hop in sched_domains_numa_masks includes the
> same or greater number of CPUs than the previous one, we can use binary
> search on hops instead of linear walk, which makes the overall complexity
> of O(log n) in terms of number of cpumask_weight() calls.

Valentin, would you be willing to give us a SoB or Review tag for
this one? We'd like to take the whole series via networking, if
that's okay.

2023-02-07 10:30:34

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH 3/9] sched: add sched_numa_find_nth_cpu()

On 06/02/23 21:09, Jakub Kicinski wrote:
> On Fri, 20 Jan 2023 20:24:30 -0800 Yury Norov wrote:
>> The function finds Nth set CPU in a given cpumask starting from a given
>> node.
>>
>> Leveraging the fact that each hop in sched_domains_numa_masks includes the
>> same or greater number of CPUs than the previous one, we can use binary
>> search on hops instead of linear walk, which makes the overall complexity
>> of O(log n) in terms of number of cpumask_weight() calls.
>
> Valentin, would you be willing to give us a SoB or Review tag for
> this one? We'd like to take the whole series via networking, if
> that's okay.

Sure, feel free to add

Reviewed-by: Valentin Schneider <[email protected]>

to patches that don't already have it.

2023-02-08 02:25:41

[permalink] [raw]

Subject: Re: [PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality

On Mon, 23 Jan 2023 09:57:43 +0000 Valentin Schneider wrote:
> On 22/01/23 14:57, Tariq Toukan wrote:
> > On 21/01/2023 6:24, Yury Norov wrote:
> >>
> >> This series was supposed to be included in v6.2, but that didn't happen. It
> >> spent enough in -next without any issues, so I hope we'll finally see it
> >> in v6.3.
> >>
> >> I believe, the best way would be moving it with scheduler patches, but I'm
> >> OK to try again with bitmap branch as well.
> >
> > Now that Yury dropped several controversial bitmap patches form the PR,
> > the rest are mostly in sched, or new API that's used by sched.
> >
> > Valentin, what do you think? Can you take it to your sched branch?
>
> I would if I had one :-)
>
> Peter/Ingo, any objections to stashing this in tip/sched/core?

No replies... so let me take it via networking.

2023-02-08 04:20:29

by patchwork-bot+netdevbpf

[permalink] [raw]

Subject: Re: [PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality

Hello:

This series was applied to netdev/net-next.git (master)
by Jakub Kicinski <[email protected]>:

On Fri, 20 Jan 2023 20:24:27 -0800 you wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.
>
> This has significant performance implications on NUMA machines, for example
> when using NUMA-aware allocated memory together with NUMA-aware IRQ
> affinity hints.
>
> [...]

Here is the summary with links:
- [1/9] lib/find: introduce find_nth_and_andnot_bit
https://git.kernel.org/netdev/net-next/c/43245117806f
- [2/9] cpumask: introduce cpumask_nth_and_andnot
https://git.kernel.org/netdev/net-next/c/62f4386e564d
- [3/9] sched: add sched_numa_find_nth_cpu()
https://git.kernel.org/netdev/net-next/c/cd7f55359c90
- [4/9] cpumask: improve on cpumask_local_spread() locality
https://git.kernel.org/netdev/net-next/c/406d394abfcd
- [5/9] lib/cpumask: reorganize cpumask_local_spread() logic
https://git.kernel.org/netdev/net-next/c/b1beed72b8b7
- [6/9] sched/topology: Introduce sched_numa_hop_mask()
https://git.kernel.org/netdev/net-next/c/9feae65845f7
- [7/9] sched/topology: Introduce for_each_numa_hop_mask()
https://git.kernel.org/netdev/net-next/c/06ac01721f7d
- [8/9] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints
https://git.kernel.org/netdev/net-next/c/2acda57736de
- [9/9] lib/cpumask: update comment for cpumask_local_spread()
https://git.kernel.org/netdev/net-next/c/2ac4980c57f5

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

2023-02-08 15:41:28

by Pawel Chmielewski

[permalink] [raw]

Subject: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks

With the introduction of sched_numa_hop_mask() and
for_each_numa_hop_mask(), the affinity masks for queue vectors can be
conveniently set by preferring the CPUs that are closest to the NUMA node
of the parent PCI device.

Signed-off-by: Pawel Chmielewski <[email protected]>
---
drivers/net/ethernet/intel/ice/ice_base.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index e864634d66bc..fd3550d15c9e 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -122,8 +122,6 @@ static int ice_vsi_alloc_q_vector(struct ice_vsi *vsi, u16 v_idx)
if (vsi->type == ICE_VSI_VF)
goto out;
/* only set affinity_mask if the CPU is online */
- if (cpu_online(v_idx))
- cpumask_set_cpu(v_idx, &q_vector->affinity_mask);

/* This will not be called in the driver load path because the netdev
* will not be created yet. All other cases with register the NAPI
@@ -659,8 +657,10 @@ int ice_vsi_wait_one_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx)
*/
int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
{
+ cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
struct device *dev = ice_pf_to_dev(vsi->back);
- u16 v_idx;
+ int numa_node = dev->numa_node;
+ u16 v_idx, cpu = 0;
int err;

if (vsi->q_vectors[0]) {
@@ -674,6 +674,17 @@ int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
goto err_out;
}

+ v_idx = 0;
+ for_each_numa_hop_mask(aff_mask, numa_node) {
+ for_each_cpu_andnot(cpu, aff_mask, last_aff_mask)
+ if (v_idx < vsi->num_q_vectors) {
+ if (cpu_online(cpu))
+ cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
+ v_idx++;
+ }
+ last_aff_mask = aff_mask;
+ }
+
return 0;

err_out:
--
2.37.3

2023-02-08 16:08:49

[permalink] [raw]

Subject: Re: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks

On Wed, Feb 08, 2023 at 09:18:14PM +0530, Kalesh Anakkur Purayil wrote:
> On Wed, Feb 8, 2023 at 9:11 PM Pawel Chmielewski <
> [email protected]> wrote:

...

> > + u16 v_idx, cpu = 0;
> >
> [Kalesh]: if you initialize v_idx to 0 here, you can avoid the assignment
> below

I would avoid doing this.

The problem is that it will become harder to maintain and more error prone
during development.

So, please leave as is.

--
With Best Regards,
Andy Shevchenko

2023-02-08 16:39:35

[permalink] [raw]

Subject: Re: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks

On Wed, Feb 08, 2023 at 04:39:05PM +0100, Pawel Chmielewski wrote:
> With the introduction of sched_numa_hop_mask() and
> for_each_numa_hop_mask(), the affinity masks for queue vectors can be
> conveniently set by preferring the CPUs that are closest to the NUMA node
> of the parent PCI device.
>
> Signed-off-by: Pawel Chmielewski <[email protected]>
> ---
> drivers/net/ethernet/intel/ice/ice_base.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index e864634d66bc..fd3550d15c9e 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -122,8 +122,6 @@ static int ice_vsi_alloc_q_vector(struct ice_vsi *vsi, u16 v_idx)
> if (vsi->type == ICE_VSI_VF)
> goto out;
> /* only set affinity_mask if the CPU is online */
> - if (cpu_online(v_idx))
> - cpumask_set_cpu(v_idx, &q_vector->affinity_mask);
>
> /* This will not be called in the driver load path because the netdev
> * will not be created yet. All other cases with register the NAPI
> @@ -659,8 +657,10 @@ int ice_vsi_wait_one_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx)
> */
> int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
> {
> + cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
> struct device *dev = ice_pf_to_dev(vsi->back);
> - u16 v_idx;
> + int numa_node = dev->numa_node;
> + u16 v_idx, cpu = 0;
> int err;
>
> if (vsi->q_vectors[0]) {
> @@ -674,6 +674,17 @@ int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
> goto err_out;
> }
>
> + v_idx = 0;
> + for_each_numa_hop_mask(aff_mask, numa_node) {
> + for_each_cpu_andnot(cpu, aff_mask, last_aff_mask)
> + if (v_idx < vsi->num_q_vectors) {
> + if (cpu_online(cpu))
> + cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
> + v_idx++;
> + }

else
goto out;

> + last_aff_mask = aff_mask;
> + }
> +
out:

> return 0;
>
> err_out:
> --
> 2.37.3

Would it make sense to increment v_idx only if matched CPU is online?
It will create a less sparse array of vectors...

Thanks,
Yury

2023-02-08 16:59:03

[permalink] [raw]

Subject: Re: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks

On Wed, Feb 08, 2023 at 08:39:20AM -0800, Yury Norov wrote:
> On Wed, Feb 08, 2023 at 04:39:05PM +0100, Pawel Chmielewski wrote:

...

> > + v_idx = 0;
> > + for_each_numa_hop_mask(aff_mask, numa_node) {
> > + for_each_cpu_andnot(cpu, aff_mask, last_aff_mask)
> > + if (v_idx < vsi->num_q_vectors) {
> > + if (cpu_online(cpu))
> > + cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
> > + v_idx++;
> > + }

> else
> goto out;

In this case the inverted conditional makes more sense:

if (v_idx >= vsi->num_q_vectors)
goto out;

if (cpu_online(cpu))
cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
v_idx++;

(indentation level will be decreased).

> > + last_aff_mask = aff_mask;
> > + }
> > +
> out:
>
> > return 0;

--
With Best Regards,
Andy Shevchenko

2023-02-08 19:12:23

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks

Hi Pawel,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tnguy-next-queue/dev-queue]
[also build test WARNING on linus/master v6.2-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Pawel-Chmielewski/ice-Change-assigning-method-of-the-CPU-affinity-masks/20230208-234144
base: https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git dev-queue
patch link: https://lore.kernel.org/r/20230208153905.109912-1-pawel.chmielewski%40intel.com
patch subject: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20230209/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/33971c3245ae75900dbc4cc9aa2b76ff9cdb534c
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Pawel-Chmielewski/ice-Change-assigning-method-of-the-CPU-affinity-masks/20230208-234144
git checkout 33971c3245ae75900dbc4cc9aa2b76ff9cdb534c
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 olddefconfig
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/net/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]

All warnings (new ones prefixed by >>):

drivers/net/ethernet/intel/ice/ice_base.c: In function 'ice_vsi_alloc_q_vectors':
drivers/net/ethernet/intel/ice/ice_base.c:678:9: error: implicit declaration of function 'for_each_numa_hop_mask'; did you mean 'for_each_node_mask'? [-Werror=implicit-function-declaration]
678 | for_each_numa_hop_mask(aff_mask, numa_node) {
| ^~~~~~~~~~~~~~~~~~~~~~
| for_each_node_mask
drivers/net/ethernet/intel/ice/ice_base.c:678:52: error: expected ';' before '{' token
678 | for_each_numa_hop_mask(aff_mask, numa_node) {
| ^~
| ;
>> drivers/net/ethernet/intel/ice/ice_base.c:663:20: warning: unused variable 'cpu' [-Wunused-variable]
663 | u16 v_idx, cpu = 0;
| ^~~
>> drivers/net/ethernet/intel/ice/ice_base.c:660:31: warning: unused variable 'last_aff_mask' [-Wunused-variable]
660 | cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
| ^~~~~~~~~~~~~
cc1: some warnings being treated as errors

vim +/cpu +663 drivers/net/ethernet/intel/ice/ice_base.c

650
651 /**
652 * ice_vsi_alloc_q_vectors - Allocate memory for interrupt vectors
653 * @vsi: the VSI being configured
654 *
655 * We allocate one q_vector per queue interrupt. If allocation fails we
656 * return -ENOMEM.
657 */
658 int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
659 {
> 660 cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
661 struct device *dev = ice_pf_to_dev(vsi->back);
662 int numa_node = dev->numa_node;
> 663 u16 v_idx, cpu = 0;
664 int err;
665
666 if (vsi->q_vectors[0]) {
667 dev_dbg(dev, "VSI %d has existing q_vectors\n", vsi->vsi_num);
668 return -EEXIST;
669 }
670
671 for (v_idx = 0; v_idx < vsi->num_q_vectors; v_idx++) {
672 err = ice_vsi_alloc_q_vector(vsi, v_idx);
673 if (err)
674 goto err_out;
675 }
676
677 v_idx = 0;
> 678 for_each_numa_hop_mask(aff_mask, numa_node) {
679 for_each_cpu_andnot(cpu, aff_mask, last_aff_mask)
680 if (v_idx < vsi->num_q_vectors) {
681 if (cpu_online(cpu))
682 cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
683 v_idx++;
684 }
685 last_aff_mask = aff_mask;
686 }
687
688 return 0;
689
690 err_out:
691 while (v_idx--)
692 ice_free_q_vector(vsi, v_idx);
693
694 dev_err(dev, "Failed to allocate %d q_vector for VSI %d, ret=%d\n",
695 vsi->num_q_vectors, vsi->vsi_num, err);
696 vsi->num_q_vectors = 0;
697 return err;
698 }
699

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-02-08 19:23:27

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks

Hi Pawel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tnguy-next-queue/dev-queue]
[also build test ERROR on linus/master v6.2-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Pawel-Chmielewski/ice-Change-assigning-method-of-the-CPU-affinity-masks/20230208-234144
base: https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git dev-queue
patch link: https://lore.kernel.org/r/20230208153905.109912-1-pawel.chmielewski%40intel.com
patch subject: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks
config: arm-randconfig-r046-20230209 (https://download.01.org/0day-ci/archive/20230209/[email protected]/config)
compiler: arm-linux-gnueabi-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/33971c3245ae75900dbc4cc9aa2b76ff9cdb534c
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Pawel-Chmielewski/ice-Change-assigning-method-of-the-CPU-affinity-masks/20230208-234144
git checkout 33971c3245ae75900dbc4cc9aa2b76ff9cdb534c
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash drivers/net/ethernet/intel/ice/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]

All errors (new ones prefixed by >>):

drivers/net/ethernet/intel/ice/ice_base.c: In function 'ice_vsi_alloc_q_vectors':
>> drivers/net/ethernet/intel/ice/ice_base.c:662:28: error: 'struct device' has no member named 'numa_node'
662 | int numa_node = dev->numa_node;
| ^~
>> drivers/net/ethernet/intel/ice/ice_base.c:678:9: error: implicit declaration of function 'for_each_numa_hop_mask'; did you mean 'for_each_node_mask'? [-Werror=implicit-function-declaration]
678 | for_each_numa_hop_mask(aff_mask, numa_node) {
| ^~~~~~~~~~~~~~~~~~~~~~
| for_each_node_mask
>> drivers/net/ethernet/intel/ice/ice_base.c:678:52: error: expected ';' before '{' token
678 | for_each_numa_hop_mask(aff_mask, numa_node) {
| ^~
| ;
drivers/net/ethernet/intel/ice/ice_base.c:663:20: warning: unused variable 'cpu' [-Wunused-variable]
663 | u16 v_idx, cpu = 0;
| ^~~
drivers/net/ethernet/intel/ice/ice_base.c:660:31: warning: unused variable 'last_aff_mask' [-Wunused-variable]
660 | cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
| ^~~~~~~~~~~~~
cc1: some warnings being treated as errors

vim +662 drivers/net/ethernet/intel/ice/ice_base.c

650
651 /**
652 * ice_vsi_alloc_q_vectors - Allocate memory for interrupt vectors
653 * @vsi: the VSI being configured
654 *
655 * We allocate one q_vector per queue interrupt. If allocation fails we
656 * return -ENOMEM.
657 */
658 int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
659 {
660 cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
661 struct device *dev = ice_pf_to_dev(vsi->back);
> 662 int numa_node = dev->numa_node;
663 u16 v_idx, cpu = 0;
664 int err;
665
666 if (vsi->q_vectors[0]) {
667 dev_dbg(dev, "VSI %d has existing q_vectors\n", vsi->vsi_num);
668 return -EEXIST;
669 }
670
671 for (v_idx = 0; v_idx < vsi->num_q_vectors; v_idx++) {
672 err = ice_vsi_alloc_q_vector(vsi, v_idx);
673 if (err)
674 goto err_out;
675 }
676
677 v_idx = 0;
> 678 for_each_numa_hop_mask(aff_mask, numa_node) {
679 for_each_cpu_andnot(cpu, aff_mask, last_aff_mask)
680 if (v_idx < vsi->num_q_vectors) {
681 if (cpu_online(cpu))
682 cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
683 v_idx++;
684 }
685 last_aff_mask = aff_mask;
686 }
687
688 return 0;
689
690 err_out:
691 while (v_idx--)
692 ice_free_q_vector(vsi, v_idx);
693
694 dev_err(dev, "Failed to allocate %d q_vector for VSI %d, ret=%d\n",
695 vsi->num_q_vectors, vsi->vsi_num, err);
696 vsi->num_q_vectors = 0;
697 return err;
698 }
699

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-02-08 23:21:18

[permalink] [raw]

Subject: Re: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks

On Wed, 8 Feb 2023 16:39:05 +0100 Pawel Chmielewski wrote:
> With the introduction of sched_numa_hop_mask() and
> for_each_numa_hop_mask(), the affinity masks for queue vectors can be
> conveniently set by preferring the CPUs that are closest to the NUMA node
> of the parent PCI device.

Damn, you had this one locked and ready, didn't you.. :)

2023-02-09 02:46:36

by Philip Li

[permalink] [raw]

Subject: Re: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks

On Thu, Feb 09, 2023 at 03:11:55AM +0800, kernel test robot wrote:
> Hi Pawel,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on tnguy-next-queue/dev-queue]
> [also build test WARNING on linus/master v6.2-rc7]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Pawel-Chmielewski/ice-Change-assigning-method-of-the-CPU-affinity-masks/20230208-234144
> base: https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git dev-queue
> patch link: https://lore.kernel.org/r/20230208153905.109912-1-pawel.chmielewski%40intel.com
> patch subject: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks
> config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20230209/[email protected]/config)
> compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
> reproduce (this is a W=1 build):
> # https://github.com/intel-lab-lkp/linux/commit/33971c3245ae75900dbc4cc9aa2b76ff9cdb534c
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review Pawel-Chmielewski/ice-Change-assigning-method-of-the-CPU-affinity-masks/20230208-234144
> git checkout 33971c3245ae75900dbc4cc9aa2b76ff9cdb534c
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> make W=1 O=build_dir ARCH=x86_64 olddefconfig
> make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/net/
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-kbuild-all/[email protected]

Sorry that the link is wrong, which should be

Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

>
> All warnings (new ones prefixed by >>):
>
> drivers/net/ethernet/intel/ice/ice_base.c: In function 'ice_vsi_alloc_q_vectors':
> drivers/net/ethernet/intel/ice/ice_base.c:678:9: error: implicit declaration of function 'for_each_numa_hop_mask'; did you mean 'for_each_node_mask'? [-Werror=implicit-function-declaration]
> 678 | for_each_numa_hop_mask(aff_mask, numa_node) {
> | ^~~~~~~~~~~~~~~~~~~~~~
> | for_each_node_mask
> drivers/net/ethernet/intel/ice/ice_base.c:678:52: error: expected ';' before '{' token
> 678 | for_each_numa_hop_mask(aff_mask, numa_node) {
> | ^~
> | ;
> >> drivers/net/ethernet/intel/ice/ice_base.c:663:20: warning: unused variable 'cpu' [-Wunused-variable]
> 663 | u16 v_idx, cpu = 0;
> | ^~~
> >> drivers/net/ethernet/intel/ice/ice_base.c:660:31: warning: unused variable 'last_aff_mask' [-Wunused-variable]
> 660 | cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
> | ^~~~~~~~~~~~~
> cc1: some warnings being treated as errors
>
>
> vim +/cpu +663 drivers/net/ethernet/intel/ice/ice_base.c
>
> 650
> 651 /**
> 652 * ice_vsi_alloc_q_vectors - Allocate memory for interrupt vectors
> 653 * @vsi: the VSI being configured
> 654 *
> 655 * We allocate one q_vector per queue interrupt. If allocation fails we
> 656 * return -ENOMEM.
> 657 */
> 658 int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
> 659 {
> > 660 cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
> 661 struct device *dev = ice_pf_to_dev(vsi->back);
> 662 int numa_node = dev->numa_node;
> > 663 u16 v_idx, cpu = 0;
> 664 int err;
> 665
> 666 if (vsi->q_vectors[0]) {
> 667 dev_dbg(dev, "VSI %d has existing q_vectors\n", vsi->vsi_num);
> 668 return -EEXIST;
> 669 }
> 670
> 671 for (v_idx = 0; v_idx < vsi->num_q_vectors; v_idx++) {
> 672 err = ice_vsi_alloc_q_vector(vsi, v_idx);
> 673 if (err)
> 674 goto err_out;
> 675 }
> 676
> 677 v_idx = 0;
> > 678 for_each_numa_hop_mask(aff_mask, numa_node) {
> 679 for_each_cpu_andnot(cpu, aff_mask, last_aff_mask)
> 680 if (v_idx < vsi->num_q_vectors) {
> 681 if (cpu_online(cpu))
> 682 cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
> 683 v_idx++;
> 684 }
> 685 last_aff_mask = aff_mask;
> 686 }
> 687
> 688 return 0;
> 689
> 690 err_out:
> 691 while (v_idx--)
> 692 ice_free_q_vector(vsi, v_idx);
> 693
> 694 dev_err(dev, "Failed to allocate %d q_vector for VSI %d, ret=%d\n",
> 695 vsi->num_q_vectors, vsi->vsi_num, err);
> 696 vsi->num_q_vectors = 0;
> 697 return err;
> 698 }
> 699
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests
>

2023-02-09 05:14:36

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks

Hi Pawel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tnguy-next-queue/dev-queue]
[also build test ERROR on linus/master v6.2-rc7 next-20230208]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Pawel-Chmielewski/ice-Change-assigning-method-of-the-CPU-affinity-masks/20230208-234144
base: https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git dev-queue
patch link: https://lore.kernel.org/r/20230208153905.109912-1-pawel.chmielewski%40intel.com
patch subject: [PATCH 1/1] ice: Change assigning method of the CPU affinity masks
config: i386-randconfig-a013 (https://download.01.org/0day-ci/archive/20230209/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/33971c3245ae75900dbc4cc9aa2b76ff9cdb534c
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Pawel-Chmielewski/ice-Change-assigning-method-of-the-CPU-affinity-masks/20230208-234144
git checkout 33971c3245ae75900dbc4cc9aa2b76ff9cdb534c
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash drivers/net/ethernet/intel/ice/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]

All errors (new ones prefixed by >>):

>> drivers/net/ethernet/intel/ice/ice_base.c:662:23: error: no member named 'numa_node' in 'struct device'
int numa_node = dev->numa_node;
~~~ ^
>> drivers/net/ethernet/intel/ice/ice_base.c:678:2: error: implicit declaration of function 'for_each_numa_hop_mask' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
for_each_numa_hop_mask(aff_mask, numa_node) {
^
>> drivers/net/ethernet/intel/ice/ice_base.c:678:45: error: expected ';' after expression
for_each_numa_hop_mask(aff_mask, numa_node) {
^
;
3 errors generated.

vim +662 drivers/net/ethernet/intel/ice/ice_base.c

650
651 /**
652 * ice_vsi_alloc_q_vectors - Allocate memory for interrupt vectors
653 * @vsi: the VSI being configured
654 *
655 * We allocate one q_vector per queue interrupt. If allocation fails we
656 * return -ENOMEM.
657 */
658 int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
659 {
660 cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
661 struct device *dev = ice_pf_to_dev(vsi->back);
> 662 int numa_node = dev->numa_node;
663 u16 v_idx, cpu = 0;
664 int err;
665
666 if (vsi->q_vectors[0]) {
667 dev_dbg(dev, "VSI %d has existing q_vectors\n", vsi->vsi_num);
668 return -EEXIST;
669 }
670
671 for (v_idx = 0; v_idx < vsi->num_q_vectors; v_idx++) {
672 err = ice_vsi_alloc_q_vector(vsi, v_idx);
673 if (err)
674 goto err_out;
675 }
676
677 v_idx = 0;
> 678 for_each_numa_hop_mask(aff_mask, numa_node) {
679 for_each_cpu_andnot(cpu, aff_mask, last_aff_mask)
680 if (v_idx < vsi->num_q_vectors) {
681 if (cpu_online(cpu))
682 cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
683 v_idx++;
684 }
685 last_aff_mask = aff_mask;
686 }
687
688 return 0;
689
690 err_out:
691 while (v_idx--)
692 ice_free_q_vector(vsi, v_idx);
693
694 dev_err(dev, "Failed to allocate %d q_vector for VSI %d, ret=%d\n",
695 vsi->num_q_vectors, vsi->vsi_num, err);
696 vsi->num_q_vectors = 0;
697 return err;
698 }
699

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-02-16 14:56:14

by Pawel Chmielewski

[permalink] [raw]

Subject: [PATCH v2 1/1] ice: Change assigning method of the CPU affinity masks

With the introduction of sched_numa_hop_mask() and for_each_numa_hop_mask(),
the affinity masks for queue vectors can be conveniently set by preferring the
CPUs that are closest to the NUMA node of the parent PCI device.

Signed-off-by: Pawel Chmielewski <[email protected]>
---

Changes since v1:
* Removed obsolete comment
* Inverted condition for loop escape
* Incrementing v_idx only in case of available cpu
---
drivers/net/ethernet/intel/ice/ice_base.c | 24 +++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 9e36f01dfa4f..27b00d224c5d 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -121,9 +121,6 @@ static int ice_vsi_alloc_q_vector(struct ice_vsi *vsi, u16 v_idx)

if (vsi->type == ICE_VSI_VF)
goto out;
- /* only set affinity_mask if the CPU is online */
- if (cpu_online(v_idx))
- cpumask_set_cpu(v_idx, &q_vector->affinity_mask);

/* This will not be called in the driver load path because the netdev
* will not be created yet. All other cases with register the NAPI
@@ -659,8 +656,10 @@ int ice_vsi_wait_one_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx)
*/
int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
{
+ cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
struct device *dev = ice_pf_to_dev(vsi->back);
- u16 v_idx;
+ int numa_node = dev->numa_node;
+ u16 v_idx, cpu = 0;
int err;

if (vsi->q_vectors[0]) {
@@ -674,6 +673,23 @@ int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
goto err_out;
}

+ v_idx = 0;
+
+ for_each_numa_hop_mask(aff_mask, numa_node) {
+ for_each_cpu_andnot(cpu, aff_mask, last_aff_mask) {
+ if (v_idx >= vsi->num_q_vectors)
+ goto out;
+
+ if (cpu_online(cpu)) {
+ cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
+ v_idx++;
+ }
+ }
+
+ last_aff_mask = aff_mask;
+ }
+
+out:
return 0;

err_out:
--
2.37.3

2023-02-16 15:16:29

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ice: Change assigning method of the CPU affinity masks

On Thu, Feb 16, 2023 at 05:14:33PM +0200, Andy Shevchenko wrote:
> On Thu, Feb 16, 2023 at 03:54:55PM +0100, Pawel Chmielewski wrote:

...

> > + for_each_cpu_andnot(cpu, aff_mask, last_aff_mask) {
> > + if (v_idx >= vsi->num_q_vectors)
>
> > + goto out;
>
> Useless. You can return 0; here.
>
> > + }

Btw, I briefly checked the other functions nearby in the code and none of them
is using goto for OK cases.

--
With Best Regards,
Andy Shevchenko

2023-02-16 15:17:25

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ice: Change assigning method of the CPU affinity masks

On Thu, Feb 16, 2023 at 03:54:55PM +0100, Pawel Chmielewski wrote:
> With the introduction of sched_numa_hop_mask() and for_each_numa_hop_mask(),
> the affinity masks for queue vectors can be conveniently set by preferring the
> CPUs that are closest to the NUMA node of the parent PCI device.

...

> + v_idx = 0;

> +

Redundant blank line.

> + for_each_numa_hop_mask(aff_mask, numa_node) {
> + for_each_cpu_andnot(cpu, aff_mask, last_aff_mask) {
> + if (v_idx >= vsi->num_q_vectors)

> + goto out;

Useless. You can return 0; here.

> + if (cpu_online(cpu)) {
> + cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
> + v_idx++;
> + }
> + }
> +
> + last_aff_mask = aff_mask;
> + }
> +
> +out:
> return 0;

--
With Best Regards,
Andy Shevchenko

2023-02-17 01:39:15

[permalink] [raw]

Subject: Re: [PATCH 3/9] sched: add sched_numa_find_nth_cpu()

Hi Jakub,

Can you please fold-in the following patch?

Thanks,
Yury

From: Yury Norov <[email protected]>
Date: Thu, 16 Feb 2023 17:03:30 -0800
Subject: [PATCH] sched/topology: fix KASAN warning in hop_cmp()

Despite that prev_hop is used conditionally on curr_hop is not the
first hop, it's initialized unconditionally.

Because initialization implies dereferencing, it might happen that
the code dereferences uninitialized memory, which has been spotted by
KASAN. Fix it by reorganizing hop_cmp() logic.

Reported-by: Bruno Goncalves <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
kernel/sched/topology.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 48838a05c008..c21b8b1f3537 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2081,14 +2081,19 @@ struct __cmp_key {

static int hop_cmp(const void *a, const void *b)
{
- struct cpumask **prev_hop = *((struct cpumask ***)b - 1);
- struct cpumask **cur_hop = *(struct cpumask ***)b;
+ struct cpumask **prev_hop, **cur_hop = *(struct cpumask ***)b;
struct __cmp_key *k = (struct __cmp_key *)a;

if (cpumask_weight_and(k->cpus, cur_hop[k->node]) <= k->cpu)
return 1;

- k->w = (b == k->masks) ? 0 : cpumask_weight_and(k->cpus, prev_hop[k->node]);
+ if (b == k->masks) {
+ k->w = 0;
+ return 0;
+ }
+
+ prev_hop = *((struct cpumask ***)b - 1);
+ k->w = cpumask_weight_and(k->cpus, prev_hop[k->node]);
if (k->w <= k->cpu)
return 0;

--
2.34.1

2023-02-17 11:13:36

[permalink] [raw]

Subject: Re: [PATCH 3/9] sched: add sched_numa_find_nth_cpu()

On Thu, Feb 16, 2023 at 05:39:08PM -0800, Yury Norov wrote:

> From: Yury Norov <[email protected]>
> Date: Thu, 16 Feb 2023 17:03:30 -0800
> Subject: [PATCH] sched/topology: fix KASAN warning in hop_cmp()
>
> Despite that prev_hop is used conditionally on curr_hop is not the

curr --> cur

> first hop, it's initialized unconditionally.
>
> Because initialization implies dereferencing, it might happen that
> the code dereferences uninitialized memory, which has been spotted by
> KASAN. Fix it by reorganizing hop_cmp() logic.

Nice catch! I guess it deserves for a comment inside the code
(IIRC I was puzzled of the logic behind and it was changed due
to lack of this knowledge.)

--
With Best Regards,
Andy Shevchenko

2023-02-20 19:47:04