2022-12-08 18:32:10

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 0/5] cpumask: improve on cpumask_local_spread() locality

cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This series is inspired by Tariq Toukan and Valentin Schneider's
"net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
hints"

https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

According to their measurements, for mlx5e:

Bottleneck in RX side is released, reached linerate (~1.8x speedup).
~30% less cpu util on TX.

This patch makes cpumask_local_spread() traversing CPUs based on NUMA
distance, just as well, and I expect comparable improvement for its
users, as in case of mlx5e.

I tested new behavior on my VM with the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node 0 1 2 3
0: 10 50 30 70
1: 50 10 70 30
2: 30 70 10 50
3: 70 30 50 10

And the cpumask_local_spread() for each node and offset traversing looks
like this:

node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15
node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7
node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5
node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3

v1: https://lore.kernel.org/lkml/[email protected]/T/
v2: https://lore.kernel.org/all/[email protected]/T/
v3:
- fix typo in find_nth_and_andnot_bit();
- add 5th patch that simplifies cpumask_local_spread();
- address various coding style nits.

Yury Norov (5):
lib/find: introduce find_nth_and_andnot_bit
cpumask: introduce cpumask_nth_and_andnot
sched: add sched_numa_find_nth_cpu()
cpumask: improve on cpumask_local_spread() locality
lib/cpumask: reorganize cpumask_local_spread() logic

include/linux/cpumask.h | 20 ++++++++++++++
include/linux/find.h | 33 +++++++++++++++++++++++
include/linux/topology.h | 8 ++++++
kernel/sched/topology.c | 57 ++++++++++++++++++++++++++++++++++++++++
lib/cpumask.c | 26 +++++-------------
lib/find_bit.c | 9 +++++++
6 files changed, 134 insertions(+), 19 deletions(-)

--
2.34.1


2022-12-08 18:32:12

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 1/5] lib/find: introduce find_nth_and_andnot_bit

The function is used to implement in-place bitmaps traversing without
storing intermediate result in temporary bitmaps, in the following patches.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/find.h | 33 +++++++++++++++++++++++++++++++++
lib/find_bit.c | 9 +++++++++
2 files changed, 42 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index 3f59c2fbe438..b594207a0010 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -23,6 +23,9 @@ unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long
unsigned long size, unsigned long n);
unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
unsigned long size, unsigned long n);
+unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+ const unsigned long *addr3, unsigned long size,
+ unsigned long n);
extern unsigned long _find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size);
extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
@@ -244,6 +247,36 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon
return __find_nth_andnot_bit(addr1, addr2, size, n);
}

+/**
+ * find_nth_and_andnot_bit - find N'th set bit in 2 memory regions,
+ * excluding those set in 3rd region
+ * @addr1: The 1st address to start the search at
+ * @addr2: The 2nd address to start the search at
+ * @addr3: The 3rd address to start the search at
+ * @size: The maximum number of bits to search
+ * @n: The number of set bit, which position is needed, counting from 0
+ *
+ * Returns the bit number of the N'th set bit.
+ * If no such, returns @size.
+ */
+static __always_inline
+unsigned long find_nth_and_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2,
+ const unsigned long *addr3,
+ unsigned long size, unsigned long n)
+{
+ if (n >= size)
+ return size;
+
+ if (small_const_nbits(size)) {
+ unsigned long val = *addr1 & *addr2 & (~*addr3) & GENMASK(size - 1, 0);
+
+ return val ? fns(val, n) : size;
+ }
+
+ return __find_nth_and_andnot_bit(addr1, addr2, addr3, size, n);
+}
+
#ifndef find_first_and_bit
/**
* find_first_and_bit - find the first set bit in both memory regions
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18bc0a7ac8ee..c10920e66788 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -155,6 +155,15 @@ unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned l
}
EXPORT_SYMBOL(__find_nth_andnot_bit);

+unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2,
+ const unsigned long *addr3,
+ unsigned long size, unsigned long n)
+{
+ return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n);
+}
+EXPORT_SYMBOL(__find_nth_and_andnot_bit);
+
#ifndef find_next_and_bit
unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
unsigned long nbits, unsigned long start)
--
2.34.1

2022-12-08 18:32:14

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 2/5] cpumask: introduce cpumask_nth_and_andnot

Introduce cpumask_nth_and_andnot() based on find_nth_and_andnot_bit().
It's used in the following patch to traverse cpumasks without storing
intermediate result in temporary cpumask.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/cpumask.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 9543b22d6dc2..5c4905108d1b 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -391,6 +391,26 @@ unsigned int cpumask_nth_andnot(unsigned int cpu, const struct cpumask *srcp1,
nr_cpumask_bits, cpumask_check(cpu));
}

+/**
+ * cpumask_nth_and_andnot - get the Nth cpu set in 1st and 2nd cpumask, and clear in 3rd.
+ * @srcp1: the cpumask pointer
+ * @srcp2: the cpumask pointer
+ * @srcp3: the cpumask pointer
+ * @cpu: the N'th cpu to find, starting from 0
+ *
+ * Returns >= nr_cpu_ids if such cpu doesn't exist.
+ */
+static __always_inline
+unsigned int cpumask_nth_and_andnot(unsigned int cpu, const struct cpumask *srcp1,
+ const struct cpumask *srcp2,
+ const struct cpumask *srcp3)
+{
+ return find_nth_and_andnot_bit(cpumask_bits(srcp1),
+ cpumask_bits(srcp2),
+ cpumask_bits(srcp3),
+ nr_cpumask_bits, cpumask_check(cpu));
+}
+
#define CPU_BITS_NONE \
{ \
[0 ... BITS_TO_LONGS(NR_CPUS)-1] = 0UL \
--
2.34.1

2022-12-08 18:32:17

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 3/5] sched: add sched_numa_find_nth_cpu()

The function finds Nth set CPU in a given cpumask starting from a given
node.

Leveraging the fact that each hop in sched_domains_numa_masks includes the
same or greater number of CPUs than the previous one, we can use binary
search on hops instead of linear walk, which makes the overall complexity
of O(log n) in terms of number of cpumask_weight() calls.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 8 ++++++
kernel/sched/topology.c | 57 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 65 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..72f264575698 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
return cpumask_of_node(cpu_to_node(cpu));
}

+#ifdef CONFIG_NUMA
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
+#else
+static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
+{
+ return cpumask_nth(cpu, cpus);
+}
+#endif /* CONFIG_NUMA */

#endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..e515dcf44816 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1764,6 +1764,8 @@ bool find_numa_distance(int distance)
* there is an intermediary node C, which is < N hops away from both
* nodes A and B, the system is a glueless mesh.
*/
+#include <linux/bsearch.h>
+
static void init_numa_topology_type(int offline_node)
{
int a, b, c, n;
@@ -2067,6 +2069,61 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
return found;
}

+struct __cmp_key {
+ const struct cpumask *cpus;
+ struct cpumask ***masks;
+ int node;
+ int cpu;
+ int w;
+};
+
+static int cmp(const void *a, const void *b)
+{
+ struct cpumask **prev_hop = *((struct cpumask ***)b - 1);
+ struct cpumask **cur_hop = *(struct cpumask ***)b;
+ struct __cmp_key *k = (struct __cmp_key *)a;
+
+ if (cpumask_weight_and(k->cpus, cur_hop[k->node]) <= k->cpu)
+ return 1;
+
+ k->w = (b == k->masks) ? 0 : cpumask_weight_and(k->cpus, prev_hop[k->node]);
+ if (k->w <= k->cpu)
+ return 0;
+
+ return -1;
+}
+
+/*
+ * sched_numa_find_nth_cpu() - given the NUMA topology, find the Nth next cpu
+ * closest to @cpu from @cpumask.
+ * cpumask: cpumask to find a cpu from
+ * cpu: Nth cpu to find
+ *
+ * returns: cpu, or nr_cpu_ids when nothing found.
+ */
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
+{
+ struct __cmp_key k = { .cpus = cpus, .node = node, .cpu = cpu };
+ struct cpumask ***hop_masks;
+ int hop, ret = nr_cpu_ids;
+
+ rcu_read_lock();
+
+ k.masks = rcu_dereference(sched_domains_numa_masks);
+ if (!k.masks)
+ goto unlock;
+
+ hop_masks = bsearch(&k, k.masks, sched_domains_numa_levels, sizeof(k.masks[0]), cmp);
+ hop = hop_masks - k.masks;
+
+ ret = hop ?
+ cpumask_nth_and_andnot(cpu - k.w, cpus, k.masks[hop][node], k.masks[hop-1][node]) :
+ cpumask_nth_and(cpu, cpus, k.masks[0][node]);
+unlock:
+ rcu_read_unlock();
+ return ret;
+}
+EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
#endif /* CONFIG_NUMA */

static int __sdt_alloc(const struct cpumask *cpu_map)
--
2.34.1

2022-12-08 18:32:49

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 4/5] cpumask: improve on cpumask_local_spread() locality

Switch cpumask_local_spread() to use newly added sched_numa_find_nth_cpu(),
which takes into account distances to each node in the system.

For the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node 0 1 2 3
0: 10 50 30 70
1: 50 10 70 30
2: 30 70 10 50
3: 70 30 50 10

The new cpumask_local_spread() traverses cpus for each node like this:

node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15
node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7
node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5
node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3

Signed-off-by: Yury Norov <[email protected]>
---
lib/cpumask.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index c7c392514fd3..255974cd6734 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -110,7 +110,7 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
#endif

/**
- * cpumask_local_spread - select the i'th cpu with local numa cpu's first
+ * cpumask_local_spread - select the i'th cpu based on NUMA distances
* @i: index number
* @node: local numa_node
*
@@ -132,15 +132,7 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
if (cpu < nr_cpu_ids)
return cpu;
} else {
- /* NUMA first. */
- cpu = cpumask_nth_and(i, cpu_online_mask, cpumask_of_node(node));
- if (cpu < nr_cpu_ids)
- return cpu;
-
- i -= cpumask_weight_and(cpu_online_mask, cpumask_of_node(node));
-
- /* Skip NUMA nodes, done above. */
- cpu = cpumask_nth_andnot(i, cpu_online_mask, cpumask_of_node(node));
+ cpu = sched_numa_find_nth_cpu(cpu_online_mask, i, node);
if (cpu < nr_cpu_ids)
return cpu;
}
--
2.34.1

2022-12-08 18:33:12

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 5/5] lib/cpumask: reorganize cpumask_local_spread() logic

Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
else-branch of cpumask_local_spread() is just a function call, and
we can simplify logic by using ternary operator.

While here, replace BUG() with WARN().

Signed-off-by: Yury Norov <[email protected]>
---
lib/cpumask.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index 255974cd6734..c7029fb3c372 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -127,16 +127,12 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
/* Wrap: we always want a cpu. */
i %= num_online_cpus();

- if (node == NUMA_NO_NODE) {
- cpu = cpumask_nth(i, cpu_online_mask);
- if (cpu < nr_cpu_ids)
- return cpu;
- } else {
- cpu = sched_numa_find_nth_cpu(cpu_online_mask, i, node);
- if (cpu < nr_cpu_ids)
- return cpu;
- }
- BUG();
+ cpu = node == NUMA_NO_NODE ?
+ cpumask_nth(i, cpu_online_mask) :
+ sched_numa_find_nth_cpu(cpu_online_mask, i, node);
+
+ WARN_ON(cpu >= nr_cpu_ids);
+ return cpu;
}
EXPORT_SYMBOL(cpumask_local_spread);

--
2.34.1

2022-12-08 19:00:32

by Keller, Jacob E

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] cpumask: improve on cpumask_local_spread() locality



On 12/8/2022 10:30 AM, Yury Norov wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.
>
> This series is inspired by Tariq Toukan and Valentin Schneider's
> "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
> hints"
>
> https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
>
> According to their measurements, for mlx5e:
>
> Bottleneck in RX side is released, reached linerate (~1.8x speedup).
> ~30% less cpu util on TX.
>
> This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> distance, just as well, and I expect comparable improvement for its
> users, as in case of mlx5e.
>
> I tested new behavior on my VM with the following NUMA configuration:
>
> root@debian:~# numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3
> node 0 size: 3869 MB
> node 0 free: 3740 MB
> node 1 cpus: 4 5
> node 1 size: 1969 MB
> node 1 free: 1937 MB
> node 2 cpus: 6 7
> node 2 size: 1967 MB
> node 2 free: 1873 MB
> node 3 cpus: 8 9 10 11 12 13 14 15
> node 3 size: 7842 MB
> node 3 free: 7723 MB
> node distances:
> node 0 1 2 3
> 0: 10 50 30 70
> 1: 50 10 70 30
> 2: 30 70 10 50
> 3: 70 30 50 10
>
> And the cpumask_local_spread() for each node and offset traversing looks
> like this:
>
> node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15
> node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7
> node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5
> node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3
>
> v1: https://lore.kernel.org/lkml/[email protected]/T/
> v2: https://lore.kernel.org/all/[email protected]/T/
> v3:
> - fix typo in find_nth_and_andnot_bit();
> - add 5th patch that simplifies cpumask_local_spread();
> - address various coding style nits.
>

The whole series look reasonable to me!

Reviewed-by: Jacob Keller <[email protected]>

2022-12-08 20:23:25

by Peter Lafreniere

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] lib/cpumask: reorganize cpumask_local_spread() logic

> Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
> else-branch of cpumask_local_spread() is just a function call, and
> we can simplify logic by using ternary operator.
>
> While here, replace BUG() with WARN().
Why make this change? It's still as bad to hit the WARN_ON as it was before.

> Signed-off-by: Yury Norov [email protected]
>
> ---
> lib/cpumask.c | 16 ++++++----------
> 1 file changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index 255974cd6734..c7029fb3c372 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -127,16 +127,12 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> /* Wrap: we always want a cpu. */
> i %= num_online_cpus();
>
> - if (node == NUMA_NO_NODE) {
> - cpu = cpumask_nth(i, cpu_online_mask);
> - if (cpu < nr_cpu_ids)
> - return cpu;
> - } else {
> - cpu = sched_numa_find_nth_cpu(cpu_online_mask, i, node);
> - if (cpu < nr_cpu_ids)
> - return cpu;
> - }
> - BUG();
> + cpu = node == NUMA_NO_NODE ?
> + cpumask_nth(i, cpu_online_mask) :
> + sched_numa_find_nth_cpu(cpu_online_mask, i, node);
I find the if version clearer, and cleaner too if you drop the brackets.

For the ternary version it would be nice to parenthesize the equality
like you did in cmp() in 3/5.

> +
> + WARN_ON(cpu >= nr_cpu_ids);
>
> + return cpu;
> }
> EXPORT_SYMBOL(cpumask_local_spread);
>
> --
> 2.34.1

Minor nit: cmp() in 3/5 could use a longer name. The file's long, and
cmp() doesn't explain _what_ it's comparing. How about cmp_cpumask() or
something related to the function using it?

Other than the above particularities, the whole series looks good to me.
Reviewed-by: Peter Lafreniere <[email protected]>

2022-12-08 20:25:02

by Tariq Toukan

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] cpumask: improve on cpumask_local_spread() locality



On 12/8/2022 8:30 PM, Yury Norov wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.
>
> This series is inspired by Tariq Toukan and Valentin Schneider's
> "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
> hints"
>
> https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
>
> According to their measurements, for mlx5e:
>
> Bottleneck in RX side is released, reached linerate (~1.8x speedup).
> ~30% less cpu util on TX.
>
> This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> distance, just as well, and I expect comparable improvement for its
> users, as in case of mlx5e.
>
> I tested new behavior on my VM with the following NUMA configuration:
>
> root@debian:~# numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3
> node 0 size: 3869 MB
> node 0 free: 3740 MB
> node 1 cpus: 4 5
> node 1 size: 1969 MB
> node 1 free: 1937 MB
> node 2 cpus: 6 7
> node 2 size: 1967 MB
> node 2 free: 1873 MB
> node 3 cpus: 8 9 10 11 12 13 14 15
> node 3 size: 7842 MB
> node 3 free: 7723 MB
> node distances:
> node 0 1 2 3
> 0: 10 50 30 70
> 1: 50 10 70 30
> 2: 30 70 10 50
> 3: 70 30 50 10
>
> And the cpumask_local_spread() for each node and offset traversing looks
> like this:
>
> node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15
> node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7
> node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5
> node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3
>
> v1: https://lore.kernel.org/lkml/[email protected]/T/
> v2: https://lore.kernel.org/all/[email protected]/T/
> v3:
> - fix typo in find_nth_and_andnot_bit();
> - add 5th patch that simplifies cpumask_local_spread();
> - address various coding style nits.
>
> Yury Norov (5):
> lib/find: introduce find_nth_and_andnot_bit
> cpumask: introduce cpumask_nth_and_andnot
> sched: add sched_numa_find_nth_cpu()
> cpumask: improve on cpumask_local_spread() locality
> lib/cpumask: reorganize cpumask_local_spread() logic
>
> include/linux/cpumask.h | 20 ++++++++++++++
> include/linux/find.h | 33 +++++++++++++++++++++++
> include/linux/topology.h | 8 ++++++
> kernel/sched/topology.c | 57 ++++++++++++++++++++++++++++++++++++++++
> lib/cpumask.c | 26 +++++-------------
> lib/find_bit.c | 9 +++++++
> 6 files changed, 134 insertions(+), 19 deletions(-)
>

Acked-by: Tariq Toukan <[email protected]>

2022-12-08 20:48:59

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] lib/cpumask: reorganize cpumask_local_spread() logic

On Thu, Dec 08, 2022 at 08:17:22PM +0000, Peter Lafreniere wrote:
> > Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
> > else-branch of cpumask_local_spread() is just a function call, and
> > we can simplify logic by using ternary operator.
> >
> > While here, replace BUG() with WARN().
> Why make this change? It's still as bad to hit the WARN_ON as it was before.

For example, because of this:

> Greg, please don't do this
>
> > [email protected], 2002-09-05 08:33:20-07:00, [email protected]
> > USB: storage driver: replace show_trace() with BUG()
>
> that BUG() thing is _way_ out of line, and has killed a few of my machines
> several times for no good reason. It actively hurts debuggability, because
> the machine is totally dead after it, and the whole and ONLY point of
> BUG() messages is to help debugging and make it clear that we can't handle
> something.
>
> In this case, we _can_ handle it, and we're much better off with a machine
> that works and that you can look up the messages with than killing it.
>
> Rule of thumb: BUG() is only good for something that never happens and
> that we really have no other option for (ie state is so corrupt that
> continuing is deadly).
>
> Linus

2022-12-08 20:58:33

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] cpumask: improve on cpumask_local_spread() locality

On Thu, Dec 08, 2022 at 10:22:22PM +0200, Tariq Toukan wrote:
>
>
> On 12/8/2022 8:30 PM, Yury Norov wrote:
> > cpumask_local_spread() currently checks local node for presence of i'th
> > CPU, and then if it finds nothing makes a flat search among all non-local
> > CPUs. We can do it better by checking CPUs per NUMA hops.
> >
> > This series is inspired by Tariq Toukan and Valentin Schneider's
> > "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
> > hints"
> >
> > https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
> >
> > According to their measurements, for mlx5e:
> >
> > Bottleneck in RX side is released, reached linerate (~1.8x speedup).
> > ~30% less cpu util on TX.
> >
> > This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> > distance, just as well, and I expect comparable improvement for its
> > users, as in case of mlx5e.
> >
> > I tested new behavior on my VM with the following NUMA configuration:
> >
> > root@debian:~# numactl -H
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1 2 3
> > node 0 size: 3869 MB
> > node 0 free: 3740 MB
> > node 1 cpus: 4 5
> > node 1 size: 1969 MB
> > node 1 free: 1937 MB
> > node 2 cpus: 6 7
> > node 2 size: 1967 MB
> > node 2 free: 1873 MB
> > node 3 cpus: 8 9 10 11 12 13 14 15
> > node 3 size: 7842 MB
> > node 3 free: 7723 MB
> > node distances:
> > node 0 1 2 3
> > 0: 10 50 30 70
> > 1: 50 10 70 30
> > 2: 30 70 10 50
> > 3: 70 30 50 10
> >
> > And the cpumask_local_spread() for each node and offset traversing looks
> > like this:
> >
> > node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15
> > node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7
> > node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5
> > node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3
> >
> > v1: https://lore.kernel.org/lkml/[email protected]/T/
> > v2: https://lore.kernel.org/all/[email protected]/T/
> > v3:
> > - fix typo in find_nth_and_andnot_bit();
> > - add 5th patch that simplifies cpumask_local_spread();
> > - address various coding style nits.
> >
> > Yury Norov (5):
> > lib/find: introduce find_nth_and_andnot_bit
> > cpumask: introduce cpumask_nth_and_andnot
> > sched: add sched_numa_find_nth_cpu()
> > cpumask: improve on cpumask_local_spread() locality
> > lib/cpumask: reorganize cpumask_local_spread() logic
> >
> > include/linux/cpumask.h | 20 ++++++++++++++
> > include/linux/find.h | 33 +++++++++++++++++++++++
> > include/linux/topology.h | 8 ++++++
> > kernel/sched/topology.c | 57 ++++++++++++++++++++++++++++++++++++++++
> > lib/cpumask.c | 26 +++++-------------
> > lib/find_bit.c | 9 +++++++
> > 6 files changed, 134 insertions(+), 19 deletions(-)
> >
>
> Acked-by: Tariq Toukan <[email protected]>

Thanks Tariq, Jacob and Peter for review. I'll add the series in
bitmap-for-next for testing. Still, I think that sched/numa branches
would be more suitable.

Thanks,
Yury

2022-12-08 21:00:16

by Peter Lafreniere

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] lib/cpumask: reorganize cpumask_local_spread() logic

> On Thu, Dec 08, 2022 at 08:17:22PM +0000, Peter Lafreniere wrote:
> > > Now after moving all NUMA logic into sched_numa_find_nth_cpu(),
> > > else-branch of cpumask_local_spread() is just a function call, and
> > > we can simplify logic by using ternary operator.
> > >
> > > While here, replace BUG() with WARN().
> > Why make this change? It's still as bad to hit the WARN_ON as it was before.
>
> For example, because of this:
>
> > Greg, please don't do this
> >
> > > [email protected], 2002-09-05 08:33:20-07:00, [email protected]
> > > USB: storage driver: replace show_trace() with BUG()
> >
> > that BUG() thing is _way_ out of line, and has killed a few of my machines
> > several times for no good reason. It actively hurts debuggability, because
> > the machine is totally dead after it, and the whole and ONLY point of
> > BUG() messages is to help debugging and make it clear that we can't handle
> > something.
> >
> > In this case, we _can_ handle it, and we're much better off with a machine
> > that works and that you can look up the messages with than killing it.
> >
> > Rule of thumb: BUG() is only good for something that never happens and
> > that we really have no other option for (ie state is so corrupt that
> > continuing is deadly).
> >
> > Linus

Fair enough. It's not like it'll be hit anyway. My concern was for if
any of the 23 callers get an invalid result. I guess that if that causes
a crash, then so be it. We have the warning to track down the cause.

Thanks for the explanation,
Peter Lafreniere <[email protected]>