2023-04-20 05:20:21

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 0/8] sched/topology: add for_each_numa_cpu() macro

for_each_cpu() is widely used in kernel, and it's beneficial to create
a NUMA-aware version of the macro.

Recently added for_each_numa_hop_mask() works, but switching existing
codebase to it is not an easy process.

This series adds for_each_numa_cpu(), which is designed to be similar to
the for_each_cpu(). It allows to convert existing code to NUMA-aware as
simple as adding a hop iterator variable and passing it inside new macro.
for_each_numa_cpu() takes care of the rest.

At the moment, we have 2 users of NUMA-aware enumerators. One is
Melanox's in-tree driver, and another is Intel's in-review driver:

https://lore.kernel.org/lkml/[email protected]/

Both real-life examples follow the same pattern:

for_each_numa_hop_mask(cpus, prev, node) {
for_each_cpu_andnot(cpu, cpus, prev) {
if (cnt++ == max_num)
goto out;
do_something(cpu);
}
prev = cpus;
}

With the new macro, it has a more standard look, like this:

for_each_numa_cpu(cpu, hop, node, cpu_possible_mask) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

Straight conversion of existing for_each_cpu() codebase to NUMA-aware
version with for_each_numa_hop_mask() is difficult because it doesn't
take a user-provided cpu mask, and eventually ends up with open-coded
double loop. With for_each_numa_cpu() it shouldn't be a brainteaser.
Consider the NUMA-ignorant example:

cpumask_t cpus = get_mask();
int cnt = 0, cpu;

for_each_cpu(cpu, cpus) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

Converting it to NUMA-aware version would be as simple as:

cpumask_t cpus = get_mask();
int node = get_node();
int cnt = 0, hop, cpu;

for_each_numa_cpu(cpu, hop, node, cpus) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

The latter looks more verbose and avoids from open-coding that annoying
double loop. Another advantage is that it works with a 'hop' parameter with
the clear meaning of NUMA distance, and doesn't make people not familiar
to enumerator internals bothering with current and previous masks machinery.

v2:
- repase on top of master;
- cleanup comments and tweak them to comply with kernel-doc;
- remove RFC from patch #8 as there's no objections.

Yury Norov (8):
lib/find: add find_next_and_andnot_bit()
sched/topology: introduce sched_numa_find_next_cpu()
sched/topology: add for_each_numa_cpu() macro
net: mlx5: switch comp_irqs_request() to using for_each_numa_cpu
lib/cpumask: update comment to cpumask_local_spread()
sched/topology: export sched_domains_numa_levels
lib: add test for for_each_numa_{cpu,hop_mask}()
sched: drop for_each_numa_hop_mask()

drivers/net/ethernet/mellanox/mlx5/core/eq.c | 16 ++----
include/linux/find.h | 43 ++++++++++++++
include/linux/topology.h | 37 ++++++------
kernel/sched/topology.c | 59 +++++++++++---------
lib/cpumask.c | 7 +--
lib/find_bit.c | 12 ++++
lib/test_bitmap.c | 16 ++++++
7 files changed, 133 insertions(+), 57 deletions(-)

--
2.34.1


2023-04-20 05:20:48

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 1/8] lib/find: add find_next_and_andnot_bit()

Similarly to find_nth_and_andnot_bit(), find_next_and_andnot_bit() is
a convenient helper that allows traversing bitmaps without storing
intermediate results in a temporary bitmap.

In the following patches the function is used to implement NUMA-aware
CPUs enumeration.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/find.h | 43 +++++++++++++++++++++++++++++++++++++++++++
lib/find_bit.c | 12 ++++++++++++
2 files changed, 55 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index 5e4f39ef2e72..90b68d76c073 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -16,6 +16,9 @@ unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned l
unsigned long nbits, unsigned long start);
unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2,
unsigned long nbits, unsigned long start);
+unsigned long _find_next_and_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+ const unsigned long *addr3, unsigned long nbits,
+ unsigned long start);
unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
unsigned long start);
extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
@@ -159,6 +162,40 @@ unsigned long find_next_or_bit(const unsigned long *addr1,
}
#endif

+#ifndef find_next_and_andnot_bit
+/**
+ * find_next_and_andnot_bit - find the next bit set in *addr1 and *addr2,
+ * excluding all the bits in *addr3
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @addr3: The third address to base the search on
+ * @size: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * Return: the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static __always_inline
+unsigned long find_next_and_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2,
+ const unsigned long *addr3,
+ unsigned long size,
+ unsigned long offset)
+{
+ if (small_const_nbits(size)) {
+ unsigned long val;
+
+ if (unlikely(offset >= size))
+ return size;
+
+ val = *addr1 & *addr2 & ~*addr3 & GENMASK(size - 1, offset);
+ return val ? __ffs(val) : size;
+ }
+
+ return _find_next_and_andnot_bit(addr1, addr2, addr3, size, offset);
+}
+#endif
+
#ifndef find_next_zero_bit
/**
* find_next_zero_bit - find the next cleared bit in a memory region
@@ -568,6 +605,12 @@ unsigned long find_next_bit_le(const void *addr, unsigned
(bit) = find_next_andnot_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
(bit)++)

+#define for_each_and_andnot_bit(bit, addr1, addr2, addr3, size) \
+ for ((bit) = 0; \
+ (bit) = find_next_and_andnot_bit((addr1), (addr2), (addr3), (size), (bit)),\
+ (bit) < (size); \
+ (bit)++)
+
#define for_each_or_bit(bit, addr1, addr2, size) \
for ((bit) = 0; \
(bit) = find_next_or_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 32f99e9a670e..4403e00890b1 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -182,6 +182,18 @@ unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned l
EXPORT_SYMBOL(_find_next_andnot_bit);
#endif

+#ifndef find_next_and_andnot_bit
+unsigned long _find_next_and_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2,
+ const unsigned long *addr3,
+ unsigned long nbits,
+ unsigned long start)
+{
+ return FIND_NEXT_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], /* nop */, nbits, start);
+}
+EXPORT_SYMBOL(_find_next_and_andnot_bit);
+#endif
+
#ifndef find_next_or_bit
unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2,
unsigned long nbits, unsigned long start)
--
2.34.1

2023-04-20 05:21:02

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 3/8] sched/topology: add for_each_numa_cpu() macro

for_each_cpu() is widely used in the kernel, and it's beneficial to
create a NUMA-aware version of the macro.

Recently added for_each_numa_hop_mask() works, but switching existing
codebase to using it is not an easy process.

New for_each_numa_cpu() is designed to be similar to the for_each_cpu().
It allows to convert existing code to NUMA-aware as simple as adding a
hop iterator variable and passing it inside new macro. for_each_numa_cpu()
takes care of the rest.

At the moment, we have 2 users of NUMA-aware enumerators. One is
Melanox's in-tree driver, and another is Intel's in-review driver:

https://lore.kernel.org/lkml/[email protected]/

Both real-life examples follow the same pattern:

for_each_numa_hop_mask(cpus, prev, node) {
for_each_cpu_andnot(cpu, cpus, prev) {
if (cnt++ == max_num)
goto out;
do_something(cpu);
}
prev = cpus;
}

With the new macro, it would look like this:

for_each_numa_cpu(cpu, hop, node, cpu_possible_mask) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

Straight conversion of existing for_each_cpu() codebase to NUMA-aware
version with for_each_numa_hop_mask() is difficult because it doesn't
take a user-provided cpu mask, and eventually ends up with open-coded
double loop. With for_each_numa_cpu() it shouldn't be a brainteaser.
Consider the NUMA-ignorant example:

cpumask_t cpus = get_mask();
int cnt = 0, cpu;

for_each_cpu(cpu, cpus) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

Converting it to NUMA-aware version would be as simple as:

cpumask_t cpus = get_mask();
int node = get_node();
int cnt = 0, hop, cpu;

for_each_numa_cpu(cpu, hop, node, cpus) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

The latter looks more verbose and avoids from open-coding that annoying
double loop. Another advantage is that it works with a 'hop' parameter with
the clear meaning of NUMA distance, and doesn't make people not familiar
to enumerator internals bothering with current and previous masks machinery.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 13209095d6e2..01fb3a55d7ce 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -286,4 +286,20 @@ sched_numa_hop_mask(unsigned int node, unsigned int hops)
!IS_ERR_OR_NULL(mask); \
__hops++)

+/**
+ * for_each_numa_cpu - iterate over cpus in increasing order taking into account
+ * NUMA distances from a given node.
+ * @cpu: the (optionally unsigned) integer iterator
+ * @hop: the iterator variable, must be initialized to a desired minimal hop.
+ * @node: the NUMA node to start the search from.
+ * @mask: the cpumask pointer
+ *
+ * Requires rcu_lock to be held.
+ */
+#define for_each_numa_cpu(cpu, hop, node, mask) \
+ for ((cpu) = 0, (hop) = 0; \
+ (cpu) = sched_numa_find_next_cpu((mask), (cpu), (node), &(hop)),\
+ (cpu) < nr_cpu_ids; \
+ (cpu)++)
+
#endif /* _LINUX_TOPOLOGY_H */
--
2.34.1

2023-04-20 05:21:07

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 4/8] net: mlx5: switch comp_irqs_request() to using for_each_numa_cpu

for_each_numa_cpu() is a more straightforward alternative to
for_each_numa_hop_mask() + for_each_cpu_andnot().

Signed-off-by: Yury Norov <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 38b32e98f3bd..80368952e9b1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -817,12 +817,10 @@ static void comp_irqs_release(struct mlx5_core_dev *dev)
static int comp_irqs_request(struct mlx5_core_dev *dev)
{
struct mlx5_eq_table *table = dev->priv.eq_table;
- const struct cpumask *prev = cpu_none_mask;
- const struct cpumask *mask;
int ncomp_eqs = table->num_comp_eqs;
u16 *cpus;
int ret;
- int cpu;
+ int cpu, hop;
int i;

ncomp_eqs = table->num_comp_eqs;
@@ -844,15 +842,11 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)

i = 0;
rcu_read_lock();
- for_each_numa_hop_mask(mask, dev->priv.numa_node) {
- for_each_cpu_andnot(cpu, mask, prev) {
- cpus[i] = cpu;
- if (++i == ncomp_eqs)
- goto spread_done;
- }
- prev = mask;
+ for_each_numa_cpu(cpu, hop, dev->priv.numa_node, cpu_possible_mask) {
+ cpus[i] = cpu;
+ if (++i == ncomp_eqs)
+ break;
}
-spread_done:
rcu_read_unlock();
ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
kfree(cpus);
--
2.34.1

2023-04-20 05:21:18

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 2/8] sched/topology: introduce sched_numa_find_next_cpu()

The function searches for the next CPU in a given cpumask according to
NUMA topology, so that it traverses cpus per-hop.

If the CPU is the last cpu in a given hop, sched_numa_find_next_cpu()
switches to the next hop, and picks the first CPU from there, excluding
those already traversed.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 7 +++++++
kernel/sched/topology.c | 39 +++++++++++++++++++++++++++++++++++++++
2 files changed, 46 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index fea32377f7c7..13209095d6e2 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -247,6 +247,7 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)

#ifdef CONFIG_NUMA
int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
+int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop);
extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops);
#else
static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
@@ -254,6 +255,12 @@ static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, i
return cpumask_nth(cpu, cpus);
}

+static __always_inline
+int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop)
+{
+ return find_next_bit(cpumask_bits(cpus), small_cpumask_bits, cpu);
+}
+
static inline const struct cpumask *
sched_numa_hop_mask(unsigned int node, unsigned int hops)
{
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 051aaf65c749..fc163e4181e6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2130,6 +2130,45 @@ int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
}
EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);

+/*
+ * sched_numa_find_next_cpu() - given the NUMA topology, find the next cpu
+ * cpumask: cpumask to find a cpu from
+ * cpu: current cpu
+ * node: local node
+ * hop: (in/out) indicates distance order of current CPU to a local node
+ *
+ * The function searches for next cpu at a given NUMA distance, indicated
+ * by hop, and if nothing found, tries to find CPUs at a greater distance,
+ * starting from the beginning.
+ *
+ * Return: cpu, or >= nr_cpu_ids when nothing found.
+ */
+int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop)
+{
+ unsigned long *cur, *prev;
+ struct cpumask ***masks;
+ unsigned int ret;
+
+ if (*hop >= sched_domains_numa_levels)
+ return nr_cpu_ids;
+
+ masks = rcu_dereference(sched_domains_numa_masks);
+ cur = cpumask_bits(masks[*hop][node]);
+ if (*hop == 0)
+ ret = find_next_and_bit(cpumask_bits(cpus), cur, nr_cpu_ids, cpu);
+ else {
+ prev = cpumask_bits(masks[*hop - 1][node]);
+ ret = find_next_and_andnot_bit(cpumask_bits(cpus), cur, prev, nr_cpu_ids, cpu);
+ }
+
+ if (ret < nr_cpu_ids)
+ return ret;
+
+ *hop += 1;
+ return sched_numa_find_next_cpu(cpus, 0, node, hop);
+}
+EXPORT_SYMBOL_GPL(sched_numa_find_next_cpu);
+
/**
* sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away from
* @node
--
2.34.1

2023-04-20 05:21:25

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 5/8] lib/cpumask: update comment to cpumask_local_spread()

Now that we have a for_each_numa_cpu(), which is a more straightforward
replacement to the cpumask_local_spread() when it comes to enumeration
of CPUs with respect to NUMA topology, it's worth to update the comment.

Signed-off-by: Yury Norov <[email protected]>
---
lib/cpumask.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index e7258836b60b..151d1dc5c593 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -127,11 +127,8 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
*
* There's a better alternative based on for_each()-like iterators:
*
- * for_each_numa_hop_mask(mask, node) {
- * for_each_cpu_andnot(cpu, mask, prev)
- * do_something(cpu);
- * prev = mask;
- * }
+ * for_each_numa_cpu(cpu, hop, node, cpu_online_mask)
+ * do_something(cpu);
*
* It's simpler and more verbose than above. Complexity of iterator-based
* enumeration is O(sched_domains_numa_levels * nr_cpu_ids), while
--
2.34.1

2023-04-20 05:21:38

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 6/8] sched/topology: export sched_domains_numa_levels

The following patch adds a test for NUMA-aware CPU enumerators, and it
requires an access to sched_domains_numa_levels.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 7 +++++++
kernel/sched/topology.c | 10 ++++++----
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 01fb3a55d7ce..7ebcc886dc76 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -43,6 +43,13 @@
for_each_online_node(node) \
if (nr_cpus_node(node))

+#ifdef CONFIG_NUMA
+extern int __sched_domains_numa_levels;
+#define sched_domains_numa_levels ((const int)__sched_domains_numa_levels)
+#else
+#define sched_domains_numa_levels (1)
+#endif
+
int arch_update_cpu_topology(void);

/* Conform to ACPI 2.0 SLIT distance definitions */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index fc163e4181e6..56daa279c411 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1508,7 +1508,9 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
#ifdef CONFIG_NUMA
enum numa_topology_type sched_numa_topology_type;

-static int sched_domains_numa_levels;
+int __sched_domains_numa_levels;
+EXPORT_SYMBOL_GPL(__sched_domains_numa_levels);
+
static int sched_domains_curr_level;

int sched_max_numa_distance;
@@ -1872,7 +1874,7 @@ void sched_init_numa(int offline_node)
*
* We reset it to 'nr_levels' at the end of this function.
*/
- sched_domains_numa_levels = 0;
+ __sched_domains_numa_levels = 0;

masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
if (!masks)
@@ -1948,7 +1950,7 @@ void sched_init_numa(int offline_node)
sched_domain_topology_saved = sched_domain_topology;
sched_domain_topology = tl;

- sched_domains_numa_levels = nr_levels;
+ __sched_domains_numa_levels = nr_levels;
WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);

init_numa_topology_type(offline_node);
@@ -1961,7 +1963,7 @@ static void sched_reset_numa(void)
struct cpumask ***masks;

nr_levels = sched_domains_numa_levels;
- sched_domains_numa_levels = 0;
+ __sched_domains_numa_levels = 0;
sched_max_numa_distance = 0;
sched_numa_topology_type = NUMA_DIRECT;
distances = sched_domains_numa_distance;
--
2.34.1

2023-04-20 05:22:15

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 7/8] lib: add test for for_each_numa_{cpu,hop_mask}()

The test ensures that enumerators' output is consistent with
cpumask_local_spread().

Signed-off-by: Yury Norov <[email protected]>
---
lib/test_bitmap.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index a8005ad3bd58..1b5f805f6879 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -12,6 +12,7 @@
#include <linux/printk.h>
#include <linux/slab.h>
#include <linux/string.h>
+#include <linux/topology.h>
#include <linux/uaccess.h>

#include "../tools/testing/selftests/kselftest_module.h"
@@ -751,6 +752,33 @@ static void __init test_for_each_set_bit_wrap(void)
}
}

+static void __init test_for_each_numa(void)
+{
+ unsigned int cpu, node;
+
+ for (node = 0; node < sched_domains_numa_levels; node++) {
+ const struct cpumask *m, *p = cpu_none_mask;
+ unsigned int c = 0;
+
+ rcu_read_lock();
+ for_each_numa_hop_mask(m, node) {
+ for_each_cpu_andnot(cpu, m, p)
+ expect_eq_uint(cpumask_local_spread(c++, node), cpu);
+ p = m;
+ }
+ rcu_read_unlock();
+ }
+
+ for (node = 0; node < sched_domains_numa_levels; node++) {
+ unsigned int hop, c = 0;
+
+ rcu_read_lock();
+ for_each_numa_cpu(cpu, hop, node, cpu_online_mask)
+ expect_eq_uint(cpumask_local_spread(c++, node), cpu);
+ rcu_read_unlock();
+ }
+}
+
static void __init test_for_each_set_bit(void)
{
DECLARE_BITMAP(orig, 500);
@@ -1237,6 +1265,7 @@ static void __init selftest(void)
test_for_each_clear_bitrange_from();
test_for_each_set_clump8();
test_for_each_set_bit_wrap();
+ test_for_each_numa();
}

KSTM_MODULE_LOADERS(test_bitmap);
--
2.34.1

2023-04-20 05:23:10

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 8/8] sched: drop for_each_numa_hop_mask()

Now that we have for_each_numa_cpu(), for_each_numa_hop_mask()
and all related code is a dead code. Drop it.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 25 -------------------------
kernel/sched/topology.c | 32 --------------------------------
lib/test_bitmap.c | 13 -------------
3 files changed, 70 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7ebcc886dc76..1225ade33053 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -255,7 +255,6 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
#ifdef CONFIG_NUMA
int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop);
-extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops);
#else
static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
{
@@ -267,32 +266,8 @@ int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsi
{
return find_next_bit(cpumask_bits(cpus), small_cpumask_bits, cpu);
}
-
-static inline const struct cpumask *
-sched_numa_hop_mask(unsigned int node, unsigned int hops)
-{
- return ERR_PTR(-EOPNOTSUPP);
-}
#endif /* CONFIG_NUMA */

-/**
- * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance
- * from a given node.
- * @mask: the iteration variable.
- * @node: the NUMA node to start the search from.
- *
- * Requires rcu_lock to be held.
- *
- * Yields cpu_online_mask for @node == NUMA_NO_NODE.
- */
-#define for_each_numa_hop_mask(mask, node) \
- for (unsigned int __hops = 0; \
- mask = (node != NUMA_NO_NODE || __hops) ? \
- sched_numa_hop_mask(node, __hops) : \
- cpu_online_mask, \
- !IS_ERR_OR_NULL(mask); \
- __hops++)
-
/**
* for_each_numa_cpu - iterate over cpus in increasing order taking into account
* NUMA distances from a given node.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 56daa279c411..9d08ffdbd2d8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2171,38 +2171,6 @@ int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsi
}
EXPORT_SYMBOL_GPL(sched_numa_find_next_cpu);

-/**
- * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away from
- * @node
- * @node: The node to count hops from.
- * @hops: Include CPUs up to that many hops away. 0 means local node.
- *
- * Return: On success, a pointer to a cpumask of CPUs at most @hops away from
- * @node, an error value otherwise.
- *
- * Requires rcu_lock to be held. Returned cpumask is only valid within that
- * read-side section, copy it if required beyond that.
- *
- * Note that not all hops are equal in distance; see sched_init_numa() for how
- * distances and masks are handled.
- * Also note that this is a reflection of sched_domains_numa_masks, which may change
- * during the lifetime of the system (offline nodes are taken out of the masks).
- */
-const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops)
-{
- struct cpumask ***masks;
-
- if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
- return ERR_PTR(-EINVAL);
-
- masks = rcu_dereference(sched_domains_numa_masks);
- if (!masks)
- return ERR_PTR(-EBUSY);
-
- return masks[hops][node];
-}
-EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
-
#endif /* CONFIG_NUMA */

static int __sdt_alloc(const struct cpumask *cpu_map)
diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 1b5f805f6879..6becb044a66f 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -756,19 +756,6 @@ static void __init test_for_each_numa(void)
{
unsigned int cpu, node;

- for (node = 0; node < sched_domains_numa_levels; node++) {
- const struct cpumask *m, *p = cpu_none_mask;
- unsigned int c = 0;
-
- rcu_read_lock();
- for_each_numa_hop_mask(m, node) {
- for_each_cpu_andnot(cpu, m, p)
- expect_eq_uint(cpumask_local_spread(c++, node), cpu);
- p = m;
- }
- rcu_read_unlock();
- }
-
for (node = 0; node < sched_domains_numa_levels; node++) {
unsigned int hop, c = 0;

--
2.34.1

2023-04-20 08:36:01

by Tariq Toukan

[permalink] [raw]
Subject: Re: [PATCH v2 4/8] net: mlx5: switch comp_irqs_request() to using for_each_numa_cpu



On 20/04/2023 8:19, Yury Norov wrote:
> for_each_numa_cpu() is a more straightforward alternative to
> for_each_numa_hop_mask() + for_each_cpu_andnot().
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/eq.c | 16 +++++-----------
> 1 file changed, 5 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> index 38b32e98f3bd..80368952e9b1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> @@ -817,12 +817,10 @@ static void comp_irqs_release(struct mlx5_core_dev *dev)
> static int comp_irqs_request(struct mlx5_core_dev *dev)
> {
> struct mlx5_eq_table *table = dev->priv.eq_table;
> - const struct cpumask *prev = cpu_none_mask;
> - const struct cpumask *mask;
> int ncomp_eqs = table->num_comp_eqs;
> u16 *cpus;
> int ret;
> - int cpu;
> + int cpu, hop;
> int i;
>
> ncomp_eqs = table->num_comp_eqs;
> @@ -844,15 +842,11 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>
> i = 0;
> rcu_read_lock();
> - for_each_numa_hop_mask(mask, dev->priv.numa_node) {
> - for_each_cpu_andnot(cpu, mask, prev) {
> - cpus[i] = cpu;
> - if (++i == ncomp_eqs)
> - goto spread_done;
> - }
> - prev = mask;
> + for_each_numa_cpu(cpu, hop, dev->priv.numa_node, cpu_possible_mask) {

I like this clean API.

nit:
Previously cpu_online_mask was used here. Is this change intentional?
We can fix it in a followup patch if this is the only comment on the series.

Reviewed-by: Tariq Toukan <[email protected]>

> + cpus[i] = cpu;
> + if (++i == ncomp_eqs)
> + break;
> }
> -spread_done:
> rcu_read_unlock();
> ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
> kfree(cpus);

2023-04-20 22:49:56

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 4/8] net: mlx5: switch comp_irqs_request() to using for_each_numa_cpu

On Thu, Apr 20, 2023 at 11:27:26AM +0300, Tariq Toukan wrote:
> I like this clean API.

Thanks :)

> nit:
> Previously cpu_online_mask was used here. Is this change intentional?
> We can fix it in a followup patch if this is the only comment on the series.
>
> Reviewed-by: Tariq Toukan <[email protected]>

The only CPUs listed in the sched_domains_numa_masks are 'available',
i.e. online CPUs. The for_each_numa_cpu() ANDs user-provided cpumask
with a map associate to the hop, and that means that if we AND with
possible mask, we'll eventually walk online CPUs only.

To make sure, I experimented with the modified test:

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 6becb044a66f..c8d557731080 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -760,8 +760,13 @@ static void __init test_for_each_numa(void)
unsigned int hop, c = 0;

rcu_read_lock();
- for_each_numa_cpu(cpu, hop, node, cpu_online_mask)
+ pr_err("Node %d:\t", node);
+ for_each_numa_cpu(cpu, hop, node, cpu_possible_mask) {
expect_eq_uint(cpumask_local_spread(c++, node), cpu);
+ pr_cont("%3d", cpu);
+
+ }
+ pr_err("\n");
rcu_read_unlock();
}
}

This is the NUMA topology of my test machine after the boot:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 1861 MB
node 0 free: 1792 MB
node 1 cpus: 4 5
node 1 size: 1914 MB
node 1 free: 1823 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1915 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7862 MB
node 3 free: 7259 MB
node distances:
node 0 1 2 3
0: 10 50 30 70
1: 50 10 70 30
2: 30 70 10 50
3: 70 30 50 10

And this is what test prints:

root@debian:~# insmod test_bitmap.ko
test_bitmap: loaded.
test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 472
test_bitmap: bitmap_print_to_pagebuf: input is '0-32767
', Time: 2665
test_bitmap: Node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15
test_bitmap:
test_bitmap: Node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7
test_bitmap:
test_bitmap: Node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5
test_bitmap:
test_bitmap: Node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3
test_bitmap:
test_bitmap: all 6614 tests passed

Now, disable a couple of CPUs:

root@debian:~# chcpu -d 1-2
smpboot: CPU 1 is now offline
CPU 1 disabled
smpboot: CPU 2 is now offline
CPU 2 disabled

And try again:

root@debian:~# rmmod test_bitmap
rmmod: ERROR: ../libkmod/libkmod[ 320.275904] test_bitmap: unloaded.
root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 3
node 0 size: 1861 MB
node 0 free: 1792 MB
node 1 cpus: 4 5
node 1 size: 1914 MB
node 1 free: 1823 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1915 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7862 MB
node 3 free: 7259 MB
node distances:
node 0 1 2 3
0: 10 50 30 70
1: 50 10 70 30
2: 30 70 10 50
3: 70 30 50 10
root@debian:~# insmod test_bitmap.ko
test_bitmap: loaded.
test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 491
test_bitmap: bitmap_print_to_pagebuf: input is '0-32767
', Time: 2174
test_bitmap: Node 0: 0 3 6 7 4 5 8 9 10 11 12 13 14 15
test_bitmap:
test_bitmap: Node 1: 4 5 8 9 10 11 12 13 14 15 0 3 6 7
test_bitmap:
test_bitmap: Node 2: 6 7 0 3 8 9 10 11 12 13 14 15 4 5
test_bitmap:
test_bitmap: Node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 3
test_bitmap:
test_bitmap: all 6606 tests passed

I used cpu_possible_mask because I wanted to keep the patch
consistent: before we traversed NUMA hop masks, now we traverse the
same hop masks AND user-provided mask, so the latter should include
all possible CPUs.

If you think it's better to have cpu_online_mask in the driver, let's
make it in a separate patch?

Thanks,
Yury

2023-04-24 17:15:46

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] lib: add test for for_each_numa_{cpu,hop_mask}()

On 19/04/23 22:19, Yury Norov wrote:
> + for (node = 0; node < sched_domains_numa_levels; node++) {
> + unsigned int hop, c = 0;
> +
> + rcu_read_lock();
> + for_each_numa_cpu(cpu, hop, node, cpu_online_mask)
> + expect_eq_uint(cpumask_local_spread(c++, node), cpu);
> + rcu_read_unlock();
> + }

I'm not fond of the export of sched_domains_numa_levels, especially
considering it's just there for tests.

Furthermore, is there any value is testing parity with
cpumask_local_spread()? Rather, shouldn't we check that using this API does
yield CPUs of increasing NUMA distance?

Something like

for_each_node(node) {
unsigned int prev_cpu, hop = 0;

cpu = cpumask_first(cpumask_of_node(node));
prev_cpu = cpu;

rcu_read_lock();

/* Assert distance is monotonically increasing */
for_each_numa_cpu(cpu, hop, node, cpu_online_mask) {
expect_ge_uint(cpu_to_node(cpu), cpu_to_node(prev_cpu));
prev_cpu = cpu;
}

rcu_read_unlock();
}

2023-04-25 10:00:27

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 2/8] sched/topology: introduce sched_numa_find_next_cpu()

On 19/04/23 22:19, Yury Norov wrote:
> +/*
> + * sched_numa_find_next_cpu() - given the NUMA topology, find the next cpu
> + * cpumask: cpumask to find a cpu from
> + * cpu: current cpu
> + * node: local node
> + * hop: (in/out) indicates distance order of current CPU to a local node
> + *
> + * The function searches for next cpu at a given NUMA distance, indicated
> + * by hop, and if nothing found, tries to find CPUs at a greater distance,
> + * starting from the beginning.
> + *
> + * Return: cpu, or >= nr_cpu_ids when nothing found.
> + */
> +int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop)
> +{
> + unsigned long *cur, *prev;
> + struct cpumask ***masks;
> + unsigned int ret;
> +
> + if (*hop >= sched_domains_numa_levels)
> + return nr_cpu_ids;
> +
> + masks = rcu_dereference(sched_domains_numa_masks);
> + cur = cpumask_bits(masks[*hop][node]);
> + if (*hop == 0)
> + ret = find_next_and_bit(cpumask_bits(cpus), cur, nr_cpu_ids, cpu);
> + else {
> + prev = cpumask_bits(masks[*hop - 1][node]);
> + ret = find_next_and_andnot_bit(cpumask_bits(cpus), cur, prev, nr_cpu_ids, cpu);
> + }
> +
> + if (ret < nr_cpu_ids)
> + return ret;
> +
> + *hop += 1;
> + return sched_numa_find_next_cpu(cpus, 0, node, hop);

sched_domains_numa_levels is a fairly small number, so the recursion depth
isn't something we really need to worry about - still, the iterative
variant of this is fairly straightforward to get to:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e850f16c003ae..4c9a9e48fef6d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2151,23 +2151,27 @@ int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsi
struct cpumask ***masks;
unsigned int ret;

- if (*hop >= sched_domains_numa_levels)
- return nr_cpu_ids;
+ /*
+ * Reset @cpu to 0 when increasing @hop, since CPU numbering has no
+ * relationship with NUMA distance: a search at @hop+1 may yield CPUs
+ * of lower ID than previously seen!
+ */
+ for (; *hop >= sched_domains_numa_levels; *hop += 1, cpu = 0) {
+ masks = rcu_dereference(sched_domains_numa_masks);
+ cur = cpumask_bits(masks[*hop][node]);
+
+ if (*hop == 0) {
+ ret = find_next_and_bit(cpumask_bits(cpus), cur, nr_cpu_ids, cpu);
+ } else {
+ prev = cpumask_bits(masks[*hop - 1][node]);
+ ret = find_next_and_andnot_bit(cpumask_bits(cpus), cur, prev, nr_cpu_ids, cpu);
+ }

- masks = rcu_dereference(sched_domains_numa_masks);
- cur = cpumask_bits(masks[*hop][node]);
- if (*hop == 0)
- ret = find_next_and_bit(cpumask_bits(cpus), cur, nr_cpu_ids, cpu);
- else {
- prev = cpumask_bits(masks[*hop - 1][node]);
- ret = find_next_and_andnot_bit(cpumask_bits(cpus), cur, prev, nr_cpu_ids, cpu);
+ if (ret < nr_cpu_ids)
+ return ret;
}

- if (ret < nr_cpu_ids)
- return ret;
-
- *hop += 1;
- return sched_numa_find_next_cpu(cpus, 0, node, hop);
+ return nr_cpu_ids;
}
EXPORT_SYMBOL_GPL(sched_numa_find_next_cpu);


2023-04-25 10:01:13

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] sched/topology: add for_each_numa_cpu() macro

On 19/04/23 22:19, Yury Norov wrote:
> +/**
> + * for_each_numa_cpu - iterate over cpus in increasing order taking into account
> + * NUMA distances from a given node.
> + * @cpu: the (optionally unsigned) integer iterator
> + * @hop: the iterator variable, must be initialized to a desired minimal hop.
> + * @node: the NUMA node to start the search from.
> + * @mask: the cpumask pointer
> + *
> + * Requires rcu_lock to be held.
> + */
> +#define for_each_numa_cpu(cpu, hop, node, mask) \
> + for ((cpu) = 0, (hop) = 0; \
> + (cpu) = sched_numa_find_next_cpu((mask), (cpu), (node), &(hop)),\
> + (cpu) < nr_cpu_ids; \
> + (cpu)++)
> +

I think we can keep sched_numa_find_next_cpu() as-is, but could we make
that macro use cpu_possible_mask by default? We can always add a variant
if/when we need to feed in a different mask.

> #endif /* _LINUX_TOPOLOGY_H */
> --
> 2.34.1

2023-04-26 05:48:25

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 2/8] sched/topology: introduce sched_numa_find_next_cpu()

On Tue, Apr 25, 2023 at 10:54:56AM +0100, Valentin Schneider wrote:
> On 19/04/23 22:19, Yury Norov wrote:
> > +/*
> > + * sched_numa_find_next_cpu() - given the NUMA topology, find the next cpu
> > + * cpumask: cpumask to find a cpu from
> > + * cpu: current cpu
> > + * node: local node
> > + * hop: (in/out) indicates distance order of current CPU to a local node
> > + *
> > + * The function searches for next cpu at a given NUMA distance, indicated
> > + * by hop, and if nothing found, tries to find CPUs at a greater distance,
> > + * starting from the beginning.
> > + *
> > + * Return: cpu, or >= nr_cpu_ids when nothing found.
> > + */
> > +int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop)
> > +{
> > + unsigned long *cur, *prev;
> > + struct cpumask ***masks;
> > + unsigned int ret;
> > +
> > + if (*hop >= sched_domains_numa_levels)
> > + return nr_cpu_ids;
> > +
> > + masks = rcu_dereference(sched_domains_numa_masks);
> > + cur = cpumask_bits(masks[*hop][node]);
> > + if (*hop == 0)
> > + ret = find_next_and_bit(cpumask_bits(cpus), cur, nr_cpu_ids, cpu);
> > + else {
> > + prev = cpumask_bits(masks[*hop - 1][node]);
> > + ret = find_next_and_andnot_bit(cpumask_bits(cpus), cur, prev, nr_cpu_ids, cpu);
> > + }
> > +
> > + if (ret < nr_cpu_ids)
> > + return ret;
> > +
> > + *hop += 1;
> > + return sched_numa_find_next_cpu(cpus, 0, node, hop);
>
> sched_domains_numa_levels is a fairly small number, so the recursion depth
> isn't something we really need to worry about - still, the iterative
> variant of this is fairly straightforward to get to:

This is a tail recursion. Compiler normally converts it into the loop just
as well. At least, my GCC does.

2023-04-26 05:49:31

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] sched/topology: add for_each_numa_cpu() macro

On Tue, Apr 25, 2023 at 10:54:48AM +0100, Valentin Schneider wrote:
> On 19/04/23 22:19, Yury Norov wrote:
> > +/**
> > + * for_each_numa_cpu - iterate over cpus in increasing order taking into account
> > + * NUMA distances from a given node.
> > + * @cpu: the (optionally unsigned) integer iterator
> > + * @hop: the iterator variable, must be initialized to a desired minimal hop.
> > + * @node: the NUMA node to start the search from.
> > + * @mask: the cpumask pointer
> > + *
> > + * Requires rcu_lock to be held.
> > + */
> > +#define for_each_numa_cpu(cpu, hop, node, mask) \
> > + for ((cpu) = 0, (hop) = 0; \
> > + (cpu) = sched_numa_find_next_cpu((mask), (cpu), (node), &(hop)),\
> > + (cpu) < nr_cpu_ids; \
> > + (cpu)++)
> > +
>
> I think we can keep sched_numa_find_next_cpu() as-is, but could we make
> that macro use cpu_possible_mask by default? We can always add a variant
> if/when we need to feed in a different mask.

As mentioned in discussion to the driver's patch, all that numa things
imply only online CPUs, so cpu_possible_mask may mislead to some extent.

Anyways, can you elaborate what you exactly want? Like this?

#define for_each_numa_online_cpu(cpu, hop, node) \
for_each_numa_cpu(cpu, hop, node, cpu_online_mask)

2023-04-26 05:52:51

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] lib: add test for for_each_numa_{cpu,hop_mask}()

Hi Valentin,

Thanks for review!

On Mon, Apr 24, 2023 at 06:09:52PM +0100, Valentin Schneider wrote:
> On 19/04/23 22:19, Yury Norov wrote:
> > + for (node = 0; node < sched_domains_numa_levels; node++) {
> > + unsigned int hop, c = 0;
> > +
> > + rcu_read_lock();
> > + for_each_numa_cpu(cpu, hop, node, cpu_online_mask)
> > + expect_eq_uint(cpumask_local_spread(c++, node), cpu);
> > + rcu_read_unlock();
> > + }
>
> I'm not fond of the export of sched_domains_numa_levels, especially
> considering it's just there for tests.
>
> Furthermore, is there any value is testing parity with
> cpumask_local_spread()?

I wanted to emphasize that new NUMA-aware functions are coherent with
each other, just like find_nth_bit() is coherent with find_next_bit().

But all that coherence looks important only in non-NUMA case, because
client code may depend on fact that next CPU is never less than current.
This doesn't hold for NUMA iterators anyways...

> Rather, shouldn't we check that using this API does
> yield CPUs of increasing NUMA distance?
>
> Something like
>
> for_each_node(node) {
> unsigned int prev_cpu, hop = 0;
>
> cpu = cpumask_first(cpumask_of_node(node));
> prev_cpu = cpu;
>
> rcu_read_lock();
>
> /* Assert distance is monotonically increasing */
> for_each_numa_cpu(cpu, hop, node, cpu_online_mask) {
> expect_ge_uint(cpu_to_node(cpu), cpu_to_node(prev_cpu));
> prev_cpu = cpu;
> }
>
> rcu_read_unlock();
> }

Your version of the test looks more straightforward. I need to think
for more, but it looks like I can take it in v3.

Thanks,
Yury

2023-04-26 09:25:59

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 2/8] sched/topology: introduce sched_numa_find_next_cpu()

On 25/04/23 22:26, Yury Norov wrote:
> On Tue, Apr 25, 2023 at 10:54:56AM +0100, Valentin Schneider wrote:
>> On 19/04/23 22:19, Yury Norov wrote:
>> > +/*
>> > + * sched_numa_find_next_cpu() - given the NUMA topology, find the next cpu
>> > + * cpumask: cpumask to find a cpu from
>> > + * cpu: current cpu
>> > + * node: local node
>> > + * hop: (in/out) indicates distance order of current CPU to a local node
>> > + *
>> > + * The function searches for next cpu at a given NUMA distance, indicated
>> > + * by hop, and if nothing found, tries to find CPUs at a greater distance,
>> > + * starting from the beginning.
>> > + *
>> > + * Return: cpu, or >= nr_cpu_ids when nothing found.
>> > + */
>> > +int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop)
>> > +{
>> > + unsigned long *cur, *prev;
>> > + struct cpumask ***masks;
>> > + unsigned int ret;
>> > +
>> > + if (*hop >= sched_domains_numa_levels)
>> > + return nr_cpu_ids;
>> > +
>> > + masks = rcu_dereference(sched_domains_numa_masks);
>> > + cur = cpumask_bits(masks[*hop][node]);
>> > + if (*hop == 0)
>> > + ret = find_next_and_bit(cpumask_bits(cpus), cur, nr_cpu_ids, cpu);
>> > + else {
>> > + prev = cpumask_bits(masks[*hop - 1][node]);
>> > + ret = find_next_and_andnot_bit(cpumask_bits(cpus), cur, prev, nr_cpu_ids, cpu);
>> > + }
>> > +
>> > + if (ret < nr_cpu_ids)
>> > + return ret;
>> > +
>> > + *hop += 1;
>> > + return sched_numa_find_next_cpu(cpus, 0, node, hop);
>>
>> sched_domains_numa_levels is a fairly small number, so the recursion depth
>> isn't something we really need to worry about - still, the iterative
>> variant of this is fairly straightforward to get to:
>
> This is a tail recursion. Compiler normally converts it into the loop just
> as well. At least, my GCC does.

I'd hope so in 2023! I still prefer the iterative approach as I find it
more readable, but I'm not /too/ strongly attached to it.

2023-04-26 09:26:05

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] sched/topology: add for_each_numa_cpu() macro

On 25/04/23 22:32, Yury Norov wrote:
> On Tue, Apr 25, 2023 at 10:54:48AM +0100, Valentin Schneider wrote:
>> On 19/04/23 22:19, Yury Norov wrote:
>> > +/**
>> > + * for_each_numa_cpu - iterate over cpus in increasing order taking into account
>> > + * NUMA distances from a given node.
>> > + * @cpu: the (optionally unsigned) integer iterator
>> > + * @hop: the iterator variable, must be initialized to a desired minimal hop.
>> > + * @node: the NUMA node to start the search from.
>> > + * @mask: the cpumask pointer
>> > + *
>> > + * Requires rcu_lock to be held.
>> > + */
>> > +#define for_each_numa_cpu(cpu, hop, node, mask) \
>> > + for ((cpu) = 0, (hop) = 0; \
>> > + (cpu) = sched_numa_find_next_cpu((mask), (cpu), (node), &(hop)),\
>> > + (cpu) < nr_cpu_ids; \
>> > + (cpu)++)
>> > +
>>
>> I think we can keep sched_numa_find_next_cpu() as-is, but could we make
>> that macro use cpu_possible_mask by default? We can always add a variant
>> if/when we need to feed in a different mask.
>
> As mentioned in discussion to the driver's patch, all that numa things
> imply only online CPUs, so cpu_possible_mask may mislead to some extent.
>
> Anyways, can you elaborate what you exactly want? Like this?
>
> #define for_each_numa_online_cpu(cpu, hop, node) \
> for_each_numa_cpu(cpu, hop, node, cpu_online_mask)

Yeah, something like that. Like you said, the NUMA cpumasks built by the
scheduler reflect the online topology, so s/possible/online/ shouldn't
change much here.

2023-04-26 09:26:10

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] lib: add test for for_each_numa_{cpu,hop_mask}()

On 25/04/23 22:50, Yury Norov wrote:
> Hi Valentin,
>
> Thanks for review!
>
> On Mon, Apr 24, 2023 at 06:09:52PM +0100, Valentin Schneider wrote:
>> On 19/04/23 22:19, Yury Norov wrote:
>> > + for (node = 0; node < sched_domains_numa_levels; node++) {
>> > + unsigned int hop, c = 0;
>> > +
>> > + rcu_read_lock();
>> > + for_each_numa_cpu(cpu, hop, node, cpu_online_mask)
>> > + expect_eq_uint(cpumask_local_spread(c++, node), cpu);
>> > + rcu_read_unlock();
>> > + }
>>
>> I'm not fond of the export of sched_domains_numa_levels, especially
>> considering it's just there for tests.
>>
>> Furthermore, is there any value is testing parity with
>> cpumask_local_spread()?
>
> I wanted to emphasize that new NUMA-aware functions are coherent with
> each other, just like find_nth_bit() is coherent with find_next_bit().
>
> But all that coherence looks important only in non-NUMA case, because
> client code may depend on fact that next CPU is never less than current.
> This doesn't hold for NUMA iterators anyways...
>

Ah right, I see your point. But yes, distance-ordered walks break this
assumption.

>> Rather, shouldn't we check that using this API does
>> yield CPUs of increasing NUMA distance?
>>
>> Something like
>>
>> for_each_node(node) {
>> unsigned int prev_cpu, hop = 0;
>>
>> cpu = cpumask_first(cpumask_of_node(node));
>> prev_cpu = cpu;
>>
>> rcu_read_lock();
>>
>> /* Assert distance is monotonically increasing */
>> for_each_numa_cpu(cpu, hop, node, cpu_online_mask) {
>> expect_ge_uint(cpu_to_node(cpu), cpu_to_node(prev_cpu));
>> prev_cpu = cpu;
>> }
>>
>> rcu_read_unlock();
>> }
>
> Your version of the test looks more straightforward. I need to think
> for more, but it looks like I can take it in v3.
>

I realized I only wrote half the relevant code - comparing node IDs is
meaningless, I meant to compare distances as we walk through the
CPUs... I tested the below against a few NUMA topologies and it seems to be
sane:

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 6becb044a66f0..8f8512d139d58 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -174,11 +174,23 @@ __check_eq_str(const char *srcfile, unsigned int line,
return eq;
}

-#define __expect_eq(suffix, ...) \
+static bool __init
+__check_ge_uint(const char *srcfile, unsigned int line,
+ const unsigned int a, unsigned int b)
+{
+ if (a < b) {
+ pr_err("[%s:%u] expected a(%u) >= b(%u)\n",
+ srcfile, line, a, b);
+ return false;
+ }
+ return true;
+}
+
+#define __expect_op(op, suffix, ...) \
({ \
int result = 0; \
total_tests++; \
- if (!__check_eq_ ## suffix(__FILE__, __LINE__, \
+ if (!__check_## op ## _ ## suffix(__FILE__, __LINE__, \
##__VA_ARGS__)) { \
failed_tests++; \
result = 1; \
@@ -186,6 +198,9 @@ __check_eq_str(const char *srcfile, unsigned int line,
result; \
})

+#define __expect_eq(suffix, ...) __expect_op(eq, suffix, ##__VA_ARGS__)
+#define __expect_ge(suffix, ...) __expect_op(ge, suffix, ##__VA_ARGS__)
+
#define expect_eq_uint(...) __expect_eq(uint, ##__VA_ARGS__)
#define expect_eq_bitmap(...) __expect_eq(bitmap, ##__VA_ARGS__)
#define expect_eq_pbl(...) __expect_eq(pbl, ##__VA_ARGS__)
@@ -193,6 +208,8 @@ __check_eq_str(const char *srcfile, unsigned int line,
#define expect_eq_clump8(...) __expect_eq(clump8, ##__VA_ARGS__)
#define expect_eq_str(...) __expect_eq(str, ##__VA_ARGS__)

+#define expect_ge_uint(...) __expect_ge(uint, ##__VA_ARGS__)
+
static void __init test_zero_clear(void)
{
DECLARE_BITMAP(bmap, 1024);
@@ -756,12 +773,23 @@ static void __init test_for_each_numa(void)
{
unsigned int cpu, node;

- for (node = 0; node < sched_domains_numa_levels; node++) {
- unsigned int hop, c = 0;
+ for_each_node(node) {
+ unsigned int start_cpu, prev_dist, hop = 0;
+
+ cpu = cpumask_first(cpumask_of_node(node));
+ prev_dist = node_distance(node, node);
+ start_cpu = cpu;

rcu_read_lock();
- for_each_numa_cpu(cpu, hop, node, cpu_online_mask)
- expect_eq_uint(cpumask_local_spread(c++, node), cpu);
+
+ /* Assert distance is monotonically increasing */
+ for_each_numa_cpu(cpu, hop, node, cpu_online_mask) {
+ unsigned int dist = node_distance(cpu_to_node(cpu), cpu_to_node(start_cpu));
+
+ expect_ge_uint(dist, prev_dist);
+ prev_dist = dist;
+ }
+
rcu_read_unlock();
}
}

2023-04-26 20:54:17

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] lib: add test for for_each_numa_{cpu,hop_mask}()

> I realized I only wrote half the relevant code - comparing node IDs is
> meaningless, I meant to compare distances as we walk through the
> CPUs... I tested the below against a few NUMA topologies and it seems to be
> sane:
>
> @@ -756,12 +773,23 @@ static void __init test_for_each_numa(void)
> {
> unsigned int cpu, node;
>
> - for (node = 0; node < sched_domains_numa_levels; node++) {
> - unsigned int hop, c = 0;
> + for_each_node(node) {
> + unsigned int start_cpu, prev_dist, hop = 0;
> +
> + cpu = cpumask_first(cpumask_of_node(node));
> + prev_dist = node_distance(node, node);
> + start_cpu = cpu;
>
> rcu_read_lock();
> - for_each_numa_cpu(cpu, hop, node, cpu_online_mask)
> - expect_eq_uint(cpumask_local_spread(c++, node), cpu);
> +
> + /* Assert distance is monotonically increasing */
> + for_each_numa_cpu(cpu, hop, node, cpu_online_mask) {
> + unsigned int dist = node_distance(cpu_to_node(cpu), cpu_to_node(start_cpu));

Interestingly, node_distance() is an arch-specific function. Generic
implementation is quite useless:

#define node_distance(from,to) ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)

Particularly, arm64 takes the above. With node_distance() implemented
like that, we can barely test something...

Taking that into the account, I think it's better to test iterator against
cpumask_local_spread(), like in v2. I'll add a comment about that in v3.

> +
> + expect_ge_uint(dist, prev_dist);
> + prev_dist = dist;
> + }
> +
> rcu_read_unlock();
> }
> }

2023-04-27 09:39:45

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] lib: add test for for_each_numa_{cpu,hop_mask}()

On 26/04/23 13:51, Yury Norov wrote:
>> I realized I only wrote half the relevant code - comparing node IDs is
>> meaningless, I meant to compare distances as we walk through the
>> CPUs... I tested the below against a few NUMA topologies and it seems to be
>> sane:
>>
>> @@ -756,12 +773,23 @@ static void __init test_for_each_numa(void)
>> {
>> unsigned int cpu, node;
>>
>> - for (node = 0; node < sched_domains_numa_levels; node++) {
>> - unsigned int hop, c = 0;
>> + for_each_node(node) {
>> + unsigned int start_cpu, prev_dist, hop = 0;
>> +
>> + cpu = cpumask_first(cpumask_of_node(node));
>> + prev_dist = node_distance(node, node);
>> + start_cpu = cpu;
>>
>> rcu_read_lock();
>> - for_each_numa_cpu(cpu, hop, node, cpu_online_mask)
>> - expect_eq_uint(cpumask_local_spread(c++, node), cpu);
>> +
>> + /* Assert distance is monotonically increasing */
>> + for_each_numa_cpu(cpu, hop, node, cpu_online_mask) {
>> + unsigned int dist = node_distance(cpu_to_node(cpu), cpu_to_node(start_cpu));
>
> Interestingly, node_distance() is an arch-specific function. Generic
> implementation is quite useless:
>
> #define node_distance(from,to) ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
>
> Particularly, arm64 takes the above. With node_distance() implemented
> like that, we can barely test something...
>

riscv and arm64 rely on drivers/base/arch_numa.c to provide
__node_distance() (cf. CONFIG_GENERIC_ARCH_NUMA).

x86, sparc, powerpc and ia64 define __node_distance()
loongarch and mips define their own node_distance().

So all of those archs will have a usable node_distance(), the others won't
and that means the scheduler can't do anything about it - the scheduler
relies on node_distance() to understand the topolgoy!