2023-04-30 17:19:07

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 0/8] sched/topology: add for_each_numa_cpu() macro

for_each_cpu() is widely used in kernel, and it's beneficial to create
a NUMA-aware version of the macro.

Recently added for_each_numa_hop_mask() works, but switching existing
codebase to it is not an easy process.

This series adds for_each_numa_cpu(), which is designed to be similar to
the for_each_cpu(). It allows to convert existing code to NUMA-aware as
simple as adding a hop iterator variable and passing it inside new macro.
for_each_numa_cpu() takes care of the rest.

At the moment, we have 2 users of NUMA-aware enumerators. One is
Melanox's in-tree driver, and another is Intel's in-review driver:

https://lore.kernel.org/lkml/[email protected]/

Both real-life examples follow the same pattern:

for_each_numa_hop_mask(cpus, prev, node) {
for_each_cpu_andnot(cpu, cpus, prev) {
if (cnt++ == max_num)
goto out;
do_something(cpu);
}
prev = cpus;
}

With the new macro, it has a more standard look, like this:

for_each_numa_cpu(cpu, hop, node, cpu_possible_mask) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

Straight conversion of existing for_each_cpu() codebase to NUMA-aware
version with for_each_numa_hop_mask() is difficult because it doesn't
take a user-provided cpu mask, and eventually ends up with open-coded
double loop. With for_each_numa_cpu() it shouldn't be a brainteaser.
Consider the NUMA-ignorant example:

cpumask_t cpus = get_mask();
int cnt = 0, cpu;

for_each_cpu(cpu, cpus) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

Converting it to NUMA-aware version would be as simple as:

cpumask_t cpus = get_mask();
int node = get_node();
int cnt = 0, hop, cpu;

for_each_numa_cpu(cpu, hop, node, cpus) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

The latter looks more verbose and avoids from open-coding that annoying
double loop. Another advantage is that it works with a 'hop' parameter with
the clear meaning of NUMA distance, and doesn't make people not familiar
to enumerator internals bothering with current and previous masks machinery.

v2: https://lore.kernel.org/netdev/ZD3l6FBnUh9vTIGc@yury-ThinkPad/T/
v3:
- fix sched_numa_find_{next,nth}_cpu() when CONFIG_NUMA is off to
only traverse online CPUs;
- don't export sched_domains_numa_levels for testing purposes. In
the test, use for_each_node() macro;
- extend the test for for_each_node();
- in comments, mention that only online CPUs are traversed;
- rebase on top of 6.3.

Yury Norov (8):
sched: fix sched_numa_find_nth_cpu() in non-NUMA case
lib/find: add find_next_and_andnot_bit()
sched/topology: introduce sched_numa_find_next_cpu()
sched/topology: add for_each_numa_{,online}_cpu() macro
net: mlx5: switch comp_irqs_request() to using for_each_numa_cpu
lib/cpumask: update comment to cpumask_local_spread()
sched: drop for_each_numa_hop_mask()
lib: test for_each_numa_cpus()

drivers/net/ethernet/mellanox/mlx5/core/eq.c | 16 ++---
include/linux/find.h | 43 ++++++++++++
include/linux/topology.h | 40 ++++++-----
kernel/sched/topology.c | 53 ++++++++-------
lib/cpumask.c | 7 +-
lib/find_bit.c | 12 ++++
lib/test_bitmap.c | 70 +++++++++++++++++++-
7 files changed, 183 insertions(+), 58 deletions(-)

--
2.37.2


2023-04-30 17:19:18

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 2/8] lib/find: add find_next_and_andnot_bit()

Similarly to find_nth_and_andnot_bit(), find_next_and_andnot_bit() is
a convenient helper that allows traversing bitmaps without storing
intermediate results in a temporary bitmap.

In the following patches the function is used to implement NUMA-aware
CPUs enumeration.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/find.h | 43 +++++++++++++++++++++++++++++++++++++++++++
lib/find_bit.c | 12 ++++++++++++
2 files changed, 55 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index 5e4f39ef2e72..90b68d76c073 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -16,6 +16,9 @@ unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned l
unsigned long nbits, unsigned long start);
unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2,
unsigned long nbits, unsigned long start);
+unsigned long _find_next_and_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+ const unsigned long *addr3, unsigned long nbits,
+ unsigned long start);
unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
unsigned long start);
extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
@@ -159,6 +162,40 @@ unsigned long find_next_or_bit(const unsigned long *addr1,
}
#endif

+#ifndef find_next_and_andnot_bit
+/**
+ * find_next_and_andnot_bit - find the next bit set in *addr1 and *addr2,
+ * excluding all the bits in *addr3
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @addr3: The third address to base the search on
+ * @size: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * Return: the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static __always_inline
+unsigned long find_next_and_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2,
+ const unsigned long *addr3,
+ unsigned long size,
+ unsigned long offset)
+{
+ if (small_const_nbits(size)) {
+ unsigned long val;
+
+ if (unlikely(offset >= size))
+ return size;
+
+ val = *addr1 & *addr2 & ~*addr3 & GENMASK(size - 1, offset);
+ return val ? __ffs(val) : size;
+ }
+
+ return _find_next_and_andnot_bit(addr1, addr2, addr3, size, offset);
+}
+#endif
+
#ifndef find_next_zero_bit
/**
* find_next_zero_bit - find the next cleared bit in a memory region
@@ -568,6 +605,12 @@ unsigned long find_next_bit_le(const void *addr, unsigned
(bit) = find_next_andnot_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
(bit)++)

+#define for_each_and_andnot_bit(bit, addr1, addr2, addr3, size) \
+ for ((bit) = 0; \
+ (bit) = find_next_and_andnot_bit((addr1), (addr2), (addr3), (size), (bit)),\
+ (bit) < (size); \
+ (bit)++)
+
#define for_each_or_bit(bit, addr1, addr2, size) \
for ((bit) = 0; \
(bit) = find_next_or_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 32f99e9a670e..4403e00890b1 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -182,6 +182,18 @@ unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned l
EXPORT_SYMBOL(_find_next_andnot_bit);
#endif

+#ifndef find_next_and_andnot_bit
+unsigned long _find_next_and_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2,
+ const unsigned long *addr3,
+ unsigned long nbits,
+ unsigned long start)
+{
+ return FIND_NEXT_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], /* nop */, nbits, start);
+}
+EXPORT_SYMBOL(_find_next_and_andnot_bit);
+#endif
+
#ifndef find_next_or_bit
unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2,
unsigned long nbits, unsigned long start)
--
2.37.2

2023-04-30 17:19:26

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 1/8] sched: fix sched_numa_find_nth_cpu() in non-NUMA case

When CONFIG_NUMA is enabled, sched_numa_find_nth_cpu() searches for a
CPU in sched_domains_numa_masks. The masks includes only online CPUs,
so effectively offline CPUs are skipped.

When CONFIG_NUMA is disabled, the fallback function should be consistent.

Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()")
Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index fea32377f7c7..52f5850730b3 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -251,7 +251,7 @@ extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int
#else
static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
{
- return cpumask_nth(cpu, cpus);
+ return cpumask_nth_and(cpu, cpus, cpu_online_mask);
}

static inline const struct cpumask *
--
2.37.2

2023-04-30 17:19:49

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 4/8] sched/topology: add for_each_numa_{,online}_cpu() macro

for_each_cpu() is widely used in the kernel, and it's beneficial to
create a NUMA-aware version of the macro.

Recently added for_each_numa_hop_mask() works, but switching existing
codebase to using it is not an easy process.

New for_each_numa_cpu() is designed to be similar to the for_each_cpu().
It allows to convert existing code to NUMA-aware as simple as adding a
hop iterator variable and passing it inside new macro. for_each_numa_cpu()
takes care of the rest.

At the moment, we have 2 users of NUMA-aware enumerators. One is
Melanox's in-tree driver, and another is Intel's in-review driver:

https://lore.kernel.org/lkml/[email protected]/

Both real-life examples follow the same pattern:

for_each_numa_hop_mask(cpus, prev, node) {
for_each_cpu_andnot(cpu, cpus, prev) {
if (cnt++ == max_num)
goto out;
do_something(cpu);
}
prev = cpus;
}

With the new macro, it would look like this:

for_each_numa_online_cpu(cpu, hop, node) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

Straight conversion of existing for_each_cpu() codebase to NUMA-aware
version with for_each_numa_hop_mask() is difficult because it doesn't
take a user-provided cpu mask, and eventually ends up with open-coded
double loop. With for_each_numa_cpu() it shouldn't be a brainteaser.
Consider the NUMA-ignorant example:

cpumask_t cpus = get_mask();
int cnt = 0, cpu;

for_each_cpu(cpu, cpus) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

Converting it to NUMA-aware version would be as simple as:

cpumask_t cpus = get_mask();
int node = get_node();
int cnt = 0, hop, cpu;

for_each_numa_cpu(cpu, hop, node, cpus) {
if (cnt++ == max_num)
break;
do_something(cpu);
}

The latter looks more verbose and avoids from open-coding that annoying
double loop. Another advantage is that it works with a 'hop' parameter with
the clear meaning of NUMA distance, and doesn't make people not familiar
to enumerator internals bothering with current and previous masks machinery.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index da92fea38585..6ed01962878c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -291,4 +291,23 @@ sched_numa_hop_mask(unsigned int node, unsigned int hops)
!IS_ERR_OR_NULL(mask); \
__hops++)

+/**
+ * for_each_numa_cpu - iterate over cpus in increasing order taking into account
+ * NUMA distances from a given node.
+ * @cpu: the (optionally unsigned) integer iterator
+ * @hop: the iterator variable, must be initialized to a desired minimal hop.
+ * @node: the NUMA node to start the search from.
+ * @mask: the cpumask pointer
+ *
+ * Requires rcu_lock to be held.
+ */
+#define for_each_numa_cpu(cpu, hop, node, mask) \
+ for ((cpu) = 0, (hop) = 0; \
+ (cpu) = sched_numa_find_next_cpu((mask), (cpu), (node), &(hop)),\
+ (cpu) < nr_cpu_ids; \
+ (cpu)++)
+
+#define for_each_numa_online_cpu(cpu, hop, node) \
+ for_each_numa_cpu(cpu, hop, node, cpu_online_mask)
+
#endif /* _LINUX_TOPOLOGY_H */
--
2.37.2

2023-04-30 17:19:49

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 6/8] lib/cpumask: update comment to cpumask_local_spread()

Now that we have a for_each_numa_online_cpu(), which is a more
straightforward replacement to the cpumask_local_spread() when it comes to
enumeration of CPUs with respect to NUMA topology, it's worth to update the
comment.

Signed-off-by: Yury Norov <[email protected]>
---
lib/cpumask.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index e7258836b60b..774966483ca9 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -127,11 +127,8 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
*
* There's a better alternative based on for_each()-like iterators:
*
- * for_each_numa_hop_mask(mask, node) {
- * for_each_cpu_andnot(cpu, mask, prev)
- * do_something(cpu);
- * prev = mask;
- * }
+ * for_each_numa_online_cpu(cpu, hop, node)
+ * do_something(cpu);
*
* It's simpler and more verbose than above. Complexity of iterator-based
* enumeration is O(sched_domains_numa_levels * nr_cpu_ids), while
--
2.37.2

2023-04-30 17:19:53

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 5/8] net: mlx5: switch comp_irqs_request() to using for_each_numa_cpu

for_each_numa_online_cpu() is a more straightforward alternative to
for_each_numa_hop_mask() + for_each_cpu_andnot().

Signed-off-by: Yury Norov <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 38b32e98f3bd..d3511e45f121 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -817,12 +817,10 @@ static void comp_irqs_release(struct mlx5_core_dev *dev)
static int comp_irqs_request(struct mlx5_core_dev *dev)
{
struct mlx5_eq_table *table = dev->priv.eq_table;
- const struct cpumask *prev = cpu_none_mask;
- const struct cpumask *mask;
int ncomp_eqs = table->num_comp_eqs;
u16 *cpus;
int ret;
- int cpu;
+ int cpu, hop;
int i;

ncomp_eqs = table->num_comp_eqs;
@@ -844,15 +842,11 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)

i = 0;
rcu_read_lock();
- for_each_numa_hop_mask(mask, dev->priv.numa_node) {
- for_each_cpu_andnot(cpu, mask, prev) {
- cpus[i] = cpu;
- if (++i == ncomp_eqs)
- goto spread_done;
- }
- prev = mask;
+ for_each_numa_online_cpu(cpu, hop, dev->priv.numa_node) {
+ cpus[i] = cpu;
+ if (++i == ncomp_eqs)
+ break;
}
-spread_done:
rcu_read_unlock();
ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
kfree(cpus);
--
2.37.2

2023-04-30 17:20:18

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 3/8] sched/topology: introduce sched_numa_find_next_cpu()

The function searches for a next CPU in a given cpumask according to
NUMA topology, so that it traverses CPUs per-hop.

If the CPU is the last CPU in a given hop, sched_numa_find_next_cpu()
switches to the next hop, and picks the first CPU from there, excluding
those already traversed.

Because only online CPUs are presented in the NUMA topology masks, offline
CPUs will be skipped even if presented in the 'cpus' mask provided in the
arguments.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 12 ++++++++++++
kernel/sched/topology.c | 39 +++++++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 52f5850730b3..da92fea38585 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,8 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
return cpumask_of_node(cpu_to_node(cpu));
}

+/*
+ * sched_numa_find_*_cpu() functions family traverses only accessible CPUs,
+ * i.e. those listed in cpu_online_mask.
+ */
#ifdef CONFIG_NUMA
int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
+int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop);
extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops);
#else
static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
@@ -254,6 +259,13 @@ static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, i
return cpumask_nth_and(cpu, cpus, cpu_online_mask);
}

+static __always_inline
+int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop)
+{
+ return find_next_and_bit(cpumask_bits(cpus), cpumask_bits(cpu_online_mask),
+ small_cpumask_bits, cpu);
+}
+
static inline const struct cpumask *
sched_numa_hop_mask(unsigned int node, unsigned int hops)
{
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 051aaf65c749..fc163e4181e6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2130,6 +2130,45 @@ int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
}
EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);

+/*
+ * sched_numa_find_next_cpu() - given the NUMA topology, find the next cpu
+ * cpumask: cpumask to find a CPU from
+ * cpu: current CPU
+ * node: local node
+ * hop: (in/out) indicates distance order of current CPU to a local node
+ *
+ * The function searches for a next CPU at a given NUMA distance, indicated
+ * by hop, and if nothing found, tries to find CPUs at a greater distance,
+ * starting from the beginning.
+ *
+ * Return: cpu, or >= nr_cpu_ids when nothing found.
+ */
+int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop)
+{
+ unsigned long *cur, *prev;
+ struct cpumask ***masks;
+ unsigned int ret;
+
+ if (*hop >= sched_domains_numa_levels)
+ return nr_cpu_ids;
+
+ masks = rcu_dereference(sched_domains_numa_masks);
+ cur = cpumask_bits(masks[*hop][node]);
+ if (*hop == 0)
+ ret = find_next_and_bit(cpumask_bits(cpus), cur, nr_cpu_ids, cpu);
+ else {
+ prev = cpumask_bits(masks[*hop - 1][node]);
+ ret = find_next_and_andnot_bit(cpumask_bits(cpus), cur, prev, nr_cpu_ids, cpu);
+ }
+
+ if (ret < nr_cpu_ids)
+ return ret;
+
+ *hop += 1;
+ return sched_numa_find_next_cpu(cpus, 0, node, hop);
+}
+EXPORT_SYMBOL_GPL(sched_numa_find_next_cpu);
+
/**
* sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away from
* @node
--
2.37.2

2023-04-30 17:20:25

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 7/8] sched: drop for_each_numa_hop_mask()

Now that we have for_each_numa_cpu(), for_each_numa_hop_mask()
and all related code is a dead code. Drop it.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 25 -------------------------
kernel/sched/topology.c | 32 --------------------------------
2 files changed, 57 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 6ed01962878c..808b5dcf6e36 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -252,7 +252,6 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
#ifdef CONFIG_NUMA
int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsigned int *hop);
-extern const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops);
#else
static __always_inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
{
@@ -265,32 +264,8 @@ int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsi
return find_next_and_bit(cpumask_bits(cpus), cpumask_bits(cpu_online_mask),
small_cpumask_bits, cpu);
}
-
-static inline const struct cpumask *
-sched_numa_hop_mask(unsigned int node, unsigned int hops)
-{
- return ERR_PTR(-EOPNOTSUPP);
-}
#endif /* CONFIG_NUMA */

-/**
- * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance
- * from a given node.
- * @mask: the iteration variable.
- * @node: the NUMA node to start the search from.
- *
- * Requires rcu_lock to be held.
- *
- * Yields cpu_online_mask for @node == NUMA_NO_NODE.
- */
-#define for_each_numa_hop_mask(mask, node) \
- for (unsigned int __hops = 0; \
- mask = (node != NUMA_NO_NODE || __hops) ? \
- sched_numa_hop_mask(node, __hops) : \
- cpu_online_mask, \
- !IS_ERR_OR_NULL(mask); \
- __hops++)
-
/**
* for_each_numa_cpu - iterate over cpus in increasing order taking into account
* NUMA distances from a given node.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index fc163e4181e6..bb5ba2c5589a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2169,38 +2169,6 @@ int sched_numa_find_next_cpu(const struct cpumask *cpus, int cpu, int node, unsi
}
EXPORT_SYMBOL_GPL(sched_numa_find_next_cpu);

-/**
- * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away from
- * @node
- * @node: The node to count hops from.
- * @hops: Include CPUs up to that many hops away. 0 means local node.
- *
- * Return: On success, a pointer to a cpumask of CPUs at most @hops away from
- * @node, an error value otherwise.
- *
- * Requires rcu_lock to be held. Returned cpumask is only valid within that
- * read-side section, copy it if required beyond that.
- *
- * Note that not all hops are equal in distance; see sched_init_numa() for how
- * distances and masks are handled.
- * Also note that this is a reflection of sched_domains_numa_masks, which may change
- * during the lifetime of the system (offline nodes are taken out of the masks).
- */
-const struct cpumask *sched_numa_hop_mask(unsigned int node, unsigned int hops)
-{
- struct cpumask ***masks;
-
- if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
- return ERR_PTR(-EINVAL);
-
- masks = rcu_dereference(sched_domains_numa_masks);
- if (!masks)
- return ERR_PTR(-EBUSY);
-
- return masks[hops][node];
-}
-EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
-
#endif /* CONFIG_NUMA */

static int __sdt_alloc(const struct cpumask *cpu_map)
--
2.37.2

2023-04-30 17:21:20

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 8/8] lib: test for_each_numa_cpus()

Test for_each_numa_cpus() output to ensure that:
- all CPUs are picked from NUMA nodes with non-decreasing distances to the
original node;
- only online CPUs are enumerated;
- the macro enumerates each online CPUs only once;
- enumeration order is consistent with cpumask_local_spread().

The latter is an implementation-defined behavior. If cpumask_local_spread()
or for_each_numa_cpu() will get changed in future, the subtest may need
to be adjusted or even removed, as appropriate.

It's useful now because some architectures don't implement numa_distance(),
and generic implementation only distinguishes local and remote nodes, which
doesn't allow to test the for_each_numa_cpu() properly.

Suggested-by: Valentin Schneider <[email protected]> (for node_distance() test)
Signed-off-by: Yury Norov <[email protected]>
---
lib/test_bitmap.c | 70 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index a8005ad3bd58..ac4fe621d37b 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -12,6 +12,7 @@
#include <linux/printk.h>
#include <linux/slab.h>
#include <linux/string.h>
+#include <linux/topology.h>
#include <linux/uaccess.h>

#include "../tools/testing/selftests/kselftest_module.h"
@@ -71,6 +72,16 @@ __check_eq_uint(const char *srcfile, unsigned int line,
return true;
}

+static bool __init
+__check_ge_uint(const char *srcfile, unsigned int line,
+ const unsigned int exp_uint, unsigned int x)
+{
+ if (exp_uint >= x)
+ return true;
+
+ pr_err("[%s:%u] expected >= %u, got %u\n", srcfile, line, exp_uint, x);
+ return false;
+}

static bool __init
__check_eq_bitmap(const char *srcfile, unsigned int line,
@@ -86,6 +97,18 @@ __check_eq_bitmap(const char *srcfile, unsigned int line,
return true;
}

+static bool __init
+__check_eq_cpumask(const char *srcfile, unsigned int line,
+ const struct cpumask *exp_cpumask, const struct cpumask *cpumask)
+{
+ if (cpumask_equal(exp_cpumask, cpumask))
+ return true;
+
+ pr_warn("[%s:%u] cpumasks contents differ: expected \"%*pbl\", got \"%*pbl\"\n",
+ srcfile, line, cpumask_pr_args(exp_cpumask), cpumask_pr_args(cpumask));
+ return false;
+}
+
static bool __init
__check_eq_pbl(const char *srcfile, unsigned int line,
const char *expected_pbl,
@@ -173,11 +196,11 @@ __check_eq_str(const char *srcfile, unsigned int line,
return eq;
}

-#define __expect_eq(suffix, ...) \
+#define __expect(suffix, ...) \
({ \
int result = 0; \
total_tests++; \
- if (!__check_eq_ ## suffix(__FILE__, __LINE__, \
+ if (!__check_ ## suffix(__FILE__, __LINE__, \
##__VA_ARGS__)) { \
failed_tests++; \
result = 1; \
@@ -185,13 +208,19 @@ __check_eq_str(const char *srcfile, unsigned int line,
result; \
})

+#define __expect_eq(suffix, ...) __expect(eq_ ## suffix, ##__VA_ARGS__)
+#define __expect_ge(suffix, ...) __expect(ge_ ## suffix, ##__VA_ARGS__)
+
#define expect_eq_uint(...) __expect_eq(uint, ##__VA_ARGS__)
#define expect_eq_bitmap(...) __expect_eq(bitmap, ##__VA_ARGS__)
+#define expect_eq_cpumask(...) __expect_eq(cpumask, ##__VA_ARGS__)
#define expect_eq_pbl(...) __expect_eq(pbl, ##__VA_ARGS__)
#define expect_eq_u32_array(...) __expect_eq(u32_array, ##__VA_ARGS__)
#define expect_eq_clump8(...) __expect_eq(clump8, ##__VA_ARGS__)
#define expect_eq_str(...) __expect_eq(str, ##__VA_ARGS__)

+#define expect_ge_uint(...) __expect_ge(uint, ##__VA_ARGS__)
+
static void __init test_zero_clear(void)
{
DECLARE_BITMAP(bmap, 1024);
@@ -751,6 +780,42 @@ static void __init test_for_each_set_bit_wrap(void)
}
}

+static void __init test_for_each_numa_cpu(void)
+{
+ unsigned int node, cpu, hop;
+ cpumask_var_t mask;
+
+ if (!alloc_cpumask_var(&mask, GFP_KERNEL)) {
+ pr_err("Can't allocate cpumask. Skipping for_each_numa_cpu() test");
+ return;
+ }
+
+ for_each_node(node) {
+ unsigned int c = 0, dist, old_dist = node_distance(node, node);
+
+ cpumask_clear(mask);
+
+ rcu_read_lock();
+ for_each_numa_cpu(cpu, hop, node, cpu_possible_mask) {
+ dist = node_distance(cpu_to_node(cpu), node);
+
+ /* Distance between nodes must never decrease */
+ expect_ge_uint(dist, old_dist);
+
+ /* Test for coherence with cpumask_local_spread() */
+ expect_eq_uint(cpumask_local_spread(c++, node), cpu);
+
+ cpumask_set_cpu(cpu, mask);
+ old_dist = dist;
+ }
+ rcu_read_unlock();
+
+ /* Each online CPU must be visited exactly once */
+ expect_eq_uint(c, num_online_cpus());
+ expect_eq_cpumask(mask, cpu_online_mask);
+ }
+}
+
static void __init test_for_each_set_bit(void)
{
DECLARE_BITMAP(orig, 500);
@@ -1237,6 +1302,7 @@ static void __init selftest(void)
test_for_each_clear_bitrange_from();
test_for_each_set_clump8();
test_for_each_set_bit_wrap();
+ test_for_each_numa_cpu();
}

KSTM_MODULE_LOADERS(test_bitmap);
--
2.37.2

2023-05-02 17:07:01

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] sched/topology: add for_each_numa_cpu() macro

On 30/04/23 10:18, Yury Norov wrote:
> for_each_cpu() is widely used in kernel, and it's beneficial to create
> a NUMA-aware version of the macro.
>
> Recently added for_each_numa_hop_mask() works, but switching existing
> codebase to it is not an easy process.
>
> This series adds for_each_numa_cpu(), which is designed to be similar to
> the for_each_cpu(). It allows to convert existing code to NUMA-aware as
> simple as adding a hop iterator variable and passing it inside new macro.
> for_each_numa_cpu() takes care of the rest.
>
> At the moment, we have 2 users of NUMA-aware enumerators. One is
> Melanox's in-tree driver, and another is Intel's in-review driver:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Both real-life examples follow the same pattern:
>
> for_each_numa_hop_mask(cpus, prev, node) {
> for_each_cpu_andnot(cpu, cpus, prev) {
> if (cnt++ == max_num)
> goto out;
> do_something(cpu);
> }
> prev = cpus;
> }
>
> With the new macro, it has a more standard look, like this:
>
> for_each_numa_cpu(cpu, hop, node, cpu_possible_mask) {
> if (cnt++ == max_num)
> break;
> do_something(cpu);
> }
>
> Straight conversion of existing for_each_cpu() codebase to NUMA-aware
> version with for_each_numa_hop_mask() is difficult because it doesn't
> take a user-provided cpu mask, and eventually ends up with open-coded
> double loop. With for_each_numa_cpu() it shouldn't be a brainteaser.
> Consider the NUMA-ignorant example:
>
> cpumask_t cpus = get_mask();
> int cnt = 0, cpu;
>
> for_each_cpu(cpu, cpus) {
> if (cnt++ == max_num)
> break;
> do_something(cpu);
> }
>
> Converting it to NUMA-aware version would be as simple as:
>
> cpumask_t cpus = get_mask();
> int node = get_node();
> int cnt = 0, hop, cpu;
>
> for_each_numa_cpu(cpu, hop, node, cpus) {
> if (cnt++ == max_num)
> break;
> do_something(cpu);
> }
>
> The latter looks more verbose and avoids from open-coding that annoying
> double loop. Another advantage is that it works with a 'hop' parameter with
> the clear meaning of NUMA distance, and doesn't make people not familiar
> to enumerator internals bothering with current and previous masks machinery.
>

LGTM, I ran the tests on a few NUMA topologies and that all seems to behave
as expected. Thanks for working on this!

Reviewed-by: Valentin Schneider <[email protected]>

2023-05-02 22:04:05

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] sched/topology: add for_each_numa_cpu() macro

> LGTM, I ran the tests on a few NUMA topologies and that all seems to behave
> as expected. Thanks for working on this!
>
> Reviewed-by: Valentin Schneider <[email protected]>

Thank you Valentin. If you spent time testing the series, why
don't you add your Tested-by?

2023-05-03 10:15:42

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] sched/topology: add for_each_numa_cpu() macro

On 02/05/23 14:58, Yury Norov wrote:
>> LGTM, I ran the tests on a few NUMA topologies and that all seems to behave
>> as expected. Thanks for working on this!
>>
>> Reviewed-by: Valentin Schneider <[email protected]>
>
> Thank you Valentin. If you spent time testing the series, why
> don't you add your Tested-by?

Well, I only ran the test_bitmap stuff and checked the output of the
iterator then, I didn't get to test on actual hardware with a mellanox card
:-)

But yeah, I suppose that does count for the rest, so feel free to add to
all patches but #5:

Tested-by: Valentin Schneider <[email protected]>

2023-05-31 15:48:45

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] sched/topology: add for_each_numa_cpu() macro

On Sun, Apr 30, 2023 at 10:18:01AM -0700, Yury Norov wrote:
> for_each_cpu() is widely used in kernel, and it's beneficial to create
> a NUMA-aware version of the macro.
>
> Recently added for_each_numa_hop_mask() works, but switching existing
> codebase to it is not an easy process.
>
> This series adds for_each_numa_cpu(), which is designed to be similar to
> the for_each_cpu(). It allows to convert existing code to NUMA-aware as
> simple as adding a hop iterator variable and passing it inside new macro.
> for_each_numa_cpu() takes care of the rest.

Hi Jakub,

Now that the series reviewed, can you consider taking it in sched
tree?

Thanks,
Yury

2023-05-31 17:03:34

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] sched/topology: add for_each_numa_cpu() macro

On Wed, 31 May 2023 08:43:46 -0700 Yury Norov wrote:
> On Sun, Apr 30, 2023 at 10:18:01AM -0700, Yury Norov wrote:
> > for_each_cpu() is widely used in kernel, and it's beneficial to create
> > a NUMA-aware version of the macro.
> >
> > Recently added for_each_numa_hop_mask() works, but switching existing
> > codebase to it is not an easy process.
> >
> > This series adds for_each_numa_cpu(), which is designed to be similar to
> > the for_each_cpu(). It allows to convert existing code to NUMA-aware as
> > simple as adding a hop iterator variable and passing it inside new macro.
> > for_each_numa_cpu() takes care of the rest.
>
> Hi Jakub,
>
> Now that the series reviewed, can you consider taking it in sched
> tree?

Do you mean someone else or did you mean the net-next tree?

2023-05-31 17:16:07

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] sched/topology: add for_each_numa_cpu() macro

On Wed, May 31, 2023 at 10:01:25AM -0700, Jakub Kicinski wrote:
> On Wed, 31 May 2023 08:43:46 -0700 Yury Norov wrote:
> > On Sun, Apr 30, 2023 at 10:18:01AM -0700, Yury Norov wrote:
> > > for_each_cpu() is widely used in kernel, and it's beneficial to create
> > > a NUMA-aware version of the macro.
> > >
> > > Recently added for_each_numa_hop_mask() works, but switching existing
> > > codebase to it is not an easy process.
> > >
> > > This series adds for_each_numa_cpu(), which is designed to be similar to
> > > the for_each_cpu(). It allows to convert existing code to NUMA-aware as
> > > simple as adding a hop iterator variable and passing it inside new macro.
> > > for_each_numa_cpu() takes care of the rest.
> >
> > Hi Jakub,
> >
> > Now that the series reviewed, can you consider taking it in sched
> > tree?
>
> Do you mean someone else or did you mean the net-next tree?

Sorry, net-next.

2023-05-31 17:46:11

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] sched/topology: add for_each_numa_cpu() macro

On Wed, 31 May 2023 10:08:58 -0700 Yury Norov wrote:
> On Wed, May 31, 2023 at 10:01:25AM -0700, Jakub Kicinski wrote:
> > On Wed, 31 May 2023 08:43:46 -0700 Yury Norov wrote:
> > > Now that the series reviewed, can you consider taking it in sched
> > > tree?
> >
> > Do you mean someone else or did you mean the net-next tree?
>
> Sorry, net-next.

I'm a bit of a coward. I don't trust my ability to judge this code,
and it seems Linus has opinions about it :( The mlx5 patch looks like
a small refactoring which can wait until 6.6. I don't feel like net-next
is the best path downstream for this series :(

2023-07-22 16:51:27

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v3 8/8] lib: test for_each_numa_cpus()

Hi,

On Sun, Apr 30, 2023 at 10:18:09AM -0700, Yury Norov wrote:
> Test for_each_numa_cpus() output to ensure that:
> - all CPUs are picked from NUMA nodes with non-decreasing distances to the
> original node;
> - only online CPUs are enumerated;
> - the macro enumerates each online CPUs only once;
> - enumeration order is consistent with cpumask_local_spread().
>
> The latter is an implementation-defined behavior. If cpumask_local_spread()
> or for_each_numa_cpu() will get changed in future, the subtest may need
> to be adjusted or even removed, as appropriate.
>
> It's useful now because some architectures don't implement numa_distance(),
> and generic implementation only distinguishes local and remote nodes, which
> doesn't allow to test the for_each_numa_cpu() properly.
>

This patch results in a crash when testing sparc64 images with qemu.

[ 4.178301] Unable to handle kernel NULL pointer dereference
[ 4.178836] tsk->{mm,active_mm}->context = 0000000000000000
[ 4.179280] tsk->{mm,active_mm}->pgd = fffff80000402000
[ 4.179710] \|/ ____ \|/
[ 4.179710] "@'/ .. \`@"
[ 4.179710] /_| \__/ |_\
[ 4.179710] \__U_/
[ 4.180307] swapper/0(1): Oops [#1]
[ 4.181070] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G N 6.5.0-rc2+ #1
[ 4.181720] TSTATE: 0000000011001600 TPC: 000000000094d800 TNPC: 000000000094d804 Y: 00000000 Tainted: G N
[ 4.182324] TPC: <_find_next_and_bit+0x20/0xa0>
[ 4.183136] g0: 0000000000530b90 g1: 0000000000000000 g2: 0000000000000000 g3: ffffffffffffffff
[ 4.183611] g4: fffff80004200020 g5: fffff8001dc42000 g6: fffff80004168000 g7: 0000000000000000
[ 4.184080] o0: 0000000000000001 o1: 0000000001a28190 o2: 0000000000000009 o3: 00000000020e9e28
[ 4.184549] o4: 0000000000000200 o5: 0000000000000001 sp: fffff8000416af11 ret_pc: 0000000000f6529c
[ 4.185020] RPC: <lock_is_held_type+0xbc/0x180>
[ 4.185477] l0: 0000000001bbfa58 l1: 0000000000000000 l2: 00000000020ea228 l3: fffff80004200aa0
[ 4.185950] l4: 81b8e1e5a4e0c637 l5: 000000000192b000 l6: 00000000023b3800 l7: 00000000020e9e28
[ 4.186417] i0: 000000000192a3f8 i1: 0000000000000000 i2: 0000000000000001 i3: 0000000000000000
[ 4.186885] i4: 0000000000000001 i5: fffff80004200aa0 i6: fffff8000416afc1 i7: 00000000004c79bc
[ 4.187356] I7: <sched_numa_find_next_cpu+0x13c/0x180>
[ 4.187821] Call Trace:
[ 4.188274] [<00000000004c79bc>] sched_numa_find_next_cpu+0x13c/0x180
[ 4.188762] [<0000000001b77c10>] test_for_each_numa_cpu+0x164/0x37c
[ 4.189196] [<0000000001b7878c>] test_bitmap_init+0x964/0x9f4
[ 4.189637] [<0000000000427f40>] do_one_initcall+0x60/0x340
[ 4.190069] [<0000000001b56f34>] kernel_init_freeable+0x1bc/0x228
[ 4.190496] [<0000000000f66aa4>] kernel_init+0x18/0x134
[ 4.190911] [<00000000004060e8>] ret_from_fork+0x1c/0x2c
[ 4.191326] [<0000000000000000>] 0x0
[ 4.191827] Disabling lock debugging due to kernel taint
[ 4.192363] Caller[00000000004c79bc]: sched_numa_find_next_cpu+0x13c/0x180
[ 4.192825] Caller[0000000001b77c10]: test_for_each_numa_cpu+0x164/0x37c
[ 4.193255] Caller[0000000001b7878c]: test_bitmap_init+0x964/0x9f4
[ 4.193681] Caller[0000000000427f40]: do_one_initcall+0x60/0x340
[ 4.194097] Caller[0000000001b56f34]: kernel_init_freeable+0x1bc/0x228
[ 4.194516] Caller[0000000000f66aa4]: kernel_init+0x18/0x134
[ 4.194922] Caller[00000000004060e8]: ret_from_fork+0x1c/0x2c
[ 4.195326] Caller[0000000000000000]: 0x0
[ 4.195728] Instruction DUMP:
[ 4.195762] 8328b003
[ 4.196237] 8728d01b
[ 4.196600] d05e0001
[ 4.196956] <ce5e4001>
[ 4.197311] 900a0007
[ 4.197669] 900a0003
[ 4.198024] 0ac20013
[ 4.198379] bb28b006
[ 4.198733] 8400a001
[ 4.199093]
[ 4.199873] note: swapper/0[1] exited with preempt_count 1
[ 4.200539] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

Bisect log attached.

Guenter

---
# bad: [ae867bc97b713121b2a7f5fcac68378a0774739b] Add linux-next specific files for 20230721
# good: [fdf0eaf11452d72945af31804e2a1048ee1b574c] Linux 6.5-rc2
git bisect start 'HEAD' 'v6.5-rc2'
# good: [f09bf8f6c8cbbff6f52523abcda88c86db72e31c] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
git bisect good f09bf8f6c8cbbff6f52523abcda88c86db72e31c
# good: [86374a6210aeebceb927204d80f9e65739134bc3] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git
git bisect good 86374a6210aeebceb927204d80f9e65739134bc3
# good: [d588c93cae9e3dff15d125e755edcba5d842f41a] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
git bisect good d588c93cae9e3dff15d125e755edcba5d842f41a
# good: [3c3990d304820b12a07c77a6e091d6711b31f8e5] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire.git
git bisect good 3c3990d304820b12a07c77a6e091d6711b31f8e5
# good: [b80a945fabd7acc5984d421c73fa0b667195adb7] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git
git bisect good b80a945fabd7acc5984d421c73fa0b667195adb7
# good: [22c343fad503564a2ef5c6aff1dcb1ec0640006e] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
git bisect good 22c343fad503564a2ef5c6aff1dcb1ec0640006e
# good: [bf05130eebc3265314f14c1314077f500a5c8d98] Merge branch 'mhi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mani/mhi.git
git bisect good bf05130eebc3265314f14c1314077f500a5c8d98
# good: [18eea171e03cc2b30fe7c11d6e94521d905026f0] Merge branch 'rust-next' of https://github.com/Rust-for-Linux/linux.git
git bisect good 18eea171e03cc2b30fe7c11d6e94521d905026f0
# bad: [94b1547668965e1fde8bde3638845ab582b40034] lib: test for_each_numa_cpus()
git bisect bad 94b1547668965e1fde8bde3638845ab582b40034
# good: [310ae5d9d46b65fdbd18ac1e5bd03681fbc19ae8] sched/topology: introduce sched_numa_find_next_cpu()
git bisect good 310ae5d9d46b65fdbd18ac1e5bd03681fbc19ae8
# good: [a4be5fa84bb269886310f563e9095e8164f82c8c] net: mlx5: switch comp_irqs_request() to using for_each_numa_cpu
git bisect good a4be5fa84bb269886310f563e9095e8164f82c8c
# good: [b9833b80d87030b0def7aeda88471ac7f6acd3cb] sched: drop for_each_numa_hop_mask()
git bisect good b9833b80d87030b0def7aeda88471ac7f6acd3cb
# first bad commit: [94b1547668965e1fde8bde3638845ab582b40034] lib: test for_each_numa_cpus()

2023-07-27 02:49:22

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 8/8] lib: test for_each_numa_cpus()

On Sat, Jul 22, 2023 at 08:47:16AM -0700, Guenter Roeck wrote:
> Hi,
>
> On Sun, Apr 30, 2023 at 10:18:09AM -0700, Yury Norov wrote:
> > Test for_each_numa_cpus() output to ensure that:
> > - all CPUs are picked from NUMA nodes with non-decreasing distances to the
> > original node;
> > - only online CPUs are enumerated;
> > - the macro enumerates each online CPUs only once;
> > - enumeration order is consistent with cpumask_local_spread().
> >
> > The latter is an implementation-defined behavior. If cpumask_local_spread()
> > or for_each_numa_cpu() will get changed in future, the subtest may need
> > to be adjusted or even removed, as appropriate.
> >
> > It's useful now because some architectures don't implement numa_distance(),
> > and generic implementation only distinguishes local and remote nodes, which
> > doesn't allow to test the for_each_numa_cpu() properly.
> >
>
> This patch results in a crash when testing sparc64 images with qemu.

Thanks Guenter for reporting. I'll remove the series until fixing the
issue.

Thanks,
Yury