LinuxLists.cc - [PATCH v3 0/9] sched, net: NUMA-aware CPU spreading interface

2022-08-25 18:21:49

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 0/9] sched, net: NUMA-aware CPU spreading interface

Hi folks,

Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut
it).

The proposed interface involved an array of CPUs and a temporary cpumask, and
being my difficult self what I'm proposing here is an interface that doesn't
require any temporary storage other than some stack variables (at the cost of
one wild macro).

Patch 9/9 is just there to showcase how the thing would be used. If this doesn't
get hated on, I'll let Tariq pick this up and push it with his networking driver
changes (with actual changelogs).

[1]: https://lore.kernel.org/all/[email protected]/

A note on treewide use of for_each_cpu_andnot()
===============================================

I've used the below coccinelle script to find places that could be patched (I
couldn't figure out the valid syntax to patch from coccinelle itself):

,-----
@tmpandnot@
expression tmpmask;
iterator for_each_cpu;
position p;
statement S;
@@
cpumask_andnot(tmpmask, ...);

...

(
for_each_cpu@p(..., tmpmask, ...)
S
|
for_each_cpu@p(..., tmpmask, ...)
{
...
}
)

@script:python depends on tmpandnot@
p << tmpandnot.p;
@@
coccilib.report.print_report(p[0], "andnot loop here")
'-----

Which yields (against c40e8341e3b3):

.//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here
.//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here
.//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here
.//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here
.//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here
.//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here
.//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here
.//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here
.//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here
.//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here
.//kernel/sched/core.c:345:1-13: andnot loop here
.//kernel/sched/core.c:366:1-13: andnot loop here
.//net/core/dev.c:3058:1-13: andnot loop here

A lot of those are actually of the shape

for_each_cpu(cpu, mask) {
...
cpumask_andnot(mask, ...);
}

I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(),
but I decided to just stick to the one obvious one in __sched_core_flip().

Revisions
=========

v2 -> v3
++++++++

o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury)
o New patches to fix issues raised by running the above

o New patch to use for_each_cpu_andnot() in sched/core.c (Yury)

v1 -> v2
++++++++

o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury)
o Rebase onto v6.0-rc1

Cheers,
Valentin

Valentin Schneider (9):
cpumask: Make cpumask_full() check for nr_cpu_ids bits
lib/test_cpumask: Make test_cpumask_last check for nr_cpu_ids bits
bitops: Introduce find_next_andnot_bit()
cpumask: Introduce for_each_cpu_andnot()
lib/test_cpumask: Add for_each_cpu_and(not) tests
sched/core: Merge cpumask_andnot()+for_each_cpu() into
for_each_cpu_andnot()
sched/topology: Introduce sched_numa_hop_mask()
sched/topology: Introduce for_each_numa_hop_cpu()
SHOWCASE: net/mlx5e: Leverage for_each_numa_hop_cpu()

drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++-
include/linux/cpumask.h | 41 ++++++++++++++++-
include/linux/find.h | 44 ++++++++++++++++---
include/linux/topology.h | 46 ++++++++++++++++++++
kernel/sched/core.c | 5 +--
kernel/sched/topology.c | 28 ++++++++++++
lib/find_bit.c | 23 +++++-----
lib/test_cpumask.c | 23 +++++++++-
8 files changed, 196 insertions(+), 26 deletions(-)

--
2.31.1

2022-08-25 18:22:49

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 1/9] cpumask: Make cpumask_full() check for nr_cpu_ids bits

Consider a system with 4 CPUs and:
CONFIG_NR_CPUS=64
CONFIG_CPUMASK_OFFSTACK=n

In this situation, we have:
nr_cpumask_bits == NR_CPUS == 64
nr_cpu_ids = 4

Per smp.c::setup_nr_cpu_ids(), nr_cpu_ids <= NR_CPUS, so we want
cpumask_full() to check for nr_cpu_ids bits set.

This issue is currently pointed out by the cpumask KUnit tests:

[ 14.072028] # test_cpumask_weight: EXPECTATION FAILED at lib/test_cpumask.c:57
[ 14.072028] Expected cpumask_full(((const struct cpumask *)&__cpu_possible_mask)) to be true, but is false
[ 14.079333] not ok 1 - test_cpumask_weight

Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/cpumask.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bd047864c7ac..1414ce8cd003 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -574,7 +574,7 @@ static inline bool cpumask_empty(const struct cpumask *srcp)
*/
static inline bool cpumask_full(const struct cpumask *srcp)
{
- return bitmap_full(cpumask_bits(srcp), nr_cpumask_bits);
+ return bitmap_full(cpumask_bits(srcp), nr_cpu_ids);
}

/**
--
2.31.1

2022-08-25 18:23:54

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 4/9] cpumask: Introduce for_each_cpu_andnot()

for_each_cpu_and() is very convenient as it saves having to allocate a
temporary cpumask to store the result of cpumask_and(). The same issue
applies to cpumask_andnot() which doesn't actually need temporary storage
for iteration purposes.

Following what has been done for for_each_cpu_and(), introduce
for_each_cpu_andnot().

Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/cpumask.h | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 1414ce8cd003..372a642bf9ba 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -238,6 +238,25 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p,
nr_cpumask_bits, n + 1);
}

+/**
+ * cpumask_next_andnot - get the next cpu in *src1p & ~*src2p
+ * @n: the cpu prior to the place to search (ie. return will be > @n)
+ * @src1p: the first cpumask pointer
+ * @src2p: the second cpumask pointer
+ *
+ * Returns >= nr_cpu_ids if no further cpus set in *src1p & ~*src2p
+ */
+static inline
+unsigned int cpumask_next_andnot(int n, const struct cpumask *src1p,
+ const struct cpumask *src2p)
+{
+ /* -1 is a legal arg here. */
+ if (n != -1)
+ cpumask_check(n);
+ return find_next_andnot_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+ nr_cpumask_bits, n + 1);
+}
+
/**
* for_each_cpu - iterate over every cpu in a mask
* @cpu: the (optionally unsigned) integer iterator
@@ -317,6 +336,26 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta
(cpu) = cpumask_next_and((cpu), (mask1), (mask2)), \
(cpu) < nr_cpu_ids;)

+/**
+ * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
+ * those present in another.
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * This saves a temporary CPU mask in many places. It is equivalent to:
+ * struct cpumask tmp;
+ * cpumask_andnot(&tmp, &mask1, &mask2);
+ * for_each_cpu(cpu, &tmp)
+ * ...
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_andnot(cpu, mask1, mask2) \
+ for ((cpu) = -1; \
+ (cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)), \
+ (cpu) < nr_cpu_ids;)
+
/**
* cpumask_any_but - return a "random" in a cpumask, but not this one.
* @mask: the cpumask to search
--
2.31.1

2022-08-25 18:26:18

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 6/9] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()

This removes the second use of the sched_core_mask temporary mask.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/core.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253c9ac0..b4c3112b0095 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -360,10 +360,7 @@ static void __sched_core_flip(bool enabled)
/*
* Toggle the offline CPUs.
*/
- cpumask_copy(&sched_core_mask, cpu_possible_mask);
- cpumask_andnot(&sched_core_mask, &sched_core_mask, cpu_online_mask);
-
- for_each_cpu(cpu, &sched_core_mask)
+ for_each_cpu_andnot(cpu, cpu_possible_mask, cpu_online_mask)
cpu_rq(cpu)->core_enabled = enabled;

cpus_read_unlock();
--
2.31.1

2022-08-25 18:26:30

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 9/9] SHOWCASE: net/mlx5e: Leverage for_each_numa_hop_cpu()

Not-signed-off-by: Valentin Schneider <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 229728c80233..0a5432903edd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -812,6 +812,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
int ncomp_eqs = table->num_comp_eqs;
u16 *cpus;
int ret;
+ int cpu;
int i;

ncomp_eqs = table->num_comp_eqs;
@@ -830,8 +831,15 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
ret = -ENOMEM;
goto free_irqs;
}
- for (i = 0; i < ncomp_eqs; i++)
- cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
+
+ rcu_read_lock();
+ for_each_numa_hop_cpus(cpu, dev->priv.numa_node) {
+ cpus[i] = cpu;
+ if (++i == ncomp_eqs)
+ goto spread_done;
+ }
+spread_done:
+ rcu_read_unlock();
ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
kfree(cpus);
if (ret < 0)
--
2.31.1

2022-08-25 18:39:49

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 5/9] lib/test_cpumask: Add for_each_cpu_and(not) tests

Following the recent introduction of for_each_andnot(), add some tests to
ensure for_each_cpu_and(not) results in the same as iterating over the
result of cpumask_and(not)().

Signed-off-by: Valentin Schneider <[email protected]>
---
lib/test_cpumask.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/lib/test_cpumask.c b/lib/test_cpumask.c
index 81b17563fcb3..62d499394d8a 100644
--- a/lib/test_cpumask.c
+++ b/lib/test_cpumask.c
@@ -29,6 +29,19 @@
KUNIT_EXPECT_EQ((test), nr_cpu_ids - mask_weight, iter); \
} while (0)

+#define EXPECT_FOR_EACH_CPU_OP_EQ(test, op, mask1, mask2) \
+ do { \
+ const cpumask_t *m1 = (mask1); \
+ const cpumask_t *m2 = (mask2); \
+ int weight; \
+ int cpu, iter = 0; \
+ cpumask_##op(&mask_tmp, m1, m2); \
+ weight = cpumask_weight(&mask_tmp); \
+ for_each_cpu_##op(cpu, mask1, mask2) \
+ iter++; \
+ KUNIT_EXPECT_EQ((test), weight, iter); \
+ } while (0)
+
#define EXPECT_FOR_EACH_CPU_WRAP_EQ(test, mask) \
do { \
const cpumask_t *m = (mask); \
@@ -50,6 +63,7 @@

static cpumask_t mask_empty;
static cpumask_t mask_all;
+static cpumask_t mask_tmp;

static void test_cpumask_weight(struct kunit *test)
{
@@ -91,10 +105,15 @@ static void test_cpumask_iterators(struct kunit *test)
EXPECT_FOR_EACH_CPU_EQ(test, &mask_empty);
EXPECT_FOR_EACH_CPU_NOT_EQ(test, &mask_empty);
EXPECT_FOR_EACH_CPU_WRAP_EQ(test, &mask_empty);
+ EXPECT_FOR_EACH_CPU_OP_EQ(test, and, &mask_empty, &mask_empty);
+ EXPECT_FOR_EACH_CPU_OP_EQ(test, and, cpu_possible_mask, &mask_empty);
+ EXPECT_FOR_EACH_CPU_OP_EQ(test, andnot, &mask_empty, &mask_empty);

EXPECT_FOR_EACH_CPU_EQ(test, cpu_possible_mask);
EXPECT_FOR_EACH_CPU_NOT_EQ(test, cpu_possible_mask);
EXPECT_FOR_EACH_CPU_WRAP_EQ(test, cpu_possible_mask);
+ EXPECT_FOR_EACH_CPU_OP_EQ(test, and, cpu_possible_mask, cpu_possible_mask);
+ EXPECT_FOR_EACH_CPU_OP_EQ(test, andnot, cpu_possible_mask, &mask_empty);
}

static void test_cpumask_iterators_builtin(struct kunit *test)
--
2.31.1

2022-08-25 18:41:08

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 7/9] sched/topology: Introduce sched_numa_hop_mask()

Tariq has pointed out that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness - cpumask_local_spread() only knows
about the local node and everything outside is in the same bucket.

sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
of CPUs reachable within a given distance budget), introduce
sched_numa_hop_mask() to export those cpumasks.

Link: http://lore.kernel.org/r/[email protected]
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/topology.h | 9 +++++++++
kernel/sched/topology.c | 28 ++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..13b82b83e547 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,14 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
return cpumask_of_node(cpu_to_node(cpu));
}

+#ifdef CONFIG_NUMA
+extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
+#else
+static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+#endif /* CONFIG_NUMA */
+

#endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..f0236a0ae65c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
return found;
}

+/**
+ * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
+ * @node: The node to count hops from.
+ * @hops: Include CPUs up to that many hops away. 0 means local node.
+ *
+ * Requires rcu_lock to be held. Returned cpumask is only valid within that
+ * read-side section, copy it if required beyond that.
+ *
+ * Note that not all hops are equal in size; see sched_init_numa() for how
+ * distances and masks are handled.
+ *
+ * Also note that this is a reflection of sched_domains_numa_masks, which may change
+ * during the lifetime of the system (offline nodes are taken out of the masks).
+ */
+const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+ struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
+
+ if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
+ return ERR_PTR(-EINVAL);
+
+ if (!masks)
+ return NULL;
+
+ return masks[hops][node];
+}
+EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
+
#endif /* CONFIG_NUMA */

static int __sdt_alloc(const struct cpumask *cpu_map)
--
2.31.1

2022-08-25 18:41:24

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 2/9] lib/test_cpumask: Make test_cpumask_last check for nr_cpu_ids bits

test_cpumask_last() currently fails on a system with
CONFIG_NR_CPUS=64
CONFIG_CPUMASK_OFFSTACK=n
nr_cpu_ids < NR_CPUS

[ 14.088853] # test_cpumask_last: EXPECTATION FAILED at lib/test_cpumask.c:77
[ 14.088853] Expected ((unsigned int)64) - 1 == cpumask_last(((const struct cpumask *)&__cpu_possible_mask)), but
[ 14.088853] ((unsigned int)64) - 1 == 63
[ 14.088853] cpumask_last(((const struct cpumask *)&__cpu_possible_mask)) == 3
[ 14.090435] not ok 3 - test_cpumask_last

Per smp.c::setup_nr_cpu_ids(), nr_cpu_ids <= NR_CPUS, so we want
the test to use nr_cpu_ids rather than nr_cpumask_bits.

Signed-off-by: Valentin Schneider <[email protected]>
---
lib/test_cpumask.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/test_cpumask.c b/lib/test_cpumask.c
index a31a1622f1f6..81b17563fcb3 100644
--- a/lib/test_cpumask.c
+++ b/lib/test_cpumask.c
@@ -73,8 +73,8 @@ static void test_cpumask_first(struct kunit *test)

static void test_cpumask_last(struct kunit *test)
{
- KUNIT_EXPECT_LE(test, nr_cpumask_bits, cpumask_last(&mask_empty));
- KUNIT_EXPECT_EQ(test, nr_cpumask_bits - 1, cpumask_last(cpu_possible_mask));
+ KUNIT_EXPECT_LE(test, nr_cpu_ids, cpumask_last(&mask_empty));
+ KUNIT_EXPECT_EQ(test, nr_cpu_ids - 1, cpumask_last(cpu_possible_mask));
}

static void test_cpumask_next(struct kunit *test)
--
2.31.1

2022-08-25 19:26:23

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 8/9] sched/topology: Introduce for_each_numa_hop_cpu()

The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs
reachable within a given distance budget, but this means each successive
cpumask is a superset of the previous one.

Code wanting to allocate one item per CPU (e.g. IRQs) at increasing
distances would thus need to allocate a temporary cpumask to note which
CPUs have already been visited. This can be prevented by leveraging
for_each_cpu_andnot() - package all that logic into one ugl^D fancy macro.

Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/topology.h | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 13b82b83e547..6c671dc3252c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -254,5 +254,42 @@ static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
}
#endif /* CONFIG_NUMA */

+/**
+ * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance,
+ * starting from a given node.
+ * @cpu: the iteration variable.
+ * @node: the NUMA node to start the search from.
+ *
+ * Requires rcu_lock to be held.
+ * Careful: this is a double loop, 'break' won't work as expected.
+ *
+ *
+ * Implementation notes:
+ *
+ * Providing it is valid, the mask returned by
+ * sched_numa_hop_mask(node, hops+1)
+ * is a superset of the one returned by
+ * sched_numa_hop_mask(node, hops)
+ * which may not be that useful for drivers that try to spread things out and
+ * want to visit a CPU not more than once.
+ *
+ * To accommodate for that, we use for_each_cpu_andnot() to iterate over the cpus
+ * of sched_numa_hop_mask(node, hops+1) with the CPUs of
+ * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs
+ * a given distance away (rather than *up to* a given distance).
+ *
+ * hops=0 forces us to play silly games: we pass cpu_none_mask to
+ * for_each_cpu_andnot(), which turns it into for_each_cpu().
+ */
+#define for_each_numa_hop_cpu(cpu, node) \
+ for (struct { const struct cpumask *curr, *prev; int hops; } __v = \
+ { sched_numa_hop_mask(node, 0), NULL, 0 }; \
+ !IS_ERR_OR_NULL(__v.curr); \
+ __v.hops++, \
+ __v.prev = __v.curr, \
+ __v.curr = sched_numa_hop_mask(node, __v.hops)) \
+ for_each_cpu_andnot(cpu, \
+ __v.curr, \
+ __v.hops ? __v.prev : cpu_none_mask)

#endif /* _LINUX_TOPOLOGY_H */
--
2.31.1

2022-08-25 19:29:15

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 3/9] bitops: Introduce find_next_andnot_bit()

In preparation of introducing for_each_cpu_andnot(), add a variant of
find_next_bit() that negate the bits in @addr2 when ANDing them with the
bits in @addr1.

Note that the _find_next_bit() @invert argument now gets split into two:
@invert1 for words in @addr1, @invert2 for words in @addr2. The only
current users of _find_next_bit() with @invert set are:
o find_next_zero_bit()
o find_next_zero_bit_le()
and neither of these pass an @addr2, so the conversion is straightforward.

Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/find.h | 44 ++++++++++++++++++++++++++++++++++++++------
lib/find_bit.c | 23 ++++++++++++-----------
2 files changed, 50 insertions(+), 17 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 424ef67d4a42..a195cf0a8bab 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -10,7 +10,8 @@

extern unsigned long _find_next_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le);
+ unsigned long start, unsigned long invert1, unsigned long invert2,
+ unsigned long le);
extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
extern unsigned long _find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size);
@@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+ return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 0);
}
#endif

@@ -71,7 +72,38 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+ return _find_next_bit(addr1, addr2, size, offset, 0UL, 0UL, 0);
+}
+#endif
+
+#ifndef find_next_andnot_bit
+/**
+ * find_next_andnot_bit - find the next bit in *addr1 excluding all the bits
+ * in *addr2
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_next_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2, unsigned long size,
+ unsigned long offset)
+{
+ if (small_const_nbits(size)) {
+ unsigned long val;
+
+ if (unlikely(offset >= size))
+ return size;
+
+ val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
+ return val ? __ffs(val) : size;
+ }
+
+ return _find_next_bit(addr1, addr2, size, offset, 0UL, ~0UL, 0);
}
#endif

@@ -99,7 +131,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
return val == ~0UL ? size : ffz(val);
}

- return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 0);
}
#endif

@@ -247,7 +279,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
return val == ~0UL ? size : ffz(val);
}

- return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 1);
}
#endif

@@ -266,7 +298,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+ return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 1);
}
#endif

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 1b8e4b2a9cba..c46b66d7d2b4 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -21,27 +21,29 @@

#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) || \
- !defined(find_next_and_bit)
+ !defined(find_next_and_bit) || !defined(find_next_andnot_bit)
/*
* This is a common helper function for find_next_bit, find_next_zero_bit, and
* find_next_and_bit. The differences are:
- * - The "invert" argument, which is XORed with each fetched word before
- * searching it for one bits.
* - The optional "addr2", which is anded with "addr1" if present.
+ * - The "invert" arguments, which are XORed with each fetched word (invert1
+ * for words in addr1, invert2 for those in addr2) before searching it for
+ * one bits.
*/
unsigned long _find_next_bit(const unsigned long *addr1,
- const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le)
+ const unsigned long *addr2,
+ unsigned long nbits, unsigned long start,
+ unsigned long invert1, unsigned long invert2,
+ unsigned long le)
{
unsigned long tmp, mask;

if (unlikely(start >= nbits))
return nbits;

- tmp = addr1[start / BITS_PER_LONG];
+ tmp = addr1[start / BITS_PER_LONG] ^ invert1;
if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
+ tmp &= addr2[start / BITS_PER_LONG] ^ invert2;

/* Handle 1st word. */
mask = BITMAP_FIRST_WORD_MASK(start);
@@ -57,10 +59,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
if (start >= nbits)
return nbits;

- tmp = addr1[start / BITS_PER_LONG];
+ tmp = addr1[start / BITS_PER_LONG] ^ invert1;
if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
+ tmp &= addr2[start / BITS_PER_LONG] ^ invert2;
}

if (le)
--
2.31.1

2022-08-25 21:03:00

[permalink] [raw]

Subject: Re: [PATCH v3 1/9] cpumask: Make cpumask_full() check for nr_cpu_ids bits

+ Sander Vanheule

On Thu, Aug 25, 2022 at 07:12:02PM +0100, Valentin Schneider wrote:
> Consider a system with 4 CPUs and:
> CONFIG_NR_CPUS=64
> CONFIG_CPUMASK_OFFSTACK=n
>
> In this situation, we have:
> nr_cpumask_bits == NR_CPUS == 64
> nr_cpu_ids = 4
>
> Per smp.c::setup_nr_cpu_ids(), nr_cpu_ids <= NR_CPUS, so we want
> cpumask_full() to check for nr_cpu_ids bits set.
>
> This issue is currently pointed out by the cpumask KUnit tests:
>
> [ 14.072028] # test_cpumask_weight: EXPECTATION FAILED at lib/test_cpumask.c:57
> [ 14.072028] Expected cpumask_full(((const struct cpumask *)&__cpu_possible_mask)) to be true, but is false
> [ 14.079333] not ok 1 - test_cpumask_weight
>
> Signed-off-by: Valentin Schneider <[email protected]>

It's really a puzzle, and some of my thoughts are below. So.

This is a question what for we need nr_cpumask_bits while we already
have nr_cpu_ids. When OFFSTACK is ON, they are obviously the same.
When it's of - the nr_cpumask_bits is an alias to NR_CPUS.

I tried to wire the nr_cpumask_bits to nr_cpu_ids unconditionally, and
it works even when OFFSTACK is OFF, no surprises.

I didn't find any discussions describing what for we need nr_cpumask_bits,
and the code adding it dates to a very long ago.

If I alias nr_cpumask_bits to nr_cpu_ids unconditionally on my VM with
NR_CPUs == 256 and nr_cpu_ids == 4, there's obviously a clear win in
performance, but the Image size gets 2.5K bigger. Probably that's the
reason for what nr_cpumask_bits was needed...

There's also a very old misleading comment in cpumask.h:

* If HOTPLUG is enabled, then cpu_possible_mask is forced to have
* all NR_CPUS bits set, otherwise it is just the set of CPUs that
* ACPI reports present at boot.

It lies, and I checked with x86_64 that cpu_possible_mask is populated
during boot time with 0b1111, if I create a 4-cpu VM. Hence, the
nr_cpu_ids is 4, while NR_CPUS == 256.

Interestingly, there's no a single user of the cpumask_full(),
obviously, because it's broken. This is really a broken dead code.

Now that we have a test that checks sanity of cpumasks, this mess
popped up.

Your fix doesn't look correct, because it fixes one function, and
doesn't touch others. For example, the cpumask subset() may fail
if src1p will have set bits after nr_cpu_ids, while cpumask_full()
will be returning true.

In -next, there is an update from Sander for the cpumask test that
removes this check, and probably if you rebase on top of -next, you
can drop this and 2nd patch of your series.

What about proper fix? I think that a long time ago we didn't have
ACPI tables for possible cpus, and didn't populate cpumask_possible
from that, so the

#define nr_cpumask_bits NR_CPUS

worked well. Now that we have cpumask_possible partially filled,
we have to always

#define nr_cpumask_bits nr_cpu_ids

and pay +2.5K price in size even if OFFSTACK is OFF. At least, it wins
at runtime...

Any thoughts?

---
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 5e2b10fb4975..0f044d93ad01 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -41,13 +41,7 @@ typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
extern unsigned int nr_cpu_ids;
#endif

-#ifdef CONFIG_CPUMASK_OFFSTACK
-/* Assuming NR_CPUS is huge, a runtime limit is more efficient. Also,
- * not all bits may be allocated. */
#define nr_cpumask_bits nr_cpu_ids
-#else
-#define nr_cpumask_bits ((unsigned int)NR_CPUS)
-#endif

/*
* The following particular system cpumasks and operations manage
@@ -67,10 +61,6 @@ extern unsigned int nr_cpu_ids;
* cpu_online_mask is the dynamic subset of cpu_present_mask,
* indicating those CPUs available for scheduling.
*
- * If HOTPLUG is enabled, then cpu_possible_mask is forced to have
- * all NR_CPUS bits set, otherwise it is just the set of CPUs that
- * ACPI reports present at boot.
- *
* If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
* depending on what ACPI reports as currently plugged in, otherwise
* cpu_present_mask is just a copy of cpu_possible_mask.

2022-08-25 21:33:43

[permalink] [raw]

Subject: Re: [PATCH v3 4/9] cpumask: Introduce for_each_cpu_andnot()

On Thu, Aug 25, 2022 at 07:12:05PM +0100, Valentin Schneider wrote:
> for_each_cpu_and() is very convenient as it saves having to allocate a
> temporary cpumask to store the result of cpumask_and(). The same issue
> applies to cpumask_andnot() which doesn't actually need temporary storage
> for iteration purposes.
>
> Following what has been done for for_each_cpu_and(), introduce
> for_each_cpu_andnot().
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> include/linux/cpumask.h | 39 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 39 insertions(+)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 1414ce8cd003..372a642bf9ba 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -238,6 +238,25 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p,
> nr_cpumask_bits, n + 1);
> }
>
> +/**
> + * cpumask_next_andnot - get the next cpu in *src1p & ~*src2p
> + * @n: the cpu prior to the place to search (ie. return will be > @n)
> + * @src1p: the first cpumask pointer
> + * @src2p: the second cpumask pointer
> + *
> + * Returns >= nr_cpu_ids if no further cpus set in *src1p & ~*src2p
> + */
> +static inline
> +unsigned int cpumask_next_andnot(int n, const struct cpumask *src1p,
> + const struct cpumask *src2p)
> +{
> + /* -1 is a legal arg here. */
> + if (n != -1)
> + cpumask_check(n);
> + return find_next_andnot_bit(cpumask_bits(src1p), cpumask_bits(src2p),
> + nr_cpumask_bits, n + 1);
> +}
> +
> /**
> * for_each_cpu - iterate over every cpu in a mask
> * @cpu: the (optionally unsigned) integer iterator
> @@ -317,6 +336,26 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta
> (cpu) = cpumask_next_and((cpu), (mask1), (mask2)), \
> (cpu) < nr_cpu_ids;)
>
> +/**
> + * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
> + * those present in another.
> + * @cpu: the (optionally unsigned) integer iterator
> + * @mask1: the first cpumask pointer
> + * @mask2: the second cpumask pointer
> + *
> + * This saves a temporary CPU mask in many places. It is equivalent to:
> + * struct cpumask tmp;
> + * cpumask_andnot(&tmp, &mask1, &mask2);
> + * for_each_cpu(cpu, &tmp)
> + * ...
> + *
> + * After the loop, cpu is >= nr_cpu_ids.
> + */
> +#define for_each_cpu_andnot(cpu, mask1, mask2) \
> + for ((cpu) = -1; \
> + (cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)), \
> + (cpu) < nr_cpu_ids;)

The standard doesn't guarantee the order of execution of last 2 lines,
so you might end up with unreliable code. Can you do it in a more
conventional style:
#define for_each_cpu_andnot(cpu, mask1, mask2) \
for ((cpu) = cpumask_next_andnot(-1, (mask1), (mask2)); \
(cpu) < nr_cpu_ids; \
(cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)))

> +
> /**
> * cpumask_any_but - return a "random" in a cpumask, but not this one.
> * @mask: the cpumask to search
> --
> 2.31.1

2022-08-25 21:45:52

[permalink] [raw]

Subject: Re: [PATCH v3 6/9] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()

On Thu, Aug 25, 2022 at 07:12:07PM +0100, Valentin Schneider wrote:
> This removes the second use of the sched_core_mask temporary mask.
>
> Signed-off-by: Valentin Schneider <[email protected]>

Suggested-by: Yury Norov <[email protected]>

> ---
> kernel/sched/core.c | 5 +----
> 1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee28253c9ac0..b4c3112b0095 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -360,10 +360,7 @@ static void __sched_core_flip(bool enabled)
> /*
> * Toggle the offline CPUs.
> */
> - cpumask_copy(&sched_core_mask, cpu_possible_mask);
> - cpumask_andnot(&sched_core_mask, &sched_core_mask, cpu_online_mask);
> -
> - for_each_cpu(cpu, &sched_core_mask)
> + for_each_cpu_andnot(cpu, cpu_possible_mask, cpu_online_mask)
> cpu_rq(cpu)->core_enabled = enabled;
>
> cpus_read_unlock();
> --
> 2.31.1

2022-08-25 21:46:33

[permalink] [raw]

Subject: Re: [PATCH v3 3/9] bitops: Introduce find_next_andnot_bit()

On Thu, Aug 25, 2022 at 07:12:04PM +0100, Valentin Schneider wrote:
> In preparation of introducing for_each_cpu_andnot(), add a variant of
> find_next_bit() that negate the bits in @addr2 when ANDing them with the
> bits in @addr1.
>
> Note that the _find_next_bit() @invert argument now gets split into two:
> @invert1 for words in @addr1, @invert2 for words in @addr2. The only
> current users of _find_next_bit() with @invert set are:
> o find_next_zero_bit()
> o find_next_zero_bit_le()
> and neither of these pass an @addr2, so the conversion is straightforward.
>
> Signed-off-by: Valentin Schneider <[email protected]>

Have you seen this series?
https://lore.kernel.org/lkml/YwaXvphVpy5A7fSs@yury-laptop/t/

It will significantly simplify your work for adding the
find_next_andnot_bit(). If you wait a week or so, you'll most likely
find it in -next.

Thanks,
Yury

> ---
> include/linux/find.h | 44 ++++++++++++++++++++++++++++++++++++++------
> lib/find_bit.c | 23 ++++++++++++-----------
> 2 files changed, 50 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/find.h b/include/linux/find.h
> index 424ef67d4a42..a195cf0a8bab 100644
> --- a/include/linux/find.h
> +++ b/include/linux/find.h
> @@ -10,7 +10,8 @@
>
> extern unsigned long _find_next_bit(const unsigned long *addr1,
> const unsigned long *addr2, unsigned long nbits,
> - unsigned long start, unsigned long invert, unsigned long le);
> + unsigned long start, unsigned long invert1, unsigned long invert2,
> + unsigned long le);
> extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
> extern unsigned long _find_first_and_bit(const unsigned long *addr1,
> const unsigned long *addr2, unsigned long size);
> @@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
> return val ? __ffs(val) : size;
> }
>
> - return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
> + return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 0);
> }
> #endif
>
> @@ -71,7 +72,38 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
> return val ? __ffs(val) : size;
> }
>
> - return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
> + return _find_next_bit(addr1, addr2, size, offset, 0UL, 0UL, 0);
> +}
> +#endif
> +
> +#ifndef find_next_andnot_bit
> +/**
> + * find_next_andnot_bit - find the next bit in *addr1 excluding all the bits
> + * in *addr2
> + * @addr1: The first address to base the search on
> + * @addr2: The second address to base the search on
> + * @size: The bitmap size in bits
> + * @offset: The bitnumber to start searching at
> + *
> + * Returns the bit number for the next set bit
> + * If no bits are set, returns @size.
> + */
> +static inline
> +unsigned long find_next_andnot_bit(const unsigned long *addr1,
> + const unsigned long *addr2, unsigned long size,
> + unsigned long offset)
> +{
> + if (small_const_nbits(size)) {
> + unsigned long val;
> +
> + if (unlikely(offset >= size))
> + return size;
> +
> + val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
> + return val ? __ffs(val) : size;
> + }
> +
> + return _find_next_bit(addr1, addr2, size, offset, 0UL, ~0UL, 0);
> }
> #endif
>
> @@ -99,7 +131,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
> return val == ~0UL ? size : ffz(val);
> }
>
> - return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
> + return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 0);
> }
> #endif
>
> @@ -247,7 +279,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
> return val == ~0UL ? size : ffz(val);
> }
>
> - return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
> + return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 1);
> }
> #endif
>
> @@ -266,7 +298,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
> return val ? __ffs(val) : size;
> }
>
> - return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
> + return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 1);
> }
> #endif
>
> diff --git a/lib/find_bit.c b/lib/find_bit.c
> index 1b8e4b2a9cba..c46b66d7d2b4 100644
> --- a/lib/find_bit.c
> +++ b/lib/find_bit.c
> @@ -21,27 +21,29 @@
>
> #if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
> !defined(find_next_bit_le) || !defined(find_next_zero_bit_le) || \
> - !defined(find_next_and_bit)
> + !defined(find_next_and_bit) || !defined(find_next_andnot_bit)
> /*
> * This is a common helper function for find_next_bit, find_next_zero_bit, and
> * find_next_and_bit. The differences are:
> - * - The "invert" argument, which is XORed with each fetched word before
> - * searching it for one bits.
> * - The optional "addr2", which is anded with "addr1" if present.
> + * - The "invert" arguments, which are XORed with each fetched word (invert1
> + * for words in addr1, invert2 for those in addr2) before searching it for
> + * one bits.
> */
> unsigned long _find_next_bit(const unsigned long *addr1,
> - const unsigned long *addr2, unsigned long nbits,
> - unsigned long start, unsigned long invert, unsigned long le)
> + const unsigned long *addr2,
> + unsigned long nbits, unsigned long start,
> + unsigned long invert1, unsigned long invert2,
> + unsigned long le)
> {
> unsigned long tmp, mask;
>
> if (unlikely(start >= nbits))
> return nbits;
>
> - tmp = addr1[start / BITS_PER_LONG];
> + tmp = addr1[start / BITS_PER_LONG] ^ invert1;
> if (addr2)
> - tmp &= addr2[start / BITS_PER_LONG];
> - tmp ^= invert;
> + tmp &= addr2[start / BITS_PER_LONG] ^ invert2;
>
> /* Handle 1st word. */
> mask = BITMAP_FIRST_WORD_MASK(start);
> @@ -57,10 +59,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
> if (start >= nbits)
> return nbits;
>
> - tmp = addr1[start / BITS_PER_LONG];
> + tmp = addr1[start / BITS_PER_LONG] ^ invert1;
> if (addr2)
> - tmp &= addr2[start / BITS_PER_LONG];
> - tmp ^= invert;
> + tmp &= addr2[start / BITS_PER_LONG] ^ invert2;
> }
>
> if (le)
> --
> 2.31.1

2022-08-25 23:24:08

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH v3 6/9] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()

On 25/08/22 14:16, Yury Norov wrote:
> On Thu, Aug 25, 2022 at 07:12:07PM +0100, Valentin Schneider wrote:
>> This removes the second use of the sched_core_mask temporary mask.
>>
>> Signed-off-by: Valentin Schneider <[email protected]>
>
> Suggested-by: Yury Norov <[email protected]>
>

Indeed, forgot that one, sorry!

2022-08-26 00:09:03

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH v3 3/9] bitops: Introduce find_next_andnot_bit()

On 25/08/22 14:05, Yury Norov wrote:
> On Thu, Aug 25, 2022 at 07:12:04PM +0100, Valentin Schneider wrote:
>> In preparation of introducing for_each_cpu_andnot(), add a variant of
>> find_next_bit() that negate the bits in @addr2 when ANDing them with the
>> bits in @addr1.
>>
>> Note that the _find_next_bit() @invert argument now gets split into two:
>> @invert1 for words in @addr1, @invert2 for words in @addr2. The only
>> current users of _find_next_bit() with @invert set are:
>> o find_next_zero_bit()
>> o find_next_zero_bit_le()
>> and neither of these pass an @addr2, so the conversion is straightforward.
>>
>> Signed-off-by: Valentin Schneider <[email protected]>
>
> Have you seen this series?
> https://lore.kernel.org/lkml/YwaXvphVpy5A7fSs@yury-laptop/t/
>
> It will significantly simplify your work for adding the
> find_next_andnot_bit(). If you wait a week or so, you'll most likely
> find it in -next.
>

I hadn't seen it, it does look promising. I'm about to disappear for a
week, so I'll have a look once I'm back.

> Thanks,
> Yury

2022-08-28 08:56:48

by Sander Vanheule

[permalink] [raw]

Subject: Re: [PATCH v3 1/9] cpumask: Make cpumask_full() check for nr_cpu_ids bits

Hi Yury, Valentin,

On Thu, 2022-08-25 at 13:49 -0700, Yury Norov wrote:
> + Sander Vanheule
>
> On Thu, Aug 25, 2022 at 07:12:02PM +0100, Valentin Schneider wrote:
> > Consider a system with 4 CPUs and:
> > CONFIG_NR_CPUS=64
> > CONFIG_CPUMASK_OFFSTACK=n
> >
> > In this situation, we have:
> > nr_cpumask_bits == NR_CPUS == 64
> > nr_cpu_ids = 4
> >
> > Per smp.c::setup_nr_cpu_ids(), nr_cpu_ids <= NR_CPUS, so we want
> > cpumask_full() to check for nr_cpu_ids bits set.
> >
> > This issue is currently pointed out by the cpumask KUnit tests:
> >
> > [   14.072028]     # test_cpumask_weight: EXPECTATION FAILED at
> > lib/test_cpumask.c:57
> > [   14.072028]     Expected cpumask_full(((const struct cpumask
> > *)&__cpu_possible_mask)) to be true, but is false
> > [   14.079333]     not ok 1 - test_cpumask_weight
> >
> > Signed-off-by: Valentin Schneider <[email protected]>
>
> It's really a puzzle, and some of my thoughts are below. So.
>
> This is a question what for we need nr_cpumask_bits while we already
> have nr_cpu_ids. When OFFSTACK is ON, they are obviously the same.
> When it's of - the nr_cpumask_bits is an alias to NR_CPUS.
>
> I tried to wire the nr_cpumask_bits to nr_cpu_ids unconditionally, and
> it works even when OFFSTACK is OFF, no surprises.
>
> I didn't find any discussions describing what for we need nr_cpumask_bits,
> and the code adding it dates to a very long ago.
>
> If I alias nr_cpumask_bits to nr_cpu_ids unconditionally on my VM with
> NR_CPUs == 256 and nr_cpu_ids == 4, there's obviously a clear win in
> performance, but the Image size gets 2.5K bigger. Probably that's the
> reason for what nr_cpumask_bits was needed...

I think it makes sense to have a compile-time-constant value for nr_cpumask_bits
in some cases. For example on embedded platforms, where every opportunity to
save a few kB should be used, or cases where NR_CPUS <= BITS_PER_LONG.

>
> There's also a very old misleading comment in cpumask.h:
>
> * If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> * all NR_CPUS bits set, otherwise it is just the set of CPUs that
> * ACPI reports present at boot.
>
> It lies, and I checked with x86_64 that cpu_possible_mask is populated
> during boot time with 0b1111, if I create a 4-cpu VM. Hence, the
> nr_cpu_ids is 4, while NR_CPUS == 256.
>
> Interestingly, there's no a single user of the cpumask_full(),
> obviously, because it's broken. This is really a broken dead code.
>
> Now that we have a test that checks sanity of cpumasks, this mess
> popped up.
>
> Your fix doesn't look correct, because it fixes one function, and
> doesn't touch others. For example, the cpumask subset() may fail
> if src1p will have set bits after nr_cpu_ids, while cpumask_full()
> will be returning true.

It appears the documentation for cpumask_full() is also incorrect, because it
claims to check if all CPUs < nr_cpu_ids are set. Meanwhile, the implementation
checks if all CPUs < nr_cpumask_bits are set.

cpumask_weight() has a similar issue, and maybe also other cpumask_*() functions
(I didn't check in detail yet).

>
> In -next, there is an update from Sander for the cpumask test that
> removes this check, and probably if you rebase on top of -next, you
> can drop this and 2nd patch of your series.
>
> What about proper fix? I think that a long time ago we didn't have
> ACPI tables for possible cpus, and didn't populate cpumask_possible
> from that, so the
>
>         #define nr_cpumask_bits NR_CPUS
>
> worked well. Now that we have cpumask_possible partially filled,
> we have to always
>
>         #define nr_cpumask_bits nr_cpu_ids
>
> and pay +2.5K price in size even if OFFSTACK is OFF. At least, it wins
> at runtime...
>
> Any thoughts?

It looks like both nr_cpumask_bits and nr_cpu_ids are used in a number of places
outside of lib/cpumask.c. Documentation for cpumask_*() functions almost always
refers to nr_cpu_ids as a highest valid value.

Perhaps nr_cpumask_bits should become an variable for internal cpumask usage,
and external users should only use nr_cpu_ids? The changes in 6.0 are my first
real interaction with cpumask, so it's possible that there are things I'm
missing here.

That being said, some of the cpumask tests compare results to nr_cpumask_bits,
so those should then probably be fixed to compare against nr_cpu_ids instead.

Best,
Sander

>
> ---
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 5e2b10fb4975..0f044d93ad01 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -41,13 +41,7 @@ typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); }
> cpumask_t;
> extern unsigned int nr_cpu_ids;
> #endif
>
> -#ifdef CONFIG_CPUMASK_OFFSTACK
> -/* Assuming NR_CPUS is huge, a runtime limit is more efficient. Also,
> - * not all bits may be allocated. */
> #define nr_cpumask_bits        nr_cpu_ids
> -#else
> -#define nr_cpumask_bits        ((unsigned int)NR_CPUS)
> -#endif
>
> /*
> * The following particular system cpumasks and operations manage
> @@ -67,10 +61,6 @@ extern unsigned int nr_cpu_ids;
> * cpu_online_mask is the dynamic subset of cpu_present_mask,
> * indicating those CPUs available for scheduling.
> *
> - * If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> - * all NR_CPUS bits set, otherwise it is just the set of CPUs that
> - * ACPI reports present at boot.
> - *
> * If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
> * depending on what ACPI reports as currently plugged in, otherwise
> * cpu_present_mask is just a copy of cpu_possible_mask.

2022-08-28 16:47:04

[permalink] [raw]

Subject: Re: [PATCH v3 1/9] cpumask: Make cpumask_full() check for nr_cpu_ids bits

On Sun, Aug 28, 2022 at 10:35:38AM +0200, Sander Vanheule wrote:

...

> > It's really a puzzle, and some of my thoughts are below. So.
> >
> > This is a question what for we need nr_cpumask_bits while we already
> > have nr_cpu_ids. When OFFSTACK is ON, they are obviously the same.
> > When it's of - the nr_cpumask_bits is an alias to NR_CPUS.
> >
> > I tried to wire the nr_cpumask_bits to nr_cpu_ids unconditionally, and
> > it works even when OFFSTACK is OFF, no surprises.
> >
> > I didn't find any discussions describing what for we need nr_cpumask_bits,
> > and the code adding it dates to a very long ago.
> >
> > If I alias nr_cpumask_bits to nr_cpu_ids unconditionally on my VM with
> > NR_CPUs == 256 and nr_cpu_ids == 4, there's obviously a clear win in
> > performance, but the Image size gets 2.5K bigger. Probably that's the
> > reason for what nr_cpumask_bits was needed...
>
> I think it makes sense to have a compile-time-constant value for nr_cpumask_bits
> in some cases. For example on embedded platforms, where every opportunity to
> save a few kB should be used, or cases where NR_CPUS <= BITS_PER_LONG.
>
> >
> > There's also a very old misleading comment in cpumask.h:
> >
> > ?*? If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> > ?*? all NR_CPUS bits set, otherwise it is just the set of CPUs that
> > ?*? ACPI reports present at boot.
> >
> > It lies, and I checked with x86_64 that cpu_possible_mask is populated
> > during boot time with 0b1111, if I create a 4-cpu VM. Hence, the
> > nr_cpu_ids is 4, while NR_CPUS == 256.
> >
> > Interestingly, there's no a single user of the cpumask_full(),
> > obviously, because it's broken. This is really a broken dead code.
> >
> > Now that we have a test that checks sanity of cpumasks, this mess
> > popped up.
> >
> > Your fix doesn't look correct, because it fixes one function, and
> > doesn't touch others. For example, the cpumask subset() may fail
> > if src1p will have set bits after nr_cpu_ids, while cpumask_full()
> > will be returning true.
>
> It appears the documentation for cpumask_full() is also incorrect, because it
> claims to check if all CPUs < nr_cpu_ids are set. Meanwhile, the implementation
> checks if all CPUs < nr_cpumask_bits are set.
>
> cpumask_weight() has a similar issue, and maybe also other cpumask_*() functions
> (I didn't check in detail yet).
>
> >
> > In -next, there is an update from Sander for the cpumask test that
> > removes this check, and probably if you rebase on top of -next, you
> > can drop this and 2nd patch of your series.
> >
> > What about proper fix? I think that a long time ago we didn't have
> > ACPI tables for possible cpus, and didn't populate cpumask_possible
> > from that, so the
> >
> > ??????? #define nr_cpumask_bits NR_CPUS
> >
> > worked well. Now that we have cpumask_possible partially filled,
> > we have to always
> >
> > ??????? #define nr_cpumask_bits nr_cpu_ids
> >
> > and pay +2.5K price in size even if OFFSTACK is OFF. At least, it wins
> > at runtime...
> >
> > Any thoughts?
>
> It looks like both nr_cpumask_bits and nr_cpu_ids are used in a number of places
> outside of lib/cpumask.c. Documentation for cpumask_*() functions almost always
> refers to nr_cpu_ids as a highest valid value.
>
> Perhaps nr_cpumask_bits should become an variable for internal cpumask usage,
> and external users should only use nr_cpu_ids? The changes in 6.0 are my first
> real interaction with cpumask, so it's possible that there are things I'm
> missing here.
>
> That being said, some of the cpumask tests compare results to nr_cpumask_bits,
> so those should then probably be fixed to compare against nr_cpu_ids instead.

Aha, and it kills me how we have such a mess in a very core subsystem.

We have 3 problems here:
- mess with nr_cpumask_bits and nr_cpu_ids;
- ineffectiveness of cpumask routines when nr_cpumask_bits > nr_cpu_ids;
- runtime nature of nr_cpu_ids, even for those embedded systems with
taught memory constraints. So that if we just drop nr_cpumask_bits,
it will add 2.5K to the Image.

I think that dropping nr_cpumask_bits is our only choice, and to avoid
Image bloating for embedded users, we can hint the kernel that NR_CPUS
is an exact number, so that it will skip setting it in runtime.

I added a EXACT_NR_CPUS option for this, which works like this:

#if (NR_CPUS == 1) || defined(CONFIG_EXACT_NR_CPUS)
#define nr_cpu_ids ((unsigned int)NR_CPUS)
#else
extern unsigned int nr_cpu_ids;
#endif

/* Deprecated */
#define nr_cpumask_bits nr_cpu_ids

I tried it with arm64 4-CPU build. When the EXACT_NR_CPUS is enabled,
the difference is:
add/remove: 3/4 grow/shrink: 46/729 up/down: 652/-46952 (-46300)
Total: Before=25670945, After=25624645, chg -0.18%

Looks quite impressive to me. I'll send a patch soon.

Thanks,
Yury

2022-09-05 10:07:19

by Tariq Toukan

[permalink] [raw]

Subject: Re: [PATCH v3 8/9] sched/topology: Introduce for_each_numa_hop_cpu()

On 8/25/2022 9:12 PM, Valentin Schneider wrote:
> The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs
> reachable within a given distance budget, but this means each successive
> cpumask is a superset of the previous one.
>
> Code wanting to allocate one item per CPU (e.g. IRQs) at increasing
> distances would thus need to allocate a temporary cpumask to note which
> CPUs have already been visited. This can be prevented by leveraging
> for_each_cpu_andnot() - package all that logic into one ugl^D fancy macro.
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> include/linux/topology.h | 37 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 37 insertions(+)
>
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 13b82b83e547..6c671dc3252c 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -254,5 +254,42 @@ static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
> }
> #endif /* CONFIG_NUMA */
>
> +/**
> + * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance,
> + * starting from a given node.
> + * @cpu: the iteration variable.
> + * @node: the NUMA node to start the search from.
> + *
> + * Requires rcu_lock to be held.
> + * Careful: this is a double loop, 'break' won't work as expected.
> + *
> + *
> + * Implementation notes:
> + *
> + * Providing it is valid, the mask returned by
> + * sched_numa_hop_mask(node, hops+1)
> + * is a superset of the one returned by
> + * sched_numa_hop_mask(node, hops)
> + * which may not be that useful for drivers that try to spread things out and
> + * want to visit a CPU not more than once.
> + *
> + * To accommodate for that, we use for_each_cpu_andnot() to iterate over the cpus
> + * of sched_numa_hop_mask(node, hops+1) with the CPUs of
> + * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs
> + * a given distance away (rather than *up to* a given distance).
> + *
> + * hops=0 forces us to play silly games: we pass cpu_none_mask to
> + * for_each_cpu_andnot(), which turns it into for_each_cpu().
> + */
> +#define for_each_numa_hop_cpu(cpu, node) \
> + for (struct { const struct cpumask *curr, *prev; int hops; } __v = \
> + { sched_numa_hop_mask(node, 0), NULL, 0 }; \
> + !IS_ERR_OR_NULL(__v.curr); \
> + __v.hops++, \
> + __v.prev = __v.curr, \
> + __v.curr = sched_numa_hop_mask(node, __v.hops)) \
> + for_each_cpu_andnot(cpu, \
> + __v.curr, \
> + __v.hops ? __v.prev : cpu_none_mask)
>

Hiding two nested loops together in one for_each_* macro leads to
unexpected behavior for the standard usage of 'break/continue'.

for_each_numa_hop_cpu(cpu, node) {
if (condition)
break; <== will terminate the inner loop only, but it's
invisible to the human developer/reviewer.
}

These bugs will not be easy to spot in code review.

> #endif /* _LINUX_TOPOLOGY_H */

2022-09-05 17:03:40

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH v3 7/9] sched/topology: Introduce sched_numa_hop_mask()

On 26/08/22 16:14, Yicong Yang wrote:
> On 2022/8/26 2:12, Valentin Schneider wrote:
>> Tariq has pointed out that drivers allocating IRQ vectors would benefit
>> from having smarter NUMA-awareness - cpumask_local_spread() only knows
>> about the local node and everything outside is in the same bucket.
>>
>> sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
>> of CPUs reachable within a given distance budget), introduce
>> sched_numa_hop_mask() to export those cpumasks.
>>
>> Link: http://lore.kernel.org/r/[email protected]
>> Signed-off-by: Valentin Schneider <[email protected]>
>> ---
>> include/linux/topology.h | 9 +++++++++
>> kernel/sched/topology.c | 28 ++++++++++++++++++++++++++++
>> 2 files changed, 37 insertions(+)
>>
>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index 4564faafd0e1..13b82b83e547 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -245,5 +245,14 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>> return cpumask_of_node(cpu_to_node(cpu));
>> }
>>
>> +#ifdef CONFIG_NUMA
>> +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
>> +#else
>> +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
>> +{
>> + return ERR_PTR(-EOPNOTSUPP);
>> +}
>> +#endif /* CONFIG_NUMA */
>> +
>>
>
> I think it should be better to return cpu_online_mask() if CONFIG_NUMA=n and hop is 0. Then we
> can keep the behaviour consistent with cpumask_local_spread() which for_each_numa_hop_cpu is
> going to replace.
>

That's a good point, thanks.

> The macro checking maybe unnecessary, check whether node is NUMA_NO_NODE will handle the case
> where NUMA is not configured.
>
> Thanks.

2022-09-05 17:09:41

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH v3 4/9] cpumask: Introduce for_each_cpu_andnot()

On 25/08/22 14:14, Yury Norov wrote:
> On Thu, Aug 25, 2022 at 07:12:05PM +0100, Valentin Schneider wrote:
>> +#define for_each_cpu_andnot(cpu, mask1, mask2) \
>> + for ((cpu) = -1; \
>> + (cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)), \
>> + (cpu) < nr_cpu_ids;)
>
> The standard doesn't guarantee the order of execution of last 2 lines,
> so you might end up with unreliable code. Can you do it in a more
> conventional style:
> #define for_each_cpu_andnot(cpu, mask1, mask2) \
> for ((cpu) = cpumask_next_andnot(-1, (mask1), (mask2)); \
> (cpu) < nr_cpu_ids; \
> (cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)))
>

IIUC the order of execution *is* guaranteed as this is a comma operator,
not argument passing:

6.5.17 Comma operator

The left operand of a comma operator is evaluated as a void expression;
there is a sequence point after its evaluation. Then the right operand is
evaluated; the result has its type and value.

for_each_cpu{_and}() uses the same pattern (which I simply copied here).

Still, I'd be up for making this a bit more readable. I did a bit of
digging to figure out how we ended up with that pattern, and found

7baac8b91f98 ("cpumask: make for_each_cpu_mask a bit smaller")

so this appears to have been done to save up on generated instructions.
*if* it is actually OK standard-wise, I'd vote to leave it as-is.

>> +
>> /**
>> * cpumask_any_but - return a "random" in a cpumask, but not this one.
>> * @mask: the cpumask to search
>> --
>> 2.31.1

2022-09-05 17:11:19

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH v3 8/9] sched/topology: Introduce for_each_numa_hop_cpu()

On 05/09/22 12:46, Tariq Toukan wrote:
> On 8/25/2022 9:12 PM, Valentin Schneider wrote:
>> The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs
>> reachable within a given distance budget, but this means each successive
>> cpumask is a superset of the previous one.
>>
>> Code wanting to allocate one item per CPU (e.g. IRQs) at increasing
>> distances would thus need to allocate a temporary cpumask to note which
>> CPUs have already been visited. This can be prevented by leveraging
>> for_each_cpu_andnot() - package all that logic into one ugl^D fancy macro.
>>
>> Signed-off-by: Valentin Schneider <[email protected]>
>> ---
>> include/linux/topology.h | 37 +++++++++++++++++++++++++++++++++++++
>> 1 file changed, 37 insertions(+)
>>
>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index 13b82b83e547..6c671dc3252c 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -254,5 +254,42 @@ static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
>> }
>> #endif /* CONFIG_NUMA */
>>
>> +/**
>> + * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance,
>> + * starting from a given node.
>> + * @cpu: the iteration variable.
>> + * @node: the NUMA node to start the search from.
>> + *
>> + * Requires rcu_lock to be held.
>> + * Careful: this is a double loop, 'break' won't work as expected.
>> + *
>> + *
>> + * Implementation notes:
>> + *
>> + * Providing it is valid, the mask returned by
>> + * sched_numa_hop_mask(node, hops+1)
>> + * is a superset of the one returned by
>> + * sched_numa_hop_mask(node, hops)
>> + * which may not be that useful for drivers that try to spread things out and
>> + * want to visit a CPU not more than once.
>> + *
>> + * To accommodate for that, we use for_each_cpu_andnot() to iterate over the cpus
>> + * of sched_numa_hop_mask(node, hops+1) with the CPUs of
>> + * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs
>> + * a given distance away (rather than *up to* a given distance).
>> + *
>> + * hops=0 forces us to play silly games: we pass cpu_none_mask to
>> + * for_each_cpu_andnot(), which turns it into for_each_cpu().
>> + */
>> +#define for_each_numa_hop_cpu(cpu, node) \
>> + for (struct { const struct cpumask *curr, *prev; int hops; } __v = \
>> + { sched_numa_hop_mask(node, 0), NULL, 0 }; \
>> + !IS_ERR_OR_NULL(__v.curr); \
>> + __v.hops++, \
>> + __v.prev = __v.curr, \
>> + __v.curr = sched_numa_hop_mask(node, __v.hops)) \
>> + for_each_cpu_andnot(cpu, \
>> + __v.curr, \
>> + __v.hops ? __v.prev : cpu_none_mask)
>>
>
> Hiding two nested loops together in one for_each_* macro leads to
> unexpected behavior for the standard usage of 'break/continue'.
>
> for_each_numa_hop_cpu(cpu, node) {
> if (condition)
> break; <== will terminate the inner loop only, but it's
> invisible to the human developer/reviewer.
> }
>
> These bugs will not be easy to spot in code review.
>

Yeah, it's not ideal, but I *did* attach a comment warning about
it. Spreading this pattern for sure isn't great, but there *is* a precedent
with for_each_process_thread().

>> #endif /* _LINUX_TOPOLOGY_H */

2022-09-05 18:37:10

[permalink] [raw]

Subject: Re: [PATCH v3 4/9] cpumask: Introduce for_each_cpu_andnot()

On Mon, Sep 5, 2022 at 9:44 AM Valentin Schneider <[email protected]> wrote:
>
> On 25/08/22 14:14, Yury Norov wrote:
> > On Thu, Aug 25, 2022 at 07:12:05PM +0100, Valentin Schneider wrote:
> >> +#define for_each_cpu_andnot(cpu, mask1, mask2) \
> >> + for ((cpu) = -1; \
> >> + (cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)), \
> >> + (cpu) < nr_cpu_ids;)
> >
> > The standard doesn't guarantee the order of execution of last 2 lines,
> > so you might end up with unreliable code. Can you do it in a more
> > conventional style:
> > #define for_each_cpu_andnot(cpu, mask1, mask2) \
> > for ((cpu) = cpumask_next_andnot(-1, (mask1), (mask2)); \
> > (cpu) < nr_cpu_ids; \
> > (cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)))
> >
>
> IIUC the order of execution *is* guaranteed as this is a comma operator,
> not argument passing:
>
> 6.5.17 Comma operator
>
> The left operand of a comma operator is evaluated as a void expression;
> there is a sequence point after its evaluation. Then the right operand is
> evaluated; the result has its type and value.
>
> for_each_cpu{_and}() uses the same pattern (which I simply copied here).
>
> Still, I'd be up for making this a bit more readable. I did a bit of
> digging to figure out how we ended up with that pattern, and found
>
> 7baac8b91f98 ("cpumask: make for_each_cpu_mask a bit smaller")
>
> so this appears to have been done to save up on generated instructions.
> *if* it is actually OK standard-wise, I'd vote to leave it as-is.

Indeed. I probably messed with ANSI C.

Sorry for the noise.