LinuxLists.cc - [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init

2024-01-20 02:51:05

Subject: [PATCH v5 0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1)

grp_spread_init_one() implementation is sub-optimal because it
traverses bitmaps from the beginning, instead of picking from the
previous iteration.

Fix it and use find_bit API where appropriate. While here, optimize
cpumasks allocation and drop unneeded cpumask_empty() call.

---
v1: https://lore.kernel.org/all/ZW5MI3rKQueLM0Bz@yury-ThinkPad/T/
v2: https://lore.kernel.org/lkml/ZXKNVRu3AfvjaFhK@fedora/T/
v3: https://lore.kernel.org/lkml/[email protected]/T/
v4: https://lore.kernel.org/lkml/[email protected]/T/
v5: add CPUMASK_NULL macro and use it to initialize cpumask_var_t
variables properly.

On cpumask_var_t initialization issue:

The idea of having different types behind the same typedef has been
considered nasty for quite a while. See a comment in include/linux/cpumask.h
for example.

Now that I'm trying to adopt kernel cleanup machinery to cpumasks, it
reveals another disadvantage of this approach - there's no way to assign
a cpumask_var_t variable at declaration time, which is required by
cleanup implementation.

To fix that, in v5 I added a CPUMASK_NULL macro as a workaround. This
CPUMASK_NULL would be also useful for those converting existing codebase
to enable cleanup variables.

On a long term, it's better to drop CPUMASK_OFFSTACK entirely. Moreover,
it's used only on Power and x86 machines if NR_CPUS >= 8K (unless people
enable it explicitly, and nobody bothers doing that in a real life). But
it requires some more discussions with Power and x64 people...

Meanwhile, I'm going to submit a patchset that deprecates cpumask_var_t,
and adds a new set of allocators which would support initialization at
declaration time.

Yury Norov (9):
cpumask: introduce for_each_cpu_and_from()
lib/group_cpus: optimize inner loop in grp_spread_init_one()
lib/group_cpus: relax atomicity requirement in grp_spread_init_one()
lib/group_cpus: optimize outer loop in grp_spread_init_one()
lib/group_cpus: don't zero cpumasks in group_cpus_evenly() on
allocation
lib/group_cpus: drop unneeded cpumask_empty() call in
__group_cpus_evenly()
cpumask: define cleanup function for cpumasks
lib/group_cpus: rework group_cpus_evenly()
lib/group_cpus: simplify group_cpus_evenly() for more

include/linux/cpumask.h | 16 ++++++
include/linux/find.h | 3 ++
lib/group_cpus.c | 110 ++++++++++++++++------------------------
3 files changed, 62 insertions(+), 67 deletions(-)

--
2.40.1

2024-01-20 02:51:23

by Yury Norov

[permalink] [raw]

Subject: [PATCH 1/9] cpumask: introduce for_each_cpu_and_from()

Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
which is handy when it's needed to traverse 2 cpumasks or bitmaps,
starting from a given position.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/cpumask.h | 11 +++++++++++
include/linux/find.h | 3 +++
2 files changed, 14 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cfb545841a2c..73ff2e0ef090 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -332,6 +332,17 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta
#define for_each_cpu_and(cpu, mask1, mask2) \
for_each_and_bit(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)

+/**
+ * for_each_cpu_and_from - iterate over every cpu in both masks starting from a given cpu
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_and_from(cpu, mask1, mask2) \
+ for_each_and_bit_from(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)
+
/**
* for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
* those present in another.
diff --git a/include/linux/find.h b/include/linux/find.h
index 5e4f39ef2e72..dfd3d51ff590 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -563,6 +563,9 @@ unsigned long find_next_bit_le(const void *addr, unsigned
(bit) = find_next_and_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
(bit)++)

+#define for_each_and_bit_from(bit, addr1, addr2, size) \
+ for (; (bit) = find_next_and_bit((addr1), (addr2), (size), (bit)), (bit) < (size); (bit)++)
+
#define for_each_andnot_bit(bit, addr1, addr2, size) \
for ((bit) = 0; \
(bit) = find_next_andnot_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
--
2.40.1

2024-01-20 02:51:29

by Yury Norov

[permalink] [raw]

Subject: [PATCH 2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one()

The loop starts from the beginning every time we switch to the next
sibling mask. This is the Schlemiel the Painter's style of coding
because we know for sure that nmsk is clear up to current CPU, and we
can just continue from the next CPU.

Also, we can do it nicer if leverage the dedicated for_each() iterator,
and simplify the logic of clearing a bit in nmsk.

Signed-off-by: Yury Norov <[email protected]>
---
lib/group_cpus.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc..063ed9ae1b8d 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -30,14 +30,14 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,

/* If the cpu has siblings, use them first */
siblmsk = topology_sibling_cpumask(cpu);
- for (sibl = -1; cpus_per_grp > 0; ) {
- sibl = cpumask_next(sibl, siblmsk);
- if (sibl >= nr_cpu_ids)
- break;
- if (!cpumask_test_and_clear_cpu(sibl, nmsk))
- continue;
+ sibl = cpu + 1;
+
+ for_each_cpu_and_from(sibl, siblmsk, nmsk) {
+ if (cpus_per_grp-- == 0)
+ return;
+
+ cpumask_clear_cpu(sibl, nmsk);
cpumask_set_cpu(sibl, irqmsk);
- cpus_per_grp--;
}
}
}
--
2.40.1

2024-01-20 02:51:46

by Yury Norov

[permalink] [raw]

Subject: [PATCH 3/9] lib/group_cpus: relax atomicity requirement in grp_spread_init_one()

Because nmsk and irqmsk are stable, extra atomicity is not required.

Signed-off-by: Yury Norov <[email protected]>
NAKed-by: Ming Lei <[email protected]>
---

Regarding the NAK discussion:

> > > > I think this kind of change should be avoided, here the code is
> > > > absolutely in slow path, and we care code cleanness and readability
> > > > much more than the saved cycle from non atomicity.
> > >
> > > Atomic ops have special meaning and special function. This 'atomic' way
> > > of moving a bit from one bitmap to another looks completely non-trivial
> > > and puzzling to me.
> > >
> > > A sequence of atomic ops is not atomic itself. Normally it's a sing of
> > > a bug. But in this case, both masks are stable, and we don't need
> > > atomicity at all.
> >
> > Here we don't care the atomicity.
> >
> > >
> > > It's not about performance, it's about readability.
> >
> > __cpumask_clear_cpu() and __cpumask_set_cpu() are more like private
> > helper, and more hard to follow.
>
> No that's not true. Non-atomic version of the function is not a
> private helper of course.
>
> > [@linux]$ git grep -n -w -E "cpumask_clear_cpu|cpumask_set_cpu" ./ | wc
> > 674 2055 53954
> > [@linux]$ git grep -n -w -E "__cpumask_clear_cpu|__cpumask_set_cpu" ./ | wc
> > 21 74 1580
> >
> > I don't object to comment the current usage, but NAK for this change.
>
> No problem, I'll add you NAK.

You can add the following words meantime:

__cpumask_clear_cpu() and __cpumask_set_cpu() are added in commit 6c8557bdb28d
("smp, cpumask: Use non-atomic cpumask_{set,clear}_cpu()") for fast code path(
smp_call_function_many()).

We have ~670 users of cpumask_clear_cpu & cpumask_set_cpu, lots of them
fall into same category with group_cpus.c(doesn't care atomicity, not in fast
code path), and needn't change to __cpumask_clear_cpu() and __cpumask_set_cpu().
Otherwise, this way may encourage to update others into the __cpumask_* version.

lib/group_cpus.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 063ed9ae1b8d..0a8ac7cb1a5d 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -24,8 +24,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
if (cpu >= nr_cpu_ids)
return;

- cpumask_clear_cpu(cpu, nmsk);
- cpumask_set_cpu(cpu, irqmsk);
+ __cpumask_clear_cpu(cpu, nmsk);
+ __cpumask_set_cpu(cpu, irqmsk);
cpus_per_grp--;

/* If the cpu has siblings, use them first */
@@ -36,8 +36,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
if (cpus_per_grp-- == 0)
return;

- cpumask_clear_cpu(sibl, nmsk);
- cpumask_set_cpu(sibl, irqmsk);
+ __cpumask_clear_cpu(sibl, nmsk);
+ __cpumask_set_cpu(sibl, irqmsk);
}
}
}
--
2.40.1

2024-01-20 02:51:51

by Yury Norov

[permalink] [raw]

Subject: [PATCH 4/9] lib/group_cpus: optimize outer loop in grp_spread_init_one()

Similarly to the inner loop, in the outer loop we can use for_each_cpu()
macro, and skip CPUs that have been moved.

With this patch, the function becomes O(1), despite that it's a
double-loop.

While here, add a comment why we can't merge outer logic into the inner
loop.

Signed-off-by: Yury Norov <[email protected]>
---
lib/group_cpus.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 0a8ac7cb1a5d..952aac9eaa81 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -17,16 +17,17 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
const struct cpumask *siblmsk;
int cpu, sibl;

- for ( ; cpus_per_grp > 0; ) {
- cpu = cpumask_first(nmsk);
-
- /* Should not happen, but I'm too lazy to think about it */
- if (cpu >= nr_cpu_ids)
+ for_each_cpu(cpu, nmsk) {
+ if (cpus_per_grp-- == 0)
return;

+ /*
+ * If a caller wants to spread IRQa on offline CPUs, we need to
+ * take care of it explicitly because those offline CPUS are not
+ * included in siblings cpumask.
+ */
__cpumask_clear_cpu(cpu, nmsk);
__cpumask_set_cpu(cpu, irqmsk);
- cpus_per_grp--;

/* If the cpu has siblings, use them first */
siblmsk = topology_sibling_cpumask(cpu);
@@ -38,6 +39,7 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,

__cpumask_clear_cpu(sibl, nmsk);
__cpumask_set_cpu(sibl, irqmsk);
+ cpu = sibl + 1;
}
}
}
--
2.40.1

2024-01-20 02:52:05

by Yury Norov

[permalink] [raw]

Subject: [PATCH 5/9] lib/group_cpus: don't zero cpumasks in group_cpus_evenly() on allocation

nmsk and npresmsk are both allocated with zalloc_cpumask_var(), but they
are initialized by copying later in the code, and so may be allocated
uninitialized.

Signed-off-by: Yury Norov <[email protected]>
---
lib/group_cpus.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 952aac9eaa81..72c308f8c322 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -354,10 +354,10 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
int ret = -ENOMEM;
struct cpumask *masks = NULL;

- if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+ if (!alloc_cpumask_var(&nmsk, GFP_KERNEL))
return NULL;

- if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
+ if (!alloc_cpumask_var(&npresmsk, GFP_KERNEL))
goto fail_nmsk;

node_to_cpumask = alloc_node_to_cpumask();
--
2.40.1

2024-01-20 02:52:23

by Yury Norov

[permalink] [raw]

Subject: [PATCH 6/9] lib/group_cpus: drop unneeded cpumask_empty() call in __group_cpus_evenly()

The function is called twice. First time it's called with
cpumask_present as a parameter, which can't be empty. Second time it's
called with a mask created with cpumask_andnot(), which returns false if
the result is an empty mask.

We can safely drop redundant cpumask_empty() call from the
__group_cpus_evenly() and save few cycles.

Signed-off-by: Yury Norov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
---
lib/group_cpus.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 72c308f8c322..b8c0c3ae2bbd 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -259,9 +259,6 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
nodemask_t nodemsk = NODE_MASK_NONE;
struct node_groups *node_groups;

- if (cpumask_empty(cpu_mask))
- return 0;
-
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);

/*
@@ -401,9 +398,14 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
curgrp = 0;
else
curgrp = nr_present;
- cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
- ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
- npresmsk, nmsk, masks);
+
+ if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk))
+ /* If npresmsk is not empty */
+ ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+ npresmsk, nmsk, masks);
+ else
+ ret = 0;
+
if (ret >= 0)
nr_others = ret;

--
2.40.1

2024-01-20 02:52:36

by Yury Norov

[permalink] [raw]

Subject: [PATCH 7/9] cpumask: define cleanup function for cpumasks

Now we can simplify a code that allocates cpumasks for local needs.

Automatic variables have to be initialized at declaration, or at least
before any possibility for the logic to return, so that compiler
wouldn't try to call an associate destructor function on a random stack
number.

Because cpumask_var_t, depending on the CPUMASK_OFFSTACK config, is
either a pointer or an array, we have to have a macro for initialization.

So define a CPUMASK_NULL macro, which allows to init struct cpumask
pointer with NULL when CPUMASK_OFFSTACK is enabled, and effectively
a no-op when CPUMASK_OFFSTACK is disabled.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/cpumask.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 73ff2e0ef090..0dd8e810200f 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -7,6 +7,7 @@
* set of CPUs in a system, one bit position per CPU number. In general,
* only nr_cpu_ids (<= NR_CPUS) bits are valid.
*/
+#include <linux/cleanup.h>
#include <linux/kernel.h>
#include <linux/threads.h>
#include <linux/bitmap.h>
@@ -898,6 +899,7 @@ typedef struct cpumask *cpumask_var_t;

#define this_cpu_cpumask_var_ptr(x) this_cpu_read(x)
#define __cpumask_var_read_mostly __read_mostly
+#define CPUMASK_NULL NULL

bool alloc_cpumask_var_node(cpumask_var_t *mask, gfp_t flags, int node);

@@ -945,6 +947,7 @@ typedef struct cpumask cpumask_var_t[1];

#define this_cpu_cpumask_var_ptr(x) this_cpu_ptr(x)
#define __cpumask_var_read_mostly
+#define CPUMASK_NULL {}

static inline bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
{
@@ -988,6 +991,8 @@ static inline bool cpumask_available(cpumask_var_t mask)
}
#endif /* CONFIG_CPUMASK_OFFSTACK */

+DEFINE_FREE(free_cpumask_var, struct cpumask *, if (_T) free_cpumask_var(_T));
+
/* It's common to want to use cpu_all_mask in struct member initializers,
* so it has to refer to an address rather than a pointer. */
extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
--
2.40.1

2024-01-20 02:52:48

by Yury Norov

[permalink] [raw]

Subject: [PATCH 8/9] lib/group_cpus: rework group_cpus_evenly()

Leverage cleanup machinery and drop most of housekeeping code.
Particularly, drop unneeded and erroneously initialized with -ENOMEM
variable ret.

Signed-off-by: Yury Norov <[email protected]>
---
lib/group_cpus.c | 79 +++++++++++++++---------------------------------
1 file changed, 25 insertions(+), 54 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index b8c0c3ae2bbd..4c09df9eb886 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -76,6 +76,8 @@ static void free_node_to_cpumask(cpumask_var_t *masks)
kfree(masks);
}

+DEFINE_FREE(free_node_to_cpumask, cpumask_var_t *, if (_T) free_node_to_cpumask(_T));
+
static void build_node_to_cpumask(cpumask_var_t *masks)
{
int cpu;
@@ -345,26 +347,16 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
*/
struct cpumask *group_cpus_evenly(unsigned int numgrps)
{
- unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
- cpumask_var_t *node_to_cpumask;
- cpumask_var_t nmsk, npresmsk;
- int ret = -ENOMEM;
- struct cpumask *masks = NULL;
-
- if (!alloc_cpumask_var(&nmsk, GFP_KERNEL))
+ cpumask_var_t *node_to_cpumask __free(free_node_to_cpumask) = alloc_node_to_cpumask();
+ struct cpumask *masks __free(kfree) = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+ cpumask_var_t npresmsk __free(free_cpumask_var) = CPUMASK_NULL;
+ cpumask_var_t nmsk __free(free_cpumask_var) = CPUMASK_NULL;
+ int curgrp, nr_present, nr_others;
+
+ if (!masks || !node_to_cpumask || !alloc_cpumask_var(&nmsk, GFP_KERNEL)
+ || !alloc_cpumask_var(&npresmsk, GFP_KERNEL))
return NULL;

- if (!alloc_cpumask_var(&npresmsk, GFP_KERNEL))
- goto fail_nmsk;
-
- node_to_cpumask = alloc_node_to_cpumask();
- if (!node_to_cpumask)
- goto fail_npresmsk;
-
- masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
- if (!masks)
- goto fail_node_to_cpumask;
-
build_node_to_cpumask(node_to_cpumask);

/*
@@ -382,11 +374,15 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
cpumask_copy(npresmsk, data_race(cpu_present_mask));

/* grouping present CPUs first */
- ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
- npresmsk, nmsk, masks);
- if (ret < 0)
- goto fail_build_affinity;
- nr_present = ret;
+ nr_present = __group_cpus_evenly(0, numgrps, node_to_cpumask, npresmsk, nmsk, masks);
+ if (nr_present < 0)
+ return NULL;
+
+ /* If npresmsk is empty */
+ if (!cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk))
+ return_ptr(masks);
+
+ curgrp = nr_present < numgrps ? nr_present : 0;

/*
* Allocate non present CPUs starting from the next group to be
@@ -394,38 +390,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
* group space, assign the non present CPUs to the already
* allocated out groups.
*/
- if (nr_present >= numgrps)
- curgrp = 0;
- else
- curgrp = nr_present;
-
- if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk))
- /* If npresmsk is not empty */
- ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
- npresmsk, nmsk, masks);
- else
- ret = 0;
-
- if (ret >= 0)
- nr_others = ret;
-
- fail_build_affinity:
- if (ret >= 0)
- WARN_ON(nr_present + nr_others < numgrps);
-
- fail_node_to_cpumask:
- free_node_to_cpumask(node_to_cpumask);
-
- fail_npresmsk:
- free_cpumask_var(npresmsk);
-
- fail_nmsk:
- free_cpumask_var(nmsk);
- if (ret < 0) {
- kfree(masks);
+ nr_others = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+ npresmsk, nmsk, masks);
+ if (nr_others < 0)
return NULL;
- }
- return masks;
+
+ WARN_ON(nr_present + nr_others < numgrps);
+ return_ptr(masks);
}
#else /* CONFIG_SMP */
struct cpumask *group_cpus_evenly(unsigned int numgrps)
--
2.40.1

2024-01-20 02:53:02

by Yury Norov

[permalink] [raw]

Subject: [PATCH 9/9] lib/group_cpus: simplify group_cpus_evenly() for more

The nmsk parameter is used only in helper function, so move it there.

Suggested-by: Ming Lei <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
lib/group_cpus.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 4c09df9eb886..71e802fca35f 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -253,13 +253,17 @@ static void alloc_nodes_groups(unsigned int numgrps,
static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
cpumask_var_t *node_to_cpumask,
const struct cpumask *cpu_mask,
- struct cpumask *nmsk, struct cpumask *masks)
+ struct cpumask *masks)
{
unsigned int i, n, nodes, cpus_per_grp, extra_grps, done = 0;
unsigned int last_grp = numgrps;
unsigned int curgrp = startgrp;
nodemask_t nodemsk = NODE_MASK_NONE;
struct node_groups *node_groups;
+ cpumask_var_t nmsk __free(free_cpumask_var) = CPUMASK_NULL;
+
+ if (!alloc_cpumask_var(&nmsk, GFP_KERNEL))
+ return -ENOMEM;

nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);

@@ -350,11 +354,9 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
cpumask_var_t *node_to_cpumask __free(free_node_to_cpumask) = alloc_node_to_cpumask();
struct cpumask *masks __free(kfree) = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
cpumask_var_t npresmsk __free(free_cpumask_var) = CPUMASK_NULL;
- cpumask_var_t nmsk __free(free_cpumask_var) = CPUMASK_NULL;
int curgrp, nr_present, nr_others;

- if (!masks || !node_to_cpumask || !alloc_cpumask_var(&nmsk, GFP_KERNEL)
- || !alloc_cpumask_var(&npresmsk, GFP_KERNEL))
+ if (!masks || !node_to_cpumask || !alloc_cpumask_var(&npresmsk, GFP_KERNEL))
return NULL;

build_node_to_cpumask(node_to_cpumask);
@@ -374,7 +376,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
cpumask_copy(npresmsk, data_race(cpu_present_mask));

/* grouping present CPUs first */
- nr_present = __group_cpus_evenly(0, numgrps, node_to_cpumask, npresmsk, nmsk, masks);
+ nr_present = __group_cpus_evenly(0, numgrps, node_to_cpumask, npresmsk, masks);
if (nr_present < 0)
return NULL;

@@ -390,8 +392,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
* group space, assign the non present CPUs to the already
* allocated out groups.
*/
- nr_others = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
- npresmsk, nmsk, masks);
+ nr_others = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, npresmsk, masks);
if (nr_others < 0)
return NULL;

--
2.40.1

2024-01-20 03:04:04

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH 1/9] cpumask: introduce for_each_cpu_and_from()

On Fri, Jan 19, 2024 at 06:50:45PM -0800, Yury Norov wrote:
> Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
> which is handy when it's needed to traverse 2 cpumasks or bitmaps,
> starting from a given position.

The new helper is useless, see

https://lore.kernel.org/lkml/ZZNgDb6bzOscrNmk@fedora/

Thanks,
Ming

2024-01-20 03:17:26

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH 2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one()

On Fri, Jan 19, 2024 at 06:50:46PM -0800, Yury Norov wrote:
> The loop starts from the beginning every time we switch to the next
> sibling mask. This is the Schlemiel the Painter's style of coding
> because we know for sure that nmsk is clear up to current CPU, and we
> can just continue from the next CPU.
>
> Also, we can do it nicer if leverage the dedicated for_each() iterator,
> and simplify the logic of clearing a bit in nmsk.
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> lib/group_cpus.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index ee272c4cefcc..063ed9ae1b8d 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -30,14 +30,14 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>
> /* If the cpu has siblings, use them first */
> siblmsk = topology_sibling_cpumask(cpu);
> - for (sibl = -1; cpus_per_grp > 0; ) {
> - sibl = cpumask_next(sibl, siblmsk);
> - if (sibl >= nr_cpu_ids)
> - break;
> - if (!cpumask_test_and_clear_cpu(sibl, nmsk))
> - continue;
> + sibl = cpu + 1;

No, it is silly to let 'sibl' point to 'cpu + 1', cause we just
want to iterate over 'siblmsk & nmsk', and nothing to do with
the next cpu('cpu + 1').

> +
> + for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> + if (cpus_per_grp-- == 0)
> + return;
> +
> + cpumask_clear_cpu(sibl, nmsk);
> cpumask_set_cpu(sibl, irqmsk);
> - cpus_per_grp--;

Andrew, please replace the 1st two patches with the following one:

From 7a983ee5e1b4f05e5ae26c025dffd801b909e2f3 Mon Sep 17 00:00:00 2001
From: Ming Lei <[email protected]>
Date: Sat, 20 Jan 2024 11:07:26 +0800
Subject: [PATCH] lib/group_cpus.c: simplify grp_spread_init_one()

What the inner loop needs to do is to iterate over `siblmsk & nmsk`, and
clear the cpu in 'nmsk' and set it in 'irqmsk'.

Clean it by for_each_cpu_and().

This is based on Yury Norov's patch, which needs one extra
for_each_cpu_and_from(), which is really not necessary.

Signed-off-by: Ming Lei <[email protected]>
---
lib/group_cpus.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc..564d8e817f65 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -30,14 +30,11 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,

/* If the cpu has siblings, use them first */
siblmsk = topology_sibling_cpumask(cpu);
- for (sibl = -1; cpus_per_grp > 0; ) {
- sibl = cpumask_next(sibl, siblmsk);
- if (sibl >= nr_cpu_ids)
- break;
- if (!cpumask_test_and_clear_cpu(sibl, nmsk))
- continue;
+ for_each_cpu_and(sibl, siblmsk, nmsk) {
+ cpumask_clear_cpu(sibl, nmsk);
cpumask_set_cpu(sibl, irqmsk);
- cpus_per_grp--;
+ if (--cpus_per_grp == 0)
+ return;
}
}
}
--
2.42.0

Thanks,
Ming

2024-01-20 03:52:23

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH 4/9] lib/group_cpus: optimize outer loop in grp_spread_init_one()

On Fri, Jan 19, 2024 at 06:50:48PM -0800, Yury Norov wrote:
> Similarly to the inner loop, in the outer loop we can use for_each_cpu()
> macro, and skip CPUs that have been moved.
>
> With this patch, the function becomes O(1), despite that it's a
> double-loop.
>
> While here, add a comment why we can't merge outer logic into the inner
> loop.
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> lib/group_cpus.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index 0a8ac7cb1a5d..952aac9eaa81 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -17,16 +17,17 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> const struct cpumask *siblmsk;
> int cpu, sibl;
>
> - for ( ; cpus_per_grp > 0; ) {
> - cpu = cpumask_first(nmsk);
> -
> - /* Should not happen, but I'm too lazy to think about it */
> - if (cpu >= nr_cpu_ids)
> + for_each_cpu(cpu, nmsk) {
> + if (cpus_per_grp-- == 0)
> return;
>
> + /*
> + * If a caller wants to spread IRQa on offline CPUs, we need to
> + * take care of it explicitly because those offline CPUS are not
> + * included in siblings cpumask.
> + */
> __cpumask_clear_cpu(cpu, nmsk);
> __cpumask_set_cpu(cpu, irqmsk);
> - cpus_per_grp--;
>
> /* If the cpu has siblings, use them first */
> siblmsk = topology_sibling_cpumask(cpu);
> @@ -38,6 +39,7 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>
> __cpumask_clear_cpu(sibl, nmsk);
> __cpumask_set_cpu(sibl, irqmsk);
> + cpu = sibl + 1;

It has been tricky enough to update condition variable of for_each_cpu()
(such kind of pattern can't build in Rust at all), and the above line could
be more tricky actually.

You can get O(1)(not sure it matters here) by using cpumask_next(), which
is more readable, isn't it?

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 564d8e817f65..e0ce878ac4c4 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -15,10 +15,10 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
unsigned int cpus_per_grp)
{
const struct cpumask *siblmsk;
- int cpu, sibl;
+ int cpu = -1;

- for ( ; cpus_per_grp > 0; ) {
- cpu = cpumask_first(nmsk);
+ while (cpus_per_grp > 0) {
+ cpu = cpumask_next(cpu, nmsk);

/* Should not happen, but I'm too lazy to think about it */
if (cpu >= nr_cpu_ids)
@@ -30,9 +30,9 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,

/* If the cpu has siblings, use them first */
siblmsk = topology_sibling_cpumask(cpu);
- for_each_cpu_and(sibl, siblmsk, nmsk) {
- cpumask_clear_cpu(sibl, nmsk);
- cpumask_set_cpu(sibl, irqmsk);
+ for_each_cpu_and(cpu, siblmsk, nmsk) {
+ cpumask_clear_cpu(cpu, nmsk);
+ cpumask_set_cpu(cpu, irqmsk);
if (--cpus_per_grp == 0)
return;
}

Thanks,
Ming

2024-01-20 06:18:03

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH 4/9] lib/group_cpus: optimize outer loop in grp_spread_init_one()

On Sat, Jan 20, 2024 at 11:51:58AM +0800, Ming Lei wrote:
> On Fri, Jan 19, 2024 at 06:50:48PM -0800, Yury Norov wrote:
> > Similarly to the inner loop, in the outer loop we can use for_each_cpu()
> > macro, and skip CPUs that have been moved.
> >
> > With this patch, the function becomes O(1), despite that it's a
> > double-loop.
> >
> > While here, add a comment why we can't merge outer logic into the inner
> > loop.
> >
> > Signed-off-by: Yury Norov <[email protected]>
> > ---
> > lib/group_cpus.c | 14 ++++++++------
> > 1 file changed, 8 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> > index 0a8ac7cb1a5d..952aac9eaa81 100644
> > --- a/lib/group_cpus.c
> > +++ b/lib/group_cpus.c
> > @@ -17,16 +17,17 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > const struct cpumask *siblmsk;
> > int cpu, sibl;
> >
> > - for ( ; cpus_per_grp > 0; ) {
> > - cpu = cpumask_first(nmsk);
> > -
> > - /* Should not happen, but I'm too lazy to think about it */
> > - if (cpu >= nr_cpu_ids)
> > + for_each_cpu(cpu, nmsk) {
> > + if (cpus_per_grp-- == 0)
> > return;
> >
> > + /*
> > + * If a caller wants to spread IRQa on offline CPUs, we need to
> > + * take care of it explicitly because those offline CPUS are not
> > + * included in siblings cpumask.
> > + */
> > __cpumask_clear_cpu(cpu, nmsk);
> > __cpumask_set_cpu(cpu, irqmsk);
> > - cpus_per_grp--;
> >
> > /* If the cpu has siblings, use them first */
> > siblmsk = topology_sibling_cpumask(cpu);
> > @@ -38,6 +39,7 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> >
> > __cpumask_clear_cpu(sibl, nmsk);
> > __cpumask_set_cpu(sibl, irqmsk);
> > + cpu = sibl + 1;
>
> It has been tricky enough to update condition variable of for_each_cpu()
> (such kind of pattern can't build in Rust at all), and the above line could
> be more tricky actually.

Not only the above line is tricky, but also it is wrong, because 'cpu'
local variable should always point to the 1st bit in 'nmsk'. However, if
you set it to 'sibl + 1', some bits in 'nmsk' are skipped in the loop,
aren't they?

Thanks,
Ming

2024-01-20 07:03:27

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH 2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one()

On Sat, Jan 20, 2024 at 11:17:00AM +0800, Ming Lei wrote:
> On Fri, Jan 19, 2024 at 06:50:46PM -0800, Yury Norov wrote:
> > The loop starts from the beginning every time we switch to the next
> > sibling mask. This is the Schlemiel the Painter's style of coding
> > because we know for sure that nmsk is clear up to current CPU, and we
> > can just continue from the next CPU.
> >
> > Also, we can do it nicer if leverage the dedicated for_each() iterator,
> > and simplify the logic of clearing a bit in nmsk.
> >
> > Signed-off-by: Yury Norov <[email protected]>
> > ---
> > lib/group_cpus.c | 14 +++++++-------
> > 1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> > index ee272c4cefcc..063ed9ae1b8d 100644
> > --- a/lib/group_cpus.c
> > +++ b/lib/group_cpus.c
> > @@ -30,14 +30,14 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> >
> > /* If the cpu has siblings, use them first */
> > siblmsk = topology_sibling_cpumask(cpu);
> > - for (sibl = -1; cpus_per_grp > 0; ) {
> > - sibl = cpumask_next(sibl, siblmsk);
> > - if (sibl >= nr_cpu_ids)
> > - break;
> > - if (!cpumask_test_and_clear_cpu(sibl, nmsk))
> > - continue;
> > + sibl = cpu + 1;
>
> No, it is silly to let 'sibl' point to 'cpu + 1', cause we just
> want to iterate over 'siblmsk & nmsk', and nothing to do with
> the next cpu('cpu + 1').
>
> > +
> > + for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> > + if (cpus_per_grp-- == 0)
> > + return;
> > +
> > + cpumask_clear_cpu(sibl, nmsk);
> > cpumask_set_cpu(sibl, irqmsk);
> > - cpus_per_grp--;
>
> Andrew, please replace the 1st two patches with the following one:
>
> From 7a983ee5e1b4f05e5ae26c025dffd801b909e2f3 Mon Sep 17 00:00:00 2001
> From: Ming Lei <[email protected]>
> Date: Sat, 20 Jan 2024 11:07:26 +0800
> Subject: [PATCH] lib/group_cpus.c: simplify grp_spread_init_one()
>
> What the inner loop needs to do is to iterate over `siblmsk & nmsk`, and
> clear the cpu in 'nmsk' and set it in 'irqmsk'.
>
> Clean it by for_each_cpu_and().
>
> This is based on Yury Norov's patch, which needs one extra
> for_each_cpu_and_from(), which is really not necessary.
>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> lib/group_cpus.c | 11 ++++-------
> 1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index ee272c4cefcc..564d8e817f65 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -30,14 +30,11 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>
> /* If the cpu has siblings, use them first */
> siblmsk = topology_sibling_cpumask(cpu);
> - for (sibl = -1; cpus_per_grp > 0; ) {
> - sibl = cpumask_next(sibl, siblmsk);
> - if (sibl >= nr_cpu_ids)
> - break;
> - if (!cpumask_test_and_clear_cpu(sibl, nmsk))
> - continue;
> + for_each_cpu_and(sibl, siblmsk, nmsk) {
> + cpumask_clear_cpu(sibl, nmsk);
> cpumask_set_cpu(sibl, irqmsk);
> - cpus_per_grp--;
> + if (--cpus_per_grp == 0)
> + return;

Iterator variable of 'nmsk' is updated inside loop, and it is still
tricky, so please ignore it, I just sent one formal & revised patch:

https://lkml.org/lkml/2024/1/20/43

Thanks,
Ming

2024-01-21 19:50:14

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH 1/9] cpumask: introduce for_each_cpu_and_from()

On Sat, Jan 20, 2024 at 11:03:37AM +0800, Ming Lei wrote:
> On Fri, Jan 19, 2024 at 06:50:45PM -0800, Yury Norov wrote:
> > Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
> > which is handy when it's needed to traverse 2 cpumasks or bitmaps,
> > starting from a given position.
>
> The new helper is useless, see
>
> https://lore.kernel.org/lkml/ZZNgDb6bzOscrNmk@fedora/

Let's consider the following configuration.

CPUs: 0b1111
Sibling groups: 0b0011 and 0b1100
nmsk: 0b1111

As the complexity measure we take the number of accesses to nmsk in
the outer loop, and to (nmsk & sibl) in the inner loop in search
routines, so that

cpumask_first(1111)

requires 1 access to find the first set bit, and

cpumask_first(1000)

requires 4 such accesses.

Actual find_bit() ops work better than this by using __ffs(), but on long
bitmaps the performance will be as described above.

Now, look at the code. This is yours:

static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
unsigned int cpus_per_grp)
{
const struct cpumask *siblmsk;
int cpu, sibl;

for ( ; cpus_per_grp > 0; ) {
cpu = cpumask_first(nmsk);

/* Should not happen, but I'm too lazy to think about it */
if (cpu >= nr_cpu_ids)
return;

cpumask_clear_cpu(cpu, nmsk);
cpumask_set_cpu(cpu, irqmsk);
cpus_per_grp--;

/* If the cpu has siblings, use them first */
siblmsk = topology_sibling_cpumask(cpu);
for (sibl = -1; cpus_per_grp > 0; ) {
sibl = cpumask_next(sibl, siblmsk);
if (sibl >= nr_cpu_ids)
break;
if (!cpumask_test_and_clear_cpu(sibl, nmsk))
continue;
cpumask_set_cpu(sibl, irqmsk);
cpus_per_grp--;
}
}
}

This is your code step-by-step:

# loop cpu match siblmsk nmsk irqmsk
0 outer 0 yes 1110 0001
1 inner 0 no 1110 0001
2 inner 1 yes 0011 1100 0011
3 inner 2 no 1100 0011
4 inner 3 no 1100 0011
5 outer 0 no 1100 0011
6 outer 1 no 1100 0011
7 outer 2 yes 1000 0111
8 inner 0 no 1100 1000 0111
9 inner 1 no 1100 1000 0111
10 inner 2 no 1100 1000 0111
11 inner 3 yes 1100 0000 1111
12 outer 0 no 0000 1111
13 outer 1 no 0000 1111
14 outer 2 no 0000 1111
15 outer 3 no 0000 1111

This is mine:

static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
unsigned int cpus_per_grp)
{
const struct cpumask *siblmsk;
int cpu, sibl;

for_each_cpu(cpu, nmsk) {
if (cpus_per_grp-- == 0)
return;

/*
* If a caller wants to spread IRQa on offline CPUs, we need to
* take care of it explicitly because those offline CPUS are not
* included in siblings cpumask.
*/
__cpumask_clear_cpu(cpu, nmsk);
__cpumask_set_cpu(cpu, irqmsk);

/* If the cpu has siblings, use them first */
siblmsk = topology_sibling_cpumask(cpu);
sibl = cpu + 1;

for_each_cpu_and_from(sibl, siblmsk, nmsk) {
if (cpus_per_grp-- == 0)
return;

__cpumask_clear_cpu(sibl, nmsk);
__cpumask_set_cpu(sibl, irqmsk);
cpu = sibl + 1;
}
}
}

Step-by-step:

# loop cpu match siblmsk nmsk irqmsk
0 outer 0 yes 1110 0001
1 inner 1 yes 0011 1100 0011
2 inner 2 no 0011 1100 0011
3 inner 3 no 0011 1100 0011
4 outer 2 yes 1000 0111
5 inner 3 yes 1100 0000 1111

Your code works worse because it's a Schlemiel the Painter's algorithm.
I mentioned it twice in the commit messages and at least 3 times in
replies to your comments.

Here I'll stop and will not reply to your emails, including the rest of
that Friday's night mailbombing, unless you at least admit you're wrong
in this case and for_each_cpu_and_from() is useful here.

I'd also recommend you to learn more about atomic operations basics and
revoke your NAK from the patch #3.

Thanks,
Yury

PS: There's a typo in the series name, I meant that the series makes the
function O(N) of course. But even that is overly optimistic. It's O(N*S),
where S is the number of sibling groups. A couple more patches needed to
make it a true O(N). Still, much better.

2024-01-22 02:41:55

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH 1/9] cpumask: introduce for_each_cpu_and_from()

On Sun, Jan 21, 2024 at 11:50:02AM -0800, Yury Norov wrote:
> On Sat, Jan 20, 2024 at 11:03:37AM +0800, Ming Lei wrote:
> > On Fri, Jan 19, 2024 at 06:50:45PM -0800, Yury Norov wrote:
> > > Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
> > > which is handy when it's needed to traverse 2 cpumasks or bitmaps,
> > > starting from a given position.
> >
> > The new helper is useless, see
> >
> > https://lore.kernel.org/lkml/ZZNgDb6bzOscrNmk@fedora/
>
> Let's consider the following configuration.
> Step-by-step:

..

>
> # loop cpu match siblmsk nmsk irqmsk
> 0 outer 0 yes 1110 0001
> 1 inner 1 yes 0011 1100 0011
> 2 inner 2 no 0011 1100 0011
> 3 inner 3 no 0011 1100 0011
> 4 outer 2 yes 1000 0111
> 5 inner 3 yes 1100 0000 1111
>
> Your code works worse because it's a Schlemiel the Painter's algorithm.
> I mentioned it twice in the commit messages and at least 3 times in
> replies to your comments.

Does it really matter here in reality? Which kind of user visible improvements
can be observed?

I have mentioned several times, for control/management code path, we care
more on maintainability, correctness instead of efficiency.

You are _wasting_ resources in wrong place, if you are really interested in
optimization, please do in fast code path, such as, related and not not limited,
irq handling, io handling, memory allocation, ....

Unfortunately, your V5 still have obvious bug, and as you mentioned,
the patchset title is wrong too.

>
> Here I'll stop and will not reply to your emails, including the rest of
> that Friday's night mailbombing, unless you at least admit you're wrong
> in this case and for_each_cpu_and_from() is useful here.

It is easy to get same result without adding for_each_cpu_and_from(), see the
patch I sent:

https://lore.kernel.org/lkml/[email protected]/

in which we needn't to update iterator variable inside loop, and fix
the bug in your patch 4 of v5, and it is still O(N). Meantime it is
simpler and easier to get proved.

Here your use of for_each_cpu_and_from() is tricky too, cause the
loop condition variable(part of iterator variable) of cpu mask is being updated
inside the loop. And we can get same result by cpumask_next_and()
without playing the trick.

>
> I'd also recommend you to learn more about atomic operations basics and
> revoke your NAK from the patch #3.

If you think my comment on the NAK is wrong, please reply on the comment
directly.

>
> Thanks,
> Yury
>
> PS: There's a typo in the series name, I meant that the series makes the
> function O(N) of course. But even that is overly optimistic. It's O(N*S),
> where S is the number of sibling groups. A couple more patches needed to
> make it a true O(N). Still, much better.

Either O(1) or O(N) isn't one big deal here, cause it is oneshot
slow code path, and nr_cpu_ids is not big enough in reality.

Even you can't make real O(N) because your patch 4 has logic
mistake, see my comment:

https://lore.kernel.org/lkml/ZatlggW%2F8SH6od9O@fedora/

Thanks,
Ming