2023-12-12 04:21:34

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 0/7] lib/group_cpus: rework grp_spread_init_one() and make it O(1)

grp_spread_init_one() implementation is sub-optimal because it
traverses bitmaps from the beginning, instead of picking from the
previous iteration.

Fix it and use find_bit API where appropriate. While here, optimize
cpumasks allocation and drop unneeded cpumask_empty() call.

---
v1: https://lore.kernel.org/all/ZW5MI3rKQueLM0Bz@yury-ThinkPad/T/
v2: https://lore.kernel.org/lkml/ZXKNVRu3AfvjaFhK@fedora/T/
v3:
- swap patches #2 and #3 @ Ming Lei;
- add patch #7, which simplifies the function further.


Yury Norov (7):
cpumask: introduce for_each_cpu_and_from()
lib/group_cpus: optimize inner loop in grp_spread_init_one()
lib/group_cpus: relax atomicity requirement in grp_spread_init_one()
lib/group_cpus: optimize outer loop in grp_spread_init_one()
lib/cgroup_cpus.c: don't zero cpumasks in group_cpus_evenly() on
allocation
lib/group_cpus.c: drop unneeded cpumask_empty() call in
__group_cpus_evenly()
lib/group_cpus: simplify grp_spread_init_one() for more

include/linux/cpumask.h | 11 ++++++++++
include/linux/find.h | 3 +++
lib/group_cpus.c | 47 +++++++++++++++++------------------------
3 files changed, 33 insertions(+), 28 deletions(-)

--
2.40.1


2023-12-12 04:21:41

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 1/7] cpumask: introduce for_each_cpu_and_from()

Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
which is handy when it's needed to traverse 2 cpumasks or bitmaps,
starting from a given position.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/cpumask.h | 11 +++++++++++
include/linux/find.h | 3 +++
2 files changed, 14 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cfb545841a2c..73ff2e0ef090 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -332,6 +332,17 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta
#define for_each_cpu_and(cpu, mask1, mask2) \
for_each_and_bit(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)

+/**
+ * for_each_cpu_and_from - iterate over every cpu in both masks starting from a given cpu
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_and_from(cpu, mask1, mask2) \
+ for_each_and_bit_from(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)
+
/**
* for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
* those present in another.
diff --git a/include/linux/find.h b/include/linux/find.h
index 5e4f39ef2e72..dfd3d51ff590 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -563,6 +563,9 @@ unsigned long find_next_bit_le(const void *addr, unsigned
(bit) = find_next_and_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
(bit)++)

+#define for_each_and_bit_from(bit, addr1, addr2, size) \
+ for (; (bit) = find_next_and_bit((addr1), (addr2), (size), (bit)), (bit) < (size); (bit)++)
+
#define for_each_andnot_bit(bit, addr1, addr2, size) \
for ((bit) = 0; \
(bit) = find_next_andnot_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
--
2.40.1

2023-12-12 04:21:48

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 5/7] lib/cgroup_cpus: don't zero cpumasks in group_cpus_evenly() on allocation

nmsk and npresmsk are both allocated with zalloc_cpumask_var(), but they
are initialized by copying later in the code, and so may be allocated
uninitialized.

Signed-off-by: Yury Norov <[email protected]>
---
lib/group_cpus.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index cded3c8ea63b..c7fcd04c87bf 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -347,10 +347,10 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
int ret = -ENOMEM;
struct cpumask *masks = NULL;

- if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+ if (!alloc_cpumask_var(&nmsk, GFP_KERNEL))
return NULL;

- if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
+ if (!alloc_cpumask_var(&npresmsk, GFP_KERNEL))
goto fail_nmsk;

node_to_cpumask = alloc_node_to_cpumask();
--
2.40.1

2023-12-12 04:21:59

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 6/7] lib/group_cpus: drop unneeded cpumask_empty() call in __group_cpus_evenly()

The function is called twice. First time it's called with
cpumask_present as a parameter, which can't be empty. Second time it's
called with a mask created with cpumask_andnot(), which returns false if
the result is an empty mask.

We can safely drop redundand cpumask_empty() call from the
__group_cpus_evenly() and save few cycles.

Signed-off-by: Yury Norov <[email protected]>
---
lib/group_cpus.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index c7fcd04c87bf..664a56171a1b 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -252,9 +252,6 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
nodemask_t nodemsk = NODE_MASK_NONE;
struct node_groups *node_groups;

- if (cpumask_empty(cpu_mask))
- return 0;
-
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);

/*
@@ -394,9 +391,14 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
curgrp = 0;
else
curgrp = nr_present;
- cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
- ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
- npresmsk, nmsk, masks);
+
+ if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk))
+ /* If npresmsk is not empty */
+ ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+ npresmsk, nmsk, masks);
+ else
+ ret = 0;
+
if (ret >= 0)
nr_others = ret;

--
2.40.1

2023-12-12 04:22:04

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 3/7] lib/group_cpus: relax atomicity requirement in grp_spread_init_one()

Because nmsk and irqmsk are stable, extra atomicity is not required.

Signed-off-by: Yury Norov <[email protected]>
---
lib/group_cpus.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 10dead3ab0e0..7ac94664230f 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -24,8 +24,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
if (cpu >= nr_cpu_ids)
return;

- cpumask_clear_cpu(cpu, nmsk);
- cpumask_set_cpu(cpu, irqmsk);
+ __cpumask_clear_cpu(cpu, nmsk);
+ __cpumask_set_cpu(cpu, irqmsk);
cpus_per_grp--;

/* If the cpu has siblings, use them first */
@@ -33,8 +33,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
sibl = cpu + 1;

for_each_cpu_and_from(sibl, siblmsk, nmsk) {
- cpumask_clear_cpu(sibl, nmsk);
- cpumask_set_cpu(sibl, irqmsk);
+ __cpumask_clear_cpu(sibl, nmsk);
+ __cpumask_set_cpu(sibl, irqmsk);
if (cpus_per_grp-- == 0)
return;
}
--
2.40.1

2023-12-12 04:22:05

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 7/7] lib/group_cpus: simplify grp_spread_init_one() for more

The outer and inner loops of grp_spread_init_one() do the same thing -
move a bit from nmsk to irqmsk.

The inner loop iterates the sibling group, which includes the CPU picked
by outer loop. And it means that we can drop the part that moves the bit
in the outer loop.

Signed-off-by: Yury Norov <[email protected]>
---
lib/group_cpus.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 664a56171a1b..7aa7a6289355 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -18,14 +18,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
int cpu, sibl;

for_each_cpu(cpu, nmsk) {
- __cpumask_clear_cpu(cpu, nmsk);
- __cpumask_set_cpu(cpu, irqmsk);
- if (cpus_per_grp-- == 0)
- return;
-
- /* If the cpu has siblings, use them first */
siblmsk = topology_sibling_cpumask(cpu);
- sibl = cpu + 1;
+ sibl = cpu;

for_each_cpu_and_from(sibl, siblmsk, nmsk) {
__cpumask_clear_cpu(sibl, nmsk);
--
2.40.1

2023-12-12 04:22:29

by Yury Norov

[permalink] [raw]
Subject: [PATCH v3 4/7] lib/group_cpus: optimize outer loop in grp_spread_init_one()

Similarly to the inner loop, in the outer loop we can use for_each_cpu()
macro, and skip CPUs that have been copied.

With this patch, the function becomes O(1), despite that it's a
double-loop.

Signed-off-by: Yury Norov <[email protected]>
---
lib/group_cpus.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 7ac94664230f..cded3c8ea63b 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -17,16 +17,11 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
const struct cpumask *siblmsk;
int cpu, sibl;

- for ( ; cpus_per_grp > 0; ) {
- cpu = cpumask_first(nmsk);
-
- /* Should not happen, but I'm too lazy to think about it */
- if (cpu >= nr_cpu_ids)
- return;
-
+ for_each_cpu(cpu, nmsk) {
__cpumask_clear_cpu(cpu, nmsk);
__cpumask_set_cpu(cpu, irqmsk);
- cpus_per_grp--;
+ if (cpus_per_grp-- == 0)
+ return;

/* If the cpu has siblings, use them first */
siblmsk = topology_sibling_cpumask(cpu);
@@ -37,6 +32,7 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
__cpumask_set_cpu(sibl, irqmsk);
if (cpus_per_grp-- == 0)
return;
+ cpu = sibl + 1;
}
}
}
--
2.40.1

2023-12-12 09:51:15

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] lib/group_cpus: relax atomicity requirement in grp_spread_init_one()

On Mon, Dec 11, 2023 at 08:21:03PM -0800, Yury Norov wrote:
> Because nmsk and irqmsk are stable, extra atomicity is not required.
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> lib/group_cpus.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index 10dead3ab0e0..7ac94664230f 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -24,8 +24,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> if (cpu >= nr_cpu_ids)
> return;
>
> - cpumask_clear_cpu(cpu, nmsk);
> - cpumask_set_cpu(cpu, irqmsk);
> + __cpumask_clear_cpu(cpu, nmsk);
> + __cpumask_set_cpu(cpu, irqmsk);
> cpus_per_grp--;
>
> /* If the cpu has siblings, use them first */
> @@ -33,8 +33,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> sibl = cpu + 1;
>
> for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> - cpumask_clear_cpu(sibl, nmsk);
> - cpumask_set_cpu(sibl, irqmsk);
> + __cpumask_clear_cpu(sibl, nmsk);
> + __cpumask_set_cpu(sibl, irqmsk);

I think this kind of change should be avoided, here the code is
absolutely in slow path, and we care code cleanness and readability
much more than the saved cycle from non atomicity.


Thanks,
Ming

2023-12-12 16:52:35

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] lib/group_cpus: relax atomicity requirement in grp_spread_init_one()

On Tue, Dec 12, 2023 at 05:50:04PM +0800, Ming Lei wrote:
> On Mon, Dec 11, 2023 at 08:21:03PM -0800, Yury Norov wrote:
> > Because nmsk and irqmsk are stable, extra atomicity is not required.
> >
> > Signed-off-by: Yury Norov <[email protected]>
> > ---
> > lib/group_cpus.c | 8 ++++----
> > 1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> > index 10dead3ab0e0..7ac94664230f 100644
> > --- a/lib/group_cpus.c
> > +++ b/lib/group_cpus.c
> > @@ -24,8 +24,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > if (cpu >= nr_cpu_ids)
> > return;
> >
> > - cpumask_clear_cpu(cpu, nmsk);
> > - cpumask_set_cpu(cpu, irqmsk);
> > + __cpumask_clear_cpu(cpu, nmsk);
> > + __cpumask_set_cpu(cpu, irqmsk);
> > cpus_per_grp--;
> >
> > /* If the cpu has siblings, use them first */
> > @@ -33,8 +33,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > sibl = cpu + 1;
> >
> > for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> > - cpumask_clear_cpu(sibl, nmsk);
> > - cpumask_set_cpu(sibl, irqmsk);
> > + __cpumask_clear_cpu(sibl, nmsk);
> > + __cpumask_set_cpu(sibl, irqmsk);
>
> I think this kind of change should be avoided, here the code is
> absolutely in slow path, and we care code cleanness and readability
> much more than the saved cycle from non atomicity.

Atomic ops have special meaning and special function. This 'atomic' way
of moving a bit from one bitmap to another looks completely non-trivial
and puzzling to me.

A sequence of atomic ops is not atomic itself. Normally it's a sing of
a bug. But in this case, both masks are stable, and we don't need
atomicity at all.

It's not about performance, it's about readability.

Thanks,
Yury

2023-12-13 00:16:28

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] lib/group_cpus: relax atomicity requirement in grp_spread_init_one()

On Tue, Dec 12, 2023 at 08:52:14AM -0800, Yury Norov wrote:
> On Tue, Dec 12, 2023 at 05:50:04PM +0800, Ming Lei wrote:
> > On Mon, Dec 11, 2023 at 08:21:03PM -0800, Yury Norov wrote:
> > > Because nmsk and irqmsk are stable, extra atomicity is not required.
> > >
> > > Signed-off-by: Yury Norov <[email protected]>
> > > ---
> > > lib/group_cpus.c | 8 ++++----
> > > 1 file changed, 4 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> > > index 10dead3ab0e0..7ac94664230f 100644
> > > --- a/lib/group_cpus.c
> > > +++ b/lib/group_cpus.c
> > > @@ -24,8 +24,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > > if (cpu >= nr_cpu_ids)
> > > return;
> > >
> > > - cpumask_clear_cpu(cpu, nmsk);
> > > - cpumask_set_cpu(cpu, irqmsk);
> > > + __cpumask_clear_cpu(cpu, nmsk);
> > > + __cpumask_set_cpu(cpu, irqmsk);
> > > cpus_per_grp--;
> > >
> > > /* If the cpu has siblings, use them first */
> > > @@ -33,8 +33,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > > sibl = cpu + 1;
> > >
> > > for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> > > - cpumask_clear_cpu(sibl, nmsk);
> > > - cpumask_set_cpu(sibl, irqmsk);
> > > + __cpumask_clear_cpu(sibl, nmsk);
> > > + __cpumask_set_cpu(sibl, irqmsk);
> >
> > I think this kind of change should be avoided, here the code is
> > absolutely in slow path, and we care code cleanness and readability
> > much more than the saved cycle from non atomicity.
>
> Atomic ops have special meaning and special function. This 'atomic' way
> of moving a bit from one bitmap to another looks completely non-trivial
> and puzzling to me.
>
> A sequence of atomic ops is not atomic itself. Normally it's a sing of
> a bug. But in this case, both masks are stable, and we don't need
> atomicity at all.

Here we don't care the atomicity.

>
> It's not about performance, it's about readability.

__cpumask_clear_cpu() and __cpumask_set_cpu() are more like private
helper, and more hard to follow.

[@linux]$ git grep -n -w -E "cpumask_clear_cpu|cpumask_set_cpu" ./ | wc
674 2055 53954
[@linux]$ git grep -n -w -E "__cpumask_clear_cpu|__cpumask_set_cpu" ./ | wc
21 74 1580

I don't object to comment the current usage, but NAK for this change.


Thanks,
Ming

2023-12-13 00:56:54

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] lib/cgroup_cpus: don't zero cpumasks in group_cpus_evenly() on allocation

On Mon, Dec 11, 2023 at 08:21:05PM -0800, Yury Norov wrote:
> nmsk and npresmsk are both allocated with zalloc_cpumask_var(), but they
> are initialized by copying later in the code, and so may be allocated
> uninitialized.
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> lib/group_cpus.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index cded3c8ea63b..c7fcd04c87bf 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -347,10 +347,10 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
> int ret = -ENOMEM;
> struct cpumask *masks = NULL;
>
> - if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
> + if (!alloc_cpumask_var(&nmsk, GFP_KERNEL))
> return NULL;

`nmsk` is actually used by __group_cpus_evenly() only, and it should be
local variable of __group_cpus_evenly(), can you move its allocation into
__group_cpus_evenly()?

>
> - if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
> + if (!alloc_cpumask_var(&npresmsk, GFP_KERNEL))
> goto fail_nmsk;

The above one looks fine, especially `npresmsk` is initialized in
group_cpus_evenly() explicitly.


Thanks,
Ming

2023-12-13 01:00:03

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] lib/group_cpus: drop unneeded cpumask_empty() call in __group_cpus_evenly()

On Mon, Dec 11, 2023 at 08:21:06PM -0800, Yury Norov wrote:
> The function is called twice. First time it's called with
> cpumask_present as a parameter, which can't be empty. Second time it's
> called with a mask created with cpumask_andnot(), which returns false if
> the result is an empty mask.
>
> We can safely drop redundand cpumask_empty() call from the
> __group_cpus_evenly() and save few cycles.
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> lib/group_cpus.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index c7fcd04c87bf..664a56171a1b 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -252,9 +252,6 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
> nodemask_t nodemsk = NODE_MASK_NONE;
> struct node_groups *node_groups;
>
> - if (cpumask_empty(cpu_mask))
> - return 0;
> -
> nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);
>
> /*
> @@ -394,9 +391,14 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
> curgrp = 0;
> else
> curgrp = nr_present;
> - cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
> - ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
> - npresmsk, nmsk, masks);
> +
> + if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk))
> + /* If npresmsk is not empty */
> + ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
> + npresmsk, nmsk, masks);
> + else
> + ret = 0;
> +
> if (ret >= 0)
> nr_others = ret;

Reviewed-by: Ming Lei <[email protected]>

Thanks,
Ming

2023-12-13 01:06:38

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] lib/group_cpus: simplify grp_spread_init_one() for more

On Mon, Dec 11, 2023 at 08:21:07PM -0800, Yury Norov wrote:
> The outer and inner loops of grp_spread_init_one() do the same thing -
> move a bit from nmsk to irqmsk.
>
> The inner loop iterates the sibling group, which includes the CPU picked
> by outer loop. And it means that we can drop the part that moves the bit
> in the outer loop.
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> lib/group_cpus.c | 8 +-------
> 1 file changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index 664a56171a1b..7aa7a6289355 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -18,14 +18,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> int cpu, sibl;
>
> for_each_cpu(cpu, nmsk) {
> - __cpumask_clear_cpu(cpu, nmsk);
> - __cpumask_set_cpu(cpu, irqmsk);
> - if (cpus_per_grp-- == 0)
> - return;
> -
> - /* If the cpu has siblings, use them first */
> siblmsk = topology_sibling_cpumask(cpu);
> - sibl = cpu + 1;
> + sibl = cpu;
>
> for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> __cpumask_clear_cpu(sibl, nmsk);

Correctness of the above change requires that 'cpu' has to be included
into topology_sibling_cpumask(cpu), however, not sure it is always true,
see the following comment in Documentation/arch/x86/topology.rst

`
- topology_sibling_cpumask():

The cpumask contains all online threads in the core to which a thread
belongs.
`

Thanks,
Ming

2023-12-13 17:03:35

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] lib/group_cpus: relax atomicity requirement in grp_spread_init_one()

On Wed, Dec 13, 2023 at 08:14:45AM +0800, Ming Lei wrote:
> On Tue, Dec 12, 2023 at 08:52:14AM -0800, Yury Norov wrote:
> > On Tue, Dec 12, 2023 at 05:50:04PM +0800, Ming Lei wrote:
> > > On Mon, Dec 11, 2023 at 08:21:03PM -0800, Yury Norov wrote:
> > > > Because nmsk and irqmsk are stable, extra atomicity is not required.
> > > >
> > > > Signed-off-by: Yury Norov <[email protected]>
> > > > ---
> > > > lib/group_cpus.c | 8 ++++----
> > > > 1 file changed, 4 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> > > > index 10dead3ab0e0..7ac94664230f 100644
> > > > --- a/lib/group_cpus.c
> > > > +++ b/lib/group_cpus.c
> > > > @@ -24,8 +24,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > > > if (cpu >= nr_cpu_ids)
> > > > return;
> > > >
> > > > - cpumask_clear_cpu(cpu, nmsk);
> > > > - cpumask_set_cpu(cpu, irqmsk);
> > > > + __cpumask_clear_cpu(cpu, nmsk);
> > > > + __cpumask_set_cpu(cpu, irqmsk);
> > > > cpus_per_grp--;
> > > >
> > > > /* If the cpu has siblings, use them first */
> > > > @@ -33,8 +33,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > > > sibl = cpu + 1;
> > > >
> > > > for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> > > > - cpumask_clear_cpu(sibl, nmsk);
> > > > - cpumask_set_cpu(sibl, irqmsk);
> > > > + __cpumask_clear_cpu(sibl, nmsk);
> > > > + __cpumask_set_cpu(sibl, irqmsk);
> > >
> > > I think this kind of change should be avoided, here the code is
> > > absolutely in slow path, and we care code cleanness and readability
> > > much more than the saved cycle from non atomicity.
> >
> > Atomic ops have special meaning and special function. This 'atomic' way
> > of moving a bit from one bitmap to another looks completely non-trivial
> > and puzzling to me.
> >
> > A sequence of atomic ops is not atomic itself. Normally it's a sing of
> > a bug. But in this case, both masks are stable, and we don't need
> > atomicity at all.
>
> Here we don't care the atomicity.
>
> >
> > It's not about performance, it's about readability.
>
> __cpumask_clear_cpu() and __cpumask_set_cpu() are more like private
> helper, and more hard to follow.

No that's not true. Non-atomic version of the function is not a
private helper of course.

> [@linux]$ git grep -n -w -E "cpumask_clear_cpu|cpumask_set_cpu" ./ | wc
> 674 2055 53954
> [@linux]$ git grep -n -w -E "__cpumask_clear_cpu|__cpumask_set_cpu" ./ | wc
> 21 74 1580
>
> I don't object to comment the current usage, but NAK for this change.

No problem, I'll add you NAK.

2023-12-14 00:44:27

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] lib/group_cpus: relax atomicity requirement in grp_spread_init_one()

On Wed, Dec 13, 2023 at 09:03:17AM -0800, Yury Norov wrote:
> On Wed, Dec 13, 2023 at 08:14:45AM +0800, Ming Lei wrote:
> > On Tue, Dec 12, 2023 at 08:52:14AM -0800, Yury Norov wrote:
> > > On Tue, Dec 12, 2023 at 05:50:04PM +0800, Ming Lei wrote:
> > > > On Mon, Dec 11, 2023 at 08:21:03PM -0800, Yury Norov wrote:
> > > > > Because nmsk and irqmsk are stable, extra atomicity is not required.
> > > > >
> > > > > Signed-off-by: Yury Norov <[email protected]>
> > > > > ---
> > > > > lib/group_cpus.c | 8 ++++----
> > > > > 1 file changed, 4 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> > > > > index 10dead3ab0e0..7ac94664230f 100644
> > > > > --- a/lib/group_cpus.c
> > > > > +++ b/lib/group_cpus.c
> > > > > @@ -24,8 +24,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > > > > if (cpu >= nr_cpu_ids)
> > > > > return;
> > > > >
> > > > > - cpumask_clear_cpu(cpu, nmsk);
> > > > > - cpumask_set_cpu(cpu, irqmsk);
> > > > > + __cpumask_clear_cpu(cpu, nmsk);
> > > > > + __cpumask_set_cpu(cpu, irqmsk);
> > > > > cpus_per_grp--;
> > > > >
> > > > > /* If the cpu has siblings, use them first */
> > > > > @@ -33,8 +33,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > > > > sibl = cpu + 1;
> > > > >
> > > > > for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> > > > > - cpumask_clear_cpu(sibl, nmsk);
> > > > > - cpumask_set_cpu(sibl, irqmsk);
> > > > > + __cpumask_clear_cpu(sibl, nmsk);
> > > > > + __cpumask_set_cpu(sibl, irqmsk);
> > > >
> > > > I think this kind of change should be avoided, here the code is
> > > > absolutely in slow path, and we care code cleanness and readability
> > > > much more than the saved cycle from non atomicity.
> > >
> > > Atomic ops have special meaning and special function. This 'atomic' way
> > > of moving a bit from one bitmap to another looks completely non-trivial
> > > and puzzling to me.
> > >
> > > A sequence of atomic ops is not atomic itself. Normally it's a sing of
> > > a bug. But in this case, both masks are stable, and we don't need
> > > atomicity at all.
> >
> > Here we don't care the atomicity.
> >
> > >
> > > It's not about performance, it's about readability.
> >
> > __cpumask_clear_cpu() and __cpumask_set_cpu() are more like private
> > helper, and more hard to follow.
>
> No that's not true. Non-atomic version of the function is not a
> private helper of course.
>
> > [@linux]$ git grep -n -w -E "cpumask_clear_cpu|cpumask_set_cpu" ./ | wc
> > 674 2055 53954
> > [@linux]$ git grep -n -w -E "__cpumask_clear_cpu|__cpumask_set_cpu" ./ | wc
> > 21 74 1580
> >
> > I don't object to comment the current usage, but NAK for this change.
>
> No problem, I'll add you NAK.

You can add the following words meantime:

__cpumask_clear_cpu() and __cpumask_set_cpu() are added in commit 6c8557bdb28d
("smp, cpumask: Use non-atomic cpumask_{set,clear}_cpu()") for fast code path(
smp_call_function_many()).

We have ~670 users of cpumask_clear_cpu & cpumask_set_cpu, lots of them
fall into same category with group_cpus.c(doesn't care atomicity, not in fast
code path), and needn't change to __cpumask_clear_cpu() and __cpumask_set_cpu().
Otherwise, this way may encourage to update others into the __cpumask_* version.


Thanks,
Ming

2023-12-25 18:03:23

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] lib/group_cpus: simplify grp_spread_init_one() for more

On Wed, Dec 13, 2023 at 09:06:12AM +0800, Ming Lei wrote:
> On Mon, Dec 11, 2023 at 08:21:07PM -0800, Yury Norov wrote:
> > The outer and inner loops of grp_spread_init_one() do the same thing -
> > move a bit from nmsk to irqmsk.
> >
> > The inner loop iterates the sibling group, which includes the CPU picked
> > by outer loop. And it means that we can drop the part that moves the bit
> > in the outer loop.
> >
> > Signed-off-by: Yury Norov <[email protected]>
> > ---
> > lib/group_cpus.c | 8 +-------
> > 1 file changed, 1 insertion(+), 7 deletions(-)
> >
> > diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> > index 664a56171a1b..7aa7a6289355 100644
> > --- a/lib/group_cpus.c
> > +++ b/lib/group_cpus.c
> > @@ -18,14 +18,8 @@ static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > int cpu, sibl;
> >
> > for_each_cpu(cpu, nmsk) {
> > - __cpumask_clear_cpu(cpu, nmsk);
> > - __cpumask_set_cpu(cpu, irqmsk);
> > - if (cpus_per_grp-- == 0)
> > - return;
> > -
> > - /* If the cpu has siblings, use them first */
> > siblmsk = topology_sibling_cpumask(cpu);
> > - sibl = cpu + 1;
> > + sibl = cpu;
> >
> > for_each_cpu_and_from(sibl, siblmsk, nmsk) {
> > __cpumask_clear_cpu(sibl, nmsk);
>
> Correctness of the above change requires that 'cpu' has to be included
> into topology_sibling_cpumask(cpu), however, not sure it is always true,
> see the following comment in Documentation/arch/x86/topology.rst
>
> `
> - topology_sibling_cpumask():
>
> The cpumask contains all online threads in the core to which a thread
> belongs.
> `

It's kind of nontrivial to spread IRQs on offline CPUs, but
technically the above seems correct. I'll drop the patch then.