LinuxLists.cc - [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu

2022-08-02 22:10:19

Subject: [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu_wrap().

The value of 'nr_cpumask_bits' is dependent on CONFIG_CPUMASK_OFFSTACK.
This in turn can change the set of cpus visited by for_each_cpu_wrap()
with a mask that has bits set in the range [nr_cpu_ids, NR_CPUS).

Specifically on !CONFIG_CPUMASK_OFFSTACK kernels the API can iterate
over cpus outside the 'cpu_possible_mask'.

Fix this to make its behavior match for_each_cpu() which always limits
the iteration to the range [0, nr_cpu_ids).

Signed-off-by: Neel Natu <[email protected]>
---
include/linux/cpumask.h | 2 +-
lib/cpumask.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index fe29ac7cc469..2a308cfc43da 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -303,7 +303,7 @@ extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool
*/
#define for_each_cpu_wrap(cpu, mask, start) \
for ((cpu) = cpumask_next_wrap((start)-1, (mask), (start), false); \
- (cpu) < nr_cpumask_bits; \
+ (cpu) < nr_cpu_ids; \
(cpu) = cpumask_next_wrap((cpu), (mask), (start), true))

/**
diff --git a/lib/cpumask.c b/lib/cpumask.c
index a971a82d2f43..d47937fb49eb 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -82,9 +82,9 @@ int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap)
next = cpumask_next(n, mask);

if (wrap && n < start && next >= start) {
- return nr_cpumask_bits;
+ return nr_cpu_ids;

- } else if (next >= nr_cpumask_bits) {
+ } else if (next >= nr_cpu_ids) {
wrap = true;
n = -1;
goto again;
--
2.37.1.455.g008518b4e5-goog

2022-08-03 01:58:31

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu_wrap().

On Tue, Aug 2, 2022 at 2:41 PM Neel Natu <[email protected]> wrote:
>
> The value of 'nr_cpumask_bits' is dependent on CONFIG_CPUMASK_OFFSTACK.
> This in turn can change the set of cpus visited by for_each_cpu_wrap()
> with a mask that has bits set in the range [nr_cpu_ids, NR_CPUS).
>
> Specifically on !CONFIG_CPUMASK_OFFSTACK kernels the API can iterate
> over cpus outside the 'cpu_possible_mask'.
>
> Fix this to make its behavior match for_each_cpu() which always limits
> the iteration to the range [0, nr_cpu_ids).
>
> Signed-off-by: Neel Natu <[email protected]>

The patch itself doesn't look correct because it randomly switches a piece
of cpumask API from nr_cpumask_bits to nr_cpu_ids, and doesn't touch
others.

However...

I don't know the story behind having 2 variables holding the max possible
number of cpus, and it looks like it dates back to prehistoric times. In
modern kernel, there are 2 cases where nr_cpumask_bits == nr_cpu_ids
for sure: it's CONFIG_CPUMASK_OFFSTACK=y and
CONFIG_HOTPLUG_CPU=y. At least one of those is enabled in defconfig
of every popular architecture.

In case of HOTPLUG is off, I don't understand why we should have nr_cpu_ids
and nr_cpumask_bits different - what case should it cover?... Interestingly, in
comments to cpumask functions and in the code those two are referred
interchangeably.

Even more interestingly, we have a function bitmap_setall() that sets all bits
up to nr_cpumask_bits, and it could trigger the problem that you described,
so that someone would complain. (Are there any other valid reasons to set
bits behind nr_cpu_ids intentionally?)

Can you share more details about how you triggered that? If you observe
those bits set, something else is probably already wrong...

So, if there is no real case in modern kernel to have nr_cpumask_bits and
nr_cpu_ids different, the proper fix would be to just drop the first.

If there is such a case, this is probably your case, and we'd know more
details to understand how to deal with that.

Thanks,
Yury

2022-08-03 17:51:06

by Neel Natu

[permalink] [raw]

Subject: Re: [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu_wrap().

On Tue, Aug 2, 2022 at 6:22 PM Yury Norov <[email protected]> wrote:
>
> On Tue, Aug 2, 2022 at 2:41 PM Neel Natu <[email protected]> wrote:
> >
> > The value of 'nr_cpumask_bits' is dependent on CONFIG_CPUMASK_OFFSTACK.
> > This in turn can change the set of cpus visited by for_each_cpu_wrap()
> > with a mask that has bits set in the range [nr_cpu_ids, NR_CPUS).
> >
> > Specifically on !CONFIG_CPUMASK_OFFSTACK kernels the API can iterate
> > over cpus outside the 'cpu_possible_mask'.
> >
> > Fix this to make its behavior match for_each_cpu() which always limits
> > the iteration to the range [0, nr_cpu_ids).
> >
> > Signed-off-by: Neel Natu <[email protected]>
>
> The patch itself doesn't look correct because it randomly switches a piece
> of cpumask API from nr_cpumask_bits to nr_cpu_ids, and doesn't touch
> others.
>
> However...
>
> I don't know the story behind having 2 variables holding the max possible
> number of cpus, and it looks like it dates back to prehistoric times. In
> modern kernel, there are 2 cases where nr_cpumask_bits == nr_cpu_ids
> for sure: it's CONFIG_CPUMASK_OFFSTACK=y and
> CONFIG_HOTPLUG_CPU=y. At least one of those is enabled in defconfig
> of every popular architecture.
>

Hmm, in a kernel with CONFIG_HOTPLUG_CPU=y but not CONFIG_CPUMASK_OFFSTACK
I see "nr_cpu_ids = 20, nr_cpumask_bits = 512". FYI since it doesn't
match the observation
above that nr_cpumask_bits == nr_cpu_ids when CONFIG_HOTPLUG_CPU=y.

> In case of HOTPLUG is off, I don't understand why we should have nr_cpu_ids
> and nr_cpumask_bits different - what case should it cover?... Interestingly, in
> comments to cpumask functions and in the code those two are referred
> interchangeably.
>
> Even more interestingly, we have a function bitmap_setall() that sets all bits
> up to nr_cpumask_bits, and it could trigger the problem that you described,

I think you mean cpumask_setall() that in turn calls
bitmap_fill(nr_cpumask_bits)?

> so that someone would complain. (Are there any other valid reasons to set
> bits behind nr_cpu_ids intentionally?)
>

I don't know of any although this wasn't the case that trigger in my case.

> Can you share more details about how you triggered that? If you observe
> those bits set, something else is probably already wrong...

The non-intuitive behavior of for_each_cpu_wrap() was triggered when iterating
over a cpumask passed by userspace that set a bit in the [nr_cpu_ids,
nr_cpumask_bits)
range.

The kernel code is out of tree but open source so happy to provide a
pointer if needed.

I considered ANDing the user supplied mask with 'cpu_possible_mask'
but that felt
like working around an inconsistency in the kernel API (hence the proposed fix).

> So, if there is no real case in modern kernel to have nr_cpumask_bits and
> nr_cpu_ids different, the proper fix would be to just drop the first.
>

I'll let other people more knowledgable than me in this area chime in.
I'll be happy either
way if that fixes the problem at hand.

best
Neel

> If there is such a case, this is probably your case, and we'd know more
> details to understand how to deal with that.
>
> Thanks,
> Yury

2022-08-03 19:04:18

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu_wrap().

On Wed, Aug 03, 2022 at 10:49:57AM -0700, Neel Natu wrote:
> On Tue, Aug 2, 2022 at 6:22 PM Yury Norov <[email protected]> wrote:
> >
> > On Tue, Aug 2, 2022 at 2:41 PM Neel Natu <[email protected]> wrote:
> > >
> > > The value of 'nr_cpumask_bits' is dependent on CONFIG_CPUMASK_OFFSTACK.
> > > This in turn can change the set of cpus visited by for_each_cpu_wrap()
> > > with a mask that has bits set in the range [nr_cpu_ids, NR_CPUS).
> > >
> > > Specifically on !CONFIG_CPUMASK_OFFSTACK kernels the API can iterate
> > > over cpus outside the 'cpu_possible_mask'.
> > >
> > > Fix this to make its behavior match for_each_cpu() which always limits
> > > the iteration to the range [0, nr_cpu_ids).
> > >
> > > Signed-off-by: Neel Natu <[email protected]>
> >
> > The patch itself doesn't look correct because it randomly switches a piece
> > of cpumask API from nr_cpumask_bits to nr_cpu_ids, and doesn't touch
> > others.
> >
> > However...
> >
> > I don't know the story behind having 2 variables holding the max possible
> > number of cpus, and it looks like it dates back to prehistoric times. In
> > modern kernel, there are 2 cases where nr_cpumask_bits == nr_cpu_ids
> > for sure: it's CONFIG_CPUMASK_OFFSTACK=y and
> > CONFIG_HOTPLUG_CPU=y. At least one of those is enabled in defconfig
> > of every popular architecture.
> >
>
> Hmm, in a kernel with CONFIG_HOTPLUG_CPU=y but not CONFIG_CPUMASK_OFFSTACK
> I see "nr_cpu_ids = 20, nr_cpumask_bits = 512". FYI since it doesn't
> match the observation
> above that nr_cpumask_bits == nr_cpu_ids when CONFIG_HOTPLUG_CPU=y.

I said this because the comment in include/linux/cpumaks.h says:
* If HOTPLUG is enabled, then cpu_possible_mask is forced to have
* all NR_CPUS bits set, otherwise it is just the set of CPUs that
* ACPI reports present at boot.

And because of the code that sets nr_cpu_ids:

void __init setup_nr_cpu_ids(void)
{
nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
}

Some arches override it, so it may be an issue. Are you running x86,
or maybe you have "nr_cpus" boot parameter?

But anyways, I feel like this should be investigated for more... Can you
please share the config of your system and boot params?

> > In case of HOTPLUG is off, I don't understand why we should have nr_cpu_ids
> > and nr_cpumask_bits different - what case should it cover?... Interestingly, in
> > comments to cpumask functions and in the code those two are referred
> > interchangeably.
> >
> > Even more interestingly, we have a function bitmap_setall() that sets all bits
> > up to nr_cpumask_bits, and it could trigger the problem that you described,
>
> I think you mean cpumask_setall() that in turn calls
> bitmap_fill(nr_cpumask_bits)?

Yes, sorry.

> > so that someone would complain. (Are there any other valid reasons to set
> > bits behind nr_cpu_ids intentionally?)
> >
>
> I don't know of any although this wasn't the case that trigger in my case.
>
> > Can you share more details about how you triggered that? If you observe
> > those bits set, something else is probably already wrong...
>
> The non-intuitive behavior of for_each_cpu_wrap() was triggered when iterating
> over a cpumask passed by userspace that set a bit in the [nr_cpu_ids,
> nr_cpumask_bits)
> range.

If you receive bitmap from userspace, you need to copy it with
bitmap_copy_clear_tail(), or bitmap_from_arr{32,64}, as appropriate.
That will clear unneeded bits.

For user-to-kernel communications, I'd recommend passing bitmaps in a
human-readable formats - hex string or bitmap list. Check the functions
cpumask_parse_user() and cpumask_parselist_user(). This would help to
avoid all possible issues related to endianness and 32/64 word length
mismatch.

> The kernel code is out of tree but open source so happy to provide a
> pointer if needed.

Yes please

> I considered ANDing the user supplied mask with 'cpu_possible_mask'
> but that felt
> like working around an inconsistency in the kernel API (hence the proposed fix).
>
> > So, if there is no real case in modern kernel to have nr_cpumask_bits and
> > nr_cpu_ids different, the proper fix would be to just drop the first.
> >
>
> I'll let other people more knowledgable than me in this area chime in.
> I'll be happy either
> way if that fixes the problem at hand.
>
> best
> Neel
>
> > If there is such a case, this is probably your case, and we'd know more
> > details to understand how to deal with that.
> >
> > Thanks,
> > Yury

2022-08-03 21:33:46

by Neel Natu

[permalink] [raw]

Subject: Re: [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu_wrap().

On Wed, Aug 3, 2022 at 11:52 AM Yury Norov <[email protected]> wrote:
>
> On Wed, Aug 03, 2022 at 10:49:57AM -0700, Neel Natu wrote:
> > On Tue, Aug 2, 2022 at 6:22 PM Yury Norov <[email protected]> wrote:
> > >
> > > On Tue, Aug 2, 2022 at 2:41 PM Neel Natu <[email protected]> wrote:
> > > >
> > > > The value of 'nr_cpumask_bits' is dependent on CONFIG_CPUMASK_OFFSTACK.
> > > > This in turn can change the set of cpus visited by for_each_cpu_wrap()
> > > > with a mask that has bits set in the range [nr_cpu_ids, NR_CPUS).
> > > >
> > > > Specifically on !CONFIG_CPUMASK_OFFSTACK kernels the API can iterate
> > > > over cpus outside the 'cpu_possible_mask'.
> > > >
> > > > Fix this to make its behavior match for_each_cpu() which always limits
> > > > the iteration to the range [0, nr_cpu_ids).
> > > >
> > > > Signed-off-by: Neel Natu <[email protected]>
> > >
> > > The patch itself doesn't look correct because it randomly switches a piece
> > > of cpumask API from nr_cpumask_bits to nr_cpu_ids, and doesn't touch
> > > others.
> > >
> > > However...
> > >
> > > I don't know the story behind having 2 variables holding the max possible
> > > number of cpus, and it looks like it dates back to prehistoric times. In
> > > modern kernel, there are 2 cases where nr_cpumask_bits == nr_cpu_ids
> > > for sure: it's CONFIG_CPUMASK_OFFSTACK=y and
> > > CONFIG_HOTPLUG_CPU=y. At least one of those is enabled in defconfig
> > > of every popular architecture.
> > >
> >
> > Hmm, in a kernel with CONFIG_HOTPLUG_CPU=y but not CONFIG_CPUMASK_OFFSTACK
> > I see "nr_cpu_ids = 20, nr_cpumask_bits = 512". FYI since it doesn't
> > match the observation
> > above that nr_cpumask_bits == nr_cpu_ids when CONFIG_HOTPLUG_CPU=y.
>
> I said this because the comment in include/linux/cpumaks.h says:
> * If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> * all NR_CPUS bits set, otherwise it is just the set of CPUs that
> * ACPI reports present at boot.
>
> And because of the code that sets nr_cpu_ids:
>
> void __init setup_nr_cpu_ids(void)
> {
> nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
> }
>
> Some arches override it, so it may be an issue. Are you running x86,
> or maybe you have "nr_cpus" boot parameter?
>
> But anyways, I feel like this should be investigated for more... Can you
> please share the config of your system and boot params?
>

Yeah, this is a stock defconfig compiled on an x86_64 host and booted
inside a 20 vcpu virtual machine (virtme). There are no command line
params to modify the number of cpus.

I think everything is working as expected based on some debug prints I
added during boot:
[ 0.641798] smp: setup_nr_cpu_ids: nr_cpu_ids 20, cpu_possible_mask 0-19
[ 0.648424] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64
nr_cpu_ids:20 nr_node_ids:2

The first one is from setup_nr_cpu_ids() in kernel/smp.c. The second
one is from setup_per_cpu_areas() from arch/x86/setup_percpu.c.

> > > In case of HOTPLUG is off, I don't understand why we should have nr_cpu_ids
> > > and nr_cpumask_bits different - what case should it cover?... Interestingly, in
> > > comments to cpumask functions and in the code those two are referred
> > > interchangeably.
> > >
> > > Even more interestingly, we have a function bitmap_setall() that sets all bits
> > > up to nr_cpumask_bits, and it could trigger the problem that you described,
> >
> > I think you mean cpumask_setall() that in turn calls
> > bitmap_fill(nr_cpumask_bits)?
>
> Yes, sorry.
>
> > > so that someone would complain. (Are there any other valid reasons to set
> > > bits behind nr_cpu_ids intentionally?)
> > >
> >
> > I don't know of any although this wasn't the case that trigger in my case.
> >
> > > Can you share more details about how you triggered that? If you observe
> > > those bits set, something else is probably already wrong...
> >
> > The non-intuitive behavior of for_each_cpu_wrap() was triggered when iterating
> > over a cpumask passed by userspace that set a bit in the [nr_cpu_ids,
> > nr_cpumask_bits)
> > range.
>
> If you receive bitmap from userspace, you need to copy it with
> bitmap_copy_clear_tail(), or bitmap_from_arr{32,64}, as appropriate.
> That will clear unneeded bits.
>
> For user-to-kernel communications, I'd recommend passing bitmaps in a
> human-readable formats - hex string or bitmap list. Check the functions
> cpumask_parse_user() and cpumask_parselist_user(). This would help to
> avoid all possible issues related to endianness and 32/64 word length
> mismatch.
>
> > The kernel code is out of tree but open source so happy to provide a
> > pointer if needed.
>
> Yes please
>

Here is where we copy the user supplied cpumask using the
get_user_cpu_mask() helper:
https://github.com/google/ghost-kernel/blob/c21b36f87663efa2189876b2caa701fe74e72adf/kernel/sched/ghost.c#L5729

For performance reasons we cannot use human readable cpu masks in this
code path.

Note that the helper copies up to 'nr_cpumask_bits' which in some
kernels (!CONFIG_CPUMASK_OFFSTACK) can copy bits beyond 'nr_cpu_ids'.
A possible option could be to fix this helper but I do feel that
for_each_cpu() and for_each_cpu_wrap() should visit the same set of
cpus given the same cpumask (ordering can be different but the set of
cpus should remain the same).

What do you think?

best
Neel

> > I considered ANDing the user supplied mask with 'cpu_possible_mask'
> > but that felt
> > like working around an inconsistency in the kernel API (hence the proposed fix).
> >
> > > So, if there is no real case in modern kernel to have nr_cpumask_bits and
> > > nr_cpu_ids different, the proper fix would be to just drop the first.
> > >
> >
> > I'll let other people more knowledgable than me in this area chime in.
> > I'll be happy either
> > way if that fixes the problem at hand.
> >
> > best
> > Neel
> >
> > > If there is such a case, this is probably your case, and we'd know more
> > > details to understand how to deal with that.
> > >
> > > Thanks,
> > > Yury

2022-08-03 23:52:56

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu_wrap().

+ Andy Lutomirski <[email protected]>
+ Thomas Gleixner <[email protected]>

On Wed, Aug 03, 2022 at 02:21:46PM -0700, Neel Natu wrote:
> ject: Re: [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu_wrap().
> To: Yury Norov <[email protected]>
> Cc: Andy Shevchenko <[email protected]>,
> Rasmus Villemoes <[email protected]>, Peter Zijlstra <[email protected]>,
> [email protected]
> Content-Type: text/plain; charset="UTF-8"
> Status: O
> Content-Length: 6037
> Lines: 152
>
> On Wed, Aug 3, 2022 at 11:52 AM Yury Norov <[email protected]> wrote:
> >
> > On Wed, Aug 03, 2022 at 10:49:57AM -0700, Neel Natu wrote:
> > > On Tue, Aug 2, 2022 at 6:22 PM Yury Norov <[email protected]> wrote:
> > > >
> > > > On Tue, Aug 2, 2022 at 2:41 PM Neel Natu <[email protected]> wrote:
> > > > >
> > > > > The value of 'nr_cpumask_bits' is dependent on CONFIG_CPUMASK_OFFSTACK.
> > > > > This in turn can change the set of cpus visited by for_each_cpu_wrap()
> > > > > with a mask that has bits set in the range [nr_cpu_ids, NR_CPUS).
> > > > >
> > > > > Specifically on !CONFIG_CPUMASK_OFFSTACK kernels the API can iterate
> > > > > over cpus outside the 'cpu_possible_mask'.
> > > > >
> > > > > Fix this to make its behavior match for_each_cpu() which always limits
> > > > > the iteration to the range [0, nr_cpu_ids).
> > > > >
> > > > > Signed-off-by: Neel Natu <[email protected]>
> > > >
> > > > The patch itself doesn't look correct because it randomly switches a piece
> > > > of cpumask API from nr_cpumask_bits to nr_cpu_ids, and doesn't touch
> > > > others.
> > > >
> > > > However...
> > > >
> > > > I don't know the story behind having 2 variables holding the max possible
> > > > number of cpus, and it looks like it dates back to prehistoric times. In
> > > > modern kernel, there are 2 cases where nr_cpumask_bits == nr_cpu_ids
> > > > for sure: it's CONFIG_CPUMASK_OFFSTACK=y and
> > > > CONFIG_HOTPLUG_CPU=y. At least one of those is enabled in defconfig
> > > > of every popular architecture.
> > > >
> > >
> > > Hmm, in a kernel with CONFIG_HOTPLUG_CPU=y but not CONFIG_CPUMASK_OFFSTACK
> > > I see "nr_cpu_ids = 20, nr_cpumask_bits = 512". FYI since it doesn't
> > > match the observation
> > > above that nr_cpumask_bits == nr_cpu_ids when CONFIG_HOTPLUG_CPU=y.
> >
> > I said this because the comment in include/linux/cpumaks.h says:
> > * If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> > * all NR_CPUS bits set, otherwise it is just the set of CPUs that
> > * ACPI reports present at boot.
> >
> > And because of the code that sets nr_cpu_ids:
> >
> > void __init setup_nr_cpu_ids(void)
> > {
> > nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
> > }
> >
> > Some arches override it, so it may be an issue. Are you running x86,
> > or maybe you have "nr_cpus" boot parameter?
> >
> > But anyways, I feel like this should be investigated for more... Can you
> > please share the config of your system and boot params?
> >
>
> Yeah, this is a stock defconfig compiled on an x86_64 host and booted
> inside a 20 vcpu virtual machine (virtme). There are no command line
> params to modify the number of cpus.
>
> I think everything is working as expected based on some debug prints I
> added during boot:
> [ 0.641798] smp: setup_nr_cpu_ids: nr_cpu_ids 20, cpu_possible_mask 0-19
> [ 0.648424] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64
> nr_cpu_ids:20 nr_node_ids:2
>
> The first one is from setup_nr_cpu_ids() in kernel/smp.c. The second
> one is from setup_per_cpu_areas() from arch/x86/setup_percpu.c.

So it doesn't look correct. When HOTPLUG is ON, than cpu_possible_mask must
be 0-NR_CPUS, as it's mentioned in include/linux/cpumaks.h, because
user may add up to NR_CPUS...

Adding Thomas, Peter and Andy.

Guys, can you please clarify:
1. What for do we have both nr_cpu_ids and nr_cpumask_bits variables? In
case of CPUMSAK_OFFSTACK, nr_cpumask_bits is aliased to nr_cpu_ids, and
in case of HOTPLUG both should be equal to NR_CPUS. In HOTPLUG=off,
everything is set up at boot time and never changes, so again - why do
we need 2 variables?
2. Neel observes that with HOTPLUG_CPU=y, nr_cpu_ids differs from
nr_cpumask_bits because cpu_possible_mask is not set completely, despite
the comment in cpumask.h:
* If HOTPLUG is enabled, then cpu_possible_mask is forced to have
* all NR_CPUS bits set, otherwise it is just the set of CPUs that
* ACPI reports present at boot.
Is that a bug, or just incorrect comment? If not a bug, what code
increases nr_cpu_ids when a cpu goes up? From what I see, nr_cpu_ids
is set only once in start_kernel(), and then never changes.

> > > > In case of HOTPLUG is off, I don't understand why we should have nr_cpu_ids
> > > > and nr_cpumask_bits different - what case should it cover?... Interestingly, in
> > > > comments to cpumask functions and in the code those two are referred
> > > > interchangeably.
> > > >
> > > > Even more interestingly, we have a function bitmap_setall() that sets all bits
> > > > up to nr_cpumask_bits, and it could trigger the problem that you described,
> > >
> > > I think you mean cpumask_setall() that in turn calls
> > > bitmap_fill(nr_cpumask_bits)?
> >
> > Yes, sorry.
> >
> > > > so that someone would complain. (Are there any other valid reasons to set
> > > > bits behind nr_cpu_ids intentionally?)
> > > >
> > >
> > > I don't know of any although this wasn't the case that trigger in my case.
> > >
> > > > Can you share more details about how you triggered that? If you observe
> > > > those bits set, something else is probably already wrong...
> > >
> > > The non-intuitive behavior of for_each_cpu_wrap() was triggered when iterating
> > > over a cpumask passed by userspace that set a bit in the [nr_cpu_ids,
> > > nr_cpumask_bits)
> > > range.
> >
> > If you receive bitmap from userspace, you need to copy it with
> > bitmap_copy_clear_tail(), or bitmap_from_arr{32,64}, as appropriate.
> > That will clear unneeded bits.
> >
> > For user-to-kernel communications, I'd recommend passing bitmaps in a
> > human-readable formats - hex string or bitmap list. Check the functions
> > cpumask_parse_user() and cpumask_parselist_user(). This would help to
> > avoid all possible issues related to endianness and 32/64 word length
> > mismatch.
> >
> > > The kernel code is out of tree but open source so happy to provide a
> > > pointer if needed.
> >
> > Yes please
> >
>
> Here is where we copy the user supplied cpumask using the
> get_user_cpu_mask() helper:
> https://github.com/google/ghost-kernel/blob/c21b36f87663efa2189876b2caa701fe74e72adf/kernel/sched/ghost.c#L5729
>
> For performance reasons we cannot use human readable cpu masks in this
> code path.

If nr_cpu_ids == 20, I can't imagine a case where copying 4 bytes
string to bitmap_parse() instead of 3 bytes of plain bitmap would
affect your performance.

Even if you're going to scale, it's hard to believe that code that
calls alloc() functions 3 times without GFP_ATOMIC is so critical...
I didn't look at the code deeply, but it looks like it sends many
IPIs, which also doesn't look deterministic in terms of high-perf
loads.

> Note that the helper copies up to 'nr_cpumask_bits' which in some
> kernels (!CONFIG_CPUMASK_OFFSTACK) can copy bits beyond 'nr_cpu_ids'.
> A possible option could be to fix this helper but I do feel that
> for_each_cpu() and for_each_cpu_wrap() should visit the same set of
> cpus given the same cpumask (ordering can be different but the set of
> cpus should remain the same).
>
> What do you think?

I think that for_each_cpu{,_wrap} is correct here because you have
HOTPLUG=y, and any cpu may be present.

Let's see what Peter, Andy and Tomas will say. I have a feeling that
we need to drop nr_cpumask_bits, and always rely on nr_cpu_ids.

For your problem, because you are passed with the mask from userspace,
you shouldn't rely on it anyways and need some sanity checks. Maybe
like this:
if (!cpumask_subset(user_mask, cpu_possible_mask))
return -EINVAL;

Thanks,
Yury

> best
> Neel
>
> > > I considered ANDing the user supplied mask with 'cpu_possible_mask'
> > > but that felt
> > > like working around an inconsistency in the kernel API (hence the proposed fix).
> > >
> > > > So, if there is no real case in modern kernel to have nr_cpumask_bits and
> > > > nr_cpu_ids different, the