Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Wed, 3 Aug 2022 16:51:07 -0700
From:   Yury Norov <yury.norov@gmail.com>
To:     Neel Natu <neelnatu@google.com>
Cc:     Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
        Rasmus Villemoes <linux@rasmusvillemoes.dk>,
        Peter Zijlstra <peterz@infradead.org>,
        linux-kernel@vger.kernel.org, Thomas Gleixner <tglx@linutronix.de>,
        Andy Lutomirski <luto@kernel.org>
Subject: Re: [PATCH] sched, cpumask: don't leak impossible cpus via
 for_each_cpu_wrap().
Message-ID: <YusJ6/q/rXclCn8F@yury-laptop>
References: <20220802214041.2656586-1-neelnatu@google.com>
 <CAAH8bW_z3TjjafPRJPZmzDdOFjmU7xTLxOinw8D4nsZ2B6WN4w@mail.gmail.com>
 <CAJDe-O+8G=hyqWcaNk8_YP9S-R9-Xnn2_cLt8Loz9ru8V24K_Q@mail.gmail.com>
 <YurEAU5SlQaZ85vZ@yury-laptop>
 <CAJDe-O+7mqqrGvWimztYgEqDDqJGH=EA1NgJyiJjnurLASOQ8w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAJDe-O+7mqqrGvWimztYgEqDDqJGH=EA1NgJyiJjnurLASOQ8w@mail.gmail.com>
Precedence: bulk

+ Andy Lutomirski <luto@kernel.org>
+ Thomas Gleixner <tglx@linutronix.de>

On Wed, Aug 03, 2022 at 02:21:46PM -0700, Neel Natu wrote:
> ject: Re: [PATCH] sched, cpumask: don't leak impossible cpus via for_each_cpu_wrap().
> To: Yury Norov <yury.norov@gmail.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>, 
> 	Rasmus Villemoes <linux@rasmusvillemoes.dk>, Peter Zijlstra <peterz@infradead.org>, 
> 	linux-kernel@vger.kernel.org
> Content-Type: text/plain; charset="UTF-8"
> Status: O
> Content-Length: 6037
> Lines: 152
> 
> On Wed, Aug 3, 2022 at 11:52 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > On Wed, Aug 03, 2022 at 10:49:57AM -0700, Neel Natu wrote:
> > > On Tue, Aug 2, 2022 at 6:22 PM Yury Norov <yury.norov@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 2, 2022 at 2:41 PM Neel Natu <neelnatu@google.com> wrote:
> > > > >
> > > > > The value of 'nr_cpumask_bits' is dependent on CONFIG_CPUMASK_OFFSTACK.
> > > > > This in turn can change the set of cpus visited by for_each_cpu_wrap()
> > > > > with a mask that has bits set in the range [nr_cpu_ids, NR_CPUS).
> > > > >
> > > > > Specifically on !CONFIG_CPUMASK_OFFSTACK kernels the API can iterate
> > > > > over cpus outside the 'cpu_possible_mask'.
> > > > >
> > > > > Fix this to make its behavior match for_each_cpu() which always limits
> > > > > the iteration to the range [0, nr_cpu_ids).
> > > > >
> > > > > Signed-off-by: Neel Natu <neelnatu@google.com>
> > > >
> > > > The patch itself doesn't look correct because it randomly switches a piece
> > > > of cpumask API from nr_cpumask_bits to nr_cpu_ids, and doesn't touch
> > > > others.
> > > >
> > > > However...
> > > >
> > > > I don't know the story behind having 2 variables holding the max possible
> > > > number of cpus, and it looks like it dates back to prehistoric times.  In
> > > > modern kernel, there are 2 cases where nr_cpumask_bits == nr_cpu_ids
> > > > for sure: it's CONFIG_CPUMASK_OFFSTACK=y and
> > > > CONFIG_HOTPLUG_CPU=y. At least one of those is enabled in defconfig
> > > > of every popular architecture.
> > > >
> > >
> > > Hmm, in a kernel with CONFIG_HOTPLUG_CPU=y but not CONFIG_CPUMASK_OFFSTACK
> > > I see "nr_cpu_ids = 20, nr_cpumask_bits = 512". FYI since it doesn't
> > > match the observation
> > > above that nr_cpumask_bits == nr_cpu_ids when CONFIG_HOTPLUG_CPU=y.
> >
> > I said this because the comment in include/linux/cpumaks.h says:
> >  *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> >  *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
> >  *  ACPI reports present at boot.
> >
> > And because of the code that sets nr_cpu_ids:
> >
> > void __init setup_nr_cpu_ids(void)
> > {
> >         nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
> > }
> >
> > Some arches override it, so it may be an issue. Are you running x86,
> > or maybe you have "nr_cpus" boot parameter?
> >
> > But anyways, I feel like this should be investigated for more... Can you
> > please share the config of your system and boot params?
> >
> 
> Yeah, this is a stock defconfig compiled on an x86_64 host and booted
> inside a 20 vcpu virtual machine (virtme). There are no command line
> params to modify the number of cpus.
> 
> I think everything is working as expected based on some debug prints I
> added during boot:
> [    0.641798] smp: setup_nr_cpu_ids: nr_cpu_ids 20, cpu_possible_mask 0-19
> [    0.648424] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64
> nr_cpu_ids:20 nr_node_ids:2
> 
> The first one is from setup_nr_cpu_ids() in kernel/smp.c. The second
> one is from setup_per_cpu_areas() from arch/x86/setup_percpu.c.

So it doesn't look correct. When HOTPLUG is ON, than cpu_possible_mask must
be 0-NR_CPUS, as it's mentioned in include/linux/cpumaks.h, because
user may add up to NR_CPUS...

Adding Thomas, Peter and Andy.

Guys, can you please clarify:
1. What for do we have both nr_cpu_ids and nr_cpumask_bits variables? In
   case of CPUMSAK_OFFSTACK, nr_cpumask_bits is aliased to nr_cpu_ids, and
   in case of HOTPLUG both should be equal to NR_CPUS. In HOTPLUG=off,
   everything is set up at boot time and never changes, so again - why do
   we need 2 variables?
2. Neel observes that with HOTPLUG_CPU=y, nr_cpu_ids differs from
   nr_cpumask_bits because cpu_possible_mask is not set completely, despite
   the comment in cpumask.h:
     *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
     *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
     *  ACPI reports present at boot.
   Is that a bug, or just incorrect comment? If not a bug, what code
   increases nr_cpu_ids when a cpu goes up? From what I see, nr_cpu_ids
   is set only once in start_kernel(), and then never changes.
 
> > > > In case of HOTPLUG is off, I don't understand why we should have nr_cpu_ids
> > > > and nr_cpumask_bits different - what case should it cover?... Interestingly, in
> > > > comments to cpumask functions and in the code those two are referred
> > > > interchangeably.
> > > >
> > > > Even more interestingly, we have a function bitmap_setall() that sets all bits
> > > > up to nr_cpumask_bits, and it could trigger the problem that you described,
> > >
> > > I think you mean cpumask_setall() that in turn calls
> > > bitmap_fill(nr_cpumask_bits)?
> >
> > Yes, sorry.
> >
> > > > so that someone would complain. (Are there any other valid reasons to set
> > > > bits behind nr_cpu_ids intentionally?)
> > > >
> > >
> > > I don't know of any although this wasn't the case that trigger in my case.
> > >
> > > > Can you share more details about how you triggered that? If you observe
> > > > those bits set, something else is probably already wrong...
> > >
> > > The non-intuitive behavior of for_each_cpu_wrap() was triggered when iterating
> > > over a cpumask passed by userspace that set a bit in the [nr_cpu_ids,
> > > nr_cpumask_bits)
> > > range.
> >
> > If you receive bitmap from userspace, you need to copy it with
> > bitmap_copy_clear_tail(), or bitmap_from_arr{32,64}, as appropriate.
> > That will clear unneeded bits.
> >
> > For user-to-kernel communications, I'd recommend passing bitmaps in a
> > human-readable formats - hex string or bitmap list. Check the functions
> > cpumask_parse_user() and cpumask_parselist_user(). This would help to
> > avoid all possible issues related to endianness and 32/64 word length
> > mismatch.
> >
> > > The kernel code is out of tree but open source so happy to provide a
> > > pointer if needed.
> >
> > Yes please
> >
> 
> Here is where we copy the user supplied cpumask using the
> get_user_cpu_mask() helper:
> https://github.com/google/ghost-kernel/blob/c21b36f87663efa2189876b2caa701fe74e72adf/kernel/sched/ghost.c#L5729
> 
> For performance reasons we cannot use human readable cpu masks in this
> code path.

If nr_cpu_ids == 20, I can't imagine a case where copying 4 bytes
string to bitmap_parse() instead of 3 bytes of plain bitmap would
affect your performance.

Even if you're going to scale, it's hard to believe that code that
calls alloc() functions 3 times without GFP_ATOMIC is so critical...
I didn't look at the code deeply, but it looks like it sends many
IPIs, which also doesn't look deterministic in terms of high-perf
loads.
 
> Note that the helper copies up to 'nr_cpumask_bits' which in some
> kernels (!CONFIG_CPUMASK_OFFSTACK) can copy bits beyond 'nr_cpu_ids'.
> A possible option could be to fix this helper but I do feel that
> for_each_cpu() and for_each_cpu_wrap() should visit the same set of
> cpus given the same cpumask (ordering can be different but the set of
> cpus should remain the same).
> 
> What do you think?

I think that for_each_cpu{,_wrap} is correct here because you have
HOTPLUG=y, and any cpu may be present.

Let's see what Peter, Andy and Tomas will say. I have a feeling that
we need to drop nr_cpumask_bits, and always rely on nr_cpu_ids.

For your problem, because you are passed with the mask from userspace,
you shouldn't rely on it anyways and need some sanity checks. Maybe
like this:
        if (!cpumask_subset(user_mask, cpu_possible_mask))
                return -EINVAL;

Thanks,
Yury

> best
> Neel
> 
> > > I considered ANDing the user supplied mask with 'cpu_possible_mask'
> > > but that felt
> > > like working around an inconsistency in the kernel API (hence the proposed fix).
> > >
> > > > So, if there is no real case in modern kernel to have nr_cpumask_bits and
> > > > nr_cpu_ids different, the