2019-05-22 11:17:10

by Andrew Murray

[permalink] [raw]
Subject: [PATCH] smp,cpumask: Don't call functions on offline CPUs

When we are able to allocate a cpumask in on_each_cpu_cond_mask
we call functions with on_each_cpu_mask - this masks out offline
cpus via smp_call_function_many.

However when we fail to allocate a cpumask in on_each_cpu_cond_mask
we call functions with smp_call_function_single - this will return
-ENXIO from generic_exec_single if a CPU is offline which will
result in a WARN_ON_ONCE.

Let's avoid the WARN by only calling smp_call_function_single when
the CPU is online and thus making both paths consistent with each
other.

Signed-off-by: Andrew Murray <[email protected]>
---
kernel/smp.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index f4cf1b0bb3b8..10970692f1c0 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -259,6 +259,7 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)

/*
* smp_call_function_single - Run a function on a specific CPU
+ * @cpu: The CPU to run on.
* @func: The function to run. This must be fast and non-blocking.
* @info: An arbitrary pointer to pass to the function.
* @wait: If true, wait until function has completed on other CPUs.
@@ -657,6 +658,8 @@ EXPORT_SYMBOL(on_each_cpu_mask);
* completed on other CPUs.
* @gfp_flags: GFP flags to use when allocating the cpumask
* used internally by the function.
+ * @mask: The set of cpus to run on (only runs on online
+ subset).
*
* The function might sleep if the GFP flags indicates a non
* atomic allocation is allowed.
@@ -690,12 +693,13 @@ void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
* just have to IPI them one by one.
*/
preempt_disable();
- for_each_cpu(cpu, mask)
- if (cond_func(cpu, info)) {
+ for_each_cpu(cpu, mask) {
+ if (cpu_online(cpu) && cond_func(cpu, info)) {
ret = smp_call_function_single(cpu, func,
info, wait);
WARN_ON_ONCE(ret);
}
+ }
preempt_enable();
}
}
--
2.21.0


2019-05-22 14:11:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] smp,cpumask: Don't call functions on offline CPUs

On Wed, May 22, 2019 at 12:15:37PM +0100, Andrew Murray wrote:
> When we are able to allocate a cpumask in on_each_cpu_cond_mask
> we call functions with on_each_cpu_mask - this masks out offline
> cpus via smp_call_function_many.
>
> However when we fail to allocate a cpumask in on_each_cpu_cond_mask
> we call functions with smp_call_function_single - this will return
> -ENXIO from generic_exec_single if a CPU is offline which will
> result in a WARN_ON_ONCE.
>
> Let's avoid the WARN by only calling smp_call_function_single when
> the CPU is online and thus making both paths consistent with each
> other.

I'm confused, why are you feeding it offline CPUs to begin with? @mask
shouldn't include them.

Is perhaps the problem that on_each_cpu_cond() uses cpu_onlne_mask
without protection?

Something like so?

diff --git a/kernel/smp.c b/kernel/smp.c
index f4cf1b0bb3b8..a493b3dfa67f 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -705,8 +707,10 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
gfp_t gfp_flags)
{
+ cpus_read_lock();
on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags,
cpu_online_mask);
+ cpus_read_unlock();
}
EXPORT_SYMBOL(on_each_cpu_cond);


2019-05-22 14:41:24

by Andrew Murray

[permalink] [raw]
Subject: Re: [PATCH] smp,cpumask: Don't call functions on offline CPUs

On Wed, May 22, 2019 at 04:09:21PM +0200, Peter Zijlstra wrote:
> On Wed, May 22, 2019 at 12:15:37PM +0100, Andrew Murray wrote:
> > When we are able to allocate a cpumask in on_each_cpu_cond_mask
> > we call functions with on_each_cpu_mask - this masks out offline
> > cpus via smp_call_function_many.
> >
> > However when we fail to allocate a cpumask in on_each_cpu_cond_mask
> > we call functions with smp_call_function_single - this will return
> > -ENXIO from generic_exec_single if a CPU is offline which will
> > result in a WARN_ON_ONCE.
> >
> > Let's avoid the WARN by only calling smp_call_function_single when
> > the CPU is online and thus making both paths consistent with each
> > other.
>
> I'm confused, why are you feeding it offline CPUs to begin with? @mask
> shouldn't include them.

There are only two users of this function: on_each_cpu_cond which passes
cpu_online_mask, and native_flush_tlb_others from arch/x86/mm/tlb.c. But
I haven't followed all callers to native_flush_tlb_others to determine if
they are only passing in online CPUs.

Whilst trying to understand the code I couldn't help but notice that the
behaviour changes depending on the ability of memory allocation at the
time. This seemed odd. And of course there may be future callers of this
function that for some strange reason do pass in offline CPUs. (Maybe this
is OK and it's something they shouldn't be doing).

I felt that this function should be consistent and document the behaviour
via the @mask doc like some of the other functions in the file.

>
> Is perhaps the problem that on_each_cpu_cond() uses cpu_onlne_mask
> without protection?

Does this prevent racing with a CPU going offline? I guess this prevents
the warning at the expense of a lock - but is only beneficial in the
unlikely path. (In the likely path this prevents new CPUs going offline
but we don't care because we don't WARN if they aren't they when we
attempt to call functions).

At least this is my limited understanding.

Thanks,

Andrew Murray

>
> Something like so?
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f4cf1b0bb3b8..a493b3dfa67f 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -705,8 +707,10 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
> smp_call_func_t func, void *info, bool wait,
> gfp_t gfp_flags)
> {
> + cpus_read_lock();
> on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags,
> cpu_online_mask);
> + cpus_read_unlock();
> }
> EXPORT_SYMBOL(on_each_cpu_cond);
>
>

2019-05-22 14:50:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] smp,cpumask: Don't call functions on offline CPUs

On Wed, May 22, 2019 at 03:37:11PM +0100, Andrew Murray wrote:
> > Is perhaps the problem that on_each_cpu_cond() uses cpu_onlne_mask
> > without protection?
>
> Does this prevent racing with a CPU going offline? I guess this prevents
> the warning at the expense of a lock - but is only beneficial in the
> unlikely path. (In the likely path this prevents new CPUs going offline
> but we don't care because we don't WARN if they aren't they when we
> attempt to call functions).
>
> At least this is my limited understanding.

Hmm.. I don't think it could matter, we only use the mask when
preempt_disable(), which would already block offline, due to it using
stop_machine().

So the patch is a no-op.

What's the WARN you see? TLB invalidation should pass mm_cpumask(),
which similarly should not contain offline CPUs I'm thinking.

2019-05-22 15:17:06

by Andrew Murray

[permalink] [raw]
Subject: Re: [PATCH] smp,cpumask: Don't call functions on offline CPUs

On Wed, May 22, 2019 at 04:49:18PM +0200, Peter Zijlstra wrote:
> On Wed, May 22, 2019 at 03:37:11PM +0100, Andrew Murray wrote:
> > > Is perhaps the problem that on_each_cpu_cond() uses cpu_onlne_mask
> > > without protection?
> >
> > Does this prevent racing with a CPU going offline? I guess this prevents
> > the warning at the expense of a lock - but is only beneficial in the
> > unlikely path. (In the likely path this prevents new CPUs going offline
> > but we don't care because we don't WARN if they aren't they when we
> > attempt to call functions).
> >
> > At least this is my limited understanding.
>
> Hmm.. I don't think it could matter, we only use the mask when
> preempt_disable(), which would already block offline, due to it using
> stop_machine().

OK, so as long as all arches use stop_machine to bring down a CPU then
preeempt_disable will stop racing with CPUs going offline (I don't know
if they all do or not).

>
> So the patch is a no-op.
>
> What's the WARN you see? TLB invalidation should pass mm_cpumask(),
> which similarly should not contain offline CPUs I'm thinking.

So the only remaining issue is if callers pass offline CPUs to the
function (I guess there is still the chance of a race between calling
on_each_cpu_cond_mask and entering the preempt disabled'd bit). As you
suggest they probably don't.

Given the above, should we add a " @mask: The mask of cpus it can run on."
to on_each_cpu_cond_mask? (Which is different to the wording of other
functions in the same file that mask it with online CPUs, e.g.
" @mask: The set of cpus to run on (only runs on online subset).".

(I haven't seen any WARN, but from looking at the code thought that it
was possible.)

Thanks,

Andrew Murray

2019-05-22 18:26:40

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] smp,cpumask: Don't call functions on offline CPUs

On Wed, 2019-05-22 at 16:49 +0200, Peter Zijlstra wrote:
> On Wed, May 22, 2019 at 03:37:11PM +0100, Andrew Murray wrote:
> > > Is perhaps the problem that on_each_cpu_cond() uses
> > > cpu_onlne_mask
> > > without protection?
> >
> > Does this prevent racing with a CPU going offline? I guess this
> > prevents
> > the warning at the expense of a lock - but is only beneficial in
> > the
> > unlikely path. (In the likely path this prevents new CPUs going
> > offline
> > but we don't care because we don't WARN if they aren't they when we
> > attempt to call functions).
> >
> > At least this is my limited understanding.
>
> Hmm.. I don't think it could matter, we only use the mask when
> preempt_disable(), which would already block offline, due to it using
> stop_machine().
>
> So the patch is a no-op.
>
> What's the WARN you see? TLB invalidation should pass mm_cpumask(),
> which similarly should not contain offline CPUs I'm thinking.

Does the TLB invalidation code have anything in it
to prevent from racing with the CPU offline code?

In other words, could we end up with the TLB
invalidation code building its bitmask, getting
interrupted (eg. hypervisor preemption, NMI),
and not sending out the IPI to that bitmask of
CPUs until after one of the CPUs in the bitmap
has gotten offlined?

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2019-05-27 10:54:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] smp,cpumask: Don't call functions on offline CPUs

On Wed, May 22, 2019 at 04:14:23PM +0100, Andrew Murray wrote:
> On Wed, May 22, 2019 at 04:49:18PM +0200, Peter Zijlstra wrote:
> > On Wed, May 22, 2019 at 03:37:11PM +0100, Andrew Murray wrote:
> > > > Is perhaps the problem that on_each_cpu_cond() uses cpu_onlne_mask
> > > > without protection?
> > >
> > > Does this prevent racing with a CPU going offline? I guess this prevents
> > > the warning at the expense of a lock - but is only beneficial in the
> > > unlikely path. (In the likely path this prevents new CPUs going offline
> > > but we don't care because we don't WARN if they aren't they when we
> > > attempt to call functions).
> > >
> > > At least this is my limited understanding.
> >
> > Hmm.. I don't think it could matter, we only use the mask when
> > preempt_disable(), which would already block offline, due to it using
> > stop_machine().
>
> OK, so as long as all arches use stop_machine to bring down a CPU then
> preeempt_disable will stop racing with CPUs going offline (I don't know
> if they all do or not).

kernel/cpu.c:takedown_cpu() is generic code :-)

> > So the patch is a no-op.
> >
> > What's the WARN you see? TLB invalidation should pass mm_cpumask(),
> > which similarly should not contain offline CPUs I'm thinking.
>
> So the only remaining issue is if callers pass offline CPUs to the
> function (I guess there is still the chance of a race between calling
> on_each_cpu_cond_mask and entering the preempt disabled'd bit). As you
> suggest they probably don't.

Yes, it is possible to construct fail that way.

2019-05-27 10:56:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] smp,cpumask: Don't call functions on offline CPUs

On Wed, May 22, 2019 at 02:23:47PM -0400, Rik van Riel wrote:
> On Wed, 2019-05-22 at 16:49 +0200, Peter Zijlstra wrote:
> > On Wed, May 22, 2019 at 03:37:11PM +0100, Andrew Murray wrote:
> > > > Is perhaps the problem that on_each_cpu_cond() uses
> > > > cpu_onlne_mask
> > > > without protection?
> > >
> > > Does this prevent racing with a CPU going offline? I guess this
> > > prevents
> > > the warning at the expense of a lock - but is only beneficial in
> > > the
> > > unlikely path. (In the likely path this prevents new CPUs going
> > > offline
> > > but we don't care because we don't WARN if they aren't they when we
> > > attempt to call functions).
> > >
> > > At least this is my limited understanding.
> >
> > Hmm.. I don't think it could matter, we only use the mask when
> > preempt_disable(), which would already block offline, due to it using
> > stop_machine().
> >
> > So the patch is a no-op.
> >
> > What's the WARN you see? TLB invalidation should pass mm_cpumask(),
> > which similarly should not contain offline CPUs I'm thinking.
>
> Does the TLB invalidation code have anything in it
> to prevent from racing with the CPU offline code?
>
> In other words, could we end up with the TLB
> invalidation code building its bitmask, getting
> interrupted (eg. hypervisor preemption, NMI),
> and not sending out the IPI to that bitmask of
> CPUs until after one of the CPUs in the bitmap
> has gotten offlined?

One possible thing would be if cpu-offline didn't remove the bit from
mm_cpumask() because it entered lazy state earlier or something.

Then, mm_cpumask() would contain an offline CPU and the WARN could
trigger, but I don't _think_ we do that, but I didn't check.