2018-05-09 06:24:52

by Dexuan Cui

[permalink] [raw]
Subject: for_each_cpu() is buggy for UP kernel?

In include/linux/cpumask.h, for_each_cpu is defined like this for UP kernel (CONFIG_NR_CPUS=1):

#define for_each_cpu(cpu, mask) \
for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)

Here 'mask' is ignored, but what if 'mask' contains 0 CPU? -- in this case, the for loop should not
run at all, but with the current code, we run the loop once with cpu==0.

I think I'm seeing a bug in my UP kernel that is caused by the buggy for_each_cpu():

in kernel/time/tick-broadcast.c: tick_handle_oneshot_broadcast(), tick_broadcast_oneshot_mask
contains 0 CPU, but due to the buggy for_each_cpu(), the variable 'next_event' is changed from
its default value KTIME_MAX to "next_event = td->evtdev->next_event"; as a result,
tick_handle_oneshot_broadcast () -> tick_broadcast_set_event() -> clockevents_program_event()
-> pit_next_event() is programming the PIT timer by accident, causing an interrupt storm of PIT
interrupts in some way: I'm seeing that the kernel is receiving ~8000 PIT interrupts per second for
1~5 minutes when the UP kernel boots, and it looks the kernel hangs, but in 1~5 minutes, finally
somehow the kernel can recover and boot up fine. But, occasionally, the kernel just hangs there
forever, receiving ~8000 PIT timers per second.

With the below change in kernel/time/tick-broadcast.c, the interrupt storm will go away:

+#undef for_each_cpu
+#define for_each_cpu(cpu, mask) \
+ for ((cpu) = 0; (((cpu) < 1) && ((mask)[0].bits[0] & 1)); (cpu)++, (void)mask)

Should we fix the for_each_cpu() in include/linux/cpumask.h for UP?

Thanks,
-- Dexuan



2018-05-09 23:22:36

by Andrew Morton

[permalink] [raw]
Subject: Re: for_each_cpu() is buggy for UP kernel?

On Wed, 9 May 2018 06:24:16 +0000 Dexuan Cui <[email protected]> wrote:

> In include/linux/cpumask.h, for_each_cpu is defined like this for UP kernel (CONFIG_NR_CPUS=1):
>
> #define for_each_cpu(cpu, mask) \
> for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)
>
> Here 'mask' is ignored, but what if 'mask' contains 0 CPU? -- in this case, the for loop should not
> run at all, but with the current code, we run the loop once with cpu==0.
>
> I think I'm seeing a bug in my UP kernel that is caused by the buggy for_each_cpu():
>
> in kernel/time/tick-broadcast.c: tick_handle_oneshot_broadcast(), tick_broadcast_oneshot_mask
> contains 0 CPU, but due to the buggy for_each_cpu(), the variable 'next_event' is changed from
> its default value KTIME_MAX to "next_event = td->evtdev->next_event"; as a result,
> tick_handle_oneshot_broadcast () -> tick_broadcast_set_event() -> clockevents_program_event()
> -> pit_next_event() is programming the PIT timer by accident, causing an interrupt storm of PIT
> interrupts in some way: I'm seeing that the kernel is receiving ~8000 PIT interrupts per second for
> 1~5 minutes when the UP kernel boots, and it looks the kernel hangs, but in 1~5 minutes, finally
> somehow the kernel can recover and boot up fine. But, occasionally, the kernel just hangs there
> forever, receiving ~8000 PIT timers per second.
>
> With the below change in kernel/time/tick-broadcast.c, the interrupt storm will go away:
>
> +#undef for_each_cpu
> +#define for_each_cpu(cpu, mask) \
> + for ((cpu) = 0; (((cpu) < 1) && ((mask)[0].bits[0] & 1)); (cpu)++, (void)mask)
>
> Should we fix the for_each_cpu() in include/linux/cpumask.h for UP?

I think so, yes. That might reveal new peculiarities, but such is life.

I guess we should use bitmap_empty() rather than open-coding it.

2018-05-13 13:37:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: for_each_cpu() is buggy for UP kernel?

On Wed, 9 May 2018, Andrew Morton wrote:

> On Wed, 9 May 2018 06:24:16 +0000 Dexuan Cui <[email protected]> wrote:
>
> > In include/linux/cpumask.h, for_each_cpu is defined like this for UP kernel (CONFIG_NR_CPUS=1):
> >
> > #define for_each_cpu(cpu, mask) \
> > for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)
> >
> > Here 'mask' is ignored, but what if 'mask' contains 0 CPU? -- in this case, the for loop should not
> > run at all, but with the current code, we run the loop once with cpu==0.
> >
> > I think I'm seeing a bug in my UP kernel that is caused by the buggy for_each_cpu():
> >
> > in kernel/time/tick-broadcast.c: tick_handle_oneshot_broadcast(), tick_broadcast_oneshot_mask
> > contains 0 CPU, but due to the buggy for_each_cpu(), the variable 'next_event' is changed from
> > its default value KTIME_MAX to "next_event = td->evtdev->next_event"; as a result,
> > tick_handle_oneshot_broadcast () -> tick_broadcast_set_event() -> clockevents_program_event()
> > -> pit_next_event() is programming the PIT timer by accident, causing an interrupt storm of PIT
> > interrupts in some way: I'm seeing that the kernel is receiving ~8000 PIT interrupts per second for
> > 1~5 minutes when the UP kernel boots, and it looks the kernel hangs, but in 1~5 minutes, finally
> > somehow the kernel can recover and boot up fine. But, occasionally, the kernel just hangs there
> > forever, receiving ~8000 PIT timers per second.
> >
> > With the below change in kernel/time/tick-broadcast.c, the interrupt storm will go away:
> >
> > +#undef for_each_cpu
> > +#define for_each_cpu(cpu, mask) \
> > + for ((cpu) = 0; (((cpu) < 1) && ((mask)[0].bits[0] & 1)); (cpu)++, (void)mask)
> >
> > Should we fix the for_each_cpu() in include/linux/cpumask.h for UP?
>
> I think so, yes. That might reveal new peculiarities, but such is life.
>
> I guess we should use bitmap_empty() rather than open-coding it.

Agreed.

FWIW, this had been discussed before, but there was no real conclusion:

https://lkml.kernel.org/r/alpine.DEB.2.20.1709161850010.2105@nanos

Thanks,

tglx


2018-05-13 18:22:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: for_each_cpu() is buggy for UP kernel?

On Tue, May 8, 2018 at 11:24 PM Dexuan Cui <[email protected]> wrote:

> Should we fix the for_each_cpu() in include/linux/cpumask.h for UP?

As Thomas points out, this has come up before.

One of the issues is historical - we tried very hard to make the SMP code
not cause code generation problems for UP, and part of that was just that
all these loops were literally designed to entirely go away under UP. It
still *looks* syntactically like a loop, but an optimizing compiler will
see that there's nothing there, and "for_each_cpu(...) x" essentially just
turns into "x" on UP. An empty mask simply generally doesn't make sense,
since opn UP you also don't have any masking of CPU ops, so the mask is
ignored, and that helps the code generation immensely.

If you have to load and test the mask, you immediately lose out badly in
code generation.

So honestly, I'd really prefer to keep our current behavior. Perhaps with a
debug option that actually tests (on SMP - because that's what every
developer is actually _using_ these days) that the mask isn't empty. But
I'm not sure that would find this case, since presumably on SMP it might
never be empty.

Now, there is likely a fairly good argument that UP is getting _so_
uninteresting that we shouldn't even worry about code generation. But the
counter-argument to that is that if people are using UP in this day and
age, they probably are using some really crappy hardware that needs all the
help it can get.

At least for now, I'd rather have this inconsistency, because it really
makes a surprisingly *big* difference in code generation. From the little
test I just did, adding that mask testing to a *single* case of
for_each_cpu() added 20 instructions. I didn't look at exactly why that
happened (because the code generation was so radically different), but it
was very noticeable. I used your macro replacement in kernel/taskstats.c in
case you want to try to dig into what happened, but I'm not surprised. It
really turns an unconditional trivial loop into a much more complex thing
that needs to look at and test a value that we didn't care about before.

Maybe we should introduce a "for_each_cpu_maybe_empty()" helper for cases
like this?

Linus

2018-05-14 07:29:05

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: for_each_cpu() is buggy for UP kernel?

On Sun, May 13, 2018 at 8:21 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, May 8, 2018 at 11:24 PM Dexuan Cui <[email protected]> wrote:
>
>> Should we fix the for_each_cpu() in include/linux/cpumask.h for UP?
>
> As Thomas points out, this has come up before.
>
> One of the issues is historical - we tried very hard to make the SMP code
> not cause code generation problems for UP, and part of that was just that
> all these loops were literally designed to entirely go away under UP. It
> still *looks* syntactically like a loop, but an optimizing compiler will
> see that there's nothing there, and "for_each_cpu(...) x" essentially just
> turns into "x" on UP. An empty mask simply generally doesn't make sense,
> since opn UP you also don't have any masking of CPU ops, so the mask is
> ignored, and that helps the code generation immensely.
>
> If you have to load and test the mask, you immediately lose out badly in
> code generation.
>
> So honestly, I'd really prefer to keep our current behavior. Perhaps with a
> debug option that actually tests (on SMP - because that's what every
> developer is actually _using_ these days) that the mask isn't empty. But
> I'm not sure that would find this case, since presumably on SMP it might
> never be empty.

This looks like the problem automated testing traditionally and
effectively solves. If UP is an important config, there must be
automated pre/post commit checks for this.

2018-05-15 03:03:06

by Dexuan Cui

[permalink] [raw]
Subject: RE: for_each_cpu() is buggy for UP kernel?

> From: Linus Torvalds <[email protected]>
> Sent: Sunday, May 13, 2018 11:22
> On Tue, May 8, 2018 at 11:24 PM Dexuan Cui <[email protected]> wrote:
>
> > Should we fix the for_each_cpu() in include/linux/cpumask.h for UP?
>
> As Thomas points out, this has come up before.
>
> One of the issues is historical - we tried very hard to make the SMP code
> not cause code generation problems for UP, and part of that was just that
> all these loops were literally designed to entirely go away under UP. It
> still *looks* syntactically like a loop, but an optimizing compiler will
> see that there's nothing there, and "for_each_cpu(...) x" essentially just
> turns into "x" on UP. An empty mask simply generally doesn't make sense,
> since opn UP you also don't have any masking of CPU ops, so the mask is
> ignored, and that helps the code generation immensely.
>
> If you have to load and test the mask, you immediately lose out badly in
> code generation.
Thank you all for the insights and the detailed background introduction!

> So honestly, I'd really prefer to keep our current behavior. Perhaps with a
> debug option that actually tests (on SMP - because that's what every
> developer is actually _using_ these days) that the mask isn't empty. But
> I'm not sure that would find this case, since presumably on SMP it might
> never be empty.
I agree.

> Now, there is likely a fairly good argument that UP is getting _so_
> uninteresting that we shouldn't even worry about code generation. But the
> counter-argument to that is that if people are using UP in this day and
> age, they probably are using some really crappy hardware that needs all the
> help it can get.
FWIW, I happened to find this issue in a SMP virtual machine, but the kernel
from a customer was built with CONFIG_SMP disabled. After spending 1 day
debugging the strange boot-up delay, which was caused by the unexpected
PIT interrupt storm, I finally tracked it down to the UP version of for_each_cpu().

The function exposing the issue is kernel/time/tick-broadcast.c:
tick_handle_oneshot_broadcast().

If you're OK with the below fix (not tested yet), I'll submit a patch for it:

--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -616,6 +616,10 @@ static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
now = ktime_get();
/* Find all expired events */
for_each_cpu(cpu, tick_broadcast_oneshot_mask) {
+#ifndef CONFIG_SMP
+ if (cpumask_empty(tick_broadcast_oneshot_mask))
+ break;
+#endif
td = &per_cpu(tick_cpu_device, cpu);
if (td->evtdev->next_event <= now) {
cpumask_set_cpu(cpu, tmpmask);

> At least for now, I'd rather have this inconsistency, because it really
> makes a surprisingly *big* difference in code generation. From the little
> test I just did, adding that mask testing to a *single* case of
> for_each_cpu() added 20 instructions. I didn't look at exactly why that
> happened (because the code generation was so radically different), but it
> was very noticeable. I used your macro replacement in kernel/taskstats.c in
> case you want to try to dig into what happened, but I'm not surprised. It
> really turns an unconditional trivial loop into a much more complex thing
> that needs to look at and test a value that we didn't care about before.
I agree.


> Maybe we should introduce a "for_each_cpu_maybe_empty()" helper for
> cases like this?
> Linus
Sounds like a good idea.

Thanks,
-- Dexuan

2018-05-15 17:22:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: for_each_cpu() is buggy for UP kernel?

On Mon, May 14, 2018 at 8:02 PM Dexuan Cui <[email protected]> wrote:

> If you're OK with the below fix (not tested yet), I'll submit a patch for
it:

> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -616,6 +616,10 @@ static void tick_handle_oneshot_broadcast(struct
clock_event_device *dev)
> now = ktime_get();
> /* Find all expired events */
> for_each_cpu(cpu, tick_broadcast_oneshot_mask) {
> +#ifndef CONFIG_SMP
> + if (cpumask_empty(tick_broadcast_oneshot_mask))
> + break;
> +#endif

I'm certainly ok with this. It's hacky, but maybe being explicitly hacky is
good to "document" this gotcha.

And I really do agree that this special UP case is nasty nasty and much too
subtle, and I hope that some day we won't care about UP at all, and maybe
kill it, or maybe just make for_each_cpu() generate the extra code to have
the actual same semantics as the SMP case.

Linus

2018-05-15 20:11:06

by Dexuan Cui

[permalink] [raw]
Subject: RE: for_each_cpu() is buggy for UP kernel?

> From: Linus Torvalds <[email protected]>
> Sent: Tuesday, May 15, 2018 10:22
> To: Dexuan Cui <[email protected]>
>
> On Mon, May 14, 2018 at 8:02 PM Dexuan Cui <[email protected]> wrote:
>
> > If you're OK with the below fix (not tested yet), I'll submit a patch for it:
>
> > --- a/kernel/time/tick-broadcast.c
> > +++ b/kernel/time/tick-broadcast.c
> > @@ -616,6 +616,10 @@ static void tick_handle_oneshot_broadcast(struct
> clock_event_device *dev)
> > now = ktime_get();
> > /* Find all expired events */
> > for_each_cpu(cpu, tick_broadcast_oneshot_mask) {
> > +#ifndef CONFIG_SMP
> > + if (cpumask_empty(tick_broadcast_oneshot_mask))
> > + break;
> > +#endif
>
> I'm certainly ok with this. It's hacky, but maybe being explicitly hacky is
> good to "document" this gotcha.
>
> And I really do agree that this special UP case is nasty nasty and much too
> subtle, and I hope that some day we won't care about UP at all, and maybe
> kill it, or maybe just make for_each_cpu() generate the extra code to have
> the actual same semantics as the SMP case.
>
> Linus

Thanks! I submitted the patch just now: https://lkml.org/lkml/2018/5/15/866

Thanks,
-- Dexuan