2020-06-26 21:07:39

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tick-sched] Clarify "NOHZ: local_softirq_pending" warning

Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
(where "HH" is the hexadecimal softirq vector number) when one or more
non-RCU softirq handlers are still enablded when checking to stop the
scheduler-tick interrupt. This message is not as enlightening as one
might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
local softirq work is pending, handler #HH.

Reported-by: Andy Lutomirski <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

---

tick-sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f0199a4..349a25a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -927,7 +927,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)

if (ratelimit < 10 &&
(local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
- pr_warn("NOHZ: local_softirq_pending %02x\n",
+ pr_warn("NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x\n",
(unsigned int) local_softirq_pending());
ratelimit++;
}


2020-06-27 21:05:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH tick-sched] Clarify "NOHZ: local_softirq_pending" warning

On Fri, Jun 26, 2020 at 2:05 PM Paul E. McKenney <[email protected]> wrote:
>
> Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
> (where "HH" is the hexadecimal softirq vector number) when one or more
> non-RCU softirq handlers are still enablded when checking to stop the
> scheduler-tick interrupt. This message is not as enlightening as one
> might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
> local softirq work is pending, handler #HH.

Thank you! It would be even better if it would explain *why* the
problem happened, but I suppose this code doesn't actually know.

--Andy

>
> Reported-by: Andy Lutomirski <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> ---
>
> tick-sched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index f0199a4..349a25a 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -927,7 +927,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
>
> if (ratelimit < 10 &&
> (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
> - pr_warn("NOHZ: local_softirq_pending %02x\n",
> + pr_warn("NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x\n",
> (unsigned int) local_softirq_pending());
> ratelimit++;
> }

2020-06-27 21:48:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tick-sched] Clarify "NOHZ: local_softirq_pending" warning

On Sat, Jun 27, 2020 at 02:02:15PM -0700, Andy Lutomirski wrote:
> On Fri, Jun 26, 2020 at 2:05 PM Paul E. McKenney <[email protected]> wrote:
> >
> > Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
> > (where "HH" is the hexadecimal softirq vector number) when one or more
> > non-RCU softirq handlers are still enablded when checking to stop the
> > scheduler-tick interrupt. This message is not as enlightening as one
> > might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
> > local softirq work is pending, handler #HH.
>
> Thank you! It would be even better if it would explain *why* the
> problem happened, but I suppose this code doesn't actually know.

Glad to help!

To your point, is it possible to bisect the appearance of this message,
or is it as usual non-reproducible? (Hey, had to ask!)

Thanx, Paul

> --Andy
>
> >
> > Reported-by: Andy Lutomirski <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > ---
> >
> > tick-sched.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index f0199a4..349a25a 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -927,7 +927,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
> >
> > if (ratelimit < 10 &&
> > (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
> > - pr_warn("NOHZ: local_softirq_pending %02x\n",
> > + pr_warn("NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x\n",
> > (unsigned int) local_softirq_pending());
> > ratelimit++;
> > }

2020-06-27 22:16:47

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH tick-sched] Clarify "NOHZ: local_softirq_pending" warning


> On Jun 27, 2020, at 2:46 PM, Paul E. McKenney <[email protected]> wrote:
>
> On Sat, Jun 27, 2020 at 02:02:15PM -0700, Andy Lutomirski wrote:
>>> On Fri, Jun 26, 2020 at 2:05 PM Paul E. McKenney <[email protected]> wrote:
>>>
>>> Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
>>> (where "HH" is the hexadecimal softirq vector number) when one or more
>>> non-RCU softirq handlers are still enablded when checking to stop the
>>> scheduler-tick interrupt. This message is not as enlightening as one
>>> might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
>>> local softirq work is pending, handler #HH.
>>
>> Thank you! It would be even better if it would explain *why* the
>> problem happened, but I suppose this code doesn't actually know.
>
> Glad to help!
>
> To your point, is it possible to bisect the appearance of this message,
> or is it as usual non-reproducible? (Hey, had to ask!)
>
>

In this particular case, I tracked it down by good old fashioned sleuthing for bugs, but it’s still unclear to me precisely how NOHZ gets involved. The bug is that we were entering the kernel from usermode, doing nmi_enter(), turning on interrupts, maybe getting a page fault, raising a signal, turning off interrupts, nmi_exit(), and back to usermode, with the signal still queued and undelivered. This is all kinds of bad, but I still don’t understand what softirqs or idle have to do with it.

But I have the bug fixed now!

2020-06-27 23:45:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tick-sched] Clarify "NOHZ: local_softirq_pending" warning

On Sat, Jun 27, 2020 at 03:14:14PM -0700, Andy Lutomirski wrote:
>
> > On Jun 27, 2020, at 2:46 PM, Paul E. McKenney <[email protected]> wrote:
> >
> > On Sat, Jun 27, 2020 at 02:02:15PM -0700, Andy Lutomirski wrote:
> >>> On Fri, Jun 26, 2020 at 2:05 PM Paul E. McKenney <[email protected]> wrote:
> >>>
> >>> Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
> >>> (where "HH" is the hexadecimal softirq vector number) when one or more
> >>> non-RCU softirq handlers are still enablded when checking to stop the
> >>> scheduler-tick interrupt. This message is not as enlightening as one
> >>> might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
> >>> local softirq work is pending, handler #HH.
> >>
> >> Thank you! It would be even better if it would explain *why* the
> >> problem happened, but I suppose this code doesn't actually know.
> >
> > Glad to help!
> >
> > To your point, is it possible to bisect the appearance of this message,
> > or is it as usual non-reproducible? (Hey, had to ask!)
> >
> >
>
> In this particular case, I tracked it down by good old fashioned sleuthing for bugs, but it’s still unclear to me precisely how NOHZ gets involved. The bug is that we were entering the kernel from usermode, doing nmi_enter(), turning on interrupts, maybe getting a page fault, raising a signal, turning off interrupts, nmi_exit(), and back to usermode, with the signal still queued and undelivered. This is all kinds of bad, but I still don’t understand what softirqs or idle have to do with it.
>
> But I have the bug fixed now!

Glad you found it!

Thanx, Paul

2020-07-06 16:49:47

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH tick-sched] Clarify "NOHZ: local_softirq_pending" warning

On Fri, Jun 26, 2020 at 02:05:06PM -0700, Paul E. McKenney wrote:
> Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
> (where "HH" is the hexadecimal softirq vector number) when one or more
> non-RCU softirq handlers are still enablded when checking to stop the
> scheduler-tick interrupt. This message is not as enlightening as one
> might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
> local softirq work is pending, handler #HH.
>
> Reported-by: Andy Lutomirski <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>

CPU hotplug will sometimes trigger this warning on linux-next from NUMA
bare-metal which the commit makes it feel more like errors.

[ 267.734981] smpboot: CPU 42 is now offline
[ 267.991940] smpboot: Booting Node 2 Processor 42 APIC 0x29
[ 267.998370] numa_add_cpu cpu 42 node 2: mask now 8-11,40-43
[ 268.092380] ACPI: \_SB_.SCK0.C015: Found 2 idle states
[ 268.181917] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #282
[ 268.190585] numa_remove_cpu cpu 1 node 0: mask now 0,2-3,32-35
[ 268.194075] smpboot: CPU 1 is now offline
[ 268.415320] x86: Booting SMP configuration:
[ 268.420245] smpboot: Booting Node 0 Processor 1 APIC 0x2

# git clone https://github.com/cailca/linux-mm
# cd linux-mm; make
# ./random 4 (it just soft offline and online each CPU)

The x86.config is also included there. This is only reproducible on x86
so far.

>
> ---
>
> tick-sched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index f0199a4..349a25a 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -927,7 +927,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
>
> if (ratelimit < 10 &&
> (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
> - pr_warn("NOHZ: local_softirq_pending %02x\n",
> + pr_warn("NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x\n",
> (unsigned int) local_softirq_pending());
> ratelimit++;
> }

2020-07-06 17:08:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tick-sched] Clarify "NOHZ: local_softirq_pending" warning

On Mon, Jul 06, 2020 at 12:48:16PM -0400, Qian Cai wrote:
> On Fri, Jun 26, 2020 at 02:05:06PM -0700, Paul E. McKenney wrote:
> > Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
> > (where "HH" is the hexadecimal softirq vector number) when one or more
> > non-RCU softirq handlers are still enablded when checking to stop the
> > scheduler-tick interrupt. This message is not as enlightening as one
> > might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
> > local softirq work is pending, handler #HH.
> >
> > Reported-by: Andy Lutomirski <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
>
> CPU hotplug will sometimes trigger this warning on linux-next from NUMA
> bare-metal which the commit makes it feel more like errors.

One point of the change was to make it clear that this is an error.
So success! ;-)

> [ 267.734981] smpboot: CPU 42 is now offline
> [ 267.991940] smpboot: Booting Node 2 Processor 42 APIC 0x29
> [ 267.998370] numa_add_cpu cpu 42 node 2: mask now 8-11,40-43
> [ 268.092380] ACPI: \_SB_.SCK0.C015: Found 2 idle states
> [ 268.181917] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #282
> [ 268.190585] numa_remove_cpu cpu 1 node 0: mask now 0,2-3,32-35
> [ 268.194075] smpboot: CPU 1 is now offline
> [ 268.415320] x86: Booting SMP configuration:
> [ 268.420245] smpboot: Booting Node 0 Processor 1 APIC 0x2
>
> # git clone https://github.com/cailca/linux-mm
> # cd linux-mm; make
> # ./random 4 (it just soft offline and online each CPU)
>
> The x86.config is also included there. This is only reproducible on x86
> so far.

I must defer to Frederic and Andy on this one. But now that you mention
it, I do see this on some of my runs, which are x86, but not bare metal.
I clearly need to upgrade my console-log parsing.

Thanx, Paul

> > ---
> >
> > tick-sched.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index f0199a4..349a25a 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -927,7 +927,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
> >
> > if (ratelimit < 10 &&
> > (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
> > - pr_warn("NOHZ: local_softirq_pending %02x\n",
> > + pr_warn("NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x\n",
> > (unsigned int) local_softirq_pending());
> > ratelimit++;
> > }