Hello,
my kernel reported:
NOHZ: local_softirq_pending 20
I cannot interpret it, but probably this is bad, because before
bc5393a6c9c0e70b4b43fb2fb63e3315e9a15c8f this used to BUG().
This happend while having a high load. Up to now it only happend once
and I cannot reproduce it.
The machine is a cc7ucamry running v2.6.24-rc3-36-g4dd2d66[1].
Config and dmesg is available at
http://www.modarm9.com/~ukleinek/20071123-local_softirq_pending/
Maybe the relevant source file is arch/arm/mach-ns9xxx/time-ns921x.c?
[1] available at
git://http://www.modarm9.com/gitsrc/pub/people/ukleinek/linux-2.6.git
--
Uwe Kleine-K?nig, Software Engineer
Digi International GmbH Branch Breisach, K?ferstrasse 8, 79206 Breisach, Germany
Tax: 315/5781/0242 / VAT: DE153662976 / Reg. Amtsgericht Dortmund HRB 13962
On Fri, 23 Nov 2007, Uwe Kleine-K?nig wrote:
> my kernel reported:
>
> NOHZ: local_softirq_pending 20
Thats TASKLET_SOFTIRQ
> I cannot interpret it, but probably this is bad, because before
> bc5393a6c9c0e70b4b43fb2fb63e3315e9a15c8f this used to BUG().
We removed the BUG, because it's a situation where the kernel can
easily recover. It should never happen that the kernel goes to sleep
with a pending softirq, but it's not a fatal error.
> This happend while having a high load. Up to now it only happend once
> and I cannot reproduce it.
That's hard to tell then. Without a reproducible test case I can not
do much to help debugging this.
tglx
Hello,
Thomas Gleixner wrote:
> On Fri, 23 Nov 2007, Uwe Kleine-K?nig wrote:
> > my kernel reported:
> >
> > NOHZ: local_softirq_pending 20
>
> Thats TASKLET_SOFTIRQ
I suppose you're right, and it's only me that fails to see that. Just
from looking on the code, I'd say TASKLET_SOFTIRQ is 5. Ah, OK, I see,
0x20 == 1 << TASKLET_SOFTIRQ.
> > I cannot interpret it, but probably this is bad, because before
> > bc5393a6c9c0e70b4b43fb2fb63e3315e9a15c8f this used to BUG().
>
> We removed the BUG, because it's a situation where the kernel can
> easily recover. It should never happen that the kernel goes to sleep
> with a pending softirq, but it's not a fatal error.
>
> > This happend while having a high load. Up to now it only happend once
> > and I cannot reproduce it.
>
> That's hard to tell then. Without a reproducible test case I can not
> do much to help debugging this.
Is there something I can do to be able to report more if it reoccurs?
Can you isolate the problem? Has it to do with the arch-specific timing
code? With the hardware?
Best regards and thanks
Uwe
--
Uwe Kleine-K?nig, Software Engineer
Digi International GmbH Branch Breisach, K?ferstrasse 8, 79206 Breisach, Germany
Tax: 315/5781/0242 / VAT: DE153662976 / Reg. Amtsgericht Dortmund HRB 13962
Hello,
Thomas Gleixner wrote:
> On Fri, 23 Nov 2007, Uwe Kleine-K?nig wrote:
> > my kernel reported:
> >
> > NOHZ: local_softirq_pending 20
>
> Thats TASKLET_SOFTIRQ
>
> > I cannot interpret it, but probably this is bad, because before
> > bc5393a6c9c0e70b4b43fb2fb63e3315e9a15c8f this used to BUG().
>
> We removed the BUG, because it's a situation where the kernel can
> easily recover. It should never happen that the kernel goes to sleep
> with a pending softirq, but it's not a fatal error.
>
> > This happend while having a high load. Up to now it only happend once
> > and I cannot reproduce it.
>
> That's hard to tell then. Without a reproducible test case I can not
> do much to help debugging this.
Back then I added some debug code to tick_nohz_stop_sched_tick to get
some more information when this happens again. As this happened just
now I saw:
- tick_nohz_stop_sched_tick was called from irq_exit
Actually this didn't surprise me, because
tick_nohz_stop_sched_tick is only called at two places, namely
irq_exit and cpu_idle. And I cannot see how
local_softirq_pending() != 0 can happen in the latter (without
first happening in irq_exit maybe).
- it happened three times in a row at the following times:
[ 1593.470000] NOHZ: (c003a3ac) local_softirq_pending 20
[ 1593.470000] Tasklet state=1, func=c0046248, data=0
[ 1593.920000] NOHZ: (c003a3ac) local_softirq_pending 20
[ 1593.920000] Tasklet state=1, func=c0046248, data=0
[ 1594.980000] NOHZ: (c003a3ac) local_softirq_pending 20
[ 1594.980000] Tasklet state=1, func=c0046248, data=0
(c003a3ac = irq_exit+0x24/0x94)
- There was a single tasklet in __get_cpu_var(tasklet_vec).list:
state = 1
func = rcu_process_callbacks (= c0046248)
data = 0
- directly afterwards the oom-killer started killing tasks
I think the only user of rcu in my kernel is networking code. Does this
help anyone to further debug my problem here?
Best regards
Uwe
--
Uwe Kleine-K?nig, Software Engineer
Digi International GmbH Branch Breisach, K?ferstrasse 8, 79206 Breisach, Germany
Tax: 315/5781/0242 / VAT: DE153662976 / Reg. Amtsgericht Dortmund HRB 13962