2007-02-20 18:54:15

by Michal Piotrowski

[permalink] [raw]
Subject: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

Hi,

I have a problem with timers.

NOHZ: local_softirq_pending 20
NOHZ: local_softirq_pending 20
BUG: soft lockup detected on CPU#0!
[<c01050f9>] show_trace_log_lvl+0x1a/0x2f
[<c01057e0>] show_trace+0x12/0x14
[<c0105892>] dump_stack+0x16/0x18
[<c0151214>] softlockup_tick+0xa7/0xb9
[<c012a18f>] run_local_timers+0x12/0x14
[<c012a54b>] update_process_times+0x3e/0x63
[<c0139e34>] tick_sched_timer+0x66/0xaa
[<c01366d2>] hrtimer_interrupt+0x113/0x19d
[<c0114c45>] smp_apic_timer_interrupt+0x77/0x89
[<c0104beb>] apic_timer_interrupt+0x33/0x38
[<c026b945>] scsi_dispatch_cmd+0x1a8/0x1e5
[<c026ff86>] scsi_request_fn+0x224/0x2ec
[<c01e1779>] __generic_unplug_device+0x20/0x23
[<c01e1b49>] generic_unplug_device+0x18/0x26
[<c01e2c48>] blk_backing_dev_unplug+0x76/0x7e
[<c018e289>] sync_buffer+0x2e/0x37
[<c0311041>] __wait_on_bit+0x36/0x5d
[<c03110c3>] out_of_line_wait_on_bit+0x5b/0x63
[<c018e1f9>] __wait_on_buffer+0x27/0x2b
[<c018ec74>] sync_dirty_buffer+0xa0/0xd6
[<c01bc5c1>] journal_commit_transaction+0xca4/0x11d3
[<c01bff86>] kjournald+0xab/0x1e8
[<c0133639>] kthread+0xb5/0xe0
[<c0104d8f>] kernel_thread_helper+0x7/0x10
=======================
Clocksource tsc unstable (delta = 18743203260 ns)
Time: acpi_pm clocksource has been installed.

http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.20-git15/git-dmesg
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.20-git15/git-config

Might it be 6ba9b346e1e0eca65ec589d32de3a9fe32dc5de6 commit?

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)


2007-02-20 21:51:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

On Tue, 2007-02-20 at 19:54 +0100, Michal Piotrowski wrote:
>
> Might it be 6ba9b346e1e0eca65ec589d32de3a9fe32dc5de6 commit?

I doubt that it is, but can you revert it ?

tglx


2007-02-20 22:37:39

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

On 20/02/07, Thomas Gleixner <[email protected]> wrote:
> On Tue, 2007-02-20 at 19:54 +0100, Michal Piotrowski wrote:
> >
> > Might it be 6ba9b346e1e0eca65ec589d32de3a9fe32dc5de6 commit?
>
> I doubt that it is, but can you revert it ?

I'm using the latest kernel without this patch since 3 hours.

So far so good.

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

2007-02-21 15:28:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

On Tue, 2007-02-20 at 23:37 +0100, Michal Piotrowski wrote:
> On 20/02/07, Thomas Gleixner <[email protected]> wrote:
> > On Tue, 2007-02-20 at 19:54 +0100, Michal Piotrowski wrote:
> > >
> > > Might it be 6ba9b346e1e0eca65ec589d32de3a9fe32dc5de6 commit?
> >
> > I doubt that it is, but can you revert it ?
>
> I'm using the latest kernel without this patch since 3 hours.
>
> So far so good.

But you still have those softirq pending messages, right ? I think those
are pointing to the root cause of this. Still no idea how to get hold of
them. All my systems refuse to produce that. Hrmpf.

tglx


2007-02-21 15:38:57

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

On 21/02/07, Thomas Gleixner <[email protected]> wrote:
> On Tue, 2007-02-20 at 23:37 +0100, Michal Piotrowski wrote:
> > On 20/02/07, Thomas Gleixner <[email protected]> wrote:
> > > On Tue, 2007-02-20 at 19:54 +0100, Michal Piotrowski wrote:
> > > >
> > > > Might it be 6ba9b346e1e0eca65ec589d32de3a9fe32dc5de6 commit?
> > >
> > > I doubt that it is, but can you revert it ?
> >
> > I'm using the latest kernel without this patch since 3 hours.
> >
> > So far so good.
>
> But you still have those softirq pending messages, right ?

Yes

(+ new NOHZ: local_softirq_pending 02)
http://www.ussg.iu.edu/hypermail/linux/kernel/0702.2/1944.html

> I think those
> are pointing to the root cause of this. Still no idea how to get hold of
> them. All my systems refuse to produce that. Hrmpf.
>
> tglx
>

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

2007-02-21 19:55:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

Michal,

On Wed, 2007-02-21 at 16:38 +0100, Michal Piotrowski wrote:
> > But you still have those softirq pending messages, right ?
>
> Yes
>
> (+ new NOHZ: local_softirq_pending 02)

Yike, that's the timer softirq.

Can you add the patch below, maybe it gives us some useful info. Please
enable lockdep (your last config had it already)

Thanks,

tglx

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 512a4a9..cc705c7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -165,9 +165,11 @@ void tick_nohz_stop_sched_tick(void)
goto end;

cpu = smp_processor_id();
- if (unlikely(local_softirq_pending()))
- printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
- local_softirq_pending());
+ if (unlikely(local_softirq_pending())) {
+ print_irqtrace_events(current);
+ printk(KERN_ERR "NOHZ: local_softirq_pending %02x, %08x\n",
+ local_softirq_pending(), preempt_count());
+ }

now = ktime_get();
/*


2007-02-22 01:48:09

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

Thomas Gleixner napisa?(a):
> Michal,
>
> On Wed, 2007-02-21 at 16:38 +0100, Michal Piotrowski wrote:
>>> But you still have those softirq pending messages, right ?
>> Yes
>>
>> (+ new NOHZ: local_softirq_pending 02)
>
> Yike, that's the timer softirq.
>
> Can you add the patch below, maybe it gives us some useful info. Please
> enable lockdep (your last config had it already)
>

I hope this helps.

irq event stamp: 103630856
hardirqs last enabled at (103630855): [<c031356a>] _spin_unlock_irq+0x22/0x43
hardirqs last disabled at (103630856): [<c013a00e>] tick_nohz_stop_sched_tick+0x13/0x1fd
softirqs last enabled at (103630824): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (103630819): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001

(gdb) l *0xc031356a
0xc031356a is in _spin_unlock_irq (include2/asm/irqflags.h:48).
43 __asm__ __volatile__("cli" : : : "memory");
44 }
45
46 static inline void raw_local_irq_enable(void)
47 {
48 __asm__ __volatile__("sti" : : : "memory");
49 }
50
51 /*
52 * Used in the idle loop; sti takes one instruction cycle
(gdb) l *0xc013a00e
0xc013a00e is in tick_nohz_stop_sched_tick (/mnt/md0/devel/linux-git/kernel/time/tick-sched.c:158).
153 ktime_t last_update, expires, now, delta;
154 int cpu;
155
156 local_irq_save(flags);
157
158 cpu = smp_processor_id();
159 ts = &per_cpu(tick_cpu_sched, cpu);
160
161 if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
162 goto end;
(gdb) l *0xc01265df
0xc01265df is in __do_softirq (/mnt/md0/devel/linux-git/kernel/softirq.c:251).
246
247 trace_softirq_exit();
248
249 account_system_vtime(current);
250 _local_bh_enable();
251 }
252
253 #ifndef __ARCH_HAS_DO_SOFTIRQ
254
255 asmlinkage void do_softirq(void)
(gdb) l *0xc0106a75
0xc0106a75 is in do_softirq (/mnt/md0/devel/linux-git/arch/i386/kernel/irq.c:222).
217 irqctx->tinfo.previous_esp = current_stack_pointer;
218
219 /* build the stack frame on the softirq stack */
220 isp = (u32*) ((char*)irqctx + sizeof(*irqctx));
221
222 asm volatile(
223 " xchgl %%ebx,%%esp \n"
224 " call __do_softirq \n"
225 " movl %%ebx,%%esp \n"
226 : "=b"(isp)

http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-config
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-dmesg

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

2007-02-22 10:50:50

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

Michal Piotrowski napisa?(a):
> Thomas Gleixner napisa?(a):
>> Michal,
>>
>> On Wed, 2007-02-21 at 16:38 +0100, Michal Piotrowski wrote:
>>>> But you still have those softirq pending messages, right ?
>>> Yes
>>>
>>> (+ new NOHZ: local_softirq_pending 02)
>> Yike, that's the timer softirq.
>>
>> Can you add the patch below, maybe it gives us some useful info. Please
>> enable lockdep (your last config had it already)
>>
>
> I hope this helps.
>
> irq event stamp: 103630856
> hardirqs last enabled at (103630855): [<c031356a>] _spin_unlock_irq+0x22/0x43
> hardirqs last disabled at (103630856): [<c013a00e>] tick_nohz_stop_sched_tick+0x13/0x1fd
> softirqs last enabled at (103630824): [<c01265df>] __do_softirq+0xe4/0xea
> softirqs last disabled at (103630819): [<c0106a75>] do_softirq+0x64/0xd1
> NOHZ: local_softirq_pending 20, 00000001
>
> (gdb) l *0xc031356a
> 0xc031356a is in _spin_unlock_irq (include2/asm/irqflags.h:48).
> 43 __asm__ __volatile__("cli" : : : "memory");
> 44 }
> 45
> 46 static inline void raw_local_irq_enable(void)
> 47 {
> 48 __asm__ __volatile__("sti" : : : "memory");
> 49 }
> 50
> 51 /*
> 52 * Used in the idle loop; sti takes one instruction cycle
> (gdb) l *0xc013a00e
> 0xc013a00e is in tick_nohz_stop_sched_tick (/mnt/md0/devel/linux-git/kernel/time/tick-sched.c:158).
> 153 ktime_t last_update, expires, now, delta;
> 154 int cpu;
> 155
> 156 local_irq_save(flags);
> 157
> 158 cpu = smp_processor_id();
> 159 ts = &per_cpu(tick_cpu_sched, cpu);
> 160
> 161 if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> 162 goto end;
> (gdb) l *0xc01265df
> 0xc01265df is in __do_softirq (/mnt/md0/devel/linux-git/kernel/softirq.c:251).
> 246
> 247 trace_softirq_exit();
> 248
> 249 account_system_vtime(current);
> 250 _local_bh_enable();
> 251 }
> 252
> 253 #ifndef __ARCH_HAS_DO_SOFTIRQ
> 254
> 255 asmlinkage void do_softirq(void)
> (gdb) l *0xc0106a75
> 0xc0106a75 is in do_softirq (/mnt/md0/devel/linux-git/arch/i386/kernel/irq.c:222).
> 217 irqctx->tinfo.previous_esp = current_stack_pointer;
> 218
> 219 /* build the stack frame on the softirq stack */
> 220 isp = (u32*) ((char*)irqctx + sizeof(*irqctx));
> 221
> 222 asm volatile(
> 223 " xchgl %%ebx,%%esp \n"
> 224 " call __do_softirq \n"
> 225 " movl %%ebx,%%esp \n"
> 226 : "=b"(isp)
>

Here is more

BUG: at /mnt/md0/devel/linux-git/kernel/lockdep.c:2427 check_flags()
[<c01050f9>] show_trace_log_lvl+0x1a/0x2f
[<c01057e0>] show_trace+0x12/0x14
[<c0105892>] dump_stack+0x16/0x18
[<c013abf7>] check_flags+0x95/0x143
[<c013d978>] lock_acquire+0x29/0x82
[<c0136974>] down_write+0x3a/0x54
[<c01630f8>] sys_munmap+0x23/0x3f
[<c0104120>] syscall_call+0x7/0xb
=======================
irq event stamp: 30788
hardirqs last enabled at (30787): [<c0104249>] syscall_exit_work+0x11/0x26
hardirqs last disabled at (30788): [<c0103fc9>] ret_from_exception+0x9/0xc
softirqs last enabled at (30202): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (30193): [<c0106a75>] do_softirq+0x64/0xd1
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 02, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 02, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 02, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 137191105
hardirqs last enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 214661966
hardirqs last enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001


http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-config
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-dmesg2

[<c01265df>] __do_softirq+0xe4/0xea
[<c0106a75>] do_softirq+0x64/0xd1

this repeats every time

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

2007-02-23 06:13:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?


Michal,

* Michal Piotrowski <[email protected]> wrote:

> Here is more
>
> hardirqs last enabled at (30787): [<c0104249>] syscall_exit_work+0x11/0x26
> hardirqs last disabled at (30788): [<c0103fc9>] ret_from_exception+0x9/0xc
> softirqs last enabled at (30202): [<c01265df>] __do_softirq+0xe4/0xea
> softirqs last disabled at (30193): [<c0106a75>] do_softirq+0x64/0xd1

could you please try the patch below? This is pretty much the only
condition under which we can silently 'leak' pending softirqs, and
trigger the new warning: if something does cond_resched_softirq() in
non-runnable state. (which is a no-no, but nothing enforced this, so it
could in theory happen.) So the question is, with this patch applied, do
you get these new warnings from sched.c?

Ingo

-------------------------->
Subject: [patch] add warning to cond_resched_softirq()
From: Ingo Molnar <[email protected]>

make sure that cond_resched_softirq() is always called with a runnable
task - so that we do not leave softirq work pending indefinitely.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 1 +
1 file changed, 1 insertion(+)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4689,6 +4689,7 @@ int __sched cond_resched_softirq(void)
BUG_ON(!in_softirq());

if (need_resched() && system_state == SYSTEM_RUNNING) {
+ WARN_ON(current->state != TASK_RUNNING);
raw_local_irq_disable();
_local_bh_enable();
raw_local_irq_enable();

2007-02-23 07:32:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

On Fri, 2007-02-23 at 07:08 +0100, Ingo Molnar wrote:

> Index: linux/kernel/sched.c
> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -4689,6 +4689,7 @@ int __sched cond_resched_softirq(void)
> BUG_ON(!in_softirq());
>
> if (need_resched() && system_state == SYSTEM_RUNNING) {
> + WARN_ON(current->state != TASK_RUNNING);
> raw_local_irq_disable();
> _local_bh_enable();
> raw_local_irq_enable();

I'm getting NOHZ: local_softirq_pending 02 or 10 every few seconds if
the box is doing anything, and the WARN_ON() has yet to trigger.

-Mike

2007-02-23 10:15:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?


* Ingo Molnar <[email protected]> wrote:

> could you please try the patch below? This is pretty much the only
> condition under which we can silently 'leak' pending softirqs, and
> trigger the new warning: if something does cond_resched_softirq() in
> non-runnable state. (which is a no-no, but nothing enforced this, so
> it could in theory happen.) So the question is, with this patch
> applied, do you get these new warnings from sched.c?

it just triggered on one of my boxes:

BUG: at kernel/sched.c:4692 cond_resched_softirq()
[<c03ce128>] cond_resched_softirq+0x5f/0x7b
[<c0369078>] release_sock+0x42/0x81
[<c03693bc>] sk_wait_data+0x57/0x9d
[<c0129a00>] autoremove_wake_function+0x0/0x33
[<c03942ff>] tcp_recvmsg+0x39c/0x973
[<c0368e39>] sock_common_recvmsg+0x3e/0x54
[<c0367903>] sock_aio_read+0x106/0x112
[<c0159b0c>] do_sync_read+0xc8/0x105
[<c0129a00>] autoremove_wake_function+0x0/0x33
[<c0159e82>] vfs_read+0xc1/0x15a
[<c015a7d2>] sys_read+0x41/0x67
[<c0103c10>] syscall_call+0x7/0xb
=======================

so tcp_recvmsg() definitely gets into this condition.

Ingo

2007-02-24 22:45:56

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

Hi Ingo,

On 23/02/07, Ingo Molnar <[email protected]> wrote:
>
> Michal,
>
> * Michal Piotrowski <[email protected]> wrote:
>
> > Here is more
> >
> > hardirqs last enabled at (30787): [<c0104249>] syscall_exit_work+0x11/0x26
> > hardirqs last disabled at (30788): [<c0103fc9>] ret_from_exception+0x9/0xc
> > softirqs last enabled at (30202): [<c01265df>] __do_softirq+0xe4/0xea
> > softirqs last disabled at (30193): [<c0106a75>] do_softirq+0x64/0xd1
>
> could you please try the patch below? This is pretty much the only
> condition under which we can silently 'leak' pending softirqs, and
> trigger the new warning: if something does cond_resched_softirq() in
> non-runnable state. (which is a no-no, but nothing enforced this, so it
> could in theory happen.) So the question is, with this patch applied, do
> you get these new warnings from sched.c?

Here is a dmesg after over 25 hours of uptime
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-dmesg3
nothing new.

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

2007-02-25 09:48:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

Michal,

On Sat, 2007-02-24 at 23:45 +0100, Michal Piotrowski wrote:
> Here is a dmesg after over 25 hours of uptime
> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-dmesg3
> nothing new.

can you please turn off CONFIG_SCHED_SMT ? It seems all reports have one
thing in common:

CPU == P4 and CONFIG_SCHED_SMT turned on.

tglx


2007-02-26 13:01:36

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?

On 25/02/07, Thomas Gleixner <[email protected]> wrote:
> Michal,
>
> On Sat, 2007-02-24 at 23:45 +0100, Michal Piotrowski wrote:
> > Here is a dmesg after over 25 hours of uptime
> > http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-dmesg3
> > nothing new.
>
> can you please turn off CONFIG_SCHED_SMT ? It seems all reports have one
> thing in common:
>
> CPU == P4 and CONFIG_SCHED_SMT turned on.
>
> tglx
>
>
>

Ok, I think that this is a SMT scheduler problem.

System works fine for fifteen hours.

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

2007-02-26 13:31:48

by Ingo Molnar

[permalink] [raw]
Subject: [patch] sched: fix SMT scheduler bug


* Michal Piotrowski <[email protected]> wrote:

> Ok, I think that this is a SMT scheduler problem.
>
> System works fine for fifteen hours.

hmmm. I think it's this piece of smt-nice code in sched.c:

if (dependent_sleeper(cpu, rq, next))
next = rq->idle;

this will skip to run this HT CPU if the sibling CPU is running a task
that is more important. And dependent_sleeper() doesnt skip kernel
threads such as ksoftirqd:

/* kernel/rt threads do not participate in dependent sleeping */
if (!p->mm || rt_task(p))
return 0;

but ... what if the highest-prio thread is a user-space task /and/ there
is ksoftirqd also queued, which is now delayed? We schedule the idle
task instead of processing softirqs. Ouch!

To test this theory, could you turn the SMT scheduler back on but also
apply the patch below. Does it still work fine?

Ingo

---------------------->
Subject: [patch] sched: fix SMT scheduler bug
From: Ingo Molnar <[email protected]>

the SMT scheduler incorrectly skips kernel threads even if they
are runnable (but they are preempted by a higher-prio user-space
task which got SMT-delayed by an even higher-priority task
running on a sibling CPU).

fix this for now by only doing the SMT-nice optimization if
the to-be-delayed task is the only runnable task. (This should
cover most of the real-life cases anyway.)

this bug has been in the SMT scheduler since 2.6.17 or so, but has
only been noticed now by the active check in the dynticks code.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3537,7 +3537,7 @@ need_resched_nonpreemptible:
}
}
next->sleep_type = SLEEP_NORMAL;
- if (dependent_sleeper(cpu, rq, next))
+ if (rq->nr_running == 1 && dependent_sleeper(cpu, rq, next))
next = rq->idle;
switch_tasks:
if (next == rq->idle)