2007-02-07 23:21:41

by Lukasz Trabinski

[permalink] [raw]
Subject: 2.6.20 BUG: soft lockup detected on CPU#0!

Hello

On 2.6.19 I had about 60 days uptime, on 2.6.20 2 days :(



oceanic:~$ uname -a
Linux oceanic.wsisiz.edu.pl 2.6.20-oceanic #2 SMP Sun Feb 4 21:55:29 CET
2007 x86_64 x86_64 x86_64 GNU/Linux

Feb 7 22:46:00 oceanic kernel: BUG: soft lockup detected on CPU#0!
Feb 7 22:46:00 oceanic kernel:
Feb 7 22:46:00 oceanic kernel: Call Trace:
Feb 7 22:46:00 oceanic kernel: <IRQ> [<ffffffff80250550>] softlockup_tick+0xdb/0xed
Feb 7 22:46:00 oceanic kernel: [<ffffffff8022ea87>] __do_softirq+0x55/0xc4
Feb 7 22:46:00 oceanic kernel: [<ffffffff8023256c>] update_process_times+0x42/0x68
Feb 7 22:46:00 oceanic kernel: [<ffffffff80213e36>] smp_local_timer_interrupt+0x34/0x55
Feb 7 22:46:00 oceanic kernel: [<ffffffff80214316>] smp_apic_timer_interrupt+0x51/0x68
Feb 7 22:46:00 oceanic kernel: [<ffffffff8020a3a6>] apic_timer_interrupt+0x66/0x70
Feb 7 22:46:00 oceanic kernel: <EOI> [<ffffffff804393c3>] _spin_unlock_irqrestore+0x8/0x9
Feb 7 22:46:00 oceanic kernel: [<ffffffff804393c3>] _spin_unlock_irqrestore+0x8/0x9
Feb 7 22:46:00 oceanic kernel: [<ffffffff8023dddd>] hrtimer_try_to_cancel+0x4a/0x53
Feb 7 22:46:00 oceanic kernel: [<ffffffff8023ddf2>] hrtimer_cancel+0xc/0x16
Feb 7 22:46:00 oceanic kernel: [<ffffffff8023ddf2>] hrtimer_cancel+0xc/0x16
Feb 7 22:46:00 oceanic kernel: [<ffffffff8022cc66>] do_exit+0x1c8/0x800
Feb 7 22:46:00 oceanic kernel: [<ffffffff8022d31d>] sys_exit_group+0x0/0xe
Feb 7 22:46:00 oceanic kernel: [<ffffffff8020995c>] tracesys+0xdc/0xe1
Feb 7 22:46:00 oceanic kernel:
Feb 7 22:46:00 oceanic kernel: BUG: soft lockup detected on CPU#1!
Feb 7 22:46:00 oceanic kernel:
Feb 7 22:46:00 oceanic kernel: Call Trace:
Feb 7 22:46:00 oceanic kernel: <IRQ> [<ffffffff80250550>] softlockup_tick+0xdb/0xed
Feb 7 22:46:00 oceanic kernel: [<ffffffff8023256c>] update_process_times+0x42/0x68
Feb 7 22:46:00 oceanic kernel: [<ffffffff80213e36>] smp_local_timer_interrupt+0x34/0x55
Feb 7 22:46:00 oceanic kernel: [<ffffffff80214316>] smp_apic_timer_interrupt+0x51/0x68
Feb 7 22:46:00 oceanic kernel: [<ffffffff802338c8>] __group_send_sig_info+0x35/0x8b
Feb 7 22:46:00 oceanic kernel: [<ffffffff8022e0f1>] it_real_fn+0x0/0x4f
Feb 7 22:46:00 oceanic kernel: [<ffffffff8020a3a6>] apic_timer_interrupt+0x66/0x70
Feb 7 22:46:00 oceanic kernel: [<ffffffff802164da>] flat_send_IPI_mask+0x0/0x3d
Feb 7 22:46:00 oceanic kernel: [<ffffffff8023dc52>] hrtimer_run_queues+0x105/0x164
Feb 7 22:46:00 oceanic kernel: [<ffffffff80231d0f>] run_timer_softirq+0x21/0x19f
Feb 7 22:46:00 oceanic kernel: [<ffffffff8022eb49>] tasklet_action+0x53/0x9d
Feb 7 22:46:00 oceanic kernel: [<ffffffff8022ea87>] __do_softirq+0x55/0xc4
Feb 7 22:46:00 oceanic kernel: [<ffffffff8020a8fc>] call_softirq+0x1c/0x28
Feb 7 22:46:00 oceanic kernel: [<ffffffff8020c3b1>] do_softirq+0x2c/0x7d
Feb 7 22:46:00 oceanic kernel: [<ffffffff8021431b>] smp_apic_timer_interrupt+0x56/0x68
Feb 7 22:46:00 oceanic kernel: [<ffffffff80208a55>] default_idle+0x0/0x3d
Feb 7 22:46:00 oceanic kernel: [<ffffffff8020a3a6>] apic_timer_interrupt+0x66/0x70
Feb 7 22:46:00 oceanic kernel: <EOI> [<ffffffff80208a7e>] default_idle+0x29/0x3d
Feb 7 22:46:00 oceanic kernel: [<ffffffff80208ae4>] cpu_idle+0x52/0x71
Feb 7 22:46:00 oceanic kernel: [<ffffffff8057fbbc>] start_secondary+0x465/0x474

--
?T


2007-02-08 04:48:10

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.20 BUG: soft lockup detected on CPU#0!

On Thu, 8 Feb 2007 00:02:10 +0100 (CET) Lukasz Trabinski <[email protected]> wrote:

> Hello
>
> On 2.6.19 I had about 60 days uptime, on 2.6.20 2 days :(
>

Did the machine actually fail? Or did it just print these messages and
keep going?


>
> oceanic:~$ uname -a
> Linux oceanic.wsisiz.edu.pl 2.6.20-oceanic #2 SMP Sun Feb 4 21:55:29 CET
> 2007 x86_64 x86_64 x86_64 GNU/Linux
>
> Feb 7 22:46:00 oceanic kernel: BUG: soft lockup detected on CPU#0!
> Feb 7 22:46:00 oceanic kernel:
> Feb 7 22:46:00 oceanic kernel: Call Trace:
> Feb 7 22:46:00 oceanic kernel: <IRQ> [<ffffffff80250550>] softlockup_tick+0xdb/0xed
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8022ea87>] __do_softirq+0x55/0xc4
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8023256c>] update_process_times+0x42/0x68
> Feb 7 22:46:00 oceanic kernel: [<ffffffff80213e36>] smp_local_timer_interrupt+0x34/0x55
> Feb 7 22:46:00 oceanic kernel: [<ffffffff80214316>] smp_apic_timer_interrupt+0x51/0x68
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8020a3a6>] apic_timer_interrupt+0x66/0x70
> Feb 7 22:46:00 oceanic kernel: <EOI> [<ffffffff804393c3>] _spin_unlock_irqrestore+0x8/0x9
> Feb 7 22:46:00 oceanic kernel: [<ffffffff804393c3>] _spin_unlock_irqrestore+0x8/0x9
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8023dddd>] hrtimer_try_to_cancel+0x4a/0x53
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8023ddf2>] hrtimer_cancel+0xc/0x16
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8023ddf2>] hrtimer_cancel+0xc/0x16
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8022cc66>] do_exit+0x1c8/0x800
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8022d31d>] sys_exit_group+0x0/0xe
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8020995c>] tracesys+0xdc/0xe1
> Feb 7 22:46:00 oceanic kernel:
> Feb 7 22:46:00 oceanic kernel: BUG: soft lockup detected on CPU#1!
> Feb 7 22:46:00 oceanic kernel:
> Feb 7 22:46:00 oceanic kernel: Call Trace:
> Feb 7 22:46:00 oceanic kernel: <IRQ> [<ffffffff80250550>] softlockup_tick+0xdb/0xed
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8023256c>] update_process_times+0x42/0x68
> Feb 7 22:46:00 oceanic kernel: [<ffffffff80213e36>] smp_local_timer_interrupt+0x34/0x55
> Feb 7 22:46:00 oceanic kernel: [<ffffffff80214316>] smp_apic_timer_interrupt+0x51/0x68
> Feb 7 22:46:00 oceanic kernel: [<ffffffff802338c8>] __group_send_sig_info+0x35/0x8b
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8022e0f1>] it_real_fn+0x0/0x4f
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8020a3a6>] apic_timer_interrupt+0x66/0x70
> Feb 7 22:46:00 oceanic kernel: [<ffffffff802164da>] flat_send_IPI_mask+0x0/0x3d
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8023dc52>] hrtimer_run_queues+0x105/0x164
> Feb 7 22:46:00 oceanic kernel: [<ffffffff80231d0f>] run_timer_softirq+0x21/0x19f
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8022eb49>] tasklet_action+0x53/0x9d
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8022ea87>] __do_softirq+0x55/0xc4
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8020a8fc>] call_softirq+0x1c/0x28
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8020c3b1>] do_softirq+0x2c/0x7d
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8021431b>] smp_apic_timer_interrupt+0x56/0x68
> Feb 7 22:46:00 oceanic kernel: [<ffffffff80208a55>] default_idle+0x0/0x3d
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8020a3a6>] apic_timer_interrupt+0x66/0x70
> Feb 7 22:46:00 oceanic kernel: <EOI> [<ffffffff80208a7e>] default_idle+0x29/0x3d
> Feb 7 22:46:00 oceanic kernel: [<ffffffff80208ae4>] cpu_idle+0x52/0x71
> Feb 7 22:46:00 oceanic kernel: [<ffffffff8057fbbc>] start_secondary+0x465/0x474
>

The softlock detector has a long history of false positives and precious
few true positives, in my experience.

2007-02-08 08:07:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.20 BUG: soft lockup detected on CPU#0!


* Andrew Morton <[email protected]> wrote:

> The softlock detector has a long history of false positives and
> precious few true positives, in my experience.

hm, not so the latest & lamest in my experience. The commit that made it
quite robust was 6687a97d4041f996f725902d2990e5de6ef5cbe5, as of March
2006, and first showed up in 2.6.17. (OTOH, since the merge of lockdep
the main source of soft lockups in the field has been quite severely
reduced. Nevertheless it's still good to have it around, occasionally
there happen other types of soft lockups too, in open-coded loops, etc.)

Ingo

2007-02-08 08:16:13

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.20 BUG: soft lockup detected on CPU#0!

On Thu, 8 Feb 2007 09:06:44 +0100 Ingo Molnar <[email protected]> wrote:

>
> * Andrew Morton <[email protected]> wrote:
>
> > The softlock detector has a long history of false positives and
> > precious few true positives, in my experience.
>
> hm, not so the latest & lamest in my experience. The commit that made it
> quite robust was 6687a97d4041f996f725902d2990e5de6ef5cbe5, as of March
> 2006, and first showed up in 2.6.17. (OTOH, since the merge of lockdep
> the main source of soft lockups in the field has been quite severely
> reduced. Nevertheless it's still good to have it around, occasionally
> there happen other types of soft lockups too, in open-coded loops, etc.)
>

So... what caused Lukasz's lockup?

2007-02-08 09:16:28

by Eric Dumazet

[permalink] [raw]
Subject: Re: 2.6.20 BUG: soft lockup detected on CPU#0!

On Thursday 08 February 2007 09:06, Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
> > The softlock detector has a long history of false positives and
> > precious few true positives, in my experience.
>
> hm, not so the latest & lamest in my experience. The commit that made it
> quite robust was 6687a97d4041f996f725902d2990e5de6ef5cbe5, as of March
> 2006, and first showed up in 2.6.17. (OTOH, since the merge of lockdep
> the main source of soft lockups in the field has been quite severely
> reduced. Nevertheless it's still good to have it around, occasionally
> there happen other types of soft lockups too, in open-coded loops, etc.)

This reminds me the current problem in close_files()
code, where we trigger soft lockup quite regularly.

Is there any chance/interest we can solve the issue Andrew had with this
patch ?

http://lkml.org/lkml/2006/5/2/273

Thank you
Eric


2007-02-08 09:56:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.20 BUG: soft lockup detected on CPU#0!


* Eric Dumazet <[email protected]> wrote:

> This reminds me the current problem in close_files()
> code, where we trigger soft lockup quite regularly.
>
> Is there any chance/interest we can solve the issue Andrew had with
> this patch ?
>
> http://lkml.org/lkml/2006/5/2/273

yes - the -rt patch included the patch below for more than 2 years.
(note that this one is even more finegrained)

Ingo

Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -431,8 +433,10 @@ static void close_files(struct files_str
while (set) {
if (set & 1) {
struct file * file = xchg(&fdt->fd[i], NULL);
- if (file)
+ if (file) {
filp_close(file, files);
+ cond_resched();
+ }
}
i++;
set >>= 1;

2007-02-08 10:13:07

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.20 BUG: soft lockup detected on CPU#0!

On Thu, 8 Feb 2007 10:56:12 +0100 Ingo Molnar <[email protected]> wrote:

>
> * Eric Dumazet <[email protected]> wrote:
>
> > This reminds me the current problem in close_files()
> > code, where we trigger soft lockup quite regularly.
> >
> > Is there any chance/interest we can solve the issue Andrew had with
> > this patch ?
> >
> > http://lkml.org/lkml/2006/5/2/273
>
> yes - the -rt patch included the patch below for more than 2 years.
> (note that this one is even more finegrained)
>
> Ingo
>
> Index: linux/kernel/exit.c
> ===================================================================
> --- linux.orig/kernel/exit.c
> +++ linux/kernel/exit.c
> @@ -431,8 +433,10 @@ static void close_files(struct files_str
> while (set) {
> if (set & 1) {
> struct file * file = xchg(&fdt->fd[i], NULL);
> - if (file)
> + if (file) {
> filp_close(file, files);
> + cond_resched();
> + }
> }
> i++;
> set >>= 1;

That doesn't hang like the other patch did on 2.6.17-rc3.

Very mysterious.

2007-02-08 15:51:42

by Lukasz Trabinski

[permalink] [raw]
Subject: Re: 2.6.20 BUG: soft lockup detected on CPU#0!

On Wed, 7 Feb 2007, Andrew Morton wrote:

> On Thu, 8 Feb 2007 00:02:10 +0100 (CET) Lukasz Trabinski <[email protected]> wrote:
>
>> Hello
>>
>> On 2.6.19 I had about 60 days uptime, on 2.6.20 2 days :(
>>
>
> Did the machine actually fail? Or did it just print these messages and
> keep going?

Was message from tg3 network driver, but i don't remember it. :(
Machine was totaly locked, only ctrl sysrq boot...

--
?T