2007-02-16 20:38:17

by Michal Piotrowski

[permalink] [raw]
Subject: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

Hi,

This looks like a tickless stuff

------------[ cut here ]------------
kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168!
invalid opcode: 0000 [#1]
PREEMPT SMP
Modules linked in: rtc unix
CPU: 0
EIP: 0060:[<c0139ea0>] Not tainted VLI
EFLAGS: 00010002 (2.6.20 #53)
EIP is at tick_nohz_stop_sched_tick+0x5d/0x1d9
eax: 00000000 ebx: c6021100 ecx: 00000001 edx: c03e74e0
esi: c6022c40 edi: 00000000 ebp: c041dfac esp: c041df78
ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
Process swapper (pid: 0, ti=c041d000 task=c03e74e0 task.ti=c041d000)
Stack: 00000000 c03e7608 c6021b20 00000000 c041dfac 00000296 00000001 c602007b
c01000d8 c0451304 c6021100 c0102b42 00000000 c041dfc0 c01023b4 c602ee80
c0453ae4 c602b864 c041dfc8 c010141d c041dff8 c0421a52 c038d4da c04211ed
Call Trace:
[<c01050f9>] show_trace_log_lvl+0x1a/0x2f
[<c01051ab>] show_stack_log_lvl+0x9d/0xa5
[<c01053a0>] show_registers+0x1ed/0x32c
[<c01055f7>] die+0x118/0x22f
[<c0105787>] do_trap+0x79/0x91
[<c0106017>] do_invalid_op+0x97/0xa1
[<c0311abc>] error_code+0x7c/0x84
[<c01023b4>] cpu_idle+0x1d/0xfb
[<c010141d>] rest_init+0x37/0x3a
[<c0421a52>] start_kernel+0x459/0x461
[<00000000>] 0x0
=======================
Code: f0 ff ff 8b 40 08 a8 08 0f 85 74 01 00 00 e8 74 e7 0b 00 89 c7 bb 80 13 45 c0 e8 68 e7 0b 00 03 1c 85 00 a6 41 c0 83 3b 00 74 04 <0f> 0b eb fe 8d 45 ec e8 bf c0 ff ff 8b 4d ec 8b 5d f0 83 7e 3c
EIP: [<c0139ea0>] tick_nohz_stop_sched_tick+0x5d/0x1d9 SS:ESP 0068:c041df78
Kernel panic - not syncing: Attempted to kill the idle task!
BUG: at /mnt/md0/devel/linux-git/arch/i386/kernel/smp.c:547 smp_call_function()
[<c01050f9>] show_trace_log_lvl+0x1a/0x2f
[<c01057e0>] show_trace+0x12/0x14
[<c0105892>] dump_stack+0x16/0x18
[<c0112792>] smp_call_function+0x61/0x102
[<c0112851>] smp_send_stop+0x1e/0x31
[<c0121ae3>] panic+0x5e/0xfb
[<c012463b>] do_exit+0x7a/0x702
[<c01056e8>] die+0x209/0x22f
[<c0105787>] do_trap+0x79/0x91
[<c0106017>] do_invalid_op+0x97/0xa1
[<c0311abc>] error_code+0x7c/0x84
[<c01023b4>] cpu_idle+0x1d/0xfb
[<c010141d>] rest_init+0x37/0x3a
[<c0421a52>] start_kernel+0x459/0x461
[<00000000>] 0x0
=======================

0xc0139ea0 is in tick_nohz_stop_sched_tick (/mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168).
163
164 if (need_resched())
165 goto end;
166
167 cpu = smp_processor_id();
168 BUG_ON(local_softirq_pending());
169
170 now = ktime_get();
171 /*
172 * When called from irq_exit we need to account the idle sleep time


http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-git13/boot.log
http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-git13/git-config

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)


2007-02-16 21:17:36

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

On Fri, 2007-02-16 at 21:38 +0100, Michal Piotrowski wrote:
> Hi,
>
> This looks like a tickless stuff

Yup.

> 0xc0139ea0 is in tick_nohz_stop_sched_tick (/mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168).
> 163
> 164 if (need_resched())
> 165 goto end;
> 166
> 167 cpu = smp_processor_id();
> 168 BUG_ON(local_softirq_pending());

Hmm, the BUG_ON is inside of an interrupt disabled region, so we should
have bailed out early in the need_resched() check above (because we are
in the idle task context according to the stack trace).

Is this reproducible ?

tglx


2007-02-16 21:24:41

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

On 16/02/07, Thomas Gleixner <[email protected]> wrote:
> On Fri, 2007-02-16 at 21:38 +0100, Michal Piotrowski wrote:
> > Hi,
> >
> > This looks like a tickless stuff
>
> Yup.
>
> > 0xc0139ea0 is in tick_nohz_stop_sched_tick (/mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168).
> > 163
> > 164 if (need_resched())
> > 165 goto end;
> > 166
> > 167 cpu = smp_processor_id();
> > 168 BUG_ON(local_softirq_pending());
>
> Hmm, the BUG_ON is inside of an interrupt disabled region, so we should
> have bailed out early in the need_resched() check above (because we are
> in the idle task context according to the stack trace).
>
> Is this reproducible ?

Yes.

>
> tglx
>
>
>

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

2007-02-17 15:09:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

On Sat, 2007-02-17 at 15:47 +0100, Alex Riesen wrote:
> > > 164 if (need_resched())
> > > 165 goto end;
> > > 166
> > > 167 cpu = smp_processor_id();
> > > 168 BUG_ON(local_softirq_pending());
> >
> > Hmm, the BUG_ON is inside of an interrupt disabled region, so we should
> > have bailed out early in the need_resched() check above (because we are
> > in the idle task context according to the stack trace).
> >
> > Is this reproducible ?
>
> Seen this too (Ubuntu, P4/ht-SMT, SATA, typed from screen):

Can you please apply the patch below, so we can at least see, which
softirq is pending. This should trigger independently of hrtimers and
dynticks. You can keep it compiled in and disable it at the kernel
commandline with "nohz=off" and / or "highres=off"

tglx

diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index bea304d..deeb90e 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -236,6 +236,11 @@ void cpu_idle(void)
* Otherwise, idle callbacks can misfire.
*/
local_irq_disable();
+ if (local_softirq_pending() && !need_resched())
+ printk(KERN_ERR
+ "Idle: local softirq pending: %04x",
+ local_softirq_pending());
+
enter_idle();
idle();
__exit_idle();
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 95e41f7..91d459f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -165,7 +165,6 @@ void tick_nohz_stop_sched_tick(void)
goto end;

cpu = smp_processor_id();
- BUG_ON(local_softirq_pending());

now = ktime_get();
/*


2007-02-17 18:20:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

On Sat, 2007-02-17 at 17:46 +0100, Alex Riesen wrote:
> > Can you please apply the patch below, so we can at least see, which
> > softirq is pending. This should trigger independently of hrtimers and
> > dynticks. You can keep it compiled in and disable it at the kernel
> > commandline with "nohz=off" and / or "highres=off"
>
> It did, only one time:
>
> Idle: local softirq pending: 0020<6>USB Universal Host Controller Interface driver v3.0

0x20 is the TASKLET_SOFTIRQ. I have no idea yet, how this can happen.

Can you please check, if this happens when you add "nohz=off" to the
kernel command line.

tglx





2007-02-17 22:41:41

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

On 17/02/07, Alex Riesen <[email protected]> wrote:
> Thomas Gleixner, Sat, Feb 17, 2007 16:14:17 +0100:
> > On Sat, 2007-02-17 at 15:47 +0100, Alex Riesen wrote:
> > > > > 164 if (need_resched())
> > > > > 165 goto end;
> > > > > 166
> > > > > 167 cpu = smp_processor_id();
> > > > > 168 BUG_ON(local_softirq_pending());
> > > >
> > > > Hmm, the BUG_ON is inside of an interrupt disabled region, so we should
> > > > have bailed out early in the need_resched() check above (because we are
> > > > in the idle task context according to the stack trace).
> > > >
> > > > Is this reproducible ?
> > >
> > > Seen this too (Ubuntu, P4/ht-SMT, SATA, typed from screen):
> >
> > Can you please apply the patch below, so we can at least see, which
> > softirq is pending. This should trigger independently of hrtimers and
> > dynticks. You can keep it compiled in and disable it at the kernel
> > commandline with "nohz=off" and / or "highres=off"
>
> It did, only one time:
>
> Idle: local softirq pending: 0020<6>USB Universal Host Controller Interface driver v3.0
>

sudo cat /var/log/messages | grep Idle
Feb 17 17:35:34 bitis-gabonica kernel: Idle: local softirq pending:
0020<6>hdd: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache,
UDMA(33)
Feb 17 19:20:01 bitis-gabonica kernel: Idle: local softirq pending:
0020<3>Idle: local softirq pending: 0020<3>Idle: local softirq
pending: 0020<7>PM: Removing info for No Bus:vcs7

cat /proc/interrupts
CPU0 CPU1
0: 232838 0 IO-APIC-edge timer
1: 401 0 IO-APIC-edge i8042
7: 0 0 IO-APIC-edge parport0
8: 1 0 IO-APIC-edge rtc
9: 1 0 IO-APIC-fasteoi acpi
12: 4 0 IO-APIC-edge i8042
14: 319 0 IO-APIC-edge ide0
15: 1612 0 IO-APIC-edge ide1
16: 16494 0 IO-APIC-fasteoi uhci_hcd:usb3, libata
17: 4670 0 IO-APIC-fasteoi uhci_hcd:usb1, uhci_hcd:usb4
18: 0 0 IO-APIC-fasteoi uhci_hcd:usb2
19: 2 0 IO-APIC-fasteoi ehci_hcd:usb5
20: 2822 0 IO-APIC-fasteoi Intel ICH5
21: 2881 0 IO-APIC-fasteoi eth1
22: 58 0 IO-APIC-fasteoi eth0
NMI: 0 0
LOC: 232562 232846
ERR: 0
MIS: 0

I can confirm that this is ICH5 SATA controller problem.

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

2007-02-18 08:36:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

On Sat, 2007-02-17 at 23:41 +0100, Michal Piotrowski wrote:
> sudo cat /var/log/messages | grep Idle
> Feb 17 17:35:34 bitis-gabonica kernel: Idle: local softirq pending:
> 0020<6>hdd: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache,
> UDMA(33)
> Feb 17 19:20:01 bitis-gabonica kernel: Idle: local softirq pending:
> 0020<3>Idle: local softirq pending: 0020<3>Idle: local softirq
> pending: 0020<7>PM: Removing info for No Bus:vcs7>
>
> I can confirm that this is ICH5 SATA controller problem.

The arch/i386/kernel/process.c part of the patch should apply to 2.6.20
as well. Can you check if the problem is there too ?

tglx


2007-02-18 12:32:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

On Sun, 2007-02-18 at 10:50 +0100, Alex Riesen wrote:
> > The arch/i386/kernel/process.c part of the patch should apply to 2.6.20
> > as well. Can you check if the problem is there too ?
>
> It does not apply and does not look trivially hackable.
> The code for cpu_idle was introduced in 2ff2d3d7 "i386: add idle notifier".

Here you go.

tglx

Index: linux-2.6.20/arch/i386/kernel/process.c
===================================================================
--- linux-2.6.20.orig/arch/i386/kernel/process.c
+++ linux-2.6.20/arch/i386/kernel/process.c
@@ -189,6 +189,13 @@ void cpu_idle(void)
play_dead();

__get_cpu_var(irq_stat).idle_timestamp = jiffies;
+
+ local_irq_disable();
+ if (local_softirq_pending() && !need_resched())
+ printk(KERN_ERR
+ "Idle: local softirq pending: %04x\n",
+ local_softirq_pending());
+ local_irq_enable();
idle();
}
preempt_enable_no_resched();




2007-02-19 01:00:07

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

On 18/02/07, Alex Riesen <[email protected]> wrote:
> Thomas Gleixner, Sun, Feb 18, 2007 13:36:56 +0100:
> > On Sun, 2007-02-18 at 10:50 +0100, Alex Riesen wrote:
> > > > The arch/i386/kernel/process.c part of the patch should apply to 2.6.20
> > > > as well. Can you check if the problem is there too ?
> > >
> > > It does not apply and does not look trivially hackable.
> > > The code for cpu_idle was introduced in 2ff2d3d7 "i386: add idle notifier".
> >
> > Here you go.
> >
>
> Ok, I confirm the message in vanilla v2.6.20
>

dmesg | grep Idle
Idle: local softirq pending: 0020

uname -a
Linux bitis-gabonica.linuxtestersgroup.org 2.6.20 #1 SMP PREEMPT Sun
Feb 18 15:20:27 CET 2007 i686 i686 i386 GNU/Linux

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

2007-02-21 12:19:12

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

Michal Piotrowski napisaƂ(a):
> On 17/02/07, Alex Riesen <[email protected]> wrote:
>> Thomas Gleixner, Sat, Feb 17, 2007 16:14:17 +0100:
>> > On Sat, 2007-02-17 at 15:47 +0100, Alex Riesen wrote:
>> > > > > 164 if (need_resched())
>> > > > > 165 goto end;
>> > > > > 166
>> > > > > 167 cpu = smp_processor_id();
>> > > > > 168 BUG_ON(local_softirq_pending());
>> > > >
>> > > > Hmm, the BUG_ON is inside of an interrupt disabled region, so we
>> should
>> > > > have bailed out early in the need_resched() check above (because
>> we are
>> > > > in the idle task context according to the stack trace).
>> > > >
>> > > > Is this reproducible ?
>> > >
>> > > Seen this too (Ubuntu, P4/ht-SMT, SATA, typed from screen):
>> >
>> > Can you please apply the patch below, so we can at least see, which
>> > softirq is pending. This should trigger independently of hrtimers and
>> > dynticks. You can keep it compiled in and disable it at the kernel
>> > commandline with "nohz=off" and / or "highres=off"
>>
>> It did, only one time:
>>
>> Idle: local softirq pending: 0020<6>USB Universal Host Controller
>> Interface driver v3.0
>>
>
> sudo cat /var/log/messages | grep Idle
> Feb 17 17:35:34 bitis-gabonica kernel: Idle: local softirq pending:
> 0020<6>hdd: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache,
> UDMA(33)
> Feb 17 19:20:01 bitis-gabonica kernel: Idle: local softirq pending:
> 0020<3>Idle: local softirq pending: 0020<3>Idle: local softirq
> pending: 0020<7>PM: Removing info for No Bus:vcs7
>
> cat /proc/interrupts
> CPU0 CPU1
> 0: 232838 0 IO-APIC-edge timer
> 1: 401 0 IO-APIC-edge i8042
> 7: 0 0 IO-APIC-edge parport0
> 8: 1 0 IO-APIC-edge rtc
> 9: 1 0 IO-APIC-fasteoi acpi
> 12: 4 0 IO-APIC-edge i8042
> 14: 319 0 IO-APIC-edge ide0
> 15: 1612 0 IO-APIC-edge ide1
> 16: 16494 0 IO-APIC-fasteoi uhci_hcd:usb3, libata
> 17: 4670 0 IO-APIC-fasteoi uhci_hcd:usb1, uhci_hcd:usb4
> 18: 0 0 IO-APIC-fasteoi uhci_hcd:usb2
> 19: 2 0 IO-APIC-fasteoi ehci_hcd:usb5
> 20: 2822 0 IO-APIC-fasteoi Intel ICH5
> 21: 2881 0 IO-APIC-fasteoi eth1
> 22: 58 0 IO-APIC-fasteoi eth0
> NMI: 0 0
> LOC: 232562 232846
> ERR: 0
> MIS: 0
>
> I can confirm that this is ICH5 SATA controller problem.

Here is something interesting

cat /var/log/messages | tail -n 300 | grep NOHZ
Feb 20 20:09:39 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 20 20:09:39 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 20 21:10:01 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 20 23:20:01 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:10:28 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
Feb 21 05:10:46 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
^^^^^^^^^

Feb 21 05:10:57 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:11:48 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
Feb 21 05:12:02 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:12:02 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:12:23 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
Feb 21 05:12:29 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:13:27 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:13:49 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
Feb 21 05:13:54 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:14:48 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
Feb 21 05:15:11 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:15:13 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:15:16 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
Feb 21 05:16:18 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:16:31 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
Feb 21 05:17:03 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
Feb 21 05:17:06 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:17:09 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:17:12 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
Feb 21 05:17:17 bitis-gabonica kernel: NOHZ: local_softirq_pending 20
Feb 21 05:17:18 bitis-gabonica kernel: NOHZ: local_softirq_pending 02
^^^^^^^^
23 times in 7 minutes.


Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)