2005-05-04 20:12:24

by Brice Goglin

[permalink] [raw]
Subject: Soft lockup with -mm

0000:00:00.0 Host bridge: Intel Corp. 82865G/PE/P DRAM Controller/Host-Hub Interface (rev 02)
Subsystem: Dell: Unknown device 019d
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
Latency: 0
Region 0: Memory at f8000000 (32-bit, prefetchable) [size=64M]
Capabilities: <available only to root>

0000:00:02.0 VGA compatible controller: Intel Corp. 82865G Integrated Graphics Device (rev 02) (prog-if 00 [VGA])
Subsystem: Dell: Unknown device 019d
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 193
Region 0: Memory at f0000000 (32-bit, prefetchable) [size=128M]
Region 1: Memory at feb80000 (32-bit, non-prefetchable) [size=512K]
Region 2: I/O ports at ed98 [size=8]
Capabilities: <available only to root>

0000:00:1d.0 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) (prog-if 00 [UHCI])
Subsystem: Dell: Unknown device 019d
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 193
Region 4: I/O ports at ff80 [size=32]

0000:00:1d.1 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) (prog-if 00 [UHCI])
Subsystem: Dell: Unknown device 019d
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin B routed to IRQ 201
Region 4: I/O ports at ff60 [size=32]

0000:00:1d.3 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #4 (rev 02) (prog-if 00 [UHCI])
Subsystem: Dell: Unknown device 019d
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 193
Region 4: I/O ports at ff20 [size=32]

0000:00:1d.7 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) (prog-if 20 [EHCI])
Subsystem: Dell: Unknown device 019d
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin D routed to IRQ 209
Region 0: Memory at ffa80800 (32-bit, non-prefetchable) [size=1K]
Capabilities: <available only to root>

0000:00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev c2) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Bus: primary=00, secondary=01, subordinate=01, sec-latency=32
I/O behind bridge: 0000d000-0000dfff
Memory behind bridge: fea00000-feafffff
BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-

0000:00:1f.0 ISA bridge: Intel Corp. 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02)
Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0

0000:00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02) (prog-if 8a [Master SecP PriP])
Subsystem: Dell: Unknown device 019d
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 177
Region 0: I/O ports at <ignored>
Region 1: I/O ports at <ignored>
Region 2: I/O ports at <ignored>
Region 3: I/O ports at <ignored>
Region 4: I/O ports at ffa0 [size=16]
Region 5: Memory at feb7fc00 (32-bit, non-prefetchable) [size=1K]

0000:00:1f.3 SMBus: Intel Corp. 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)
Subsystem: Dell: Unknown device 019d
Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Interrupt: pin B routed to IRQ 3
Region 4: I/O ports at eda0 [size=32]

0000:00:1f.5 Multimedia audio controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02)
Subsystem: Dell: Unknown device 019d
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin B routed to IRQ 217
Region 0: I/O ports at ee00 [size=256]
Region 1: I/O ports at edc0 [size=64]
Region 2: Memory at feb7fa00 (32-bit, non-prefetchable) [size=512]
Region 3: Memory at feb7f900 (32-bit, non-prefetchable) [size=256]
Capabilities: <available only to root>

0000:01:01.0 Modem: Intel Corp.: Unknown device 1080 (rev 04) (prog-if 00 [Generic])
Subsystem: Dell: Unknown device 1000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping+ SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (250ns min, 15500ns max), Cache Line Size: 0x10 (64 bytes)
Interrupt: pin A routed to IRQ 169
Region 0: Memory at feafe000 (32-bit, non-prefetchable) [size=4K]
Region 1: I/O ports at de00 [size=256]
Capabilities: <available only to root>

0000:01:08.0 Ethernet controller: Intel Corp. 82562EZ 10/100 Ethernet Controller (rev 02)
Subsystem: Dell: Unknown device 019d
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (2000ns min, 14000ns max), Cache Line Size: 0x10 (64 bytes)
Interrupt: pin A routed to IRQ 185
Region 0: Memory at feaff000 (32-bit, non-prefetchable) [size=4K]
Region 1: I/O ports at ddc0 [size=64]
Capabilities: <available only to root>


Attachments:
config (36.55 kB)
cpuinfo (0.00 B)
lspci (6.31 kB)
Download all attachments

2005-05-04 20:24:19

by Andrew Morton

[permalink] [raw]
Subject: Re: Soft lockup with -mm

Brice Goglin <[email protected]> wrote:
>
> I was seeing a lockup with several -mm releases since 2.6.12-rc2-mm1
> (IIRC). With 2.6.12-rc2-mm1, I remember getting the lockup a few minutes
> after boot time.
> With 2.6.12-rc3-mm1, I waited several days before getting it.
> But, I finally caught this one with netconsole. So here it is:
>
> BUG: soft lockup detected on CPU#0!
> Pid: 0, comm: swapper
> EIP: 0060:[<c02d40a5>] CPU: 0
> EIP is at _spin_unlock_irqrestore+0x5/0x30
> EFLAGS: 00000286 Not tainted (2.6.12-rc3-mm1=Pignouf)
> EAX: c18e8160 EBX: c18e8160 ECX: 00000001 EDX: 00000286
> ESI: c18e0160 EDI: dbf96c64 EBP: ffffffff DS: 007b ES: 007b
> CR0: 8005003b CR2: b6e43000 CR3: 0e912000 CR4: 00000690
> [<c012a635>] __mod_timer+0xc5/0xf0

It could be the timer bug. Can you try it with Oleg's fix?


From: Oleg Nesterov <[email protected]>

The bug was identified by Maneesh Soni.

When __mod_timer() changes timer's base it waits for the completion of
timer->function. It is just stupid: the caller of __mod_timer() can held
locks which would prevent completion of the timer's handler.

Solution: do not change the base of the currently running timer.

Side effect: __mod_timer() doesn't garantees anymore that timer will run on
the local cpu.

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

kernel/timer.c | 42 ++++++++++++++++++++----------------------
1 files changed, 20 insertions(+), 22 deletions(-)

diff -puN kernel/timer.c~timers-fix-__mod_timer-vs-__run_timers-deadlock kernel/timer.c
--- 25/kernel/timer.c~timers-fix-__mod_timer-vs-__run_timers-deadlock 2005-05-01 02:20:28.415889280 -0700
+++ 25-akpm/kernel/timer.c 2005-05-01 02:20:28.420888520 -0700
@@ -211,41 +211,39 @@ int __mod_timer(struct timer_list *timer
timer_base_t *base;
tvec_base_t *new_base;
unsigned long flags;
- int ret = -1;
+ int ret;

BUG_ON(!timer->function);
check_timer(timer);

- do {
- base = lock_timer_base(timer, &flags);
- new_base = &__get_cpu_var(tvec_bases);
+ base = lock_timer_base(timer, &flags);

- /* Ensure the timer is serialized. */
- if (base != &new_base->t_base
- && base->running_timer == timer)
- goto unlock;
+ ret = 0;
+ if (timer_pending(timer)) {
+ detach_timer(timer, 0);
+ ret = 1;
+ }

- ret = 0;
- if (timer_pending(timer)) {
- detach_timer(timer, 0);
- ret = 1;
- }
+ new_base = &__get_cpu_var(tvec_bases);

- if (base != &new_base->t_base) {
+ if (base != &new_base->t_base) {
+ if (unlikely(base->running_timer == timer))
+ /* Don't change timer's base while it is running.
+ * Needed for serialization of timer wrt itself. */
+ new_base = container_of(base, tvec_base_t, t_base);
+ else {
timer->base = NULL;
/* Safe: the timer can't be seen via ->entry,
* and lock_timer_base checks ->base != 0. */
spin_unlock(&base->lock);
- base = &new_base->t_base;
- spin_lock(&base->lock);
- timer->base = base;
+ spin_lock(&new_base->t_base.lock);
+ timer->base = &new_base->t_base;
}
+ }

- timer->expires = expires;
- internal_add_timer(new_base, timer);
-unlock:
- spin_unlock_irqrestore(&base->lock, flags);
- } while (ret < 0);
+ timer->expires = expires;
+ internal_add_timer(new_base, timer);
+ spin_unlock_irqrestore(&new_base->t_base.lock, flags);

return ret;
}
_

2005-05-04 20:53:44

by Brice Goglin

[permalink] [raw]
Subject: Re: Soft lockup with -mm

Andrew Morton a ?crit :
> Brice Goglin <[email protected]> wrote:
>
>>I was seeing a lockup with several -mm releases since 2.6.12-rc2-mm1
>>(IIRC). With 2.6.12-rc2-mm1, I remember getting the lockup a few minutes
>>after boot time.
>>With 2.6.12-rc3-mm1, I waited several days before getting it.
>>But, I finally caught this one with netconsole. So here it is:
>>
>>BUG: soft lockup detected on CPU#0!
>>Pid: 0, comm: swapper
>>EIP: 0060:[<c02d40a5>] CPU: 0
>>EIP is at _spin_unlock_irqrestore+0x5/0x30
>> EFLAGS: 00000286 Not tainted (2.6.12-rc3-mm1=Pignouf)
>>EAX: c18e8160 EBX: c18e8160 ECX: 00000001 EDX: 00000286
>>ESI: c18e0160 EDI: dbf96c64 EBP: ffffffff DS: 007b ES: 007b
>>CR0: 8005003b CR2: b6e43000 CR3: 0e912000 CR4: 00000690
>> [<c012a635>] __mod_timer+0xc5/0xf0
>
>
> It could be the timer bug. Can you try it with Oleg's fix?

Ok, thanks.
I applied it on top of mm2 and started praying for a few days
until something happens or not.

Brice

2005-05-16 12:35:38

by Brice Goglin

[permalink] [raw]
Subject: Re: Soft lockup with -mm

Brice Goglin a ?crit :
> Andrew Morton a ?crit :
>
>>Brice Goglin <[email protected]> wrote:
>>
>>>I was seeing a lockup with several -mm releases since 2.6.12-rc2-mm1
>>>(IIRC). With 2.6.12-rc2-mm1, I remember getting the lockup a few minutes
>>>after boot time.
>>>With 2.6.12-rc3-mm1, I waited several days before getting it.
>>>But, I finally caught this one with netconsole. So here it is:
>>>
>>>BUG: soft lockup detected on CPU#0!
>>>Pid: 0, comm: swapper
>>>EIP: 0060:[<c02d40a5>] CPU: 0
>>>EIP is at _spin_unlock_irqrestore+0x5/0x30
>>>EFLAGS: 00000286 Not tainted (2.6.12-rc3-mm1=Pignouf)
>>>EAX: c18e8160 EBX: c18e8160 ECX: 00000001 EDX: 00000286
>>>ESI: c18e0160 EDI: dbf96c64 EBP: ffffffff DS: 007b ES: 007b
>>>CR0: 8005003b CR2: b6e43000 CR3: 0e912000 CR4: 00000690
>>>[<c012a635>] __mod_timer+0xc5/0xf0
>>
>>
>>It could be the timer bug. Can you try it with Oleg's fix?
>
>
> Ok, thanks.
> I applied it on top of mm2 and started praying for a few days
> until something happens or not.

Hi Andrew,

This box is running 2.6.12-rc3-mm2 with Oleg's fix for 12 days.
Looks like the soft-lockup I was seeing is fixed.
I'll try to keep this box running -mm kernels to be sure the lockup
won't occur anymore.

Thanks,
Brice