2008-02-15 15:19:46

by Denys Fedoryschenko

[permalink] [raw]
Subject: BUG/ spinlock lockup, 2.6.24

Server crashed(not responding over network), last line over netconsole was

Feb 15 15:50:17 217.151.X.X [1521315.068984] BUG: spinlock lockup on CPU#1,
ksoftirqd/1/7, f0551180

I have random crashes, at least once per week. It is very difficult to catch
error message, and only recently i setup netconsole. Now i got crash, but
there is no traceback and only single line came over netconsole, mentioned
before.

.config file
http://www.nuclearcat.com/files/config_qos

Kernel is 2.6.24 with epoll patch(it is from mainline) applied.
cat /proc/version
Linux version 2.6.24-devel (root@visp-1) (gcc version 4.1.1 (Gentoo 4.1.1-
r3)) #1 SMP Sat Jan 26 17:26:54 EET 2008

visp-1 ~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 18361 17785 17471 17748 IO-APIC-edge timer
1: 2 0 0 0 IO-APIC-edge i8042
8: 5 4 3 4 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-fasteoi acpi
12: 1 0 1 2 IO-APIC-edge i8042
14: 14 17 17 15 IO-APIC-edge libata
15: 0 0 0 0 IO-APIC-edge libata
17: 269 259 256 259 IO-APIC-fasteoi ioc0
18: 5 5 6 7 IO-APIC-fasteoi
ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb4
19: 0 0 0 0 IO-APIC-fasteoi
uhci_hcd:usb3
66: 1 0 0 0 none-<NULL>
212: 27 32 35 32 PCI-MSI-edge eth1
213: 36818 36995 37307 37029 PCI-MSI-edge eth0
214: 0 1 1 1 PCI-MSI-edge
NMI: 71107 70983 70962 70962 Non-maskable interrupts
LOC: 53005 53178 53490 53214 Local timer interrupts
RES: 414 434 363 378 Rescheduling interrupts
CAL: 52 46 56 47 function call interrupts
TLB: 398 288 403 264 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
SPU: 0 0 0 0 Spurious interrupts
ERR: 0
MIS: 0

visp-1 ~ # cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 6
model name : Intel(R) Xeon(TM) CPU 3.20GHz
stepping : 4
cpu MHz : 3192.163
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx cid cx16 xtpr lahf_lm
bogomips : 6390.17
clflush size : 64

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 6
model name : Intel(R) Xeon(TM) CPU 3.20GHz
stepping : 4
cpu MHz : 3192.163
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx cid cx16 xtpr lahf_lm
bogomips : 6383.72
clflush size : 64

processor : 2
vendor_id : GenuineIntel
cpu family : 15
model : 6
model name : Intel(R) Xeon(TM) CPU 3.20GHz
stepping : 4
cpu MHz : 3192.163
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx cid cx16 xtpr lahf_lm
bogomips : 6383.75
clflush size : 64

processor : 3
vendor_id : GenuineIntel
cpu family : 15
model : 6
model name : Intel(R) Xeon(TM) CPU 3.20GHz
stepping : 4
cpu MHz : 3192.163
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx cid cx16 xtpr lahf_lm
bogomips : 6383.76
clflush size : 64


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


2008-02-15 15:25:22

by Bart Van Assche

[permalink] [raw]
Subject: Re: BUG/ spinlock lockup, 2.6.24

2008/2/15 Denys Fedoryshchenko <[email protected]>:
> I have random crashes, at least once per week. It is very difficult to catch
> error message, and only recently i setup netconsole. Now i got crash, but
> there is no traceback and only single line came over netconsole, mentioned
> before.

Did you already run memtest ? You can run memtest by booting from the
Knoppix CD-ROM or DVD. Most Linux distributions also have included
memtest on their bootable distribution CD's/DVD's.

Bart Van Assche.

2008-02-15 19:43:23

by Denys Fedoryschenko

[permalink] [raw]
Subject: Re: BUG/ spinlock lockup, 2.6.24

This server was working fine under load under FreeBSD, and worked fine before
with other tasks under Linux. I dont think it is RAM.
Additionally it is server hardware (Dell PowerEdge) with ECC, MCE and other
layers, who will report about any hardware issue most probably, and i think
even better than memtest.
Additionally it is very difficult to run test on it, cause it is in another
country, and i have limited access to it (i dont have network KVM).

I have similar crashes on completely different hardware with same job (QOS),
so i think it is actually some nasty bug in networking.


On Fri, 15 Feb 2008 16:24:56 +0100, Bart Van Assche wrote
> 2008/2/15 Denys Fedoryshchenko <[email protected]>:
> > I have random crashes, at least once per week. It is very difficult to
catch
> > error message, and only recently i setup netconsole. Now i got crash, but
> > there is no traceback and only single line came over netconsole,
mentioned
> > before.
>
> Did you already run memtest ? You can run memtest by booting from the
> Knoppix CD-ROM or DVD. Most Linux distributions also have included
> memtest on their bootable distribution CD's/DVD's.
>
> Bart Van Assche.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

2008-02-15 20:18:49

by Jarek Poplawski

[permalink] [raw]
Subject: Re: BUG/ spinlock lockup, 2.6.24

Denys Fedoryshchenko wrote, On 02/15/2008 08:42 PM:
...

> I have similar crashes on completely different hardware with same job (QOS),
> so i think it is actually some nasty bug in networking.

Maybe you could try with some other debugging options? E.g. since lockdep
doesn't help - turn this off. Instead try some others, like these:

> # CONFIG_DEBUG_SPINLOCK_SLEEP is not set
> # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
> # CONFIG_DEBUG_KOBJECT is not set
> # CONFIG_DEBUG_HIGHMEM is not set
> # CONFIG_DEBUG_VM is not set
> # CONFIG_DEBUG_LIST is not set
> # CONFIG_DEBUG_SG is not set
> # CONFIG_BOOT_PRINTK_DELAY is not set
> # CONFIG_DEBUG_STACKOVERFLOW is not set
> # CONFIG_DEBUG_STACK_USAGE is not set
> # CONFIG_DEBUG_RODATA is not set

Regards,
Jarek P.

2008-02-15 21:00:50

by Jarek Poplawski

[permalink] [raw]
Subject: Re: BUG/ spinlock lockup, 2.6.24

Jarek Poplawski wrote, On 02/15/2008 09:21 PM:

> Denys Fedoryshchenko wrote, On 02/15/2008 08:42 PM:
> ...
>
>> I have similar crashes on completely different hardware with same job (QOS),
>> so i think it is actually some nasty bug in networking.
>
> Maybe you could try with some other debugging options? E.g. since lockdep
> doesn't help - turn this off. Instead try some others, like these:

...On the other hand this:

> Feb 15 15:50:17 217.151.X.X [1521315.068984] BUG: spinlock lockup on CPU#1,
> ksoftirqd/1/7, f0551180

seems to point just at spinlock lockup, so it's more about the full report.
I wonder if this patch to prink could help here:

author Ingo Molnar <mingo at elte.hu>
Fri, 25 Jan 2008 20:07:58 +0000 (21:07 +0100)
printk: make printk more robust by not allowing recursion

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=32a76006683f7b28ae3cc491da37716e002f198e

Jarek P.

2008-02-15 22:57:29

by Jarek Poplawski

[permalink] [raw]
Subject: Re: BUG/ spinlock lockup, 2.6.24

Jarek Poplawski wrote, On 02/15/2008 10:03 PM:
...

> ...On the other hand this:
>
>> Feb 15 15:50:17 217.151.X.X [1521315.068984] BUG: spinlock lockup on CPU#1,
>> ksoftirqd/1/7, f0551180
>
> seems to point just at spinlock lockup, so it's more about the full report.
> I wonder if this patch to prink could help here:
>
> author Ingo Molnar <mingo at elte.hu>
> Fri, 25 Jan 2008 20:07:58 +0000 (21:07 +0100)
> printk: make printk more robust by not allowing recursion
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=32a76006683f7b28ae3cc491da37716e002f198e


...or maybe a patch like this attached here?

Jarek P.


Attachments:
spinlock_debug.diff (733.00 B)