2003-07-21 13:57:52

by Udo A. Steinberg

[permalink] [raw]
Subject: CPU Lockup with 2.4.21 and 2.4.22-pre


Hi all,

We have a Dual-Xeon machine with Hyperthreading which keeps locking up hard,
so that not even Sysrq works anymore. I have captured such a lockup using the
NMI oopser. Below you'll find the lockup fed through ksymoops. Note that
after CPU3 locked up, CPU2 did too. But that lockup couldn't be captured
anymore. Kernel is a monolithic 2.4.22-pre6. Problem also happened on
plain 2.4.21. I can provide more information wrt. hardware, config etc.
on request.

Regards,
-Udo.


ksymoops 2.4.9 on i686 2.4.22-pre6. Options used
-V (default)
-K (specified)
-l /proc/modules (default)
-o /lib/modules/2.4.22-pre6/ (default)
-m /boot/System.map-2.4.21 (specified)

No modules in ksyms, skipping objects
No ksyms, skipping lsmod
NMI Watchdog detected LOCKUP on CPU3, eip c01f8364, registers:
CPU: 3
EIP: 0010:[<c01f8364>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000082
eax: 00000006 ebx: 00000202 ecx: f7ee3400 edx: c01f5d20
esi: f7ee3400 edi: 00000180 ebp: 00000003 esp: f7efff64
ds: 0018 es: 0018 ss: 0018
Process ksoftirqd_CPU3 (pid: 6, stackpage=f7eff000)
Stack: f7ee3428 c02d03e6 c0105cfa f6dae000 f7e4ac80 f7efff88 f7efff88 c011f8da
f7ee3400 f7ee34ac f7ee34ac c0393434 00000000 c0122cfd c02ebe1c c011f7f5
c011f6a3 00000009 00000001 c0367a00 fffffffe c011f456 c0367a00 00000246
Call Trace: [<c0105cfa>] [<c011f8da>] [<c0122cfd>] [<c011f7f5>] [<c011f6a3>]
[<c011f456>] [<c011f9b5>] [<c0105000>] [<c01058ee>] [<c011f8f0>]
Code: 7e f5 e9 e8 d9 ff ff 80 3d 40 be 2e c0 00 f3 90 7e f5 e9 db


>>EIP; c01f8364 <pcibios_lookup_irq+194/370> <=====

>>edx; c01f5d20 <restore_i387+70/1a0>

Trace; c0105cfa <ext2_file_operations+3a/60>
Trace; c011f8da <unix_stream_ops+1a/60>
Trace; c0122cfd <init_tss+7fd/2000>
Trace; c011f7f5 <arpt_sockopts+15/40>
Trace; c011f6a3 <required_len.1+23/60>
Trace; c011f456 <info.0+76/140>
Trace; c011f9b5 <unix_table+15/60>
Trace; c0105000 <proc_mem_inode_operations+20/60>
Trace; c01058ee <nibblemap+e/40>
Trace; c011f8f0 <unix_stream_ops+30/60>

Code; c01f8364 <pcibios_lookup_irq+194/370>
00000000 <_EIP>:
Code; c01f8364 <pcibios_lookup_irq+194/370> <=====
0: 7e f5 jle fffffff7 <_EIP+0xfffffff7> <=====
Code; c01f8366 <pcibios_lookup_irq+196/370>
2: e9 e8 d9 ff ff jmp ffffd9ef <_EIP+0xffffd9ef>
Code; c01f836b <pcibios_lookup_irq+19b/370>
7: 80 3d 40 be 2e c0 00 cmpb $0x0,0xc02ebe40
Code; c01f8372 <pcibios_lookup_irq+1a2/370>
e: f3 90 repz nop
Code; c01f8374 <pcibios_lookup_irq+1a4/370>
10: 7e f5 jle 7 <_EIP+0x7>
Code; c01f8376 <pcibios_lookup_irq+1a6/370>
12: e9 db 00 00 00 jmp f2 <_EIP+0xf2>

NMI Watchdog detected LOCKUP on CPU2, eip c01062cd, registers:


Attachments:
(No filename) (189.00 B)

2003-07-21 14:03:08

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: CPU Lockup with 2.4.21 and 2.4.22-pre

On Mon, 21 Jul 2003 16:12:26 +0200 Udo A. Steinberg (UAS) wrote:

UAS> We have a Dual-Xeon machine with Hyperthreading which keeps locking up hard,
UAS> so that not even Sysrq works anymore. I have captured such a lockup using the
UAS> NMI oopser. Below you'll find the lockup fed through ksymoops. Note that
UAS> after CPU3 locked up, CPU2 did too. But that lockup couldn't be captured
UAS> anymore. Kernel is a monolithic 2.4.22-pre6. Problem also happened on
UAS> plain 2.4.21. I can provide more information wrt. hardware, config etc.
UAS> on request.

Sorry, I used the wrong System.map. Below is the fixed decode. Looks like
the lockup is caused by the 3rd party Compushack FDDI driver.

Regards,
-Udo.


ksymoops 2.4.9 on i686 2.4.22-pre6. Options used
-V (default)
-K (specified)
-l /proc/modules (default)
-o /lib/modules/2.4.22-pre6/ (default)
-m /boot/System.map-2.4.22 (specified)

No modules in ksyms, skipping objects
No ksyms, skipping lsmod
NMI Watchdog detected LOCKUP on CPU3, eip c01f8364, registers:
CPU: 3
EIP: 0010:[<c01f8364>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000082
eax: 00000006 ebx: 00000202 ecx: f7ee3400 edx: c01f5d20
esi: f7ee3400 edi: 00000180 ebp: 00000003 esp: f7efff64
ds: 0018 es: 0018 ss: 0018
Process ksoftirqd_CPU3 (pid: 6, stackpage=f7eff000)
Stack: f7ee3428 c02d03e6 c0105cfa f6dae000 f7e4ac80 f7efff88 f7efff88 c011f8da
f7ee3400 f7ee34ac f7ee34ac c0393434 00000000 c0122cfd c02ebe1c c011f7f5
c011f6a3 00000009 00000001 c0367a00 fffffffe c011f456 c0367a00 00000246
Call Trace: [<c0105cfa>] [<c011f8da>] [<c0122cfd>] [<c011f7f5>] [<c011f6a3>]
[<c011f456>] [<c011f9b5>] [<c0105000>] [<c01058ee>] [<c011f8f0>]
Code: 7e f5 e9 e8 d9 ff ff 80 3d 40 be 2e c0 00 f3 90 7e f5 e9 db


>>EIP; c01f8364 <.text.lock.csfddi+39/55> <=====

>>edx; c01f5d20 <csfddi_timer_work+0/e0>

Trace; c0105cfa <__switch_to+ca/d0>
Trace; c011f8da <__run_task_queue+6a/80>
Trace; c0122cfd <immediate_bh+1d/20>
Trace; c011f7f5 <bh_action+45/70>
Trace; c011f6a3 <tasklet_hi_action+63/b0>
Trace; c011f456 <do_softirq+d6/e0>
Trace; c011f9b5 <ksoftirqd+c5/f0>
Trace; c0105000 <_stext+0/0>
Trace; c01058ee <arch_kernel_thread+2e/40>
Trace; c011f8f0 <ksoftirqd+0/f0>

Code; c01f8364 <.text.lock.csfddi+39/55>
00000000 <_EIP>:
Code; c01f8364 <.text.lock.csfddi+39/55> <=====
0: 7e f5 jle fffffff7 <_EIP+0xfffffff7> <=====
Code; c01f8366 <.text.lock.csfddi+3b/55>
2: e9 e8 d9 ff ff jmp ffffd9ef <_EIP+0xffffd9ef>
Code; c01f836b <.text.lock.csfddi+40/55>
7: 80 3d 40 be 2e c0 00 cmpb $0x0,0xc02ebe40
Code; c01f8372 <.text.lock.csfddi+47/55>
e: f3 90 repz nop
Code; c01f8374 <.text.lock.csfddi+49/55>
10: 7e f5 jle 7 <_EIP+0x7>
Code; c01f8376 <.text.lock.csfddi+4b/55>
12: e9 db 00 00 00 jmp f2 <_EIP+0xf2>

NMI Watchdog detected LOCKUP on CPU2, eip c01062cd, registers:


Attachments:
(No filename) (189.00 B)

2003-07-22 10:09:32

by Michael Troß

[permalink] [raw]
Subject: Re: CPU Lockup with 2.4.21 and 2.4.22-pre

Am Mon, 2003-07-21 um 16.17 schrieb Udo A. Steinberg:
> On Mon, 21 Jul 2003 16:12:26 +0200 Udo A. Steinberg (UAS) wrote:
>
> UAS> We have a Dual-Xeon machine with Hyperthreading which keeps locking
> up hard,
> UAS> so that not even Sysrq works anymore. I have captured such a lockup
> using the
> UAS> NMI oopser. Below you'll find the lockup fed through ksymoops. Note
> that
> UAS> after CPU3 locked up, CPU2 did too. But that lockup couldn't be
> captured
> UAS> anymore. Kernel is a monolithic 2.4.22-pre6. Problem also happened
> on
> UAS> plain 2.4.21. I can provide more information wrt. hardware, config
> etc.
> UAS> on request.

Would be really useful if you do so.

> Sorry, I used the wrong System.map. Below is the fixed decode. Looks
> like
> the lockup is caused by the 3rd party Compushack FDDI driver.

What makes you believe this? There is no matching code sequence like the
one from your dump in the driver, to be exact: in a driver compiled with
gcc 3.3 and kernel 2.4.21.

> Regards,
> -Udo.

Regards,
Michael

2003-07-22 11:36:49

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: CPU Lockup with 2.4.21 and 2.4.22-pre

On 22 Jul 2003 12:24:24 +0200 Michael Tro? (MT) wrote:

UAS> I can provide more information wrt. hardware, config etc.
UAS> on request.

MT> Would be really useful if you do so.

I have put the following information at: http://www.wh8.tu-dresden.de/fddi/

* My .config for 2.4.22-pre6
* dmesg output of 2.4.22-pre6 (both 2.4.21 and 2.4.22-pre6 behave the same)
* the ksymoops output of the lockup
* the output of lspci -v
* the fddi patch i used (applies cleanly to 2.4.21 and with fuzz to -pre6)

Note that the fddi patch includes a patch you've previously sent me, which
isn't present in the driver on your website.

If you need more information, let me know. Also if you have any tips or
patches that would help in debugging the issue, I'm happy to try them.

MT> What makes you believe this? There is no matching code sequence like the
MT> one from your dump in the driver, to be exact: in a driver compiled with
MT> gcc 3.3 and kernel 2.4.21.

The fact that the backtrace in the decoded oops looks like the lockup
happened in the fddi driver led me to the conclusion that this may be
the culprit. I have compiled the 2.4.22-pre6 kernel with gcc-3.3 also.

Regards,
-Udo.


Attachments:
(No filename) (189.00 B)

2003-07-22 12:40:09

by Michael Troß

[permalink] [raw]
Subject: Re: CPU Lockup with 2.4.21 and 2.4.22-pre

Am Die, 2003-07-22 um 13.51 schrieb Udo A. Steinberg:

> Note that the fddi patch includes a patch you've previously sent me, which
> isn't present in the driver on your website.

As you might know, the Compu-Shack fddi products reached end-of-life
last year.

> If you need more information, let me know. Also if you have any tips or
> patches that would help in debugging the issue, I'm happy to try them.

As I can't locate the code sequence in my driver module, please check it
with your compiled kernel:
objdump -d vmlinux | grep -A 20 "7e f5" | grep csfddi
or module:
hexdump -e '32/1 "%02x " "\n"' csf.o | grep "7e f5 e9 e8"
Do you get a result like the code line from your oops, which eip is
referring to?

> MT> What makes you believe this? There is no matching code sequence like the
> MT> one from your dump in the driver, to be exact: in a driver compiled with
> MT> gcc 3.3 and kernel 2.4.21.
>
> The fact that the backtrace in the decoded oops looks like the lockup
> happened in the fddi driver led me to the conclusion that this may be
> the culprit.

But you got two different decoding results, didn't you ?!

> Regards,
> -Udo.

Regards,
Michael

2003-07-22 14:09:33

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: CPU Lockup with 2.4.21 and 2.4.22-pre

On 22 Jul 2003 14:55:05 +0200 Michael Tro? (MT) wrote:

MT> As you might know, the Compu-Shack fddi products reached end-of-life
MT> last year.

Yes. Just thought I'd let you know that we aren't using the same
patch as on the website, but one that has been rediffed for 2.4.21 and
has an additional fix from you in it.

MT> As I can't locate the code sequence in my driver module, please check it
MT> with your compiled kernel:
MT> objdump -d vmlinux | grep -A 20 "7e f5" | grep csfddi

c01f8334: 7e f5 jle c01f832b <.text.lock.csfddi>
c01f8336: e9 87 d1 ff ff jmp c01f54c2 <csfddi_transmit+0x22>
c01f8344: 7e f5 jle c01f833b <.text.lock.csfddi+0x10>
c01f8346: e9 b2 d2 ff ff jmp c01f55fd <csfddi_transmit_timeout+0x1d>
c01f8354: 7e f5 jle c01f834b <.text.lock.csfddi+0x20>
c01f8356: e9 02 d7 ff ff jmp c01f5a5d <csfddi_interrupt+0xd>
c01f8364: 7e f5 jle c01f835b <.text.lock.csfddi+0x30>
c01f8366: e9 e8 d9 ff ff jmp c01f5d53 <csfddi_timer_work+0x33>
c01f8374: 7e f5 jle c01f836b <.text.lock.csfddi+0x40>
c01f8376: e9 db da ff ff jmp c01f5e56 <csfddi_timer+0x56>

MT> Do you get a result like the code line from your oops, which eip is
MT> referring to?

It's referring to EIP c01f8364. Here is the disassembly of the code fragment.

c01f832b <.text.lock.csfddi>:
c01f832b: 80 bb 94 00 00 00 00 cmpb $0x0,0x94(%ebx)
c01f8332: f3 90 repz nop
c01f8334: 7e f5 jle c01f832b <.text.lock.csfddi>
c01f8336: e9 87 d1 ff ff jmp c01f54c2 <csfddi_transmit+0x22>
c01f833b: 80 be 94 00 00 00 00 cmpb $0x0,0x94(%esi)
c01f8342: f3 90 repz nop
c01f8344: 7e f5 jle c01f833b <.text.lock.csfddi+0x10>
c01f8346: e9 b2 d2 ff ff jmp c01f55fd <csfddi_transmit_timeout+0x1d>
c01f834b: 80 be 94 00 00 00 00 cmpb $0x0,0x94(%esi)
c01f8352: f3 90 repz nop
c01f8354: 7e f5 jle c01f834b <.text.lock.csfddi+0x20>
c01f8356: e9 02 d7 ff ff jmp c01f5a5d <csfddi_interrupt+0xd>
c01f835b: 80 be 94 00 00 00 00 cmpb $0x0,0x94(%esi)
c01f8362: f3 90 repz nop
c01f8364: 7e f5 jle c01f835b <.text.lock.csfddi+0x30>
c01f8366: e9 e8 d9 ff ff jmp c01f5d53 <csfddi_timer_work+0x33>
c01f836b: 80 3d 40 be 2e c0 00 cmpb $0x0,0xc02ebe40
c01f8372: f3 90 repz nop
c01f8374: 7e f5 jle c01f836b <.text.lock.csfddi+0x40>
c01f8376: e9 db da ff ff jmp c01f5e56 <csfddi_timer+0x56>
c01f837b: 90 nop
c01f837c: 90 nop
c01f837d: 90 nop
c01f837e: 90 nop
c01f837f: 90 nop

I've also put up the vmlinux image at the URL I've posted in my previous
post, if it's of any help.

MT> But you got two different decoding results, didn't you ?!

The first posting which was only sent to LKML and not to you had the
lockup output misdecoded, because I used a wrong System.map.
The second posting (the one I cc'd to you) and the decoded lockup output
(lockup.txt) on the website are the correct ones.

Regards,
-Udo.


Attachments:
(No filename) (189.00 B)

2003-07-22 15:11:10

by Michael Troß

[permalink] [raw]
Subject: Re: CPU Lockup with 2.4.21 and 2.4.22-pre

Am Die, 2003-07-22 um 16.24 schrieb Udo A. Steinberg:

> MT> As you might know, the Compu-Shack fddi products reached end-of-life
> MT> last year.
>
> Yes. Just thought I'd let you know that we aren't using the same
> patch as on the website, but one that has been rediffed for 2.4.21 and
> has an additional fix from you in it.

Mentioned it just to let you know that the company is no longer
providing new drivers for new kernels. Probably you better stay with
2.4.18.

[snip]

Seems that a spin lock is already held. Do you get this oops right after
opening the device? Then please try NoSelfTest.

Regards,
Michael

2003-07-22 15:15:39

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: CPU Lockup with 2.4.21 and 2.4.22-pre

On 22 Jul 2003 17:26:10 +0200 Michael Tro? (MT) wrote:

MT> Seems that a spin lock is already held. Do you get this oops right after
MT> opening the device? Then please try NoSelfTest.

No, the lockup happens during operation. Sometimes the kernel runs only for
about one hour, sometimes for a day, but never longer before the lockups
happen.

I don't think going back to 2.4.18 will make a difference for this case,
or do you think it will?

Regards,
-Udo.


Attachments:
(No filename) (189.00 B)