LinuxLists.cc - <=2.4.17 deadlock (RedHat 7.2, SMP)

2001-12-23 22:55:15

Subject: <=2.4.17 deadlock (RedHat 7.2, SMP)

The problem occurs only when I disconnect a PPP dialup to an ISP. It
results in a box which only listens to alt-sysrq or a power cycle. The
userland setup is a bit complicated (two pppd instances, a zebra routing
daemon) but below there should be a clue about the cause. This is all
on a RedHat 7.2 distro running on a dual PIII box. My brain decoded the
following JPEG picture I took from the screen after typing alt-sysrq-p
3 times:

Pid: 5613, comm pppd
EIP: 0010: [<c0403068>] CPU: 1 EFLAGS 00000286 not tainted
Call Trace: c0273cae c028ee04 c0293ac8 c0155879 c013f063
c0150bc6 c028eac8 c01076cb

Pid: 5613, comm pppd
EIP: 0010: [<c040306f>] CPU: 1 EFLAGS 00000286 not tainted
Call Trace: c0273cae c028ee04 c0293ac8 c0155879 c013f063
c0150bc6 c028eac8 c01076cb

Pid: 18, comm kjournald
EIP: 0010: [<c03fca5b>] CPU: 0 EFLAGS 00000286 not tainted
Call Trace: c02ad897 c011f1ea c0140af5 c0179dde c011560c
c017ca4d c017c7e0 c010590f

cat /proc/version says:
Linux version 2.4.17-b65 ([email protected]) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 SMP Sun Dec 23 22:16:27 CET 2001

ksymoops says:
ksymoops 2.4.1 on i686 2.4.17-b65. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.17-b65/ (default)
-m /boot/System.map-2.4.17-b65 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

No modules in ksyms, skipping objects
Warning (read_lsmod): no symbols in lsmod, is /proc/modules a valid lsmod file?
EIP: 0010: [<c0403068>] CPU: 1 EFLAGS 00000286 not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Call Trace: c0273cae c028ee04 c0293ac8 c0155879 c013f063
c0150bc6 c028eac8 c01076cb
EIP: 0010: [<c040306f>] CPU: 1 EFLAGS 00000286 not tainted
Call Trace: c0273cae c028ee04 c0293ac8 c0155879 c013f063
c0150bc6 c028eac8 c01076cb
EIP: 0010: [<c03fca5b>] CPU: 0 EFLAGS 00000286 not tainted
Call Trace: c02ad897 c011f1ea c0140af5 c0179dde c011560c
c017ca4d c017c7e0 c010590f
Warning (Oops_read): Code line not seen, dumping what data is available

>>EIP; c0403068 <stext_lock+6c30/e9ac> <=====
Trace; c0273cae <rs_wait_until_sent+b6/c4>
Trace; c028ee04 <ppp_ioctl+33c/d84>
Trace; c0293ac8 <ppp_destroy_channel+1bc/1c4>
Trace; c0155879 <dput+19/214>
Trace; c013f063 <filp_close+133/140>
>>EIP; c040306f <stext_lock+6c37/e9ac> <=====
Trace; c0273cae <rs_wait_until_sent+b6/c4>
Trace; c028ee04 <ppp_ioctl+33c/d84>
Trace; c0293ac8 <ppp_destroy_channel+1bc/1c4>
Trace; c0155879 <dput+19/214>
Trace; c013f063 <filp_close+133/140>
>>EIP; c03fca5b <stext_lock+623/e9ac> <=====
Trace; c02ad897 <do_ide_request+f/14>
Trace; c011f1ea <__run_task_queue+d2/e0>
Trace; c0140af5 <__wait_on_buffer+55/9c>
Trace; c0179dde <journal_commit_transaction+114a/181c>
Trace; c011560c <schedule+640/818>

3 warnings issued. Results may not be reliable.

--
Frank

2001-12-24 19:49:32

by Frank van Maarseveen

[permalink] [raw]

Subject: <=2.4.17 deadlock (RedHat 7.2, SMP, ext3 related?) (2)

Below a second case, again it occured right at the moment a PPP
connection was terminated. For some unknown reason it seems to occur
frequently enough now for doing some experimenting. Tell me if more info
is needed. This is the same kernel as before. When the problem first
emerged some weeks ago after installing RedHat 7.2 I switched to kgcc
(gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) instead
of gcc 2.96 but it didn't make a difference.

Following two alt-sysrq-p have been decoded from a JPEG image I took from
the screen. Even alt-sysrq-l leaves most processes intact according to
alt-sysrq-t so to me it looks like a genuine deadlock:

Pid: 4465, comm: hotplug
EIP: 0010:[<c03fe0ba>] CPU: 0 EFLAGS 00000286 not tainted
Call Trace: c0140134 c013eba2 c013eab2 c013ee23 c01076cb

Pid: 2930, comm: pppd
EIP: 0010:[<c0403068>] CPU: 1 EFLAGS 00000286 not tainted
Call Trace: c016d320 c028ee04 c0159e01 c0156e68 c013f063 c0150bc6 c028eac8 c01076cb

Now ksymoops at work:
ksymoops 2.4.1 on i686 2.4.17-b65. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.17-b65/ (default)
-m /boot/System.map-2.4.17-b65 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

No modules in ksyms, skipping objects
Warning (read_lsmod): no symbols in lsmod, is /proc/modules a valid lsmod file?
EIP: 0010:[<c03fe0ba>] CPU: 0 EFLAGS 00000286 not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Call Trace: c0140134 c013eba2 c013eab2 c013ee23 c01076cb
EIP: 0010:[<c0403068>] CPU: 1 EFLAGS 00000286 not tainted
Call Trace: c016d320 c028ee04 c0159e01 c0156e68 c013f063 c0150bc6 c028eac8 c01076cb
Warning (Oops_read): Code line not seen, dumping what data is available

>>EIP; c03fe0ba <stext_lock+1c82/e9ac> <=====
Trace; c0140134 <chrdev_open+28/124>
Trace; c013eba2 <dentry_open+e6/194>
Trace; c013eab2 <filp_open+52/5c>
Trace; c013ee23 <sys_open+4b/140>
Trace; c01076cb <system_call+33/38>
>>EIP; c0403068 <stext_lock+6c30/e9ac> <=====
Trace; c016d320 <ext3_delete_inode+0/270>
Trace; c028ee04 <ppp_ioctl+33c/d84>
Trace; c0159e01 <iput+2c5/2f0>
Trace; c0156e68 <d_delete+dc/20c>
Trace; c013f063 <filp_close+133/140>
Trace; c0150bc6 <sys_ioctl+24a/2e4>
Trace; c028eac8 <ppp_ioctl+0/d84>
Trace; c01076cb <system_call+33/38>

3 warnings issued. Results may not be reliable.

--
Frank

2001-12-24 20:17:15

by Manfred Spraul

[permalink] [raw]

Subject: Re: <=2.4.17 deadlock (RedHat 7.2, SMP, ext3 related?) (2)

Frank,
could you open your vmlinux file with gdb, and figure out where the 2
stext_lock references lead to?
stext_lock+1c82 just means that it spins trying to acquire a spinlock,
but which one?
IIRC
gdb vmlinux
$x/30i 0xc03fe0a0
$x/30i 0xc0403050
should be enough.

>>EIP; c03fe0ba <stext_lock+1c82/e9ac> <=====
Trace; c0140134 <chrdev_open+28/124>
Trace; c013eba2 <dentry_open+e6/194>
Trace; c013eab2 <filp_open+52/5c>
Trace; c013ee23 <sys_open+4b/140>
Trace; c01076cb <system_call+33/38>

Probably this cpu is spinning in get_chrfops(), trying to acquire the big kernel
lock.

>>EIP; c0403068 <stext_lock+6c30/e9ac> <=====
Trace; c016d320 <ext3_delete_inode+0/270>
Trace; c028ee04 <ppp_ioctl+33c/d84>
Trace; c0159e01 <iput+2c5/2f0>
Trace; c0156e68 <d_delete+dc/20c>
Trace; c013f063 <filp_close+133/140>
Trace; c0150bc6 <sys_ioctl+24a/2e4>
Trace; c028eac8 <ppp_ioctl+0/d84>
Trace; c01076cb <system_call+33/38>

I think the call chain is
system_call
-> calls sys_ioctl.
acquires the big kernel lock.
puts ppp_ioctl on the stack (+0: it's a function pointer, not a return value.)
-> calls ppp_ioctl
and that one locks up.
The references to close/iput/delete_inode are just stale stack values from a previous
syscall.

Please check where stext_lock+6c30 leads: I think the deadlock is somewhere within
ppp_ioctl, and then the system locks up because the big kernel lock is blocked.

--
Manfred

2001-12-24 20:56:08

by Frank van Maarseveen

[permalink] [raw]

Subject: Re: <=2.4.17 deadlock (RedHat 7.2, SMP, ext3 related?) (2)

On Mon, Dec 24, 2001 at 09:17:00PM +0100, Manfred Spraul wrote:
> Frank,
> could you open your vmlinux file with gdb, and figure out where the 2
> stext_lock references lead to?
> stext_lock+1c82 just means that it spins trying to acquire a spinlock,
> but which one?
> IIRC
> gdb vmlinux
> $x/30i 0xc03fe0a0
> $x/30i 0xc0403050
> should be enough.

(gdb) x/30i 0xc03fe0a0
0xc03fe0a0 <stext_lock+7272>: lea 0xc04e79c0,%eax
0xc03fe0a6 <stext_lock+7278>: call 0xc01062f4 <__read_lock_failed>
0xc03fe0ab <stext_lock+7283>: pop %eax
0xc03fe0ac <stext_lock+7284>: jmp 0xc013fd3e <get_chrfops+78>
0xc03fe0b1 <stext_lock+7289>: cmpb $0x0,0xc05dd400
0xc03fe0b8 <stext_lock+7296>: repz nop
0xc03fe0ba <stext_lock+7298>: jle 0xc03fe0b1 <stext_lock+7289>
0xc03fe0bc <stext_lock+7300>: jmp 0xc013fde0 <get_chrfops+240>
0xc03fe0c1 <stext_lock+7305>: push %eax
0xc03fe0c2 <stext_lock+7306>: lea 0xc04e79c0,%eax
0xc03fe0c8 <stext_lock+7312>: call 0xc01062f4 <__read_lock_failed>
0xc03fe0cd <stext_lock+7317>: pop %eax
0xc03fe0ce <stext_lock+7318>: jmp 0xc013feee <get_chrfops+510>
0xc03fe0d3 <stext_lock+7323>: push %eax
0xc03fe0d4 <stext_lock+7324>: lea 0xc04e79c0,%eax
0xc03fe0da <stext_lock+7330>: call 0xc01062d4 <__write_lock_failed>
0xc03fe0df <stext_lock+7335>: pop %eax
0xc03fe0e0 <stext_lock+7336>: jmp 0xc013ff78 <register_chrdev+68>
0xc03fe0e5 <stext_lock+7341>: push %eax
0xc03fe0e6 <stext_lock+7342>: lea 0xc04e79c0,%eax
0xc03fe0ec <stext_lock+7348>: call 0xc01062d4 <__write_lock_failed>
0xc03fe0f1 <stext_lock+7353>: pop %eax
0xc03fe0f2 <stext_lock+7354>: jmp 0xc013fff3 <register_chrdev+191>
0xc03fe0f7 <stext_lock+7359>: push %eax
0xc03fe0f8 <stext_lock+7360>: lea 0xc04e79c0,%eax
0xc03fe0fe <stext_lock+7366>: call 0xc01062d4 <__write_lock_failed>
0xc03fe103 <stext_lock+7371>: pop %eax
0xc03fe104 <stext_lock+7372>: jmp 0xc01400a3 <unregister_chrdev+79>
0xc03fe109 <stext_lock+7377>: cmpb $0x0,0xc05dd400
0xc03fe110 <stext_lock+7384>: repz nop
(gdb) x/30i 0xc0403050
0xc0403050 <stext_lock+27672>: nop
0xc0403051 <stext_lock+27673>: jle 0xc0403048 <stext_lock+27664>
0xc0403053 <stext_lock+27675>: jmp 0xc0292971 <find_compressor+53>
0xc0403058 <stext_lock+27680>: cmpb $0x0,0xc057d540
0xc040305f <stext_lock+27687>: repz nop
0xc0403061 <stext_lock+27689>: jle 0xc0403058 <stext_lock+27680>
0xc0403063 <stext_lock+27691>: jmp 0xc0292ac0 <ppp_create_interface+72>
0xc0403068 <stext_lock+27696>: cmpb $0x0,0xc057d540
0xc040306f <stext_lock+27703>: repz nop
0xc0403071 <stext_lock+27705>: jle 0xc0403068 <stext_lock+27696>
0xc0403073 <stext_lock+27707>: jmp 0xc0292df0 <ppp_destroy_interface+68>
0xc0403078 <stext_lock+27712>: cmpb $0x0,(%ebx)
0xc040307b <stext_lock+27715>: repz nop
0xc040307d <stext_lock+27717>: jle 0xc0403078 <stext_lock+27712>
0xc040307f <stext_lock+27719>: jmp 0xc0292e60 <ppp_destroy_interface+180>
0xc0403084 <stext_lock+27724>: cmpb $0x0,(%ebx)
0xc0403087 <stext_lock+27727>: repz nop
0xc0403089 <stext_lock+27729>: jle 0xc0403084 <stext_lock+27724>
0xc040308b <stext_lock+27731>: jmp 0xc0292eb0 <ppp_destroy_interface+260>
0xc0403090 <stext_lock+27736>: cmpb $0x0,(%ebx)
0xc0403093 <stext_lock+27739>: repz nop
0xc0403095 <stext_lock+27741>: jle 0xc0403090 <stext_lock+27736>
0xc0403097 <stext_lock+27743>: jmp 0xc0292f40 <ppp_destroy_interface+404>
0xc040309c <stext_lock+27748>: cmpb $0x0,(%ebx)
0xc040309f <stext_lock+27751>: repz nop
0xc04030a1 <stext_lock+27753>: jle 0xc040309c <stext_lock+27748>
0xc04030a3 <stext_lock+27755>: jmp 0xc0293014 <ppp_destroy_interface+616>
0xc04030a8 <stext_lock+27760>: push %eax
0xc04030a9 <stext_lock+27761>: push %ecx
0xc04030aa <stext_lock+27762>: push %edx
(gdb) p stext_lock
$1 = {<text variable, no debug info>} 0xc03fc438 <stext_lock>
(gdb) p stext_lock+0x1c82
$1 = (<text variable, no debug info> *) 0xc03fe0ba <stext_lock+7298>
(gdb) p stext_lock+0x6c30
$2 = (<text variable, no debug info> *) 0xc0403068 <stext_lock+27696>

--
Frank

2001-12-24 21:23:05

by Manfred Spraul

[permalink] [raw]

Subject: Re: <=2.4.17 deadlock (RedHat 7.2, SMP, ext3 related?) (2)

From: "Frank van Maarseveen" <[email protected]>
>
> 0xc03fe0b1 <stext_lock+7289>: cmpb $0x0,0xc05dd400
> 0xc03fe0b8 <stext_lock+7296>: repz nop
> 0xc03fe0ba <stext_lock+7298>: jle 0xc03fe0b1 <stext_lock+7289>
> 0xc03fe0bc <stext_lock+7300>: jmp 0xc013fde0 <get_chrfops+240>

I assume 0xd05dd400 is kernel_flag.

> 0xc0403068 <stext_lock+27696>: cmpb $0x0,0xc057d540
> 0xc040306f <stext_lock+27703>: repz nop
> 0xc0403071 <stext_lock+27705>: jle 0xc0403068 <stext_lock+27696>
> 0xc0403073 <stext_lock+27707>: jmp 0xc0292df0 <ppp_destroy_interface+68>

And 0xc057d540 is all_ppp_lock.

The only obvious abuse is that both ppp_destroy_interface and ppp_create_interface
call rtnl_lock (that's a semaphore) with the spinlock acquired.

--
Manfred

2001-12-25 12:42:27

by Frank van Maarseveen

[permalink] [raw]

Subject: Re: <=2.4.17 deadlock (RedHat 7.2, SMP, ext3 related?) (2)

On Mon, Dec 24, 2001 at 10:36:11PM +0100, Manfred Spraul wrote:
> If you are brave, you can try the attached patch.

Your assuptions about the address of kernel_flag and
all_ppp_lock were correct (except for an obvious typo).

The patch seems to work, no deadlocks so far but I'll keep
an eye on it.

Thanks,

--
Frank