2002-01-17 16:51:54

by Fabien Ribes

[permalink] [raw]
Subject: Oops in sock_poll

Hi all,

I have a kernel Oops on ppc kernel 2.4.5 with an application listenning
to a high throughput of incoming messages on a netlink socket. The
application is running select() on the netlink socket file descriptor
followed with recvmsg() call (in a forever loop).

The Oops (not saved, and hard to reproduce) showed a crash in the
sock_poll function (kernel/net/socket.c); after investigations, crash is
due to a NULL pointer in f_dentry member of the file structure. This
pointer is set to NULL in the fput (kernel/fs/file_table.c) function.
The backtraces show that the calling function is the sys_recvmsg
(kernel/net/socket.c).

My understanding of the problem is the following:

- When everything goes right:

A/ When netlink socket is opened, its associated file structure is
initialised with f_count to 1, and a dentry;

B/ When select is executed, f_count is increased to 2;

C/ When select ends, f_count is decreased to 1;

D/ When recvmsg is executed, f_count is increased to 2;

E/ When recvmsg ends, f_count is decreased to 1;

F/ Loop forever to B/

- When the problem occurs:

A/ When netlink socket is opened, its associated file structure is
initialised with f_count to 1, and a dentry;

B/ When select is executed, f_count is increased to 2;

C/ When select ends, f_count is decreased to 1;

D/ When recvmsg is executed, f_count is increased to 2;

????/ SOMETHING decreases f_count to 1;

E/ When recvmsg ends, f_count is decreased to 0, AND THEREFORE f_dentry
member of file is set to NULL (since file is considered as not used) ;

F/ When select is executed, f_count is incremented to 1, but f_dentry is
NULL and therefore following code crashes in sock_poll function:
sock = socki_lookup(file->f_dentry->d_inode);

Do you have an idea of the event that could have decreased the f_count
member between D/ and E/ ?
Could you give me elements to continue my investigation ?

Thanks a lot for you help,
Fabien


2002-01-17 21:14:00

by David Miller

[permalink] [raw]
Subject: Re: Oops in sock_poll


Can you reproduce this with a more recent kernel? Anything
>=2.4.9 (this includes all Red Hat errata kernels therefore)
would be sufficient.

And also please provide a full decoded OOPS log as well, thanks.

2002-01-18 09:02:15

by Fabien Ribes

[permalink] [raw]
Subject: Re: Oops in sock_poll

Hi,

"David S. Miller" wrote:
>
> Can you reproduce this with a more recent kernel? Anything
> >=2.4.9 (this includes all Red Hat errata kernels therefore)
> would be sufficient.
The kernel used is customized in many ways, it is a long work to upgrade
...

> And also please provide a full decoded OOPS log as well, thanks.
here it is:
ksymoops 2.3.7 on i686 2.4.3. Options used
-v vmlinux (specified)
-K (specified)
-L (specified)
-O (specified)
-m System.map (specified)
-t elf_powerpc -a powerpc:common

Oops: kernel access of bad area, sig: 11
NIP: C00A0EB4 XER: 00000000 LR: C0046B20 SP: C1981E60 REGS: c1981db0
TRAP: 0300
MSR: 00009230 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
TASK = c1980000[148] 'feemond' Last syscall: 142
last math 00000000 last altivec 00000000
GPR00: C0046B20 C1981E60 C1980000 C1AEFBA0 C1981E78 C1981E78 C1BEB780
00000000
GPR08: 00000000 00000000 00000000 C1BEB800 C1BEB780 1001D8B8 00000000
00000000
GPR16: 00000000 00000000 00000000 00000000 C1981EE8 00000005 000000B4
00000000
GPR24: C1981E78 00000004 00000145 C1981EC8 00000000 00000000 00000010
C1AEFBA0
Call backtrace:
C0046884 C0046B20 C0046FC4 C0007E1C C000266C 10001888 100016F8
10000B30 0FEF6A6C 00000000
Warning (Oops_read): Code line not seen, dumping what data is available

>>NIP; c00a0eb4 <sock_poll+14/3c> <=====
Trace; c0046884 <poll_freewait+54/70>
Trace; c0046b20 <do_select+e4/208>
Trace; c0046fc4 <sys_select+330/470>
Trace; c0007e1c <ppc_select+a0/b0>
Trace; c000266c <ret_from_syscall_1+0/b4>
Trace; 10001888 Before first symbol
Trace; 100016f8 Before first symbol
Trace; 10000b30 Before first symbol
Trace; 0fef6a6c Before first symbol
Trace; 00000000 Before first symbol


1 warning issued. Results may not be reliable.

2002-01-18 10:57:33

by David Miller

[permalink] [raw]
Subject: Re: Oops in sock_poll

From: Fabien Ribes <[email protected]>
Date: Fri, 18 Jan 2002 09:01:32 +0000

"David S. Miller" wrote:
>
> Can you reproduce this with a more recent kernel? Anything
> >=2.4.9 (this includes all Red Hat errata kernels therefore)
> would be sufficient.

The kernel used is customized in many ways, it is a long work to upgrade

Then I can't help you... there have probably been many
networking bugs fixed since 2.4.9

2002-01-19 18:41:45

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Oops in sock_poll

Hello!

> The kernel used is customized in many ways, it is a long work to upgrade
>
> Then I can't help you... there have probably been many
> networking bugs fixed since 2.4.9

I do not remember that we _ever_ had problems with leaking f_count.
And it is so far of networking... :-)

"customized in many ways" bug sounds as better candidate to be fixed.

Alexey