2004-03-27 13:07:39

by Frank Denis

[permalink] [raw]
Subject: nfsd oops with 2.6.5-rc2-mm4

Hello.

I got a reproducible oops after a few minutes with a 2.6.5-rc2-mm4 kernel.

/etc/exports :
/mnt/data 10.42.42.0/24(rw,async,no_subtree_check,root_squash,
anonuid=10000,anongid=10000)

Clients are 2.6.5-rc2-mm2 kernels, filesystem is ReiserFS 3, data=writeback.
Exports are mounted with tcp,nolock,soft,timeo=600,retrans=2,actimeo=30,
rsize=32768,wsize=32768.

Once the oops has happened, no client can access the mount point any more.

Unable to handle kernel NULL pointer dereference at virtual address 00000004
printing eip:
c029fd35
*pde = 00000000
Oops: 0002 [#1]
SMP
CPU: 0
EIP: 0060:[<c029fd35>] Not tainted VLI
EFLAGS: 00010287 (2.6.5-rc2-mm4)
EIP is at do_tcp_sendpages+0x197/0xa79
eax: d1d24108 ebx: f5e3fd80 ecx: 00000008 edx: 00000000
esi: 00000001 edi: d1d24100 ebp: f72391ec esp: f6283e34
ds: 007b es: 007b ss: 0068
Process nfsd (pid: 3330, threadinfo=f6283000 task=f62962b0)
Stack: 000000d0 000000d0 00000000 00000000 15270000 c01e6a8d d1d24110 f7239064
00000008 00000000 00000000 00000000 000005b4 00007530 00000000 f7239000
00000008 00000000 c02a069f f7239000 f6283eac 00000000 00000008 00000000
Call Trace:
[<c01e6a8d>] nfsd_readdir+0x69/0xe8
[<c02a069f>] tcp_sendpage+0x88/0x96
[<c02d8ed4>] svc_sendto+0x16a/0x29e
[<c01ed0d5>] encode_post_op_attr+0x1c9/0x241
[<c02d9f40>] svc_tcp_sendto+0x53/0xa8
[<c02da6f8>] svc_send+0xb9/0xfc
[<c02dc384>] svcauth_unix_release+0x57/0x59
[<c02d838c>] svc_process+0x187/0x611
[<c01e0de5>] nfsd+0x1ea/0x3b6
[<c01e0bfb>] nfsd+0x0/0x3b6
[<c0104e01>] kernel_thread_helper+0x5/0xb

Code: 4c 24 20 85 f6 74 17 8d 04 f7 8d 50 08 89 54 24 18 8b 54 24 28 3b 50 08 0f 84 80 08 00 00 83 fe 11 0f 87 25 04 00 00 8b 54 24 28 <f0> ff 42 04 8b 7c 24 28 8b 83 98 00 00 00 8d 04 f0 89 78 10 8d

Best regards,

-Frank.


2004-03-27 19:10:25

by Dieter Stüken

[permalink] [raw]
Subject: Re: nfsd oops with 2.6.5-rc2-mm4

Frank wrote:
> I got a reproducible oops after a few minutes with a 2.6.5-rc2-mm4 kernel.
> ...
> [<c01e6a8d>] nfsd_readdir+0x69/0xe8

was some exported directory quite big? Try running a "find /mnt/..." on
the client to see when exactly it fails. I observe similar behavior
when reading huge directories of some 1000 entries. What nfs version
are you using? You may try "mount -o nfsvers=2 ..." or 3.

My own Oops seems to be reproducible when using a Sun (2.8) as
client, only. It did not occur when using nfsV2. I also failed
to reproduce the bug when mounting by an other Linux client.
So may be we observe two different bugs here.

With that instability observed, I won't/can't switch to 2.6.x for
my system in production :-( Can I help somehow? Making a TCP-dump?
Try some patches?

Dieter.
--
Dieter St?ken, con terra GmbH, M?nster
[email protected]
http://www.conterra.de/
(0)251-7474-501

2004-03-27 20:02:51

by Frank Denis

[permalink] [raw]
Subject: Re: nfsd oops with 2.6.5-rc2-mm4

Dieter Stueken wrote:
> was some exported directory quite big?

Yes, some exported directories contains a lot of small files.

> What nfs version
> are you using? You may try "mount -o nfsvers=2 ..." or 3.

Version 3. I can give version 2 a try but there will probably a
significant loss of performance :(

> My own Oops seems to be reproducible when using a Sun (2.8) as
> client, only.

There actually 10 clients, 9 are Linux 2.6.2-rc2-mm2, 1 is indeed a
Solaris 2.8 box.

> It did not occur when using nfsV2. I also failed
> to reproduce the bug when mounting by an other Linux client.

Unfortunately this is a production environment and I can hardly switch
the Solaris box off in order to make it sure that it is triggering the bug.

> So may be we observe two different bugs here.

Your situation looks similar.

I reverted to 2.6.2-rc2-mm3, the server didn't crash after 6 hours.
Crossing fingers...

2004-03-27 20:39:49

by Frank Denis

[permalink] [raw]
Subject: Re: nfsd oops with 2.6.5-rc2-mm4

Frank Denis wrote:
> I reverted to 2.6.2-rc2-mm3, the server didn't crash after 6 hours.
> Crossing fingers...

Pointless.

2.6.2-rc2-mm3 just crashed the same way.


Unable to handle kernel NULL pointer dereference at virtual address 00000004
printing eip:
c029fb79
*pde = 00000000
Oops: 0002 [#1]
SMP
CPU: 0
EIP: 0060:[<c029fb79>] Not tainted VLI
EFLAGS: 00010287 (2.6.5-rc2-mm3)
EIP is at do_tcp_sendpages+0x197/0xa79
eax: f5b75108 ebx: f42c0980 ecx: 00000008 edx: 00000000
esi: 00000001 edi: f5b75100 ebp: f61bddec esp: f62c5e34
ds: 007b es: 007b ss: 0068
Process nfsd (pid: 3325, threadinfo=f62c5000 task=f7727670)
Stack: 000000d0 000000d0 f62c5e60 00000000 15270000 c01e6991 f5b75110
f61bdc64
00000008 00000000 00000000 00000000 000005b4 00007530 00000000
f61bdc00
00000008 00000000 c02a04e3 f61bdc00 f62c5eac 00000000 00000008
00000000
Call Trace:
[<c01e6991>] nfsd_readdir+0x69/0xe8
[<c02a04e3>] tcp_sendpage+0x88/0x96
[<c02d8ccd>] svc_sendto+0x16a/0x29e
[<c01ecfd9>] encode_post_op_attr+0x1c9/0x241
[<c02d9d39>] svc_tcp_sendto+0x53/0xa8
[<c02da4f1>] svc_send+0xb9/0xfc
[<c02dc17c>] svcauth_unix_release+0x57/0x59
[<c02d8189>] svc_process+0x184/0x60e
[<c02e2d9c>] common_interrupt+0x18/0x20
[<c01e0ce9>] nfsd+0x1ea/0x3b6
[<c01e0aff>] nfsd+0x0/0x3b6
[<c0104e01>] kernel_thread_helper+0x5/0xb

2004-03-27 23:17:35

by Dieter Stüken

[permalink] [raw]
Subject: Re: nfsd oops with 2.6.5-rc2-mm4

Frank Denis wrote:

> Version 3. I can give version 2 a try but there will probably
> a significant loss of performance :(

probably better than a total loss of performance :-)

> Unfortunately this is a production environment and I can hardly
> switch the Solaris box off in order to make it sure that it is
> triggering the bug.

Same here. I set up a NFS-server for test and let the Sun
mount the data in a temporary directory to verify the situation.
The test-server may crash, but the sun will survive. After you
stopped the test on the sun, you might have to reboot your
test-server to be able to unmount the disk on the sun again.

>> I reverted to 2.6.2-rc2-mm3, the server didn't crash after 6 hours. Crossing fingers...
>
> Pointless.
>
> 2.6.2-rc2-mm3 just crashed the same way.

Yes, all 2.6.x seem to be affected. My good old 2.4.19 runs since 6 month.
I started to scan the 2.5.x patches for changes on nfsd, to find by which
one this kind of trouble was caused....

Dieter.