2007-06-13 12:07:41

by Maciej Sołtysiak

[permalink] [raw]
Subject: 2.6.21.14 NFS related oops

Hi,

If anyone is interested I got this OOPS while running a torrent
(btdownloadcurses)
application writing directly to a NAS mounted via nfs3.

The client machine is 2.6.21.14 and it is mounted with options:
wsize=8192,rsize=8192,hard,intr,tcp

After that, the application hung and i am unable to cd into the mounted
nfs directory
nor unmount it (busy), nor kill the app (kill -9 fails, process in D state)

Best regards,
Maciej

BUG: unable to handle kernel paging request at virtual address 5018f248
printing eip:
f0a93c94
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: binfmt_misc sit nfs lockd nfs_acl sunrpc w83627ehf
i2c_isa i2c_viapro i2c_core via_agp agpgart rtc
CPU: 0
EIP: 0060:[<f0a93c94>] Not tainted VLI
EFLAGS: 00010206 (2.6.20.14-cks1 #15)
EIP is at rpcauth_checkverf+0x34/0x70 [sunrpc]
eax: d2f4447c ebx: c655d584 ecx: 00000000 edx: f0aa9f60
esi: e91ea640 edi: d2f44474 ebp: ede2f228 esp: e64b5eec
ds: 007b es: 007b ss: 0068
Process rpciod/0 (pid: 1005, ti=e64b4000 task=efe95a90 task.ti=e64b4000)
Stack: 00000286 ede2f8a0 ede2f8a0 00000286 c655d584 121d0da3 00000820
f0a8d7fd
f0a93d60 f08bae07 00000286 c655d5cc 00000286 00000286 f08c0520
c655d584
00000000 c655d5ec f0a93260 f0a9306f efe95a90 ee2d5740 e092ffb0
c034e11c
Call Trace:
[<f0a8d7fd>] call_decode+0x27d/0x5e0 [sunrpc]
[<f0a93d60>] rpcauth_unbindcred+0x20/0x60 [sunrpc]
[<f08bae07>] nfs_readpage_result_full+0xf7/0x120 [nfs]
[<f08c0520>] nfs3_xdr_readres+0x0/0x160 [nfs]
[<f0a93260>] rpc_async_schedule+0x0/0x10 [sunrpc]
[<f0a9306f>] __rpc_execute+0x5f/0x250 [sunrpc]
[<c034e11c>] schedule+0x21c/0x450
[<c01283aa>] run_workqueue+0x7a/0x110
[<c0128a07>] worker_thread+0x137/0x160
[<c01176b0>] default_wake_function+0x0/0x10
[<c01288d0>] worker_thread+0x0/0x160
[<c012b329>] kthread+0xa9/0xe0
[<c012b280>] kthread+0x0/0xe0
[<c0103a97>] kernel_thread_helper+0x7/0x10
=======================
Code: 10 89 5c 24 10 89 c3 89 7c 24 18 89 d7 89 74 24 14 8b 70 28 75 1a 8b
4e 08 89 fa 89 d8 ff 51 18 8b 5c 24 10 83 74 24 14 8b 7c 24 <18> 83 c4 1c c3
89 74 24 0c 8b 40 10 8b 40 24 8b 40 10 8b 40 08 EIP: [<f0a93c94>]
rpcauth_checkverf+0x34/0x70 [sunrpc] SS:ESP 0068:e64b5eec


2007-06-13 19:17:18

by Trond Myklebust

[permalink] [raw]
Subject: Re: 2.6.21.14 NFS related oops

On Wed, 2007-06-13 at 14:00 +0200, Maciej Soltysiak wrote:
> Hi,
>
> If anyone is interested I got this OOPS while running a torrent
> (btdownloadcurses)
> application writing directly to a NAS mounted via nfs3.
>
> The client machine is 2.6.21.14 and it is mounted with options:
> wsize=8192,rsize=8192,hard,intr,tcp

Hmm. The Oops says '2.6.20.14-cks1'

Firstly, does that have any extra out-of-tree patches?
Secondly, is it reproducible with 2.6.21 or a more recent kernel?

> After that, the application hung and i am unable to cd into the mounted
> nfs directory
> nor unmount it (busy), nor kill the app (kill -9 fails, process in D state)
>
> Best regards,
> Maciej
>
> BUG: unable to handle kernel paging request at virtual address 5018f248
> printing eip:
> f0a93c94
> *pde = 00000000
> Oops: 0002 [#1]
> Modules linked in: binfmt_misc sit nfs lockd nfs_acl sunrpc w83627ehf
> i2c_isa i2c_viapro i2c_core via_agp agpgart rtc
> CPU: 0
> EIP: 0060:[<f0a93c94>] Not tainted VLI
> EFLAGS: 00010206 (2.6.20.14-cks1 #15)
> EIP is at rpcauth_checkverf+0x34/0x70 [sunrpc]
> eax: d2f4447c ebx: c655d584 ecx: 00000000 edx: f0aa9f60
> esi: e91ea640 edi: d2f44474 ebp: ede2f228 esp: e64b5eec
> ds: 007b es: 007b ss: 0068
> Process rpciod/0 (pid: 1005, ti=e64b4000 task=efe95a90 task.ti=e64b4000)
> Stack: 00000286 ede2f8a0 ede2f8a0 00000286 c655d584 121d0da3 00000820
> f0a8d7fd
> f0a93d60 f08bae07 00000286 c655d5cc 00000286 00000286 f08c0520
> c655d584
> 00000000 c655d5ec f0a93260 f0a9306f efe95a90 ee2d5740 e092ffb0
> c034e11c
> Call Trace:
> [<f0a8d7fd>] call_decode+0x27d/0x5e0 [sunrpc]
> [<f0a93d60>] rpcauth_unbindcred+0x20/0x60 [sunrpc]
> [<f08bae07>] nfs_readpage_result_full+0xf7/0x120 [nfs]
> [<f08c0520>] nfs3_xdr_readres+0x0/0x160 [nfs]
> [<f0a93260>] rpc_async_schedule+0x0/0x10 [sunrpc]
> [<f0a9306f>] __rpc_execute+0x5f/0x250 [sunrpc]
> [<c034e11c>] schedule+0x21c/0x450
> [<c01283aa>] run_workqueue+0x7a/0x110
> [<c0128a07>] worker_thread+0x137/0x160
> [<c01176b0>] default_wake_function+0x0/0x10
> [<c01288d0>] worker_thread+0x0/0x160
> [<c012b329>] kthread+0xa9/0xe0
> [<c012b280>] kthread+0x0/0xe0
> [<c0103a97>] kernel_thread_helper+0x7/0x10
> =======================
> Code: 10 89 5c 24 10 89 c3 89 7c 24 18 89 d7 89 74 24 14 8b 70 28 75 1a 8b
> 4e 08 89 fa 89 d8 ff 51 18 8b 5c 24 10 83 74 24 14 8b 7c 24 <18> 83 c4 1c c3
> 89 74 24 0c 8b 40 10 8b 40 24 8b 40 10 8b 40 08 EIP: [<f0a93c94>]
> rpcauth_checkverf+0x34/0x70 [sunrpc] SS:ESP 0068:e64b5eec

At a first guess, it looks as though something has scribbled over your
credential. Have you tried running this kernel with slab debugging
enabled?

Cheers
Trond

2007-06-13 20:35:30

by Chuck Ebbert

[permalink] [raw]
Subject: Re: 2.6.21.14 NFS related oops

On 06/13/2007 03:17 PM, Trond Myklebust wrote:
> On Wed, 2007-06-13 at 14:00 +0200, Maciej Soltysiak wrote:
>> =======================
>> Code: 10 89 5c 24 10 89 c3 89 7c 24 18 89 d7 89 74 24 14 8b 70 28 75 1a 8b
>> 4e 08 89 fa 89 d8 ff 51 18 8b 5c 24 10 83 74 24 14 8b 7c 24 <18> 83 c4 1c c3
>> 89 74 24 0c 8b 40 10 8b 40 24 8b 40 10 8b 40 08 EIP: [<f0a93c94>]
>> rpcauth_checkverf+0x34/0x70 [sunrpc] SS:ESP 0068:e64b5eec
>
> At a first guess, it looks as though something has scribbled over your
> credential. Have you tried running this kernel with slab debugging
> enabled?
>

Disassembly of this code yields gibberish, like a bit got flipped
somewhere:

1c: ff 51 18 call *0x18(%ecx)
1f: 8b 5c 24 10 mov 0x10(%esp),%ebx
23: 83 74 24 14 8b xorl $0xffffff8b,0x14(%esp)
28: 7c 24 jl 4e <_EIP+0x4e>
0: 18 83 c4 1c c3 89 sbb %al,0x89c31cc4(%ebx) <=====
6: 74 24 je 2c <_EIP+0x2c>
8: 0c 8b or $0x8b,%al
a: 40 inc %eax
b: 10 8b 40 24 8b 40 adc %cl,0x408b2440(%ebx)
11: 10 .byte 0x10
12: 8b 40 08 mov 0x8(%eax),%eax

Somewhere around 23: things went horribly wrong.
At 12: it starts to make sense again.

2007-06-14 15:34:49

by Maciej Sołtysiak

[permalink] [raw]
Subject: Re: 2.6.21.14 NFS related oops

Trond Myklebust pisze:
> On Wed, 2007-06-13 at 14:00 +0200, Maciej Soltysiak wrote:
>
>> Hi,
>>
>> If anyone is interested I got this OOPS while running a torrent
>> (btdownloadcurses)
>> application writing directly to a NAS mounted via nfs3.
>>
>> The client machine is 2.6.21.14 and it is mounted with options:
>> wsize=8192,rsize=8192,hard,intr,tcp
>>
>
> Hmm. The Oops says '2.6.20.14-cks1'
>
> Firstly, does that have any extra out-of-tree patches?
> Secondly, is it reproducible with 2.6.21 or a more recent kernel?
>
>
Ah, yes, 2.6.20.14 not 2.6.21.14 and it does contain 2 extra things:
- Con Kolivas' -cks1 (server version)
- reiser4 code, one mounted filesystem.
>> After that, the application hung and i am unable to cd into the mounted
>> nfs directory
>> nor unmount it (busy), nor kill the app (kill -9 fails, process in D state)
>>
>> Best regards,
>> Maciej
>>
>> BUG: unable to handle kernel paging request at virtual address 5018f248
>> printing eip:
>> f0a93c94
>> *pde = 00000000
>> Oops: 0002 [#1]
>> Modules linked in: binfmt_misc sit nfs lockd nfs_acl sunrpc w83627ehf
>> i2c_isa i2c_viapro i2c_core via_agp agpgart rtc
>> CPU: 0
>> EIP: 0060:[<f0a93c94>] Not tainted VLI
>> EFLAGS: 00010206 (2.6.20.14-cks1 #15)
>> EIP is at rpcauth_checkverf+0x34/0x70 [sunrpc]
>> eax: d2f4447c ebx: c655d584 ecx: 00000000 edx: f0aa9f60
>> esi: e91ea640 edi: d2f44474 ebp: ede2f228 esp: e64b5eec
>> ds: 007b es: 007b ss: 0068
>> Process rpciod/0 (pid: 1005, ti=e64b4000 task=efe95a90 task.ti=e64b4000)
>> Stack: 00000286 ede2f8a0 ede2f8a0 00000286 c655d584 121d0da3 00000820
>> f0a8d7fd
>> f0a93d60 f08bae07 00000286 c655d5cc 00000286 00000286 f08c0520
>> c655d584
>> 00000000 c655d5ec f0a93260 f0a9306f efe95a90 ee2d5740 e092ffb0
>> c034e11c
>> Call Trace:
>> [<f0a8d7fd>] call_decode+0x27d/0x5e0 [sunrpc]
>> [<f0a93d60>] rpcauth_unbindcred+0x20/0x60 [sunrpc]
>> [<f08bae07>] nfs_readpage_result_full+0xf7/0x120 [nfs]
>> [<f08c0520>] nfs3_xdr_readres+0x0/0x160 [nfs]
>> [<f0a93260>] rpc_async_schedule+0x0/0x10 [sunrpc]
>> [<f0a9306f>] __rpc_execute+0x5f/0x250 [sunrpc]
>> [<c034e11c>] schedule+0x21c/0x450
>> [<c01283aa>] run_workqueue+0x7a/0x110
>> [<c0128a07>] worker_thread+0x137/0x160
>> [<c01176b0>] default_wake_function+0x0/0x10
>> [<c01288d0>] worker_thread+0x0/0x160
>> [<c012b329>] kthread+0xa9/0xe0
>> [<c012b280>] kthread+0x0/0xe0
>> [<c0103a97>] kernel_thread_helper+0x7/0x10
>> =======================
>> Code: 10 89 5c 24 10 89 c3 89 7c 24 18 89 d7 89 74 24 14 8b 70 28 75 1a 8b
>> 4e 08 89 fa 89 d8 ff 51 18 8b 5c 24 10 83 74 24 14 8b 7c 24 <18> 83 c4 1c c3
>> 89 74 24 0c 8b 40 10 8b 40 24 8b 40 10 8b 40 08 EIP: [<f0a93c94>]
>> rpcauth_checkverf+0x34/0x70 [sunrpc] SS:ESP 0068:e64b5eec
>>
>
> At a first guess, it looks as though something has scribbled over your
> credential. Have you tried running this kernel with slab debugging
> enabled?
>
>
No, i will turn it on, though. The server crashes on heavy NFS traffic
(eg. nightly rsync backup)
It crashed again today, but the oops did not get written to kern.log
> Cheers
> Trond
>
Thanks for your reply and best regards,
Maciej

2007-06-16 09:32:24

by Maciej Sołtysiak

[permalink] [raw]
Subject: Re: 2.6.21.14 NFS related oops

>> =======================
>> Code: 10 89 5c 24 10 89 c3 89 7c 24 18 89 d7 89 74 24 14 8b 70 28 75 1a
>> 8b
>> 4e 08 89 fa 89 d8 ff 51 18 8b 5c 24 10 83 74 24 14 8b 7c 24 <18> 83 c4 1c
>> c3
>> 89 74 24 0c 8b 40 10 8b 40 24 8b 40 10 8b 40 08 EIP: [<f0a93c94>]
>> rpcauth_checkverf+0x34/0x70 [sunrpc] SS:ESP 0068:e64b5eec
>
> At a first guess, it looks as though something has scribbled over your
> credential. Have you tried running this kernel with slab debugging
> enabled?

I'm running 2.6.21.5 now with slab debugging on, here's what I got about
slab corruption:

Slab corruption: skbuff_head_cache start=ef287b78, len=164
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<c031710c>](kfree_skbmem+0x3c/0x90)
090: 6b 6b 6b 6b 6b 63 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Probably bad RAM.
Run memtest86+ or a similar memory test tool.
Prev obj: start=ef287ac8, len=164
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [<c031798b>](__alloc_skb+0x2b/0x100)
000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
010: 00 00 00 00 e0 71 e6 ef 00 00 00 00 00 00 00 00
Next obj: start=ef287c28, len=164
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [<c031798b>](__alloc_skb+0x2b/0x100)
000: 84 d0 85 c5 84 d0 85 c5 04 d0 85 c5 2c 0a 73 46
010: 6f cd 09 00 00 00 00 00 01 00 00 00 08 e5 72 ee

How probable is that it is really a bad memory issue?
Does this report say anything about which RAM chip I should
investigate/replace ? I have 1x512MB+1x256MB

Best Regards,
Maciej

2007-06-16 15:08:33

by Trond Myklebust

[permalink] [raw]
Subject: Re: 2.6.21.14 NFS related oops

On Sat, 2007-06-16 at 11:26 +0200, Maciej Sołtysiak wrote:
> >> =======================
> >> Code: 10 89 5c 24 10 89 c3 89 7c 24 18 89 d7 89 74 24 14 8b 70 28 75 1a
> >> 8b
> >> 4e 08 89 fa 89 d8 ff 51 18 8b 5c 24 10 83 74 24 14 8b 7c 24 <18> 83 c4 1c
> >> c3
> >> 89 74 24 0c 8b 40 10 8b 40 24 8b 40 10 8b 40 08 EIP: [<f0a93c94>]
> >> rpcauth_checkverf+0x34/0x70 [sunrpc] SS:ESP 0068:e64b5eec
> >
> > At a first guess, it looks as though something has scribbled over your
> > credential. Have you tried running this kernel with slab debugging
> > enabled?
>
> I'm running 2.6.21.5 now with slab debugging on, here's what I got about
> slab corruption:
>
> Slab corruption: skbuff_head_cache start=ef287b78, len=164
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<c031710c>](kfree_skbmem+0x3c/0x90)
> 090: 6b 6b 6b 6b 6b 63 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Single bit error detected. Probably bad RAM.
> Run memtest86+ or a similar memory test tool.
> Prev obj: start=ef287ac8, len=164
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<c031798b>](__alloc_skb+0x2b/0x100)
> 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 010: 00 00 00 00 e0 71 e6 ef 00 00 00 00 00 00 00 00
> Next obj: start=ef287c28, len=164
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<c031798b>](__alloc_skb+0x2b/0x100)
> 000: 84 d0 85 c5 84 d0 85 c5 04 d0 85 c5 2c 0a 73 46
> 010: 6f cd 09 00 00 00 00 00 01 00 00 00 08 e5 72 ee
>
> How probable is that it is really a bad memory issue?
> Does this report say anything about which RAM chip I should
> investigate/replace ? I have 1x512MB+1x256MB
>
> Best Regards,
> Maciej

I'd try doing as suggested above: run memtest86 on the computer for a
couple of hours and see what it tells you. That should hopefully give
you enough information to figure out which chips need replacing.

Cheers
Trond

2007-06-20 10:42:45

by Maciej Sołtysiak

[permalink] [raw]
Subject: Re: 2.6.21.14 NFS related oops

> > I'm running 2.6.21.5 now with slab debugging on, here's what I got
about
> > slab corruption:
> >
> > Slab corruption: skbuff_head_cache start=ef287b78, len=164
> > Redzone: 0x5a2cf071/0x5a2cf071.
> > Last user: [<c031710c>](kfree_skbmem+0x3c/0x90)
> > 090: 6b 6b 6b 6b 6b 63 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > Single bit error detected. Probably bad RAM.
> > Run memtest86+ or a similar memory test tool.
> > Prev obj: start=ef287ac8, len=164
> > Redzone: 0x170fc2a5/0x170fc2a5.
> > Last user: [<c031798b>](__alloc_skb+0x2b/0x100)
> > 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 010: 00 00 00 00 e0 71 e6 ef 00 00 00 00 00 00 00 00
> > Next obj: start=ef287c28, len=164
> > Redzone: 0x170fc2a5/0x170fc2a5.
> > Last user: [<c031798b>](__alloc_skb+0x2b/0x100)
> > 000: 84 d0 85 c5 84 d0 85 c5 04 d0 85 c5 2c 0a 73 46
> > 010: 6f cd 09 00 00 00 00 00 01 00 00 00 08 e5 72 ee
> >
> > How probable is that it is really a bad memory issue?
> > Does this report say anything about which RAM chip I should
> > investigate/replace ? I have 1x512MB+1x256MB
> >
> > Best Regards,
> > Maciej
>
> I'd try doing as suggested above: run memtest86 on the computer for a
> couple of hours and see what it tells you. That should hopefully give
> you enough information to figure out which chips need replacing.

I am also getting BAD CRC on the disk that holds my swap partition.
I was wondering if slab debugging could say I have slab corruption not
because
my RAM chips are bad, but because SWAP has bad blocks ? And that the
whole problem might be swap disk related not ram related.

> Cheers
> Trond
Regards,
Maciej