LinuxLists.cc - [repost] Another NFS related oops on smp servers

2003-02-03 17:57:02

Subject: [repost] Another NFS related oops on smp servers

Hello,

i got no answer for original message (lost in holidays spam probably)
so i'm reposting it (we got another such oops this sunday):

-----Original Message-----
From: "Peter Lojkin" <[email protected]>
To: [email protected]
Date: Thu, 26 Dec 2002 17:48:03 +0300
Subject: [NFS] Another NFS related oops on smp servers

>
> Hello,
>
> we had 3 nfs related oopses on our servers in 2 days. such oopses
> never happened before. the kernel on oopsed servers is basically
> 2.4.20aa1 + trond's waitq fix. one of the oopsed servers had also
> trond's fix for another our oops problem, discussed earlier.
> the setup is:
> - intel 4-way smp general-purpose servers with debian 3.0
> - intel and sparc fileservers with solaris8
> - intel workstations with solaris7/8, redhat 7.2/7.3 and debian 3.0
>
> the workload is mostly software development. developers are running
> simultaneous builds on our genereal-purpose servers, accessing a
> multitude of files exported from fileservers and workstations in
> parallel. there's no nfsd running on workservrs. we use autofs with
> no special mount options, so we get
> rw,nosuid,v3,rsize=8192,wsize=8192,hard,intr,udp,lock
> for linux exports and
> rw,nosuid,v3,rsize=32768,wsize=32768,hard,intr,udp,lock
> for solaris exports.
>
> after oops any process accessing nfs hangs.
>
> ksymoops parsed kern.log:
> ==================================================================
> Dec 24 16:43:36 raven kernel: 3136MB HIGHMEM available.
> Dec 24 16:43:36 raven kernel: cpu: 0, clocks: 1002304, slice: 200460
> Dec 24 16:43:36 raven kernel: cpu: 1, clocks: 1002304, slice: 200460
> Dec 24 16:43:36 raven kernel: cpu: 2, clocks: 1002304, slice: 200460
> Dec 24 16:43:36 raven kernel: cpu: 3, clocks: 1002304, slice: 200460
> Dec 24 16:43:36 raven kernel: Receiver lock-up bug exists -- enabling work-around.
> Dec 24 16:43:36 raven kernel: e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
> Dec 26 16:20:14 raven kernel: kernel BUG at inode.c:1100!
> Dec 26 16:20:14 raven kernel: invalid operand: 0000 2.4.20aa1 #1 SMP Tue Dec 24 16:00:13 MSK 2002
> Dec 26 16:20:14 raven kernel: CPU: 2
> Dec 26 16:20:14 raven kernel: EIP: 0010:[iput+32/508] Not tainted
> Dec 26 16:20:14 raven kernel: EFLAGS: 00010246
> Dec 26 16:20:14 raven kernel: eax: 00000001 ebx: c7f15120 ecx: 00000000 edx: c7f15201
> Dec 26 16:20:14 raven kernel: esi: c5767c00 edi: 00000000 ebp: e1924a60 esp: ccf6df3c
> Dec 26 16:20:14 raven kernel: ds: 0018 es: 0018 ss: 0018
> Dec 26 16:20:14 raven kernel: Process rpciod (pid: 406, stackpage=ccf6d000)
> Dec 26 16:20:14 raven kernel: Stack: c7f15120 efa8dbc0 c7f152e8 c0191a08 c7f15120 e1924ab4 ccf6c000 c038d208
> Dec 26 16:20:14 raven kernel: 00000000 e1924b90 e1924b08 e1924aec e1924a60 c02786f6 e1924a60 e1924a60
> Dec 26 16:20:14 raven kernel: ccf6c000 c038d208 00000000 ccf6c000 ccf6c000 c0355c00 ccf6dfdc c44fe000
> Dec 26 16:20:14 raven kernel: Call Trace: [nfs_writeback_done+764/1340] [__rpc_execute+726/880] [__rpc_schedule+231/364] [rpciod+245/584] [kernel_thread+40/56]
> Dec 26 16:20:14 raven kernel: Code: 0f 0b 4c 04 92 49 29 c0 85 f6 74 03 8b 7e 20 85 ff 74 0d 8b
>
>
> >>ebx; c7f15120 <END_OF_CODE+23df381/????>
> >>edx; c7f15201 <END_OF_CODE+23df462/????>
> >>esi; c5767c00 <_end+53897e4/5711be4>
> >>ebp; e1924a60 <END_OF_CODE+1bdeecc1/????>
> >>esp; ccf6df3c <END_OF_CODE+743819d/????>
>
> Code; 00000000 Before first symbol
> 00000000 <_EIP>:
> Code; 00000000 Before first symbol
> 0: 0f 0b ud2a
> Code; 00000002 Before first symbol
> 2: 4c dec %esp
> Code; 00000003 Before first symbol
> 3: 04 92 add $0x92,%al
> Code; 00000005 Before first symbol
> 5: 49 dec %ecx
> Code; 00000006 Before first symbol
> 6: 29 c0 sub %eax,%eax
> Code; 00000008 Before first symbol
> 8: 85 f6 test %esi,%esi
> Code; 0000000a Before first symbol
> a: 74 03 je f <_EIP+0xf> 0000000f Before first symbol
> Code; 0000000c Before first symbol
> c: 8b 7e 20 mov 0x20(%esi),%edi
> Code; 0000000f Before first symbol
> f: 85 ff test %edi,%edi
> Code; 00000011 Before first symbol
> 11: 74 0d je 20 <_EIP+0x20> 00000020 Before first symbol
> Code; 00000013 Before first symbol
> 13: 8b 00 mov (%eax),%eax
>
> ==================================================================

-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-02-03 20:47:15

by Trond Myklebust

[permalink] [raw]

Subject: Re: [repost] Another NFS related oops on smp servers

>>>>> " " == Peter Lojkin <[email protected]> writes:

> Hello, i got no answer for original message (lost in holidays
> spam probably) so i'm reposting it (we got another such oops
> this sunday):

> -----Original Message----- From: "Peter Lojkin"
> <[email protected]> To: [email protected] Date: Thu, 26
> Dec 2002 17:48:03 +0300 Subject: [NFS] Another NFS related oops
> on smp servers

>>
>> Hello,
>>
>> we had 3 nfs related oopses on our servers in 2 days. such
>> oopses never happened before. the kernel on oopsed servers is
>> basically 2.4.20aa1 + trond's waitq fix. one of the oopsed
>> servers had also trond's fix for another our oops problem,
>> discussed earlier. the setup is:

I suggest you first try to reproduce that Oops on a stock kernel from
ftp.kernel.org: 2.4.20aa1 is very unlikely to be used by too many of
the developers (I certainly don't use it).

Cheers,
Trond

-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-02-04 15:58:46

by Peter Lojkin

[permalink] [raw]

Subject: Re: [repost] Another NFS related oops on smp servers

Hello,

i understand your point and you are right. if i find the way to
reproduce this oops i'll check it with stock kernel. but since
stock kernel can't survive our work load for more then 2 hours
(because of vm and scheduler issues) i can't initiate kernel
upgrade on servers just to check if it oopses in a few weeks. and
now we have no free servers for long testing :(
do you have any suggestions on how to test it quick enough, so i
can test oops on our kernel then boot stock and check it there in
a couple of hours? how can i simulate rpc load (oops was in rpciod)?
nfs load we used last time doesn't trigger this oops...

btw, Neil posted a few rpc related patches about three weeks ago
can this be related?

-----Original Message-----
From: Trond Myklebust <[email protected]>
To: "Peter Lojkin" <[email protected]>
Date: 03 Feb 2003 21:47:06 +0100
Subject: Re: [NFS] [repost] Another NFS related oops on smp servers

> >> we had 3 nfs related oopses on our servers in 2 days. such
> >> oopses never happened before. the kernel on oopsed servers is
> >> basically 2.4.20aa1 + trond's waitq fix. one of the oopsed
> >> servers had also trond's fix for another our oops problem,
> >> discussed earlier. the setup is:
>
> I suggest you first try to reproduce that Oops on a stock kernel from
> ftp.kernel.org: 2.4.20aa1 is very unlikely to be used by too many of
> the developers (I certainly don't use it).
>
> Cheers,
> Trond

-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs