2006-08-14 08:01:48

by Allard Hoeve

[permalink] [raw]
Subject: Kernel oops (NLM?)

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


Attachments:
config-2.6.16.13-byte (40.04 kB)
(No filename) (373.00 B)
(No filename) (140.00 B)
Download all attachments

2006-08-14 13:52:33

by Wendy Cheng

[permalink] [raw]
Subject: Re: Kernel oops (NLM?)

Allard Hoeve wrote:

>
> Hello all,
>
> Some time ago, I encountered a kernel oops that resulted in a crashed
> lock daemon. nfsd never stopped working, but all client lock requests
> stalled.


Are there certain steps we can follow to recreate this issue ?

-- Wendy


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-14 14:09:36

by Allard Hoeve

[permalink] [raw]
Subject: Re: Kernel oops (NLM?)


Hello Wendy,

>> Some time ago, I encountered a kernel oops that resulted in a crashed lock
>> daemon. nfsd never stopped working, but all client lock requests stalled.

> Are there certain steps we can follow to recreate this issue ?

I'm currently unaware of a way to reproduce the oops, sorry. It is a
production server under heavy load, so I can't fiddle around much. I was
actually hoping there was something blindingly obvious you could deduct
from the voodoo in the oops...

Is there anything I can do to generate more/better logs for you in case
this occurs again on this system?

PS: All clients are configured to use nfsv3. nlm4svc_retrieve_args looks
like an NFS4 codepath, doesn't it?

Regards,

Allard Hoeve


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-14 14:25:58

by Wendy Cheng

[permalink] [raw]
Subject: Re: Kernel oops (NLM?)

Allard Hoeve wrote:

>I'm currently unaware of a way to reproduce the oops, sorry. It is a
>production server under heavy load, so I can't fiddle around much.
>
Understood. We're going to do some stress tests on NLM too - so don't
worry about it.

>PS: All clients are configured to use nfsv3. nlm4svc_retrieve_args looks
> like an NFS4 codepath, doesn't it?
>
>
>
No - that's an NLM version number, not NFS version number.

-- Wendy



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-14 14:34:29

by Trond Myklebust

[permalink] [raw]
Subject: Re: Kernel oops (NLM?)

On Mon, 2006-08-14 at 10:01 +0200, Allard Hoeve wrote:
> Hello all,
>
> Some time ago, I encountered a kernel oops that resulted in a crashed lock
> daemon. nfsd never stopped working, but all client lock requests stalled.
>
> The machine is a PowerEdge 2850 and had a vanilla 2.6.16.13 kernel
> running.

I'd suggest upgrading. We've eliminated a lot of races in NLM since
then, and have (among other things) completely removed the codepath that
appears to have Oopsed below.

Cheers,
Trond

> I'm not sure how to debug this, so I thought I might just send it to you.
>
> If you need any more information, please ask. I'm on the list.
>
> Regards,
>
> Allard Hoeve
>
> PS: Kernel config attached
>
>
>
> Unable to handle kernel paging request at virtual address 6e6f6974
> printing eip:
> c0176469
> *pde = 00000000
> Oops: 0000 [#1]
> SMP
> Modules linked in: nfsd ipv6 shpchp pci_hotplug ehci_hcd uhci_hcd usbcore
> ide_cd cdrom
> CPU: 0
> EIP: 0060:[<c0176469>] Not tainted VLI
> EFLAGS: 00010246 (2.6.16.13-byte #2)
> EIP is at posix_locks_deadlock+0x3b/0x5e
> eax: 00000000 ebx: 6e6f6974 ecx: f7fb74b8 edx: 00000000
> esi: f7fb74b8 edi: f73622b8 ebp: f7362200 esp: f7247efc
> ds: 007b es: 007b ss: 0068
> Process lockd (pid: 2478, threadinfo=f7246000 task=f7d89030)
> Stack: <0>f77634c8 f7fb74b8 f7fb74b8 f7362600 f7362224 c022446d f73622b8 f7fb74b8
> 00000000 d22430c0 f7362224 f7362200 00000000 c0229148 f7735400 f7247f40
> f7477c00 f73622b8 f7362200 f7362600 f7735400 f7735464 c02293a9 f7735400
> Call Trace:
> [<c022446d>] nlmsvc_lock+0x9a/0x414
> [<c0229148>] nlm4svc_retrieve_args+0x7c/0xe9
> [<c02293a9>] nlm4svc_proc_lock+0xe2/0x13d
> [<c045ce98>] svc_process+0x3f0/0x684
> [<c0223a6a>] lockd+0x133/0x23a
> [<c0223937>] lockd+0x0/0x23a
> [<c0101141>] kernel_thread_helper+0x5/0xb
> Code: 24 04 89 3c 24 e8 f7 fa ff ff 85 c0 75 39 8b 1d 5c 47 4e c0 eb 15 8d
> 43 fc 89 74 24 04 89 04 24 e8 dc fa ff ff 85 c0 75 19 8b 1b <8b> 03 0f 18
> 00 90 81 fb 5c 47 4e c0 75 dd 31 c0 83 c4 08 5b 5e
> <4>lockd_down: lockd failed to exit, clearing pid
> ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-14 15:39:54

by Allard Hoeve

[permalink] [raw]
Subject: Re: Kernel oops (NLM?)


Hello Trond,

> I'd suggest upgrading. We've eliminated a lot of races in NLM since
> then, and have (among other things) completely removed the codepath that
> appears to have Oopsed below.

Thanks for your reply.

We've upgraded to vanilla 2.6.17.8 since (for other reasons as well). It
looks like the codepath still exists there? Is the mainstream kernel in
sync? What version do you recommend (NFS-wise)?



Ironically, we experienced an entirely different NFS crash on 2.6.17.8
kernel as well :)

Short description:

* All nfsd threads hanging (state D)
* All clients stalled
* Restarting the nfs-kernel-server didn't help
* No syslog entries
* No dmesg entries

Due to the lack of logs, I can't be more verbose...

Regards,

Allard

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-14 17:29:29

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Kernel oops (NLM?)

On Mon, Aug 14, 2006 at 05:39:45PM +0200, Allard Hoeve wrote:
> Ironically, we experienced an entirely different NFS crash on 2.6.17.8
> kernel as well :)
>
> Short description:
>
> * All nfsd threads hanging (state D)
> * All clients stalled
> * Restarting the nfs-kernel-server didn't help
> * No syslog entries
> * No dmesg entries
>
> Due to the lack of logs, I can't be more verbose...

You could try alt-syrq-T (or "echo t >/proc/sysrq-trigger"). That'll
dump to the logs a stack for every process. (It's the nfsd ones that
are most likely to be interesting). So we might be able to figure out
where the nfsd threads are hanging.

--b.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs