2006-10-18 09:30:50

by Helge Bahmann

[permalink] [raw]
Subject: NFSv3 server: lockd hangs

Hello,

I have problems with lockd on an NFS server (stock 2.6.17 kernel); I have
no easy test case to reproduce the problem, but after one to two weeks of
uptime the lockd process is simply stuck. It fails to shutdown
("lockd_down: lockd failed to exit"; afterwards the "[lockd]" thread is
still around and stuck in D state) and is otherwise unresponsive. No other
kernel messages are logged.

Only full reboot solves the problem.

Can you tell me
- if there is a known lockd problem? I could not find anything in the
archives, but maybe I have missed something
- if it may be a file-system-specific problem (reiserfs in the present
case)?
- what else can I do to provide more useful diagnostics?

Thanks and best regards
--
Helge Bahmann <[email protected]> /| \__
The past: Smart users in front of dumb terminals /_|____\
_/\ | __)
Wer im finally-Block sitzt, sollte nicht \\ \|__/__|
mit exceptions werfen. \\/___/ |
|


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2006-10-18 18:21:18

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFSv3 server: lockd hangs

On Wed, 2006-10-18 at 11:30 +0200, Helge Bahmann wrote:
> Hello,
>
> I have problems with lockd on an NFS server (stock 2.6.17 kernel); I have
> no easy test case to reproduce the problem, but after one to two weeks of
> uptime the lockd process is simply stuck. It fails to shutdown
> ("lockd_down: lockd failed to exit"; afterwards the "[lockd]" thread is
> still around and stuck in D state) and is otherwise unresponsive. No other
> kernel messages are logged.
>
> Only full reboot solves the problem.
>
> Can you tell me
> - if there is a known lockd problem? I could not find anything in the
> archives, but maybe I have missed something
> - if it may be a file-system-specific problem (reiserfs in the present
> case)?
> - what else can I do to provide more useful diagnostics?

There is a known deadlock in lockd for stock 2.6.17. See the fix at

http://client.linux-nfs.org/Linux-2.6.x/2.6.18-rc4/linux-2.6.18-011-fix_nlm_traverse_files_deadlock.dif


This patch will be included in the next stable release for 2.6.17.

Cheers,
Trond


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-19 07:56:40

by Helge Bahmann

[permalink] [raw]
Subject: Re: NFSv3 server: lockd hangs

Trond Myklebust wrote:
> There is a known deadlock in lockd for stock 2.6.17. See the fix at
>
> http://client.linux-nfs.org/Linux-2.6.x/2.6.18-rc4/linux-2.6.18-011-fix_nlm_traverse_files_deadlock.dif
>
>
> This patch will be included in the next stable release for 2.6.17.

Okay thanks; upgraded to 2.6.18 which apparently contains the fix (and
hopefully no regressions)

Is there also a known problem with NFSv3 file locking and sec=krb5 mounts?
Sometimes a process on the NFS client hangs trying to lock a file which is
definitely *not* locked on any other client -- strange thing is, it
happens only if the same process does more than one lock/unlock cycle, and
it only happens after multiple ticket expiries. However if it has happened
once, it becomes quite reproducable. Rebooting the client fixes the
problem. Rebooting the server does as well.

Is there something I could do to diagnose the problem once the problem
starts appearing?

Thanks again and best regards
--
Helge Bahmann <[email protected]> /| \__
The past: Smart users in front of dumb terminals /_|____\
_/\ | __)
Wer im finally-Block sitzt, sollte nicht \\ \|__/__|
mit exceptions werfen. \\/___/ |
|


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-19 13:30:47

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFSv3 server: lockd hangs

On Thu, 2006-10-19 at 09:56 +0200, Helge Bahmann wrote:
> Trond Myklebust wrote:
> > There is a known deadlock in lockd for stock 2.6.17. See the fix at
> >
> > http://client.linux-nfs.org/Linux-2.6.x/2.6.18-rc4/linux-2.6.18-011-fix_nlm_traverse_files_deadlock.dif
> >
> >
> > This patch will be included in the next stable release for 2.6.17.
>
> Okay thanks; upgraded to 2.6.18 which apparently contains the fix (and
> hopefully no regressions)

2.6.17.14 should also contain it.

> Is there also a known problem with NFSv3 file locking and sec=krb5 mounts?
> Sometimes a process on the NFS client hangs trying to lock a file which is
> definitely *not* locked on any other client -- strange thing is, it
> happens only if the same process does more than one lock/unlock cycle, and
> it only happens after multiple ticket expiries. However if it has happened
> once, it becomes quite reproducable. Rebooting the client fixes the
> problem. Rebooting the server does as well.
>
> Is there something I could do to diagnose the problem once the problem
> starts appearing?

cat /proc/locks on both the client and server. The 5th column there
should be in the form 'device_major:device_minor:inode_number'. See if
you can find a match for the file that is not supposed to be locked.

Cheers,
Trond


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-19 14:14:52

by Helge Bahmann

[permalink] [raw]
Subject: Re: NFSv3 server: lockd hangs

Trond Myklebust wrote:

> > Is there also a known problem with NFSv3 file locking and sec=krb5 mounts?
[snip]
> > Is there something I could do to diagnose the problem once the problem
> > starts appearing?
>
> cat /proc/locks on both the client and server. The 5th column there
> should be in the form 'device_major:device_minor:inode_number'. See if
> you can find a match for the file that is not supposed to be locked.

it just happened again; output from /proc/locks on the NFS client (only
for the relevant file):
1: POSIX ADVISORY WRITE 30962 00:16:976311 0 EOF

and again on the the server:
9: POSIX ADVISORY READ 30183 fe:09:976311 0 EOF
9: -> POSIX ADVISORY WRITE 30962 fe:09:976311 0 EOF

sure enough 30183 was the pid of (one of the) processes on the NFS client
that previously had the file locked (there is only one client machine ever
accessing that file); however there is no process (or thread, for that
matter) with the pid 30183 on the client anymore

it seems there is one program that reliably triggers the problem
(kcminit); I think I can now reproduce it -- what kind of traces
should I take to produce useful information?

Thanks and best regards
--
Helge Bahmann <[email protected]> /| \__
The past: Smart users in front of dumb terminals /_|____\
_/\ | __)
Wer im finally-Block sitzt, sollte nicht \\ \|__/__|
mit exceptions werfen. \\/___/ |
|


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs