2006-02-17 04:46:21

by asha yr

[permalink] [raw]
Subject: NFS lock reclaiming not working on SLES9 SP2

Hi,

NFS lock reclaiming is not working on SLES9 SP2. After the server reboot, sm-notify sends reboot notifications but the clients fail to reclaim locks.

My NFS Server and NFS client are on SLES9 SP2. My NFS server is
sgmlx1(15.70.191.172) and NFS client is sgmlx2(15.70.191.173). I mounted NFS file system on the client. Then on the client, I acquired a lock on NFS shared file using fcntl. On NFS server, an entry was made for the client in sm directory. I restarted NFS server and then
started sm-notify application. The client failed to reclaim the lock and I could acquire the lock on the same file from another client.

The reclaiming of locks was working fine on base SLES9. It started failing after I updated SLES9 with SP2.

I have captured lockd debugging messages and network tetheral trace on client and attaching the same.

Thanks for your help in advance.

Regards,
Asha



Attachments:
(No filename) (890.00 B)
(No filename) (1.26 kB)
debugging_messages.txt (12.29 kB)
Download all attachments

2006-02-17 09:26:01

by Olaf Kirch

[permalink] [raw]
Subject: Re: NFS lock reclaiming not working on SLES9 SP2

Hi,

On Fri, Feb 17, 2006 at 04:44:48AM -0000, asha yr wrote:
> The reclaiming of locks was working fine on base SLES9. It started failing after I updated SLES9 with SP2.

Did you try SLES9 SP3? I remember we had some problems with lock reclaim,
but I thought they were fixed in SP2.

It would be useful to get a lockd trace on the server side, what it
receives and what it sends back. The ethereal traces aren't very useful
though; it seems the snaplen is too small (and binary dumps are usually
much more helpful than the "helpful" ASCII packet representation that
ethereal or tcpdump generate)

Thanks,
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-17 09:37:26

by Greg Banks

[permalink] [raw]
Subject: Re: NFS lock reclaiming not working on SLES9 SP2

On Fri, 2006-02-17 at 20:25, Olaf Kirch wrote:
> Hi,
>
> On Fri, Feb 17, 2006 at 04:44:48AM -0000, asha yr wrote:
> > The reclaiming of locks was working fine on base SLES9. It started failing after I updated SLES9 with SP2.
>
> Did you try SLES9 SP3? I remember we had some problems with lock reclaim,
> but I thought they were fixed in SP2.
>
> It would be useful to get a lockd trace on the server side, what it
> receives and what it sends back. The ethereal traces aren't very useful
> though; it seems the snaplen is too small (and binary dumps are usually
> much more helpful than the "helpful" ASCII packet representation that
> ethereal or tcpdump generate)

In his trace, the client wasn't sending any LOCK calls at all
for multiple minutes after receiving the NOTIFY.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-17 09:56:23

by Olaf Kirch

[permalink] [raw]
Subject: Re: NFS lock reclaiming not working on SLES9 SP2

On Fri, Feb 17, 2006 at 08:37:05PM +1100, Greg Banks wrote:
> In his trace, the client wasn't sending any LOCK calls at all
> for multiple minutes after receiving the NOTIFY.

Ah, you're right. I wasn't paying attention to the time stamps.

It seems the problem is that we're now using hostnames to identify lockd
peers, but you mounted the file system using the ipaddr:/path.

Could you please try the attached patch?

Thanks
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


Attachments:
(No filename) (572.00 B)
statd-hostname-fix (2.95 kB)
Download all attachments

2006-02-17 10:04:24

by Greg Banks

[permalink] [raw]
Subject: Re: NFS lock reclaiming not working on SLES9 SP2

On Fri, 2006-02-17 at 20:56, Olaf Kirch wrote:
> It seems the problem is that we're now using hostnames to identify lockd
> peers, but you mounted the file system using the ipaddr:/path.

Indeed, well caught. I missed the significance of name=...

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-17 10:49:18

by asha yr

[permalink] [raw]
Subject: Re: Re: NFS lock reclaiming not working on SLES9 SP2

Hi,

I am comparing the working of NFS lock recovery feature on different updates of SLES9 and Redhat4. So I dont want to test the feature by applying any patch. Anyway, thanks for the patch.

I tried mounting NFS file system using hostname on client. After sending notifications, sm-notify clears sm directory. Then the sm directory should get updated freshly with the client information. This is not happening.

The debugging messages of lockd on server and client are:

on client side
Feb 17 16:10:07 sgmlx2 kernel: NFS lockd/statd started (ver 0.5).
Feb 17 16:10:23 sgmlx2 kernel: lockd: request from 0f46bfac
Feb 17 16:10:23 sgmlx2 kernel: lockd: nlm_host_rebooted("sgmlx1")
Feb 17 16:10:30 sgmlx2 kernel: lockd: nlm_lookup_host(0f46bfac, p=6, v=4, my role=client, name=sgmlx1)
Feb 17 16:10:30 sgmlx2 kernel: lockd: host garbage collection
Feb 17 16:10:30 sgmlx2 kernel: lockd: nlmsvc_mark_resources
Feb 17 16:10:30 sgmlx2 kernel: lockd: nlm_bind_host(0f46bfac)
Feb 17 16:11:00 sgmlx2 last message repeated 3 times
Feb 17 16:11:00 sgmlx2 kernel: lockd: release host sgmlx1
Feb 17 16:11:00 sgmlx2 kernel: lockd: get host sgmlx1
Feb 17 16:11:48 sgmlx2 kernel: lockd: request from 0f46bfac
Feb 17 16:11:48 sgmlx2 kernel: lockd: nlm_host_rebooted("sgmlx1")
Feb 17 16:11:48 sgmlx2 kernel: lockd: rebind host sgmlx1
Feb 17 16:11:48 sgmlx2 kernel: lockd: get host sgmlx1
Feb 17 16:11:48 sgmlx2 kernel: lockd: release host sgmlx1
Feb 17 16:11:48 sgmlx2 kernel: nlmsvc_retry_blocked(00000000, when=0)
Feb 17 16:11:48 sgmlx2 kernel: lockd: nlm_bind_host(0f46bfac)
Feb 17 16:11:48 sgmlx2 kernel: lockd: release host sgmlx1


on server side
Feb 17 21:47:34 sgmlx1 kernel: NFS lockd/statd started (ver 0.5).
Feb 17 21:47:52 sgmlx1 rpc.mountd: authenticated mount request from sgmlx2.india.hp.com:970 for /test (/test)
Feb 17 21:48:15 sgmlx1 kernel: lockd: request from 0f46bfad
Feb 17 21:48:30 sgmlx1 kernel: lockd: request from 0f46bfad
Feb 17 21:48:30 sgmlx1 kernel: nlmsvc_retry_blocked(00000000, when=0)
Feb 17 21:48:45 sgmlx1 kernel: lockd: request from 0f46bfad
Feb 17 21:48:45 sgmlx1 kernel: lockd: nlm_lookup_host(0f46bfad, p=6, v=4, my role=server, name=sgmlx2)
Feb 17 21:48:45 sgmlx1 kernel: lockd: host garbage collection
Feb 17 21:48:45 sgmlx1 kernel: lockd: nlmsvc_mark_resources
Feb 17 21:48:45 sgmlx1 kernel: lockd: nlm_file_lookup(06000001 01006800 00025489 00012f4e 00025489 0002da54)
Feb 17 21:48:45 sgmlx1 kernel: lockd: creating file for (06000001 01006800 00025489 00012f4e 00025489 0002da54)
Feb 17 21:48:45 sgmlx1 kernel: lockd: found file f5070e00 (count 0)
Feb 17 21:48:45 sgmlx1 kernel: lockd: nlmsvc_lock(cciss/c0d0p1/77646, ty=1, pi=6733, 0-9223372036854775807, bl=0)
Feb 17 21:48:45 sgmlx1 kernel: lockd: nlmsvc_lookup_block f=f5070e00 pd=6733 0-9223372036854775807 ty=1
Feb 17 21:48:45 sgmlx1 kernel: lockd: posix_lock_file returned 0
Feb 17 21:48:45 sgmlx1 kernel: lockd: release host sgmlx2
Feb 17 21:48:45 sgmlx1 kernel: lockd: nlm_release_file(f5070e00, ct = 1)
Feb 17 21:48:45 sgmlx1 kernel: nlmsvc_retry_blocked(00000000, when=0)
Feb 17 21:48:45 sgmlx1 kernel: nlmsvc_retry_blocked(00000000, when=0)
Feb 17 21:49:27 sgmlx1 kernel: lockd: nlmsvc_free_host_resources
Feb 17 21:49:27 sgmlx1 kernel: lockd: release host sgmlx2
Feb 17 21:49:33 sgmlx1 kernel: lockd: request from 0f46bfad
Feb 17 21:49:33 sgmlx1 kernel: lockd: nlm_lookup_host(0f46bfad, p=6, v=4, my role=server, name=sgmlx2)
Feb 17 21:49:33 sgmlx1 kernel: lockd: nlm_file_lookup(06000001 01006800 00025489 00012f4e 00025489 0002da54)
Feb 17 21:49:33 sgmlx1 kernel: lockd: creating file for (06000001 01006800 00025489 00012f4e 00025489 0002da54)
Feb 17 21:49:33 sgmlx1 kernel: lockd: found file f5070000 (count 0)
Feb 17 21:49:33 sgmlx1 kernel: lockd: nlmsvc_lock(cciss/c0d0p1/77646, ty=1, pi=6733, 0-9223372036854775807, bl=0)
Feb 17 21:49:33 sgmlx1 kernel: lockd: nlmsvc_lookup_block f=f5070000 pd=6733 0-9223372036854775807 ty=1
Feb 17 21:49:33 sgmlx1 kernel: lockd: posix_lock_file returned 0
Feb 17 21:49:33 sgmlx1 kernel: lockd: release host sgmlx2
Feb 17 21:49:33 sgmlx1 kernel: lockd: nlm_release_file(f5070000, ct = 1)

Regards,
Asha

On Fri, 17 Feb 2006 Greg Banks wrote :
>On Fri, 2006-02-17 at 20:56, Olaf Kirch wrote:
> > It seems the problem is that we're now using hostnames to identify lockd
> > peers, but you mounted the file system using the ipaddr:/path.
>
>Indeed, well caught. I missed the significance of name=...
>
>Greg.
>--
>Greg Banks, R&D Software Engineer, SGI Australian Software Group.
>I don't speak for SGI.
>
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
>for problems? Stop! Download the new AJAX search engine that makes
>searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
>_______________________________________________
>NFS maillist - [email protected]
>https://lists.sourceforge.net/lists/listinfo/nfs


Attachments:
(No filename) (4.92 kB)
(No filename) (5.72 kB)
Download all attachments

2006-02-17 11:08:02

by Olaf Kirch

[permalink] [raw]
Subject: Re: Re: NFS lock reclaiming not working on SLES9 SP2

On Fri, Feb 17, 2006 at 10:48:42AM -0000, asha yr wrote:
> I am comparing the working of NFS lock recovery feature on different
> updates of SLES9 and Redhat4. So I dont want to test the feature by
> applying any patch. Anyway, thanks for the patch.

Hm, I thought posting to a developer list normally indicates willingness
to try out patches to narrow down the problem. That's why it is a free
service, unlike the business support we offer (the difference being that
they would build a test kernel for you, using this exact same patch :-)

> Feb 17 16:10:07 sgmlx2 kernel: NFS lockd/statd started (ver 0.5).
> Feb 17 16:10:23 sgmlx2 kernel: lockd: request from 0f46bfac
> Feb 17 16:10:23 sgmlx2 kernel: lockd: nlm_host_rebooted("sgmlx1")
> Feb 17 16:10:30 sgmlx2 kernel: lockd: nlm_lookup_host(0f46bfac, p=6, v=4, my role=client, name=sgmlx1)
> Feb 17 16:10:30 sgmlx2 kernel: lockd: host garbage collection
> Feb 17 16:10:30 sgmlx2 kernel: lockd: nlmsvc_mark_resources

Hm, it still doesn't find the host. Please provide the line in
/proc/mounts for this NFS mount.

Thanks,
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-17 19:08:39

by Marc Eshel

[permalink] [raw]
Subject: Re: NFS lock reclaiming not working on SLES9 SP2

Hi Olaf,
This fix will not work for HA-NFS which allows a failover node to send the
SM_NOTIFY on behalf of the failed node.
The -v option was added for this purpose.
Marc.

[email protected] wrote on 02/17/2006 01:56:18 AM:

> On Fri, Feb 17, 2006 at 08:37:05PM +1100, Greg Banks wrote:
> > In his trace, the client wasn't sending any LOCK calls at all
> > for multiple minutes after receiving the NOTIFY.
>
> Ah, you're right. I wasn't paying attention to the time stamps.
>
> It seems the problem is that we're now using hostnames to identify lockd
> peers, but you mounted the file system using the ipaddr:/path.
>
> Could you please try the attached patch?
>
> Thanks
> Olaf
> --
> Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
> [email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
> [attachment "statd-hostname-fix" deleted by Marc Eshel/Almaden/IBM]


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-20 11:00:26

by Greg Banks

[permalink] [raw]
Subject: Re: Re: NFS lock reclaiming not working on SLES9 SP2

On Fri, 2006-02-17 at 21:48, asha yr wrote:
> I tried mounting NFS file system using hostname on client. After
> sending notifications, sm-notify clears sm directory. Then the sm
> directory should get updated freshly with the client information. This
> is not happening.
>
> The debugging messages of lockd on server and client are:
>
> on client side
> [...]
> Feb 17 16:10:23 sgmlx2 kernel: lockd: request from 0f46bfac
> Feb 17 16:10:23 sgmlx2 kernel: lockd: nlm_host_rebooted("sgmlx1")
> Feb 17 16:10:30 sgmlx2 kernel: lockd: nlm_lookup_host(0f46bfac, p=6,
> v=4, my role=client, name=sgmlx1)
> [...]

Works fine on SP3 (if you mount by hostname):

Feb 20 01:43:57 4A:cocky kernel: lockd: request from 860e364f
Feb 20 01:43:57 4A:cocky kernel: statd: NOTIFY called
Feb 20 01:43:57 4A:cocky kernel: lockd: nlm_host_rebooted("rosella")
Feb 20 01:43:57 4A:cocky kernel: lockd: rebind host rosella
Feb 20 01:43:57 4A:cocky kernel: NLM: reclaiming locks for host rosellalockd: get host rosella
Feb 20 01:43:57 4A:cocky kernel: lockd: release host rosella
Feb 20 01:43:57 4A:cocky kernel: nlmsvc_retry_blocked(0000000000000000, when=0)
Feb 20 01:43:57 4A:cocky kernel: nlmsvc_retry_blocked(0000000000000000, when=0)
Feb 20 01:43:57 4A:cocky kernel: lockd: call procedure 2 on rosella
Feb 20 01:43:57 4A:cocky kernel: lockd: nlm_bind_host(860e364f)
Feb 20 01:43:57 4A:cocky kernel: lockd: server returns status 0
Feb 20 01:43:57 4A:cocky kernel: lockd: release host rosella

Mounting by IP address is (unsurprisingly) as broken as SP2.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs