2005-01-27 16:46:06

by Ara.T.Howard

[permalink] [raw]
Subject: debugging failed lock recovery


i've been out sick all week but am about to dig back into our failing lockd
recovery issues. if you recall we have this issue

~ client > get_and_hold_lock

~ server > cat /proc/locks # shows correct lock pid

~ client > reboot

~ client > get_and_hold_lock # fails!

~ server > cat /proc/locks # shows old lock pid (before reboot) still there!

our clients AND servers are multihomed and each has iptables running. we have
opened ALL traffic between client and server in the firewalls - NOTHING is
blocked. still no go. we're still seeing lots of SM_UNMON errors in var
messages and i'm suspicious these are related to some multi-homed issues
that's contributing to our failing lock recovery - but this is an un-educated
hunch.

at this point were are talking about starting a tcpdump in the nfslock script
so we can see what's going on during boot/lock-recovery. my question is

- what should i expect to see? eg. what exactly should the client be doing
to recover it's locks? which messages - which ports - etc. i'm after a
high level description here.

- any other tips (someone has suggested using ethereal) to determine the
problem?

sorry to drag this through the mud but our system is too fragile without
proper lock recovery to ignore and, unfortunately i'm not versed in debugging
these sorts of things.


kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2005-01-28 20:54:43

by Ara.T.Howard

[permalink] [raw]
Subject: RE: debugging failed lock recovery

On Fri, 28 Jan 2005, Lever, Charles wrote:

> ara-
>
> the client must have a fully qualified nodename (uname -n) which matches
> the result of a DNS lookup, in order for lock recovery to work.
>
> the nodename is set in /etc/sysconfig/networking on red hat systems.

charles-

thanks for the lead!

yes. they all have that and they are in dns. however, our nfs runs on the
'backdoor': each client/server has two cards, one mated to

client.domain

and one to

client.b.domain

both are in dns.

*however* the output of uname -n is (of course) the frontdoor. so it seems we
have a mismatch. so uname -n *might* match the result of a dns lookup
depending on which interface is looked up and, i'm guessing, this will be the
backdoor interface and therefore fail.

anyone have ideas on what to do about this? can nfs be run on a multihomed
server on an interface NOT named the same as 'uname -n'??

kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-01-28 19:10:29

by Lever, Charles

[permalink] [raw]
Subject: RE: debugging failed lock recovery

hardwiring the results of gethostbyname on the client and server by
seeding entries in /etc/hosts might be a start.

i believe there is also an option on statd to always use a particular
nodename (-n ?).

> -----Original Message-----
> From: Ara.T.Howard [mailto:[email protected]]=20
> Sent: Friday, January 28, 2005 12:27 PM
> To: Lever, Charles
> Cc: [email protected]
> Subject: RE: [NFS] debugging failed lock recovery
>=20
>=20
> On Fri, 28 Jan 2005, Lever, Charles wrote:
>=20
> > ara-
> >
> > the client must have a fully qualified nodename (uname -n)=20
> which matches
> > the result of a DNS lookup, in order for lock recovery to work.
> >
> > the nodename is set in /etc/sysconfig/networking on red hat systems.
>=20
> charles-
>=20
> thanks for the lead!
>=20
> yes. they all have that and they are in dns. however, our=20
> nfs runs on the
> 'backdoor': each client/server has two cards, one mated to
>=20
> client.domain
>=20
> and one to
>=20
> client.b.domain
>=20
> both are in dns.
>=20
> *however* the output of uname -n is (of course) the=20
> frontdoor. so it seems we
> have a mismatch. so uname -n *might* match the result of a dns lookup
> depending on which interface is looked up and, i'm guessing,=20
> this will be the
> backdoor interface and therefore fail.
>=20
> anyone have ideas on what to do about this? can nfs be run=20
> on a multihomed
> server on an interface NOT named the same as 'uname -n'??
>=20
> kind regards.
>=20
> -a
> --=20
> =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
> | PHONE :: 303.497.6469
> | When you do something, you should burn yourself completely,=20
> like a good
> | bonfire, leaving no trace of yourself. --Shunryu Suzuki
> =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-01-28 22:27:39

by Ara.T.Howard

[permalink] [raw]
Subject: RE: debugging failed lock recovery

On Fri, 28 Jan 2005, Lever, Charles wrote:

> hardwiring the results of gethostbyname on the client and server by
> seeding entries in /etc/hosts might be a start.

hmmm. seems like mapping the backdoor to the frontdoor name (uname -n) would
cause other problems wouldn't it?

> i believe there is also an option on statd to always use a particular
> nodename (-n ?).

this sounds promising. i'll look into it.

thanks for all the help!

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-02-01 22:42:16

by Ara.T.Howard

[permalink] [raw]
Subject: [SOLVED] RE: debugging failed lock recovery

On Fri, 28 Jan 2005, Lever, Charles wrote:

> i believe there is also an option on statd to always use a particular
> nodename (-n ?).

as it turns out this was indeed the issue.

summary of our problem:

- client obtains lock

- server reboots

- client cannot re-obtain lock (lockd recovery failure)


summary of setup:

- all nfs clients and servers were multi-homed. having front-door back-door
interfaces like client.domain and client.b.domain, etc.

- all nfs clients and servers were running iptables. holes must be open for
rpc.statd, etc. in general we allowed all traffic between client and
server on the backdoor.

summary of solution

- rpc.statd reports to server during lock recovery. it uses the output of
gethostname (uname -n) by default. in this case the client would attempt
lock recovery using this hostname and the server would refuse since it
expected to the see the name of the backdoor interface (client.b.domain)

rpc.statd needs to be started using the '-n' (name) option to override the
output of gethostname. in our case (redhat) this is done by putting
something like the following

STATD_HOSTNAME=client.b.ngdc.noaa.gov

into the file /etc/sysconfig/nfs, which is, itself, sourced by
/etc/init.d/nfclock

lockd recovery now operates correctly. thanks to all who helped!

kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-02-01 22:57:06

by Lever, Charles

[permalink] [raw]
Subject: RE: [SOLVED] RE: debugging failed lock recovery

for future reference: this is generically described in the NFS FAQ.

http://nfs.sourceforge.net/index.php#faq_d7



> summary of solution
>=20
> - rpc.statd reports to server during lock recovery. it=20
> uses the output of
> gethostname (uname -n) by default. in this case the=20
> client would attempt
> lock recovery using this hostname and the server would=20
> refuse since it
> expected to the see the name of the backdoor interface=20
> (client.b.domain)
>=20
> rpc.statd needs to be started using the '-n' (name)=20
> option to override the
> output of gethostname. in our case (redhat) this is=20
> done by putting
> something like the following
>=20
> STATD_HOSTNAME=3Dclient.b.ngdc.noaa.gov
>=20
> into the file /etc/sysconfig/nfs, which is, itself, sourced by
> /etc/init.d/nfclock
>=20
> lockd recovery now operates correctly. thanks to all who helped!


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs