2004-03-09 04:39:44

by Lever, Charles

[permalink] [raw]
Subject: NSM lock recovery fails too often

the way things work today, NLM is in-kernel on most Linux systems,
and it uses an in-kernel equivalent of gethostname(3) to determine
the client's hostname for use when making NLM requests. NSM,
though, is still in user-land, and uses gethostbyname(3) to
determine the client's hostname. very often this results in
NSM using a different client hostname string than NLM, thus
causing lock recovery to fail.

NLM and NSM must use the same hostname string.

this is a real bug that many of NetApp's customers hit all
the time. the problem is exposed only after a client crashes
and recovers, not when it shuts down normally and reboots.

i attach two patches that accomplish a solution in different
ways.

first is a patch by Olaf Kirch against nfs-utils-1.0.1 that
adds an option to disable the extra gethostbyname(3) call in
rpc.statd. second is a reductionist approach -- just excise
that call entirely. the first patch allows backwards compat-
ibility with the user-level lockd, which nfs-utils still
contains. the second makes rpc.statd match the behavior of
the in-kernel lockd unconditionally.

perhaps the best solution is to use an option as Olaf's patch
does, but to make the default behavior match the in-kernel
lockd's behavior, not the user-level lockd's behavior. or,
maybe we use the second patch and simply remove the user
level lockd from nfs-utils.

comments?


Attachments:
nfs-utils-1.0.1-local-hostname.patch (2.91 kB)
nfs-utils-1.0.1-local-hostname.patch
nfs-utils-1.0.6-no-ghbn.patch (703.00 B)
nfs-utils-1.0.6-no-ghbn.patch
Download all attachments

2004-03-09 14:31:10

by Olaf Kirch

[permalink] [raw]
Subject: Re: NSM lock recovery fails too often

On Tue, Mar 09, 2004 at 06:15:13AM -0800, Lever, Charles wrote:
> i thought the purpose of moving NSM into the kernel was
> to eliminate the need for any user-level programs because
> sometimes they don't get run at reboot..?

Yes and no. I think there's a difference security wise between starting
an RPC service by default (such as statd) vs running a small utility
that sends out NSM notifications and exits when it's done.

I also think these RPC upcalls from kernel to user land are awfully ugly,
and the entire NSM protocol is completely overengineered. For a kernel
statd, all you need is the NULL procedure and the ability to process
SM_NOTIFY messages. The rest is simply not implemented.

> how does your recovery program handle this case?

It's fairly stubborn. As long as it can open a socket, it will retry
notification for as long as you tell it to (15 minutes by default,
but you can change that)

I forgot to mention in my previous message that the kernel-statd patch
requires Andreas Gruenbacher's sunrpc patch that lets you register
several RPC services on a single socket (originally written for his
nfsacl implementation)

Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-09 15:13:52

by Trond Myklebust

[permalink] [raw]
Subject: Re: NSM lock recovery fails too often

P=E5 ty , 09/03/2004 klokka 09:22, skreiv Olaf Kirch:

> I forgot to mention in my previous message that the kernel-statd patch
> requires Andreas Gruenbacher's sunrpc patch that lets you register
> several RPC services on a single socket (originally written for his
> nfsacl implementation)

For 2.6.x, see the rpc_clone_client() function...

Cheers,
Trond


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-09 15:20:05

by Olaf Kirch

[permalink] [raw]
Subject: Re: NSM lock recovery fails too often

On Tue, Mar 09, 2004 at 10:04:32AM -0500, Trond Myklebust wrote:
> P=E5 ty , 09/03/2004 klokka 09:22, skreiv Olaf Kirch:
>=20
> > I forgot to mention in my previous message that the kernel-statd patc=
h
> > requires Andreas Gruenbacher's sunrpc patch that lets you register
> > several RPC services on a single socket (originally written for his
> > nfsacl implementation)
>=20
> For 2.6.x, see the rpc_clone_client() function...

That's something different; rpc_clone_client is client side.

What I was referring to was the ability to have several RPC programs on a
single svc_sock server side; e.g. NFS and NFSACL, or NLM and NSM. Pretty
much the way svc_register from the good ole sunrpc code works.

Olaf
--=20
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+=20


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-09 15:57:57

by Trond Myklebust

[permalink] [raw]
Subject: Re: NSM lock recovery fails too often

P=E5 ty , 09/03/2004 klokka 10:10, skreiv Olaf Kirch:
> What I was referring to was the ability to have several RPC programs on a
> single svc_sock server side; e.g. NFS and NFSACL, or NLM and NSM. Pretty
> much the way svc_register from the good ole sunrpc code works.

Oh, sorry...

Hmm... Any reason why you couldn't put them all on port 2049 with this
patch? Would that be less efficient than having two sets of threads?

Cheers,
Trond


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-09 16:08:44

by Olaf Kirch

[permalink] [raw]
Subject: Re: NSM lock recovery fails too often

On Tue, Mar 09, 2004 at 10:47:42AM -0500, Trond Myklebust wrote:
> P=E5 ty , 09/03/2004 klokka 10:10, skreiv Olaf Kirch:
> > What I was referring to was the ability to have several RPC programs =
on a
> > single svc_sock server side; e.g. NFS and NFSACL, or NLM and NSM. Pre=
tty
> > much the way svc_register from the good ole sunrpc code works.
>=20
> Oh, sorry...
>=20
> Hmm... Any reason why you couldn't put them all on port 2049 with this
> patch? Would that be less efficient than having two sets of threads?

No, you could put all of them on the same port. The problem is that
you want lockd and nfsd to be two different processes, so you can
shut down nfsd without disrupting lockd.

Olaf
--=20
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+=20


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-09 11:05:14

by Olaf Kirch

[permalink] [raw]
Subject: Re: NSM lock recovery fails too often

Hi,

On Mon, Mar 08, 2004 at 08:30:45PM -0800, Lever, Charles wrote:
> perhaps the best solution is to use an option as Olaf's patch
> does, but to make the default behavior match the in-kernel
> lockd's behavior, not the user-level lockd's behavior. or,
> maybe we use the second patch and simply remove the user
> level lockd from nfs-utils.

I have continued working on the kernel statd, and it seems to be
reasonably functional now. I'm attaching my current kernel patch
and a user land utility for sending out the SM_NOTIFY calls
at reboot.

The kernel patch isn't 100% clean yet, as it breaks the non-
CONFIG_STATD case.

Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+


Attachments:
(No filename) (761.00 B)
kernel-statd (26.75 kB)
Download all attachments

2004-03-09 11:06:28

by Olaf Kirch

[permalink] [raw]
Subject: Re: NSM lock recovery fails too often

Here's the promised sm-notify utility.

Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+


Attachments:
(No filename) (172.00 B)
sm-notify.c (11.76 kB)
Download all attachments

2004-03-09 14:24:32

by Lever, Charles

[permalink] [raw]
Subject: RE: NSM lock recovery fails too often

hi olaf-

> Here's the promised sm-notify utility.

looks clean!

i thought the purpose of moving NSM into the kernel was
to eliminate the need for any user-level programs because
sometimes they don't get run at reboot..?

one of the issues we've observed with reboot notification
is that it fails entirely if the network stack on the client
hasn't been initialized when NSM starts. for example, if a
client is DHCP configured, and the DHCP service is slow,
the NSM reboot notification will run and fail before the
client's network has been configured. it also means that
the success of NSM reboot recovery is entirely dependent
on exactly when this program is run during start up.

how does your recovery program handle this case?


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs