2006-06-12 05:25:43

by Wendy Cheng

[permalink] [raw]
Subject: [RFC] NLM lock failover admin interface

NFS v2/v3 active-active NLM lock failover has been an issue with our
cluster suite. With current implementation, it (cluster suite) is trying
to carry the workaround as much as it can with user mode scripts where,
upon failover, on taken-over server, it:

1. Tear down virtual IP.
2. Unexport the subject NFS export.
3. Signal lockd to drop the locks.
4. Un-mount filesystem if needed.

There are many other issues (such as /var/lib/nfs/statd/sm file, etc)
but this particular post is to further refine step 3 to avoid the 50
second global (default) grace period for all NFS exports; i.e., we would
like to be able to selectively drop locks (only) associated with the
requested exports without disrupting other NFS services.

We've done some prototype (coding) works but would like to search for
community consensus on the admin interface if possible. We've tried out
the following:

1. /proc interface, say writing the fsid into a /proc directory entry
would end up dropping all NLM locks associated with the NFS export that
has fsid in its /etc/exports file.

2. Adding a new flag into "exportfs" command, say "h", such that

"exportfs -uh *:/export_path"

would un-export the entry and drop the NLM locks associated with the
entry.

3. Add a new nfsctl by re-using a 2.4 kernel flag (NFSCTL_FOLOCKS) where
it takes:

struct nfsctl_folocks {
int type;
unsigned int fsid;
unsigned int devno;
}

as input argument. Depending on "type", the kernel call would drop the
locks associated with either the fsid, or devno.

The core of the implementation is a new cloned version of
nlm_traverse_files() where it searches the "nlm_files" list one by one
to compare the fsid (or devno) based on nlm_file.f_handle field. A
helper function is also implemented to extract the fsid (or devno) from
f_handle.

The new function is planned to allow failover to abort if the file can't
be closed. We may also put the file locks back if abort occurs.

Would appreciate comments on the above admin interface. As soon as the
external interface can be finalized, the code will be submitted for
review.

-- Wendy


2006-06-14 11:36:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [NFS] [RFC] NLM lock failover admin interface

On Wed, Jun 14, 2006 at 02:54:51AM -0400, Wendy Cheng wrote:
> Hi,
>
> KABI (kernel application binary interface) commitment is a big thing
> from our end - so I would like to focus more on the interface agreement
> before jumping into coding and implementation details.

Please stop this crap now. If zou don't get that there is no kernel internal
ABI and there never will be get a different job ASAP.

2006-06-14 13:39:04

by Wendy Cheng

[permalink] [raw]
Subject: Re: [NFS] [RFC] NLM lock failover admin interface

On Wed, 2006-06-14 at 12:36 +0100, Christoph Hellwig wrote:
> On Wed, Jun 14, 2006 at 02:54:51AM -0400, Wendy Cheng wrote:
> > Hi,
> >
> > KABI (kernel application binary interface) commitment is a big thing
> > from our end - so I would like to focus more on the interface agreement
> > before jumping into coding and implementation details.
>
> Please stop this crap now. If zou don't get that there is no kernel internal
> ABI and there never will be get a different job ASAP.

Actually I don't quite understand this statement (sorry! English is not
my native language) but it is ok. People are entitled for different
opinions and I respect yours.

On the technical side, just a pre-cautious, in case we need to touch
some kernel export symbols so it would be nice to have external (and
admin) interfaces decided before we start to code.

So I'll not talk about this and I assume we can keep focusing on NLM
issues. No more noises from each other. Fair ?

-- Wendy

2006-06-14 14:00:54

by Wendy Cheng

[permalink] [raw]
Subject: Re: Re: [NFS] [RFC] NLM lock failover admin interface

On Wed, 2006-06-14 at 02:54 -0400, Wendy Cheng wrote:

>
> Assume we still have this on the table.... Could I expect the admin
> interface goes thru rpc.lockd command (man page and nfs-util code
> changes) ? The modified command will take similar options as rpc.statd;
> more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the
> individual IP (socket address) to kernel, we'll need nfsctl with struct
> nfsctl_svc modified.

I want to make sure people catch this. Here we're talking about NFS
system call interface changes. We need either a new NFS syscall or
altering the existing nfsctl_svc structure.

-- Wendy

>
> For the kernel piece, since we're there anyway, could we have the
> individual lockd IP interface passed to SM (statd) (in SM_MON call) ?
> This would allow statd to structure its SM files based on each lockd IP
> address, an important part of lock recovery.
>
> > One is to register a callback when an interface is shut down.
>
> Haven't checked out (linux) socket interface yet. I'm very fuzzy how
> this can be done. Anyone has good ideas ?
>
> > Another (possibly the best) is to arrange a new signal for lockd
> > which say "Drop any locks which were sent to IP addresses that are
> > no longer valid local addresses".
>
> Very appealing - but the devil's always in the details. How to decide
> which IP address is no longer valid ? Or how does lockd know about these
> IP addresses ? And how to associate one particular IP address with the
> "struct nlm_file" entries within nlm_files list ? Need few more days to
> sort this out (or any one already has ideas in mind ?).
>
> -- Wendy
>
> --
> Linux-cluster mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/linux-cluster

Subject: Re: [NFS] Re: [RFC] NLM lock failover admin interface

this discusion has centered around removing the locks of an export.
we also want the interface to ge able to remove the locks owned by a single
client. this is needed to enable client migration between replica's or between
nodes in a cluster file system. it is not acceptable to place an entire export
in grace just to move a small number of clients.

-->Andy

[email protected] said:
> On Wed, 2006-06-14 at 02:54 -0400, Wendy Cheng wrote:
>
> Assume we still have this on the table.... Could I expect the admin
> interface goes thru rpc.lockd command (man page and nfs-util code
> changes) ? The modified command will take similar options as rpc.statd;
> more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the
> individual IP (socket address) to kernel, we'll need nfsctl with struct
> nfsctl_svc modified.
>
> I want to make sure people catch this. Here we're talking about NFS system
> call interface changes. We need either a new NFS syscall or altering the
> existing nfsctl_svc structure.

> -- Wendy

2006-06-15 15:09:41

by Wendy Cheng

[permalink] [raw]
Subject: Re: [NFS] Re: [RFC] NLM lock failover admin interface

William A.(Andy) Adamson wrote:

>this discusion has centered around removing the locks of an export.
>we also want the interface to ge able to remove the locks owned by a single
>client. this is needed to enable client migration between replica's or between
>nodes in a cluster file system. it is not acceptable to place an entire export
>in grace just to move a small number of clients.
>
>
>
Andy,

Gotcha ... forgot about NFS V4. BTW, the discussion has moved back to
/proc interface. I agree we need to add one more layer of granularity
into it. Glad you caught this flaw.

-- Wendy

2006-06-12 06:11:04

by Wendy Cheng

[permalink] [raw]
Subject: Re: [RFC] NLM lock failover admin interface

On Mon, 2006-06-12 at 01:25 -0400, Wendy Cheng wrote:
> NFS v2/v3 active-active NLM lock failover has been an issue with our
> cluster suite. With current implementation, it (cluster suite) is trying
> to carry the workaround as much as it can with user mode scripts where,
> upon failover, on taken-over server, it:
>
> 1. Tear down virtual IP.
> 2. Unexport the subject NFS export.
> 3. Signal lockd to drop the locks.
> 4. Un-mount filesystem if needed.
>
> There are many other issues (such as /var/lib/nfs/statd/sm file, etc)
> but this particular post is to further refine step 3 to avoid the 50
> second global (default) grace period for all NFS exports; i.e., we would
> like to be able to selectively drop locks (only) associated with the
> requested exports without disrupting other NFS services.
>
> We've done some prototype (coding) works but would like to search for
> community consensus on the admin interface if possible.

While ping-pong the emails with our base kernel folks to choose
between /proc, or exportfs, or nfsctl (internally within the company -
mostly with steved and staubach), Peter suggested to try out multiple
lockd(s) to handle different NFS exports. In that case, we may require
to change a big portion of lockd kernel code. I prefer not going that
far since lockd failover is our cluster suite's immediate issue.
However, if this approach can get everyone's vote, we'll comply.

-- Wendy

2006-06-12 15:00:57

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [Linux-cluster] [RFC] NLM lock failover admin interface

On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote:
> 2. Adding a new flag into "exportfs" command, say "h", such that
>
> "exportfs -uh *:/export_path"
>
> would un-export the entry and drop the NLM locks associated with the
> entry.

What does the kernel interface end up looking like in that case?

--b.


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-06-12 16:21:45

by Madhan P

[permalink] [raw]
Subject: Re: [Linux-cluster] [RFC] NLM lock failover admin interface

For what it's worth, would second this approach of using a flag to
unexport and associating the cleanup with that. Another quick hack we
used was to store the NSM entries on a standard location on the
respective exported filesystem, so that notification is sent once the
filesystem comes back online on the destination server and is exported
again. BTW, this was not on Linux. It was a simple solution providing
the necessary active/active and active/passive cluster support.

- Madhan

>>> On 6/12/2006 at 9:14:55 pm, in message
<[email protected]>, Wendy
Cheng <[email protected]> wrote:
> J. Bruce Fields wrote:
>
>>On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote:
>>
>>
>>>2. Adding a new flag into "exportfs" command, say "h", such that
>>>
>>> "exportfs -uh *:/export_path"
>>>
>>>would un-export the entry and drop the NLM locks associated with
the
>>>entry.
>>>
>>>
>>
>>What does the kernel interface end up looking like in that case?
>>
>>
>>
> Happy to see this new exportfs command gets positive response - it
was
> our original pick too.
>
> Uploaded is part of a draft version of 2.4 base kernel patch - we're

> cleaning up 2.6 patches at this moment. It basically adds a new
export
> flag (NFSEXP_FOLOCK - note that ex_flags is an int but is currently
only
> defined up to 16 bits) so nfs-util and kernel can communicate.
>
> The nice thing about this approach is the recovery part - the
take-over
> server can use the counter part command to export and set grace
period
> for one particular interface within the same system call.
>
> -- Wendy


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-06-12 16:58:50

by Madhan P

[permalink] [raw]
Subject: Re: [Linux-cluster] [RFC] NLM lock failover admin interface

For what it's worth, would second this approach of using a flag to
unexport and associating the cleanup with that. Another quick hack we
used was to store the NSM entries on a standard location on the
respective exported filesystem, so that notification is sent once the
filesystem comes back online on the destination server and is exported
again. BTW, this was not on Linux. It was a simple solution providing
the necessary active/active and active/passive cluster support.

- Madhan

>>> On 6/12/2006 at 9:14:55 pm, in message
<[email protected]>, Wendy
Cheng <[email protected]> wrote:
> J. Bruce Fields wrote:
>
>>On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote:
>>
>>
>>>2. Adding a new flag into "exportfs" command, say "h", such that
>>>
>>> "exportfs -uh *:/export_path"
>>>
>>>would un-export the entry and drop the NLM locks associated with
the
>>>entry.
>>>
>>>
>>
>>What does the kernel interface end up looking like in that case?
>>
>>
>>
> Happy to see this new exportfs command gets positive response - it
was
> our original pick too.
>
> Uploaded is part of a draft version of 2.4 base kernel patch - we're

> cleaning up 2.6 patches at this moment. It basically adds a new
export
> flag (NFSEXP_FOLOCK - note that ex_flags is an int but is currently
only
> defined up to 16 bits) so nfs-util and kernel can communicate.
>
> The nice thing about this approach is the recovery part - the
take-over
> server can use the counter part command to export and set grace
period
> for one particular interface within the same system call.
>
> -- Wendy


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-06-12 17:23:17

by Steve Dickson

[permalink] [raw]
Subject: Re: [Linux-cluster] [RFC] NLM lock failover admin interface

Wendy Cheng wrote:
> The nice thing about this approach is the recovery part - the take-over
> server can use the counter part command to export and set grace period
> for one particular interface within the same system call.
Actually this is a pretty clean and simple interface... imho..
The only issue I had was adding a flag to an older version and then
having to carry that flag forward... So if this interface is
accepted and added to the mainline nfs-utils (which it should be.. imho)
that fact it is so clean and simple would make the back porting fairly
trivial...

steved.


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-06-12 17:27:18

by James Yarbrough

[permalink] [raw]
Subject: Re: [RFC] NLM lock failover admin interface

> 2. Adding a new flag into "exportfs" command, say "h", such that
>
> "exportfs -uh *:/export_path"
>
> would un-export the entry and drop the NLM locks associated with the
> entry.

This is fine for releasing the locks, but how do you plan to re-enter
the grace period for reclaiming the locks when you relocate the export?
And how do you intend to segregate the export for which reclaims are
valid from the ones which are not? How do you plan to support the
sending of SM_NOTIFY? This might be where a lockd per export has an
advantage.

--
[email protected]
650 933 3124

Why is there a snake in my Coke?


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-06-12 18:09:30

by Wendy Cheng

[permalink] [raw]
Subject: Re: [NFS] [RFC] NLM lock failover admin interface

Madhan P wrote:

>For what it's worth, would second this approach of using a flag to
>unexport and associating the cleanup with that.
>

Happy to have another vote :) ! It is appreicated.

> Another quick hack we
>used was to store the NSM entries on a standard location on the
>respective exported filesystem, so that notification is sent once the
>filesystem comes back online on the destination server and is exported
>again. BTW, this was not on Linux. It was a simple solution providing
>the necessary active/active and active/passive cluster support.
>
>

Lon Hohberge (from our cluster suite team) has been working on similar
setup too (to structure the MSM file directory). We'll submit the
associated kernel patch when it is ready ("rpc.statd -H" needs some
bandaids). Future reviews and comments are also appreciated.

-- Wendy

2006-06-12 19:07:23

by Wendy Cheng

[permalink] [raw]
Subject: Re: [NFS] [RFC] NLM lock failover admin interface

James Yarbrough wrote:

>>2. Adding a new flag into "exportfs" command, say "h", such that
>>
>> "exportfs -uh *:/export_path"
>>
>>would un-export the entry and drop the NLM locks associated with the
>>entry.
>>
>>
>
>This is fine for releasing the locks, but how do you plan to re-enter
>the grace period for reclaiming the locks when you relocate the export?
>And how do you intend to segregate the export for which reclaims are
>valid from the ones which are not? How do you plan to support the
>sending of SM_NOTIFY? This might be where a lockd per export has an
>advantage.
>
>
>
Yeah, that's why Peter's idea (different lockd(s)) is also attractive.
However, on the practical side, we don't plan to introduce kernel
patches agressively. The approach is to be away from mainline NLM code
base until we have enough QA cycles to make sure things work. The
unexport part would allow other nfs services on the taken-over server
un-interrupted. On the take-over server side, we currently do a global
grace period. The plan has been to put a little delay before fixing
take-over server's logic due to other NLM/posix lock issues - for
example, the current (linux) NLM doesn't bother to call filesystem's
lock method (which virtually disables any cluster filesystem's NFS
locking across different NFS servers). However, if we have enough
resources and/or volunteers, we may do these things in parallel. The
following are planned:

Take-over server logic:
1. setup the statd sm file (currently /var/lib/nfs/statd/sm or the
equivalent configured directory) properly.
2. rpc.statd is dispatched with "--ha-callout" option.
3. implement the ha-callout user mode program to create a seperate
statd sm files for each exported ip.
4. export the target filesystem and set up grace period based on
fsid (or devno). It will be used in NLM procedure calls by
extracting the fsid (or devno) from nfs file handle to decide
accepting or reject the not-reclaiming requests.
5. bring up the failover IP address.
6. send SM_NOTIFY to client machines using the configured sm
directory created by the ha-callout program (rpc.statd -N -P).

Step 4 will be the counter-part of our unexport flag.

-- Wendy



2006-06-13 03:18:03

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC] NLM lock failover admin interface

On Monday June 12, [email protected] wrote:
> NFS v2/v3 active-active NLM lock failover has been an issue with our
> cluster suite. With current implementation, it (cluster suite) is trying
> to carry the workaround as much as it can with user mode scripts where,
> upon failover, on taken-over server, it:
>
> 1. Tear down virtual IP.
> 2. Unexport the subject NFS export.
> 3. Signal lockd to drop the locks.
> 4. Un-mount filesystem if needed.
>
...
> we would
> like to be able to selectively drop locks (only) associated with the
> requested exports without disrupting other NFS services.

There seems to be an unstated assumption here that there is one
virtual IP per exported filesystem. Is that true?

Assuming it is and that I understand properly what you want to do....

I think that maybe the right thing to do is *not* drop the locks on a
particular filesystem, but to drop the locks made to a particular
virtual IP.

Then it would make a lot of sense to have one lockd thread per IP, and
signal the lockd in order to drop the locks.
True: that might be more code. But if it is the right thing to do,
then it should be done that way.

On the other hand, I can see a value in removing all the locks for a
particular filesytem quite independent of failover requirements.
If I want to force-unmount a filesystem, I need to unexport it, and I
need to kill all the locks. Currently you can only remove locks from
all filesystems, which might not be ideal.

I'm not at all keen on the NFSEXP_FOLOCK flag to exp_unexport, as that
is an interface that I would like to discard eventually. The
preferred mechanism for exporting filesystems is to flush the
appropriate 'cache', and allow it to be repopulated with whatever is
still valid via upcalls to mountd.

So:
I think if we really want to "remove all NFS locks on a filesystem",
we could probably tie it into umount - maybe have lockd register some
callback which gets called just before s_op->umount_begin.

If we want to remove all locks that arrived on a particular
interface, then we should arrange to do exactly that. There are a
number of different options here.
One is the multiple-lockd-threads idea.
One is to register a callback when an interface is shut down.
Another (possibly the best) is to arrange a new signal for lockd
which say "Drop any locks which were sent to IP addresses that are
no longer valid local addresses".

So those are my thoughts. Do any of them seem reasonable to you?

NeilBrown



_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-06-13 07:00:11

by Wendy Cheng

[permalink] [raw]
Subject: Re: [NFS] [RFC] NLM lock failover admin interface

On Tue, 2006-06-13 at 13:17 +1000, Neil Brown wrote:

> So:
> I think if we really want to "remove all NFS locks on a filesystem",
> we could probably tie it into umount - maybe have lockd register some
> callback which gets called just before s_op->umount_begin.

The "umount_begin" idea was one time on my list but got discarded. The
thought was that nfsd was not a filesystem, neither was lockd. How to
register something with VFS umount for non-filesystem kernel modules ?
Invent another autofs-like pseudo filesystem ? Mostly, not every
filesystem would like to get un-mounted upon failover (GFS, for example,
does not get un-mounted by our cluster suite upon failover).

> If we want to remove all locks that arrived on a particular
> interface, then we should arrange to do exactly that. There are a
> number of different options here.
> One is the multiple-lockd-threads idea.

Certainly a good option. To make it happen, we still need admin
interface. How to pass IP address from user mode into kernel - care to
give this some suggestions if you have them handy ? Should socket ports
get dynamics assigned ? Will we have scalibility issues ?

> One is to register a callback when an interface is shut down.
> Another (possibly the best) is to arrange a new signal for lockd
> which say "Drop any locks which were sent to IP addresses that are
> no longer valid local addresses".

These, again, give individual filesystem no freedom to adjust what they
need upon failover. But I'll check them out this week - maybe there are
good socket layer hooks that I overlook.

>
> So those are my thoughts. Do any of them seem reasonable to you?
>

The comments are greatly appreciated. And hopefully we can reach
agreement soon.

-- Wendy

2006-06-13 07:08:25

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC] NLM lock failover admin interface

On Tuesday June 13, [email protected] wrote:
> > One is to register a callback when an interface is shut down.
> > Another (possibly the best) is to arrange a new signal for lockd
> > which say "Drop any locks which were sent to IP addresses that are
> > no longer valid local addresses".
>
> These, again, give individual filesystem no freedom to adjust what they
> need upon failover. But I'll check them out this week - maybe there are
> good socket layer hooks that I overlook.
>

Can you say more about what sort of adjustments an individual filesystem
might want the freedom to make? It might help me understand the
issues better.

Thanks,
NeilBrown


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-06-14 06:54:51

by Wendy Cheng

[permalink] [raw]
Subject: Re: [NFS] [RFC] NLM lock failover admin interface

Hi,

KABI (kernel application binary interface) commitment is a big thing
from our end - so I would like to focus more on the interface agreement
before jumping into coding and implementation details.

> One is the multiple-lockd-threads idea.

Assume we still have this on the table.... Could I expect the admin
interface goes thru rpc.lockd command (man page and nfs-util code
changes) ? The modified command will take similar options as rpc.statd;
more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the
individual IP (socket address) to kernel, we'll need nfsctl with struct
nfsctl_svc modified.

For the kernel piece, since we're there anyway, could we have the
individual lockd IP interface passed to SM (statd) (in SM_MON call) ?
This would allow statd to structure its SM files based on each lockd IP
address, an important part of lock recovery.

> One is to register a callback when an interface is shut down.

Haven't checked out (linux) socket interface yet. I'm very fuzzy how
this can be done. Anyone has good ideas ?

> Another (possibly the best) is to arrange a new signal for lockd
> which say "Drop any locks which were sent to IP addresses that are
> no longer valid local addresses".

Very appealing - but the devil's always in the details. How to decide
which IP address is no longer valid ? Or how does lockd know about these
IP addresses ? And how to associate one particular IP address with the
"struct nlm_file" entries within nlm_files list ? Need few more days to
sort this out (or any one already has ideas in mind ?).

-- Wendy