2007-04-19 07:05:01

by NeilBrown

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover

On Tuesday April 17, [email protected] wrote:
>
> In short, my vote is taking this (NLM) patch set and let people try it
> out while we switch our gear to look into other NFS V3 failover issues
> (nfsd in particular). Neil ?

I agree with Christoph in that we should do it properly.
That doesn't mean that we need a complete solution. But we do want to
make sure to avoid any design decisions that we might not want to be
stuck with. Sometimes that's unavoidable, but let's try a little
harder for the moment.

One thing that has been bothering me is that sometimes the
"filesystem" (in the guise of an fsid) is used to talk to the kernel
about failover issues (when flushing locks or restarting the grace
period) and sometimes the local network address is used (when talking
with statd).

I would rather use a single identifier. In my previous email I was
leaning towards using the filesystem as the single identifier. Today
I'm leaning the other way - to using the local network address.

It works like this:

We have a module parameter for lockd something like
"virtual_server".
If that is set to 0, none of the following changes are effective.
If it is set to 1:

The destination address for any lockd request becomes part of the
key to find the nsm_handle.
The my_name field in SM_MON requests and SM_UNMON requests is set
to a textual representation of that destination address.
The reply to SM_MON (currently completely ignored by all versions
of Linux) has an extra value which indicates how many more seconds
of grace period there is to go. This can be stuffed into res_stat
maybe.
Places where we currently check 'nlmsvc_grace_period', get moved to
*after* the nlmsvc_retrieve_args call, and the grace_period value
is extracted from host->nsm.

This is the full extent of the kernel changes.

To remove old locks, we arrange for the callbacks registered with
statd for the relevant clients to be called.
To set the grace period, we make sure statd knows about it and it
will return the relevant information to lockd.
To notify clients of the need to reclaim locks, we simple use the
information stored by statd, which contains the local network
address.

The only aspect of this that gives me any cause for concern is
overloading the return value for SM_MON. Possibly it might be cleaner
to define an SM_MON2 with different args or whatever.
As this interface is entirely local to the one machine, and as it can
quite easily be kept back-compatible, I think the concept is fine.

Statd would need to pass the my_name field to the ha callout rather
than replacing it with "127.0.0.1", but other than that I don't think
any changes are needed to statd (though I haven't thought through that
fully yet).

Comments?

NeilBrown

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2007-04-24 03:20:36

by Wendy Cheng

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover

Neil Brown wrote:

>One thing that has been bothering me is that sometimes the
>"filesystem" (in the guise of an fsid) is used to talk to the kernel
>about failover issues (when flushing locks or restarting the grace
>period) and sometimes the local network address is used (when talking
>with statd).
>
>

This is a perception issue - it depends on how the design is described.
More on this later.

>I would rather use a single identifier. In my previous email I was
>leaning towards using the filesystem as the single identifier. Today
>I'm leaning the other way - to using the local network address.
>
>
Guess you're juggling with too many things so forget why we came down to
this route ? We started the discussion using network interface (to drop
the locks) but found it wouldn't work well on local filesytems such as
ext3. There is really no control on which local (sever side) interface
NFS clients will use (shouldn't be hard to implement one though). When
the fail-over server starts to remove the locks, it needs a way to find
*all* of the locks associated with the will-be-moved partition. This is
to allow umount to succeed. The server ip address alone can't guarantee
that. That was the reason we switched to fsid. Also remember this is NFS
v2/v3 - clients have no knowledge of server migration.

Now, let's move back to first paragraph. An active-active failover can
be described as a 5-steps process:

Step 1. Quiesce the floating network address.
Step 2. Move the exported filesystem directories from Server A to Server B.
Step 3. Re-enable the network interface.
Step 4. Inform clients about the changes via NSM (Network Status
Monitor) Protocol.
Step 5. Grace period.

I was told last week that, independent of lockd, some cluster
filesystems do have their own implementation of grace period. It is on
the wish list that this feature is taken into consideration. IMHO, the
overall process should be viewed as a collaboration between filesystem,
network interface, and NFS protocol itself. Mixing the filesystem and
network operations are unavoidable.

On the other hand, the current proposed interface is expandable .. say,
prefix a non-numerical string "DEV" or "UUID" to ask for dropping locks
as in:
shell> echo "DEV12390 > /proc/fs/nfsd/nlm_unlock;

or allow individual grace period of 10 seconds as:
shell> echo "1234@10" > nlm_set_grace_for_fsid

With above said, some of the following flow confuses me ... comment
inlined as below ..

>It works like this:
>
> We have a module parameter for lockd something like
> "virtual_server".
> If that is set to 0, none of the following changes are effective.
> If it is set to 1:
>
>
ok with me ...

> The destination address for any lockd request becomes part of the
> key to find the nsm_handle.
>
>

As explained above, the address along can't guarantee the associated
locks get cleaned up for one particular filesystem.

> The my_name field in SM_MON requests and SM_UNMON requests is set
> to a textual representation of that destination address.
>
>

That's what the current patch does.

> The reply to SM_MON (currently completely ignored by all versions
> of Linux) has an extra value which indicates how many more seconds
> of grace period there is to go. This can be stuffed into res_stat
> maybe.
> Places where we currently check 'nlmsvc_grace_period', get moved to
> *after* the nlmsvc_retrieve_args call, and the grace_period value
> is extracted from host->nsm.
>
>
ok with me but I don't see the advantages though ?

> This is the full extent of the kernel changes.
>
> To remove old locks, we arrange for the callbacks registered with
> statd for the relevant clients to be called.
> To set the grace period, we make sure statd knows about it and it
> will return the relevant information to lockd.
> To notify clients of the need to reclaim locks, we simple use the
> information stored by statd, which contains the local network
> address.
>
>

I'm lost here... help ?

>The only aspect of this that gives me any cause for concern is
>overloading the return value for SM_MON. Possibly it might be cleaner
>to define an SM_MON2 with different args or whatever.
>As this interface is entirely local to the one machine, and as it can
>quite easily be kept back-compatible, I think the concept is fine.
>
>
Agree !

>Statd would need to pass the my_name field to the ha callout rather
>than replacing it with "127.0.0.1", but other than that I don't think
>any changes are needed to statd (though I haven't thought through that
>fully yet).
>
>

That's the current patch does.

>Comments?
>
>
>
>
I feel we're in the loop again... If there is any way I can shorten this
discussion, please do let me know.

-- Wendy


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-04-19 15:14:45

by Wendy Cheng

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover

Neil Brown wrote:
> On Tuesday April 17, [email protected] wrote:
>
>> In short, my vote is taking this (NLM) patch set and let people try it
>> out while we switch our gear to look into other NFS V3 failover issues
>> (nfsd in particular). Neil ?
>>
>
> I agree with Christoph in that we should do it properly.
> That doesn't mean that we need a complete solution. But we do want to
> make sure to avoid any design decisions that we might not want to be
> stuck with. Sometimes that's unavoidable, but let's try a little
> harder for the moment.
>

As any code review, set personal feeling aside, at the end of the day,
you would start to appreciate some of the look-like-harsh comments. This
instance is definitely one of that moments. I agree we should try harder.

NFS failover has been a difficult subject. There is a three-years-old
Red Hat bugzilla asking for this feature, plus few others marked as
duplicate. By reading through the comments last night, I do feel
strongly that we should put restrictions on the implementation to avoid
dragging users into another three more years.

> One thing that has been bothering me is that sometimes the
> "filesystem" (in the guise of an fsid) is used to talk to the kernel
> about failover issues (when flushing locks or restarting the grace
> period) and sometimes the local network address is used (when talking
> with statd).
>
> I would rather use a single identifier. In my previous email I was
> leaning towards using the filesystem as the single identifier. Today
> I'm leaning the other way - to using the local network address.
>
> It works like this:
>
> We have a module parameter for lockd something like
> "virtual_server".
> If that is set to 0, none of the following changes are effective.
> If it is set to 1:
>
> The destination address for any lockd request becomes part of the
> key to find the nsm_handle.
> The my_name field in SM_MON requests and SM_UNMON requests is set
> to a textual representation of that destination address.
> The reply to SM_MON (currently completely ignored by all versions
> of Linux) has an extra value which indicates how many more seconds
> of grace period there is to go. This can be stuffed into res_stat
> maybe.
> Places where we currently check 'nlmsvc_grace_period', get moved to
> *after* the nlmsvc_retrieve_args call, and the grace_period value
> is extracted from host->nsm.
>
> This is the full extent of the kernel changes.
>
> To remove old locks, we arrange for the callbacks registered with
> statd for the relevant clients to be called.
> To set the grace period, we make sure statd knows about it and it
> will return the relevant information to lockd.
> To notify clients of the need to reclaim locks, we simple use the
> information stored by statd, which contains the local network
> address.
>
> The only aspect of this that gives me any cause for concern is
> overloading the return value for SM_MON. Possibly it might be cleaner
> to define an SM_MON2 with different args or whatever.
> As this interface is entirely local to the one machine, and as it can
> quite easily be kept back-compatible, I think the concept is fine.
>
> Statd would need to pass the my_name field to the ha callout rather
> than replacing it with "127.0.0.1", but other than that I don't think
> any changes are needed to statd (though I haven't thought through that
> fully yet).
>
> Comments?
>
>
Need sometime to look into the ramifications ... comment will follow soon.

-- Wendy


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs