2006-06-15 04:27:01

by NeilBrown

[permalink] [raw]
Subject: Re: [NFS] [RFC] NLM lock failover admin interface

On Wednesday June 14, [email protected] wrote:
> Hi,
>
> KABI (kernel application binary interface) commitment is a big thing
> from our end - so I would like to focus more on the interface agreement
> before jumping into coding and implementation details.
>

Before we can agree on an interface, we need to be clear what
functionality is required.

You started out suggesting that the required functionality was to
"remove all locks that lockd holds on a particular filesystem".

I responded that I suspect a better functionality was "remove all
locks that locked holds on behalf of a particular IP address".

You replied that this such an approach

> give[s] individual filesystem no freedom to adjust what they
> need upon failover.

I asked:
> Can you say more about what sort of adjustments an individual filesystem
> might want the freedom to make? It might help me understand the
> issues better.

and am still waiting for an answer. Without an answer, I still lean
towards and IP-address based approach, and the reply from James
Yarbrough seems to support that (though I don't want to read too much
into his comments).

Lockd is not currently structured to associate locks with
server-ip-addresses. There is an assumption that one client may talk
to any of the IP addresses that the server supports. This is clearly
not the case for the failover scenario that you are considering, so a
little restructuring might be in order.

Some locks will be held on behalf of a client, no matter what
interface the requests arrive on. Other locks will be held on behalf
of a client and tied to a particular server IP address. Probably the
easiest way to make this distinction in as a new nfsd export flag.

So, maybe something like this:

Add a 'struct sockaddr_in' to 'struct nlm_file'.
If nlm_fopen return (say) 3, then treat is as success, and
also copy rqstp->rq_addr into that 'sockaddr_in'.
define a new file in the 'nfsd' filesystem into which can
be written an IP address and which calls some new lockd
function which releases all locks held for that IP address.
Probably get nlm_lookup_file to insist that if the sockaddr_in
is defined in a lock, it must match the one in rqstp

Does that sound OK ?


> > One is the multiple-lockd-threads idea.
>
> Assume we still have this on the table.... Could I expect the admin
> interface goes thru rpc.lockd command (man page and nfs-util code
> changes) ? The modified command will take similar options as rpc.statd;
> more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the
> individual IP (socket address) to kernel, we'll need nfsctl with struct
> nfsctl_svc modified.

I'm losing interest in the multiple-lockd-threads approach myself (for
the moment anyway :-)
However I would be against trying to re-use rpc.lockd - that was a
mistake that is best forgotten.
If the above approach were taken, then I don't think you need anything
more than
echo aa.bb.cc.dd > /proc/fs/nfsd/vserver_unlock
(or whatever), though it you really want to wrap that in a shell
script that might be ok.

>
> For the kernel piece, since we're there anyway, could we have the
> individual lockd IP interface passed to SM (statd) (in SM_MON call) ?
> This would allow statd to structure its SM files based on each lockd IP
> address, an important part of lock recovery.
>

Maybe.... but I don't get the scenario.
Surely the SM files are only needed when the server restarts, and in
that case it needs to notify all clients... Or is it that you want to
make sure the notification comes from the right IP address.... I guess
that would make sense. I that what you are after?


> > One is to register a callback when an interface is shut down.
>
> Haven't checked out (linux) socket interface yet. I'm very fuzzy how
> this can be done. Anyone has good ideas ?

No good idea, but I have a feeling there is a callback we could use.
However I think I am going off this idea.

>
> > Another (possibly the best) is to arrange a new signal for lockd
> > which say "Drop any locks which were sent to IP addresses that are
> > no longer valid local addresses".
>
> Very appealing - but the devil's always in the details. How to decide
> which IP address is no longer valid ? Or how does lockd know about these
> IP addresses ? And how to associate one particular IP address with the
> "struct nlm_file" entries within nlm_files list ? Need few more days to
> sort this out (or any one already has ideas in mind ?).

See above.

NeilBrown


2006-06-15 06:39:24

by Wendy Cheng

[permalink] [raw]
Subject: Re: [NFS] [RFC] NLM lock failover admin interface

On Thu, 2006-06-15 at 14:27 +1000, Neil Brown wrote:

> You started out suggesting that the required functionality was to
> "remove all locks that lockd holds on a particular filesystem".

I didn't make this clear. No, we don't want to "remove all locks
associated with a particular filesystem". We want to "remove all locks
associated with an NFS service" - one NFS service is normally associated
with one NFS export. For example, say in /etc/exports:

/mnt/export_fs/dir_1 *(fsid=1,async,rw)
/mnt/export_fs/dir_2 *(fsid=2,async,rw)

One same filesystem (export_fs) is exported via two entries, each with
its own fsid. The "fsid" is eventually encoded as part of the filehanlde
stored into "struct nlm_file" and linked into nlm_file global list.

This is to allow, not only active-active failover (for local filesystem
such as ext3), but also load balancing for cluster file systems (such as
GFS).

In reality, each NFS service is associated with one virtual IP. The
failover and load-balancing tasks are carried out by moving the virtual
IP around - so I'm ok with the idea of "remove all locks that lockd
holds on behalf of a particular IP address".

>
> Lockd is not currently structured to associate locks with
> server-ip-addresses. There is an assumption that one client may talk
> to any of the IP addresses that the server supports. This is clearly
> not the case for the failover scenario that you are considering, so a
> little restructuring might be in order.
>
> Some locks will be held on behalf of a client, no matter what
> interface the requests arrive on. Other locks will be held on behalf
> of a client and tied to a particular server IP address. Probably the
> easiest way to make this distinction in as a new nfsd export flag.

We're very close now - note that I originally proposed adding a new nfsd
export flag (NFSEXP_FOLOCKS) so we can OR it into export's ex_flag upon
un-export. If the new action flag is set, a new sub-call added into
unexport kernel routine will walk thru nlm_file to find the export entry
(matched by either fsid or devno, taken from filehandle, within nlm_file
struct); then subsequently release the lock.

The ex_flag is an "int" but currently only used up to 16 bit. So my new
export flag is defined as: NFSEXP_FOLOCKS 0x00010000.

>
> So, maybe something like this:
>
> Add a 'struct sockaddr_in' to 'struct nlm_file'.
> If nlm_fopen return (say) 3, then treat is as success, and
> also copy rqstp->rq_addr into that 'sockaddr_in'.
> define a new file in the 'nfsd' filesystem into which can
> be written an IP address and which calls some new lockd
> function which releases all locks held for that IP address.
> Probably get nlm_lookup_file to insist that if the sockaddr_in
> is defined in a lock, it must match the one in rqstp

Yes, we definitely can do this but there is a "BUT" from our end. What I
did in my prototyping code is taking filehandle from nlm_file structure
and yank the fsid (or devno) out of it (so we didn't need to know the
socket address). With (your) above approach, adding a new field into
"struct nlm_file" to hold the sock addr, sadly say, violates our KABI
policy.

I learnt my lesson. Forget KABI for now. Let me see what you have in the
next paragraph (so I can know how to response ...)

>
>
> > > One is the multiple-lockd-threads idea.
>
> I'm losing interest in the multiple-lockd-threads approach myself (for
> the moment anyway :-)
> However I would be against trying to re-use rpc.lockd - that was a
> mistake that is best forgotten.
> If the above approach were taken, then I don't think you need anything
> more than
> echo aa.bb.cc.dd > /proc/fs/nfsd/vserver_unlock
> (or whatever), though it you really want to wrap that in a shell
> script that might be ok.

This is funny - so we go back to /proc. OK with me :) but you may want
to re-think my exportfs command approach. Want me to go over the
unexport flow again ? The idea is to add a new user mode flag, say "-h".
If you unexport the interface as:

shell> exportfs -u *:/export_path // nothing happens, old behavior

but if you do:

shell> exportfs -hu *:/export_patch // the kernel code would walk thru
// nlm_file list to release the
// the locks.

The "-h" "OR" 0x0001000 into ex_flags field of struct nfsctl_export so
kernel can know what to do. With fsid (or devno) in filehandle within
nlm_file, we don't need socket address at all.

But again, I'm OK with /proc approach. However, with /proc approach, we
may need socket address (since not every export uses fsid and devno is
not easy to get).

Do we agree now ? In simple sentence, I prefer my original "exportfs -
hu" approach. But I'm ok with /proc if you insist.


>
> >
> > For the kernel piece, since we're there anyway, could we have the
> > individual lockd IP interface passed to SM (statd) (in SM_MON call) ?
> > This would allow statd to structure its SM files based on each lockd IP
> > address, an important part of lock recovery.
> >
>
> Maybe.... but I don't get the scenario.
> Surely the SM files are only needed when the server restarts, and in
> that case it needs to notify all clients... Or is it that you want to
> make sure the notification comes from the right IP address.... I guess
> that would make sense. I that what you are after?

Yes ! Right now, lockd doesn't pass the specific server address (that
client connects to) to statd. I don't know how the "-H" can ever work.
Consider this a bug. If you forget what "rpc.statd -H" is, check out the
man page (man rpc.statd).

Thank you for the patience - I'm grateful.

-- Wendy