LinuxLists.cc - [PATCH 2/5] NLM failover

2006-08-14 05:48:33

Subject: [PATCH 2/5] NLM failover - per fs grace period

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

Attachments:

gfs_nlm_igrace.patch (13.45 kB)
(No filename) (373.00 B)
(No filename) (140.00 B)
Download all attachments

2006-08-18 09:49:34

by Greg Banks

[permalink] [raw]

Subject: Re: [PATCH 2/5] NLM failover - per fs grace period

On Mon, 2006-08-14 at 16:00, Wendy Cheng wrote:
> This change enables per NFS-export entry lockd grace period.[...]

>
> +/* Server fsid linked list for NLM lock failover */
> +struct nlm_serv {
> + struct nlm_serv* s_next; /* linked list */
> + unsigned long s_grace_period; /* per fsid grace period */
> + int s_fsid; /* export fsid */
> +};
> +

The name of this structure appears to be left over from your
previous approach; it doesn't really represent a server anymore.
Giving the structure, and list, and the lock that protects it
similar and appropriate names might be nice.

Also, the s_grace_period field isn't actually a period, it's
the future expiry time expressed in jiffies. The field name
and comment are both confusing.

> +int
> +nlmsvc_fo_setgrace(int fsid)
> +{
> + struct nlm_serv *per_fsid, *entry;
> +
> + /* allocate the entry */
> + per_fsid = kmalloc(sizeof(struct nlm_serv), GFP_KERNEL);
> + if (per_fsid == NULL) {
> + printk("lockd: nlmsvc_fo_setgrace kmalloc fails\n");
> + return(-ENOMEM);
> + }
> +

These actions:

# echo 1234 > /proc/fs/nfsd/nlm_set_igrace
# echo 1234 > /proc/fs/nfsd/nlm_set_igrace

result in two separate nlm_serv structures being allocated and
stored in the global list. That doesn't make sense, a filesystem
should have at most one grace period.

> + /* link into the global list */
> + spin_lock(&nlm_fo_lock);
> +
> + entry = nlm_servs;
> + per_fsid->s_next = entry;
> + nlm_servs = per_fsid;

Why use opencoded list manipulation when Linux has a adequate
list package in <linux/list.h> ?

> +
> + /* done */
> + spin_unlock(&nlm_fo_lock);
> + return 0;
> +}
> +
> +/* nlm_servs gargabe collection

s/gargabe/garbage/

> + * - caller should hold nlm_ip_mutex

This comment is stale, you don't have an nlm_ip_mutex anymore.

> +void
> +nlmsvc_fo_reset_servs()
> +{
> [...]
> + return;

This "return" is superfluous.

> +int
> +nlmsvc_fo_check(struct nfs_fh *fh)
> +{
> + struct nlm_serv **e_top, *e_this, *e_purge=NULL;
> + int rc=0, this_fsid, not_found;
> +
> + spin_lock(&nlm_fo_lock);
> +
> + /* no failover entry */
> + if (!(e_this = nlm_servs))
> + goto nlmsvc_fo_check_out;
> +
> + /* see if this fh has fsid */
> + not_found = nlm_fo_get_fsid(fh, &this_fsid);
> + if (not_found)
> + goto nlmsvc_fo_check_out;

You could do this outside the critical section.

> +
> + /* check to see whether this_fsid is in nlm_servs list */
> + e_top = &nlm_servs;
> + while (e_this) {
> + if (time_before(e_this->s_grace_period, jiffies)) {
> + dprintk("lockd: fsid=%d grace period expires\n",
> + e_this->s_fsid);
> + e_purge = e_this;
> + break;
> + } else if (e_this->s_fsid == this_fsid) {
> + dprintk("lockd: fsid=%d in grace period\n",
> + e_this->s_fsid);
> + rc = 1;
> + }
> + e_top = &(e_this->s_next);
> + e_this = e_this->s_next;
> + }
> +
> + /* piggy back nlm_servs garbage collection */
> + if (e_purge) {
> + *e_top = NULL;
> + __nlm_servs_gc(e_purge);
> + }
> +

So...if you find an expired entry, you delete entries from the global
list starting at that entry and continuing to the end. Presumably the
assumption here is that the list is sorted in decreasing order of expiry
time, because you only prepend to the list in nlmsvc_fo_setgrace().

However that's wrong: the expiry times returned from set_grace_period()
don't have to monotonically increase. Both nlm_grace_period and
nlm_timeout may be changed by sysctls, so the length of the grace
period can vary. If it varies downwards between two writes to the
set_igrace file, the list may not be in the order you expect.

Also, if your userspace program went beserk and started writing
random fsids into the set_igrace file, they wouldn't be purged
until lockd is shut down or the first NLM call arrives *after*
their grace expires. This has the potential for a kernel memory
Denial of Service attack. Perhaps, when you add a new entry you
could also purge expired ones, and/or put an arbitrary limit on
how large the list can grow?

> static ssize_t (*write_op[])(struct file *, char *, size_t) = {
> [NFSD_Svc] = write_svc,
> @@ -104,6 +106,7 @@ static ssize_t (*write_op[])(struct file
> [NFSD_Getfs] = write_getfs,
> [NFSD_Fh] = write_filehandle,
> [NFSD_Nlm_unlock] = do_nlm_fo_unlock,
> + [NFSD_Nlm_igrace] = do_nlm_fs_grace,
> [NFSD_Threads] = write_threads,
> [NFSD_Versions] = write_versions,
> #ifdef CONFIG_NFSD_V4

Same comment as before.

> +extern struct nlm_serv *nlm_servs;
> +
> +static inline int
> +nlmsvc_fo_grace_period(struct nlm_args *argp)
> +{
> + if (unlikely(nlm_servs))
> + return(nlmsvc_fo_check(&argp->lock.fh));
> +
> + return 0;
> +}
> +

You have two nearly identical functions to do this, which seems
superfluous. This is why we have header files.

> @@ -81,7 +81,6 @@ static unsigned long set_grace_period(vo
> / nlm_timeout) * nlm_timeout * HZ;
> else
> grace_period = nlm_timeout * 5 * HZ;
> - nlmsvc_grace_period = 1;
> return grace_period + jiffies;
> }

As set_grace_period() no longer sets anything, and returns
an expiry time rather than a period, it's name isn't very
appropriate anymore.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-14 15:45:21

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH 2/5] NLM failover - per fs grace period

On Mon, 2006-08-14 at 02:00 -0400, Wendy Cheng wrote:
> This change enables per NFS-export entry lockd grace period. The
> implementation is based on a global single linked list nlm_servs that
> contains entries of fsid info. It is expected this would not be a
> frequent event. The nlm_servs list should be short and the entries
> expire within a maximum of 50 seconds. The grace period setting follows
> the existing NLM grace period handling logic and is triggered via
> echoing the NFS export filesystem id into /proc/fs/nfsd/nlm_set_igrace
> file as:
>
> shell> echo 1234 > /proc/fs/nfsd/nlm_set_igrace

I still don't find the above interface convincing.

Firstly, as I already told you, the NSM protocol does not allow you to
set only a single filesystem in grace. Clients get notified of server
reboots, not filesystem reboots: if they try to reclaim locks and find
out that some of filesystems they have mounted will not allow them to do
so, then they _will_ get confused and start dropping locks that would
otherwise be perfectly valid.

Secondly, with the above interface, you have to export the filesystem
first, and then set the grace period. Since that is not an atomic
operation, it is perfectly possible for someone to mount the filesystem,
after you exported it, then set a new lock before you have managed to
declare it in grace. This makes reclaiming locks impossible.

Cheers,
Trond

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-14 16:00:41

by Wendy Cheng

[permalink] [raw]

Subject: Re: [PATCH 2/5] NLM failover - per fs grace period

Trond Myklebust wrote:

>On Mon, 2006-08-14 at 02:00 -0400, Wendy Cheng wrote:
>
>
>>This change enables per NFS-export entry lockd grace period. The
>>implementation is based on a global single linked list nlm_servs that
>>contains entries of fsid info. It is expected this would not be a
>>frequent event. The nlm_servs list should be short and the entries
>>expire within a maximum of 50 seconds. The grace period setting follows
>>the existing NLM grace period handling logic and is triggered via
>>echoing the NFS export filesystem id into /proc/fs/nfsd/nlm_set_igrace
>>file as:
>>
>>shell> echo 1234 > /proc/fs/nfsd/nlm_set_igrace
>>
>>
>
>I still don't find the above interface convincing.
>
>Firstly, as I already told you, the NSM protocol does not allow you to
>set only a single filesystem in grace. Clients get notified of server
>reboots, not filesystem reboots: if they try to reclaim locks and find
>out that some of filesystems they have mounted will not allow them to do
>so, then they _will_ get confused and start dropping locks that would
>otherwise be perfectly valid.
>
>
I'll check into Linux client code to see what's going on. But please be
aware that the individual filesystem grace period goes with floating ip.
You notify (nfs) client by floating ip address, NOT by filesystem id
(but set the grace period in server via fsid).

Say you expect nfs requests going into floating ip 10.10.1.1 that will
handle exported fsid 1234 and 1235. On server, you do /proc entries
based on 1234 and 1235 and you notify client about 10.10.1.1.

>Secondly, with the above interface, you have to export the filesystem
>first, and then set the grace period.
>
No, you don't. The changes and code has nothing to do with export. It
just adds the numerical fsid into a global array (nlm_servs). When lock
requests finally arrive (later), it extracts the filesystem id from the
filehandle to compare. It can be invoked before filesystem is exported.

-- Wendy

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-15 18:34:12

by Wendy Cheng

[permalink] [raw]

Subject: Re: [PATCH 2/5] NLM failover - per fs grace period

There were few replies emails that didn't arrive at nfs list last week
but got archived in:

* why switching to fsid
https://www.redhat.com/archives/cluster-devel/2006-August/msg00090.html

* cluster script failover sequence)
https://www.redhat.com/archives/cluster-devel/2006-August/msg00087.html

To help people understand the patch logic, the following is a sample
failover flow:

Assume we have server_A and server_B. Server_A is a multi-home NFS
server with two ip address 10.10.10.1 and 10.10.1.1 where:

shell> cat /etc/exports
/mnt/ext3/exports *(fsid=1234,sync,rw)
/mnt/gfs1/exports *(fsid=1236,sync,rw)

It is expected ext3 filesystem served by 10.10.1.1 and gfs served by
10.10.10.1. If we ever want to move ext3 over to server_B, the sequence
would be:

On server_A:
1. Tear down 10.10.1.1
2. Un-export ext3 exports
3. echo 1234 > /proc/fs/nfsd/nlm_unlock
4. Umount ext3

On sever_B:
1. Mount ext3 fs
2. echo 1234 > /proc/fs/nfsd/nlm_set_igrace
3. Export ext3 exports
4. Bring up 10.10.1.1
5. Sending restricted reclaim notifications via 10.10.1.1

Step 5 is implemented based on the ha-callout program as described in
"man rpc.statd". What our cluster suite (user mode script) will do (if
this patch set gets accepted) is to bring up rpc.statd on both servers
with ha-callout program.
On server_A, the ha-callout program constructs two sm directories
(sm-10.10.10.1 and sm-10.10.1.1) that can be accessed by both servers.
This is normally done by placing the directories on the filesystem that
will get moved over. The original /var/lib/nfs/statd/sm stays in its
default place in case of server_A ends up with a real reboot (or crash).
When server_B takes over, it sends out an one-time notice to all the
clients that has entries in sm-10.10.1.1 directory.

Note that the code of ha-callout program (will be done by our cluster
suite) actually can be integrated into statd within nfs-utils package.
However, I would like to keep the changes made into mainline nfs code as
miminum as possible at this moment.

-- Wendy

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs