From: Ian Kent Subject: Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover Date: Wed, 24 May 2006 21:45:45 +0800 Message-ID: <1148478346.8182.22.camel@raven.themaw.net> References: <44745972.2010305@redhat.com> Mime-Version: 1.0 Content-Type: text/plain Cc: nfs@lists.sourceforge.net, linux-fsdevel , autofs mailing list Return-path: To: Peter Staubach In-Reply-To: <44745972.2010305@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Wed, 2006-05-24 at 09:02 -0400, Peter Staubach wrote: > Ian Kent wrote: > > > > >I've re-written the server selection code now and I believe it works > >correctly. > > > > > > > >>Apart from mount time server selection read-only replicated servers need > >>to be able to fail over to another server if the current one becomes > >>unavailable. > >> > >>The questions I have are: > >> > >>1) What is the best place for each part of this process to be > >> carried out. > >> - mount time selection. > >> - read-only mount fail over. > >> > >> > > > >I think mount time selection should be done in mount and I believe the > >failover needs to be done in the kernel against the list established with > >the user space selection. The list should only change when a umount > >and then a mount occurs (surely this is the only practical way to do it > >?). > > > >The code that I now have for the selection process can potentially improve > >the code used by patches to mount for probing NFS servers and doing this > >once in one place has to be better than doing it in automount and mount. > > > >The failover is another story. > > > >It seems to me that there are two similar ways to do this: > > > >1) Pass a list of address and path entries to NFS at mount time and > >intercept errors, identify if the host is down and if it is select and > >mount another server. > > > >2) Mount each member of the list with the best one on top and intercept > >errors, identify if the host is down and if it is select another from the > >list of mounts and put it atop the mounts. Maintaining the ordering with > >this approach could be difficult. > > > >With either of these approaches handling open files and held locks appears > >to be the the difficult part. > > > >Anyone have anything to contribute on how I could handle this or problems > >that I will encounter? > > > > > > > > It seems to me that there is one other way which is similiar to #1 except > that instead of passing path entries to NFS at mount time, pass in file > handles. This keeps all of the MOUNT protocol processing at the user > level and does not require the kernel to learn anything about the MOUNT > protocol. It also allows a reasonable list to be constructed, with > checking to ensure that all the servers support the same version of the > NFS protocol, probably that all of the server support the same transport > protocol, etc. Of course, like #1 but with the benefits of #2 without the clutter. I guess all I would have to do then is the vfs mount to make it happen. Are we assuming a restriction like all the mounts have the same path exported from the server? mtab could get a little confused. > > >snip .. > > > > > > > >>3) Is there any existing work available that anyone is aware > >> of that could be used as a reference. > >> > >> > > > >Still wondering about this. > > > > > > > > Well, there is the Solaris support. But I'm not supposed to peek at that am I (cough, splutter, ...)? > > >>4) How does NFS v4 fit into this picture as I believe that some > >> of this functionality is included within the protocol. > >> > >> > > > >And this. > > > >NFS v4 appears quite different so should I be considering this for v2 and > >v3 only? > > > > > > > >>Any comments or suggestions or reference code would be very much > >>appreciated. > >> > >> > > The Solaris support works by passing a list of structs containing server > information down into the kernel at mount time. This makes normal mounting > just a subset of the replicated support because a normal mount would just > contain a list of a single entry. Cool. That's the way the selection code I have works, except for the kernel bit of course. > > When the Solaris client gets a timeout from an RPC, it checks to see whether > this file and mount are failover'able. This checks to see whether there are > alternate servers in the list and could contain a check to see if there are > locks existing on the file. If there are locks, then don't failover. The > alternative to doing this is to attempt to move the lock, but this could > be problematic because there would be no guarantee that the new lock could > be acquired. Yep. Failing over the locks looks like it could turn into a nightmare really fast. Sounds like a good simplifying restriction for a first stab at this. > > Anyway, if the file is failover'able, then a new server is chosen from the > list and the file handle associated with the file is remapped to the > equivalent file on the new server. This is done by repeating the lookups > done to get the original file handle. Once the new file handle is acquired, > then some minimal checks are done to try to ensure that the files are the > "same". This is probably mostly checking to see whether the sizes of the > two files are the same. > > Please note that this approach contains the interesting aspect that > files are only failed over when they need to be and are not failed over > proactively. This can lead to the situation where processes using the > the file system can be talking to many of the different underlying > servers, all at the sametime. If a server goes down and then comes back > up before a process, which was talking to that server, notices, then it > will just continue to use that server, while another process, which > noticed the failed server, may have failed over to a new server. Interesting. This hadn't occurred to me yet. I was still at the stage of wondering whether the "on demand" approach would work but the simplifying restriction above should make it workable (I think ....). > > The key ingredient to this approach, I think, is a list of servers and > information about them, and then information for each active NFS inode > that keeps track of the pathname used to discover the file handle and > also the server which is being currently used by the specific file. Haven't quite got to the path issues yet. But can't we just get the path from d_path? It will return the path from a given dentry to the root of the mount, if I remember correctly, and we have a file handle for the server. But your talking about the difficulty of the housekeeping overall I think. > Thanx... Thanks for your comments. Much appreciated and certainly very helpful. Ian