LinuxLists.cc - [RFC] Multiple server selection and replicated mount failover

2006-05-02 05:56:47

Subject: [RFC] Multiple server selection and replicated mount failover

Hi all,

For some time now I have had code in autofs that attempts to select an
appropriate server from a weighted list to satisfy server priority
selection and Replicated Server requirements. The code has been
problematic from the beginning and is still incorrect largely due to me
not merging the original patch well and also not fixing it correctly
afterward.

So I'd like to have this work properly and to do that I also need to
consider read-only NFS mount fail over.

The rules for server selection are, in order of priority (I believe):

1) Hosts on the local subnet.
2) Hosts on the local network.
3) Hosts on other network.

Each of these proximity groups is made up of the largest number of
servers supporting a given NFS protocol version. For example if there were
5 servers and 4 supported v3 and 2 supported v2 then the candidate group
would be made up of the 4 supporting v3. Within the group of candidate
servers the one with the best response time is selected. Selection
within a proximity group can be further influenced by a zero based weight
associated with each host. The higher the weight (a cost really) the less
likely a server is to be selected. I'm not clear on exactly how he weight
influences the selection, so perhaps someone who is familiar with this
could explain it?

Apart from mount time server selection read-only replicated servers need
to be able to fail over to another server if the current one becomes
unavailable.

The questions I have are:

1) What is the best place for each part of this process to be
carried out.
- mount time selection.
- read-only mount fail over.

2) What mechanisms would be best to use for the selection process.

3) Is there any existing work available that anyone is aware
of that could be used as a reference.

4) How does NFS v4 fit into this picture as I believe that some
of this functionality is included within the protocol.

Any comments or suggestions or reference code would be very much
appreciated.

Ian

2006-05-24 13:02:42

by Peter Staubach

[permalink] [raw]

Subject: Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

Ian Kent wrote:

>On Tue, 2 May 2006, Ian Kent wrote:
>
>
>
>>Hi all,
>>
>>For some time now I have had code in autofs that attempts to select an
>>appropriate server from a weighted list to satisfy server priority
>>selection and Replicated Server requirements. The code has been
>>problematic from the beginning and is still incorrect largely due to me
>>not merging the original patch well and also not fixing it correctly
>>afterward.
>>
>>So I'd like to have this work properly and to do that I also need to
>>consider read-only NFS mount fail over.
>>
>>The rules for server selection are, in order of priority (I believe):
>>
>>1) Hosts on the local subnet.
>>2) Hosts on the local network.
>>3) Hosts on other network.
>>
>>Each of these proximity groups is made up of the largest number of
>>servers supporting a given NFS protocol version. For example if there were
>>5 servers and 4 supported v3 and 2 supported v2 then the candidate group
>>would be made up of the 4 supporting v3. Within the group of candidate
>>servers the one with the best response time is selected. Selection
>>within a proximity group can be further influenced by a zero based weight
>>associated with each host. The higher the weight (a cost really) the less
>>likely a server is to be selected. I'm not clear on exactly how he weight
>>influences the selection, so perhaps someone who is familiar with this
>>could explain it?
>>
>>
>
>I've re-written the server selection code now and I believe it works
>correctly.
>
>
>
>>Apart from mount time server selection read-only replicated servers need
>>to be able to fail over to another server if the current one becomes
>>unavailable.
>>
>>The questions I have are:
>>
>>1) What is the best place for each part of this process to be
>> carried out.
>> - mount time selection.
>> - read-only mount fail over.
>>
>>
>
>I think mount time selection should be done in mount and I believe the
>failover needs to be done in the kernel against the list established with
>the user space selection. The list should only change when a umount
>and then a mount occurs (surely this is the only practical way to do it
>?).
>
>The code that I now have for the selection process can potentially improve
>the code used by patches to mount for probing NFS servers and doing this
>once in one place has to be better than doing it in automount and mount.
>
>The failover is another story.
>
>It seems to me that there are two similar ways to do this:
>
>1) Pass a list of address and path entries to NFS at mount time and
>intercept errors, identify if the host is down and if it is select and
>mount another server.
>
>2) Mount each member of the list with the best one on top and intercept
>errors, identify if the host is down and if it is select another from the
>list of mounts and put it atop the mounts. Maintaining the ordering with
>this approach could be difficult.
>
>With either of these approaches handling open files and held locks appears
>to be the the difficult part.
>
>Anyone have anything to contribute on how I could handle this or problems
>that I will encounter?
>
>
>

It seems to me that there is one other way which is similiar to #1 except
that instead of passing path entries to NFS at mount time, pass in file
handles. This keeps all of the MOUNT protocol processing at the user
level and does not require the kernel to learn anything about the MOUNT
protocol. It also allows a reasonable list to be constructed, with
checking to ensure that all the servers support the same version of the
NFS protocol, probably that all of the server support the same transport
protocol, etc.

>snip ..
>
>
>
>>3) Is there any existing work available that anyone is aware
>> of that could be used as a reference.
>>
>>
>
>Still wondering about this.
>
>
>

Well, there is the Solaris support.

>>4) How does NFS v4 fit into this picture as I believe that some
>> of this functionality is included within the protocol.
>>
>>
>
>And this.
>
>NFS v4 appears quite different so should I be considering this for v2 and
>v3 only?
>
>
>
>>Any comments or suggestions or reference code would be very much
>>appreciated.
>>
>>

The Solaris support works by passing a list of structs containing server
information down into the kernel at mount time. This makes normal mounting
just a subset of the replicated support because a normal mount would just
contain a list of a single entry.

When the Solaris client gets a timeout from an RPC, it checks to see whether
this file and mount are failover'able. This checks to see whether there are
alternate servers in the list and could contain a check to see if there are
locks existing on the file. If there are locks, then don't failover. The
alternative to doing this is to attempt to move the lock, but this could
be problematic because there would be no guarantee that the new lock could
be acquired.

Anyway, if the file is failover'able, then a new server is chosen from the
list and the file handle associated with the file is remapped to the
equivalent file on the new server. This is done by repeating the lookups
done to get the original file handle. Once the new file handle is acquired,
then some minimal checks are done to try to ensure that the files are the
"same". This is probably mostly checking to see whether the sizes of the
two files are the same.

Please note that this approach contains the interesting aspect that
files are only failed over when they need to be and are not failed over
proactively. This can lead to the situation where processes using the
the file system can be talking to many of the different underlying
servers, all at the sametime. If a server goes down and then comes back
up before a process, which was talking to that server, notices, then it
will just continue to use that server, while another process, which
noticed the failed server, may have failed over to a new server.

The key ingredient to this approach, I think, is a list of servers and
information about them, and then information for each active NFS inode
that keeps track of the pathname used to discover the file handle and
also the server which is being currently used by the specific file.

Thanx...

ps

2006-05-24 13:45:45

by Ian Kent

[permalink] [raw]

Subject: Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

On Wed, 2006-05-24 at 09:02 -0400, Peter Staubach wrote:
> Ian Kent wrote:
>
> >
> >I've re-written the server selection code now and I believe it works
> >correctly.
> >
> >
> >
> >>Apart from mount time server selection read-only replicated servers need
> >>to be able to fail over to another server if the current one becomes
> >>unavailable.
> >>
> >>The questions I have are:
> >>
> >>1) What is the best place for each part of this process to be
> >> carried out.
> >> - mount time selection.
> >> - read-only mount fail over.
> >>
> >>
> >
> >I think mount time selection should be done in mount and I believe the
> >failover needs to be done in the kernel against the list established with
> >the user space selection. The list should only change when a umount
> >and then a mount occurs (surely this is the only practical way to do it
> >?).
> >
> >The code that I now have for the selection process can potentially improve
> >the code used by patches to mount for probing NFS servers and doing this
> >once in one place has to be better than doing it in automount and mount.
> >
> >The failover is another story.
> >
> >It seems to me that there are two similar ways to do this:
> >
> >1) Pass a list of address and path entries to NFS at mount time and
> >intercept errors, identify if the host is down and if it is select and
> >mount another server.
> >
> >2) Mount each member of the list with the best one on top and intercept
> >errors, identify if the host is down and if it is select another from the
> >list of mounts and put it atop the mounts. Maintaining the ordering with
> >this approach could be difficult.
> >
> >With either of these approaches handling open files and held locks appears
> >to be the the difficult part.
> >
> >Anyone have anything to contribute on how I could handle this or problems
> >that I will encounter?
> >
> >
> >
>
> It seems to me that there is one other way which is similiar to #1 except
> that instead of passing path entries to NFS at mount time, pass in file
> handles. This keeps all of the MOUNT protocol processing at the user
> level and does not require the kernel to learn anything about the MOUNT
> protocol. It also allows a reasonable list to be constructed, with
> checking to ensure that all the servers support the same version of the
> NFS protocol, probably that all of the server support the same transport
> protocol, etc.

Of course, like #1 but with the benefits of #2 without the clutter. I
guess all I would have to do then is the vfs mount to make it happen.
Are we assuming a restriction like all the mounts have the same path
exported from the server? mtab could get a little confused.

>
> >snip ..
> >
> >
> >
> >>3) Is there any existing work available that anyone is aware
> >> of that could be used as a reference.
> >>
> >>
> >
> >Still wondering about this.
> >
> >
> >
>
> Well, there is the Solaris support.

But I'm not supposed to peek at that am I (cough, splutter, ...)?

>
> >>4) How does NFS v4 fit into this picture as I believe that some
> >> of this functionality is included within the protocol.
> >>
> >>
> >
> >And this.
> >
> >NFS v4 appears quite different so should I be considering this for v2 and
> >v3 only?
> >
> >
> >
> >>Any comments or suggestions or reference code would be very much
> >>appreciated.
> >>
> >>
>
> The Solaris support works by passing a list of structs containing server
> information down into the kernel at mount time. This makes normal mounting
> just a subset of the replicated support because a normal mount would just
> contain a list of a single entry.

Cool. That's the way the selection code I have works, except for the
kernel bit of course.

>
> When the Solaris client gets a timeout from an RPC, it checks to see whether
> this file and mount are failover'able. This checks to see whether there are
> alternate servers in the list and could contain a check to see if there are
> locks existing on the file. If there are locks, then don't failover. The
> alternative to doing this is to attempt to move the lock, but this could
> be problematic because there would be no guarantee that the new lock could
> be acquired.

Yep. Failing over the locks looks like it could turn into a nightmare
really fast. Sounds like a good simplifying restriction for a first stab
at this.

>
> Anyway, if the file is failover'able, then a new server is chosen from the
> list and the file handle associated with the file is remapped to the
> equivalent file on the new server. This is done by repeating the lookups
> done to get the original file handle. Once the new file handle is acquired,
> then some minimal checks are done to try to ensure that the files are the
> "same". This is probably mostly checking to see whether the sizes of the
> two files are the same.
>
> Please note that this approach contains the interesting aspect that
> files are only failed over when they need to be and are not failed over
> proactively. This can lead to the situation where processes using the
> the file system can be talking to many of the different underlying
> servers, all at the sametime. If a server goes down and then comes back
> up before a process, which was talking to that server, notices, then it
> will just continue to use that server, while another process, which
> noticed the failed server, may have failed over to a new server.

Interesting. This hadn't occurred to me yet.

I was still at the stage of wondering whether the "on demand" approach
would work but the simplifying restriction above should make it workable
(I think ....).

>
> The key ingredient to this approach, I think, is a list of servers and
> information about them, and then information for each active NFS inode
> that keeps track of the pathname used to discover the file handle and
> also the server which is being currently used by the specific file.

Haven't quite got to the path issues yet.
But can't we just get the path from d_path?
It will return the path from a given dentry to the root of the mount, if
I remember correctly, and we have a file handle for the server.

But your talking about the difficulty of the housekeeping overall I
think.

> Thanx...

Thanks for your comments.
Much appreciated and certainly very helpful.

Ian

2006-05-24 14:04:51

by Peter Staubach

[permalink] [raw]

Subject: Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

Ian Kent wrote:

>
>Of course, like #1 but with the benefits of #2 without the clutter. I
>guess all I would have to do then is the vfs mount to make it happen.
>Are we assuming a restriction like all the mounts have the same path
>exported from the server? mtab could get a little confused.
>
>
>

I don't think that this needs to be restricted like this. I think that
it would be nice if mtab just showed the arguments which were used to
the mount command, ie. with all of the server and path combinations listed
just like they were on the command line or in the autofs map.

>But I'm not supposed to peek at that am I (cough, splutter, ...)?
>
>
>

Yes, I know. I am trying to figure out how much of the architecture and
implementation that I can talk about too... :-)

>Cool. That's the way the selection code I have works, except for the
>kernel bit of course.
>
>
>

Good news!

>Yep. Failing over the locks looks like it could turn into a nightmare
>really fast. Sounds like a good simplifying restriction for a first stab
>at this.
>
>
>

Agreed.

>Interesting. This hadn't occurred to me yet.
>
>I was still at the stage of wondering whether the "on demand" approach
>would work but the simplifying restriction above should make it workable
>(I think ....).
>
>
>

I think that simple is good. We can always get more complicate later, if
need be. (Hope not... :-) )

>>The key ingredient to this approach, I think, is a list of servers and
>>information about them, and then information for each active NFS inode
>>that keeps track of the pathname used to discover the file handle and
>>also the server which is being currently used by the specific file.
>>
>>
>
>Haven't quite got to the path issues yet.
>But can't we just get the path from d_path?
>It will return the path from a given dentry to the root of the mount, if
>I remember correctly, and we have a file handle for the server.
>
>But your talking about the difficulty of the housekeeping overall I
>think.
>
>
>

Yes, I was specifically trying to avoid talking about how to manage the
information. I think that that is an implementation detail, which is
better left until after the high level architecture is defined.

>> Thanx...
>>
>>
>
>Thanks for your comments.
>Much appreciated and certainly very helpful.
>

You're welcome.

I can try to talk more about the architecture and implementation that I am
familiar with, if you like.

ps

2006-05-24 14:31:48

by Ian Kent

[permalink] [raw]

Subject: Re: Re: [RFC] Multiple server selection and replicated mount failover

On Wed, 2006-05-24 at 10:04 -0400, Peter Staubach wrote:
> Ian Kent wrote:
>
> >
> >Of course, like #1 but with the benefits of #2 without the clutter. I
> >guess all I would have to do then is the vfs mount to make it happen.
> >Are we assuming a restriction like all the mounts have the same path
> >exported from the server? mtab could get a little confused.
> >
> >
> >
>
> I don't think that this needs to be restricted like this. I think that
> it would be nice if mtab just showed the arguments which were used to
> the mount command, ie. with all of the server and path combinations listed
> just like they were on the command line or in the autofs map.
>
>
> >But I'm not supposed to peek at that am I (cough, splutter, ...)?
> >
> >
> >
>
> Yes, I know. I am trying to figure out how much of the architecture and
> implementation that I can talk about too... :-)
>
> >Cool. That's the way the selection code I have works, except for the
> >kernel bit of course.
> >
> >
> >
>
> Good news!

It's in autofs 5 now. The idea is that it will work for any mount
string, replicated syntax or not. So there's no extra mucking around.

I hope to push it into mount and provide a configure option to disable
it in autofs if mount can do it instead.

>
> >Yep. Failing over the locks looks like it could turn into a nightmare
> >really fast. Sounds like a good simplifying restriction for a first stab
> >at this.
> >
> >
> >
>
> Agreed.
>
> >Interesting. This hadn't occurred to me yet.
> >
> >I was still at the stage of wondering whether the "on demand" approach
> >would work but the simplifying restriction above should make it workable
> >(I think ....).
> >
> >
> >
>
> I think that simple is good. We can always get more complicate later, if
> need be. (Hope not... :-) )
>
> >>The key ingredient to this approach, I think, is a list of servers and
> >>information about them, and then information for each active NFS inode
> >>that keeps track of the pathname used to discover the file handle and
> >>also the server which is being currently used by the specific file.
> >>
> >>
> >
> >Haven't quite got to the path issues yet.
> >But can't we just get the path from d_path?
> >It will return the path from a given dentry to the root of the mount, if
> >I remember correctly, and we have a file handle for the server.
> >
> >But your talking about the difficulty of the housekeeping overall I
> >think.
> >
> >
> >
>
> Yes, I was specifically trying to avoid talking about how to manage the
> information. I think that that is an implementation detail, which is
> better left until after the high level architecture is defined.

Yep.

>
>
> >> Thanx...
> >>
> >>
> >
> >Thanks for your comments.
> >Much appreciated and certainly very helpful.
> >
>
> You're welcome.
>
> I can try to talk more about the architecture and implementation that I am
> familiar with, if you like.

Any and all information is good.
Food for thought will give me something to eat!

Ian

-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-05-24 16:29:16

by Trond Myklebust

[permalink] [raw]

Subject: Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote:

> It seems to me that there are two similar ways to do this:
>
> 1) Pass a list of address and path entries to NFS at mount time and
> intercept errors, identify if the host is down and if it is select and
> mount another server.
>
> 2) Mount each member of the list with the best one on top and intercept
> errors, identify if the host is down and if it is select another from the
> list of mounts and put it atop the mounts. Maintaining the ordering with
> this approach could be difficult.

Solaris has implemented option (1). To me, that is the approach that
makes the most sense: why add the overhead of maintaining all these
redundant mounts?

> With either of these approaches handling open files and held locks appears
> to be the the difficult part.

Always has been, and always will. We're working on this problem, but
progress is slow. In any case, we'll be concentrating on solving it for
NFSv4 first (since that has native support for migrated/replicated
volumes).

> Anyone have anything to contribute on how I could handle this or problems
> that I will encounter?
>
>
> snip ..
>
> >
> > 3) Is there any existing work available that anyone is aware
> > of that could be used as a reference.
>
> Still wondering about this.
>
> >
> > 4) How does NFS v4 fit into this picture as I believe that some
> > of this functionality is included within the protocol.
>
> And this.
>
> NFS v4 appears quite different so should I be considering this for v2 and
> v3 only?

NFSv4 has full support for migration/replication in the protocol. If a
filesystem fails on a given server, then the server itself will tell the
client where it can find the replicas. There should be no need to
provide that information at mount time.

Cheers,
Trond

2006-05-24 17:58:49

by Jeff Moyer

[permalink] [raw]

Subject: Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <[email protected]> adds:

trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote:
>> > 4) How does NFS v4 fit into this picture as I believe that some
>> > of this functionality is included within the protocol.
>>
>> And this.
>>
>> NFS v4 appears quite different so should I be considering this for v2 and
>> v3 only?

trond.myklebust> NFSv4 has full support for migration/replication in the
trond.myklebust> protocol. If a filesystem fails on a given server, then
trond.myklebust> the server itself will tell the client where it can find
trond.myklebust> the replicas. There should be no need to provide that
trond.myklebust> information at mount time.

And what happens when the server disappears?

-Jeff

2006-05-24 18:31:41

by Trond Myklebust

[permalink] [raw]

Subject: Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

On Wed, 2006-05-24 at 13:58 -0400, Jeff Moyer wrote:
> ==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <[email protected]> adds:
>
> trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote:
> >> > 4) How does NFS v4 fit into this picture as I believe that some
> >> > of this functionality is included within the protocol.
> >>
> >> And this.
> >>
> >> NFS v4 appears quite different so should I be considering this for v2 and
> >> v3 only?
>
> trond.myklebust> NFSv4 has full support for migration/replication in the
> trond.myklebust> protocol. If a filesystem fails on a given server, then
> trond.myklebust> the server itself will tell the client where it can find
> trond.myklebust> the replicas. There should be no need to provide that
> trond.myklebust> information at mount time.
>
> And what happens when the server disappears?

There are 2 strategies for dealing with that:

Firstly, we can maintain a cache of the list of replica volumes (we can
request the list of replicas when we mount the original volume).

Secondly, there are plans to add a backup list of failover servers in a
specialised DNS record. This strategy could be made to work for NFSv2/v3
too.

Cheers,
Trond

2006-05-24 19:17:03

by Peter Staubach

[permalink] [raw]

Subject: Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

Trond Myklebust wrote:

>On Wed, 2006-05-24 at 13:58 -0400, Jeff Moyer wrote:
>
>
>>==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <[email protected]> adds:
>>
>>trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote:
>>
>>
>>>>>4) How does NFS v4 fit into this picture as I believe that some
>>>>> of this functionality is included within the protocol.
>>>>>
>>>>>
>>>>And this.
>>>>
>>>>NFS v4 appears quite different so should I be considering this for v2 and
>>>>v3 only?
>>>>
>>>>
>>trond.myklebust> NFSv4 has full support for migration/replication in the
>>trond.myklebust> protocol. If a filesystem fails on a given server, then
>>trond.myklebust> the server itself will tell the client where it can find
>>trond.myklebust> the replicas. There should be no need to provide that
>>trond.myklebust> information at mount time.
>>
>>And what happens when the server disappears?
>>
>>
>
>There are 2 strategies for dealing with that:
>
>Firstly, we can maintain a cache of the list of replica volumes (we can
>request the list of replicas when we mount the original volume).
>
>
>

This assumes a lot on the part of the server and it doesn't seem to me
that current server implementations are ready with the infrastructure to
be able to make this a reality.

I think that the client should be prepared to handle this sort of scenario
but also be prepared to take a list of servers at mount time too.

>Secondly, there are plans to add a backup list of failover servers in a
>specialised DNS record. This strategy could be made to work for NFSv2/v3
>too.
>

This would seem to be a solution for how to determine the list of replicas,
but not how the NFS client fails over from one replica to the next.

Thanx...

ps

2006-05-24 19:45:54

by Trond Myklebust

[permalink] [raw]

Subject: Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

On Wed, 2006-05-24 at 15:17 -0400, Peter Staubach wrote:

> This would seem to be a solution for how to determine the list of replicas,
> but not how the NFS client fails over from one replica to the next.

I'm fully aware of _that_. As I said earlier, work is in progress.

Cheers,
Trond

2006-05-24 20:45:04

by jtk

[permalink] [raw]

Subject: Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

>>>>> "PS" == Peter Staubach <[email protected]> writes:

PS> When the Solaris client gets a timeout from an RPC, it checks to see
PS> whether this file and mount are failover'able. This checks to see
PS> whether there are alternate servers in the list and could contain a
PS> check to see if there are locks existing on the file. If there are
PS> locks, then don't failover. The alternative to doing this is to
PS> attempt to move the lock, but this could be problematic because
PS> there would be no guarantee that the new lock could be acquired.

PS> Anyway, if the file is failover'able, then a new server is chosen
PS> from the list and the file handle associated with the file is
PS> remapped to the equivalent file on the new server. This is done by
PS> repeating the lookups done to get the original file handle. Once
PS> the new file handle is acquired, then some minimal checks are done
PS> to try to ensure that the files are the "same". This is probably
PS> mostly checking to see whether the sizes of the two files are the
PS> same.

PS> Please note that this approach contains the interesting aspect that
PS> files are only failed over when they need to be and are not failed over
PS> proactively. This can lead to the situation where processes using the
PS> the file system can be talking to many of the different underlying
PS> servers, all at the sametime. If a server goes down and then comes back
PS> up before a process, which was talking to that server, notices, then it
PS> will just continue to use that server, while another process, which
PS> noticed the failed server, may have failed over to a new server.

If you have multiple processes talking to different server replicas, can
you then get cases where the processes aren't sharing the same files given
the same name?

Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and
starts working on it. It then sits around doing nothing for a while.

Process "B" cd's to /mount/a/b, gets a timeout, fails over to server 2,
and then looks up "c/file.c" which will be referencing the object on
server 2 ?

A & B then try locking to cooperate...

Are replicas only useful for read-only copies? If they're read-only, do
locks even make sense?

--
John Kohl
Senior Software Engineer - Rational Software - IBM Software Group
Lexington, Massachusetts, USA
[email protected]
<http://www.ibm.com/software/rational/>

2006-05-24 20:52:48

by Dan Stromberg

[permalink] [raw]

Subject: Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

On Wed, 2006-05-24 at 16:45 -0400, John T. Kohl wrote:

>
> If you have multiple processes talking to different server replicas, can
> you then get cases where the processes aren't sharing the same files given
> the same name?

Sounds like it to me.

> Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and
> starts working on it. It then sits around doing nothing for a while.
>
> Process "B" cd's to /mount/a/b, gets a timeout, fails over to server 2,
> and then looks up "c/file.c" which will be referencing the object on
> server 2 ?

Yup.

> A & B then try locking to cooperate...

Yup.

To get good locking semantics, you'll probably need a distributed
filesystem like GFS or Lustre.

> Are replicas only useful for read-only copies? If they're read-only, do
> locks even make sense?

Yes, they can - imagine a software system that wants to make sure only
one process is accessing some data at a time. The lock might be in one
filesystem, but the data might be in another. Or you might be reading
from a device that has really expensive seeks (a DVD comes to mind), so
you want to be sure that only one thing is reading from it at a time.
There are other possible scenarios I imagine, but many folks may be able
to live without any of them. :)

2006-05-25 03:56:26

by Ian Kent

[permalink] [raw]

Subject: Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

On Wed, 2006-05-24 at 14:31 -0400, Trond Myklebust wrote:
> On Wed, 2006-05-24 at 13:58 -0400, Jeff Moyer wrote:
> > ==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <[email protected]> adds:
> >
> > trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote:
> > >> > 4) How does NFS v4 fit into this picture as I believe that some
> > >> > of this functionality is included within the protocol.
> > >>
> > >> And this.
> > >>
> > >> NFS v4 appears quite different so should I be considering this for v2 and
> > >> v3 only?
> >
> > trond.myklebust> NFSv4 has full support for migration/replication in the
> > trond.myklebust> protocol. If a filesystem fails on a given server, then
> > trond.myklebust> the server itself will tell the client where it can find
> > trond.myklebust> the replicas. There should be no need to provide that
> > trond.myklebust> information at mount time.
> >
> > And what happens when the server disappears?
>
> There are 2 strategies for dealing with that:
>
> Firstly, we can maintain a cache of the list of replica volumes (we can
> request the list of replicas when we mount the original volume).
>
> Secondly, there are plans to add a backup list of failover servers in a
> specialised DNS record. This strategy could be made to work for NFSv2/v3
> too.
>

I see. That would work fine.

Personally, I'm not keen on using DNS for this as it adds another
source, separate from the original source, that needs to be kept up to
date.

Unfortunately, in many environments it's not possible to deploy new
services, often for several years after they become available. So there
is a need to do this for v2 and v3 in the absence of v4. We at least
need to support the mount syntax used in other industry OSs to round out
the v2 and v3 implementation, so using mount seems the logical thing to
do. I think this would also fit in well with v4 in that, as you mention
above, the replica information needs to be gathered at mount time.

I have the opportunity to spend some time on this now.

Ideally I would like to fit in with the work that is being done for v4
as much as possible. For example I noticed references to a struct
nfs_fs_locations in you patch set which may be useful for the
information I need. However, I haven't spotted anything that relates to
fail detection and fail over itself (ok I know you said your working on
it) so perhaps I can contribute to this in a way that could help your v4
work. So what's you plan for this?

Ian

2006-05-29 07:31:59

by Ian Kent

[permalink] [raw]

Subject: Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

On Thu, 24 May 2006, John T. Kohl wrote:

> >>>>> "PS" == Peter Staubach <[email protected]> writes:
>
> PS> When the Solaris client gets a timeout from an RPC, it checks to see
> PS> whether this file and mount are failover'able. This checks to see
> PS> whether there are alternate servers in the list and could contain a
> PS> check to see if there are locks existing on the file. If there are
> PS> locks, then don't failover. The alternative to doing this is to
> PS> attempt to move the lock, but this could be problematic because
> PS> there would be no guarantee that the new lock could be acquired.
>
> PS> Anyway, if the file is failover'able, then a new server is chosen
> PS> from the list and the file handle associated with the file is
> PS> remapped to the equivalent file on the new server. This is done by
> PS> repeating the lookups done to get the original file handle. Once
> PS> the new file handle is acquired, then some minimal checks are done
> PS> to try to ensure that the files are the "same". This is probably
> PS> mostly checking to see whether the sizes of the two files are the
> PS> same.
>
> PS> Please note that this approach contains the interesting aspect that
> PS> files are only failed over when they need to be and are not failed over
> PS> proactively. This can lead to the situation where processes using the
> PS> the file system can be talking to many of the different underlying
> PS> servers, all at the sametime. If a server goes down and then comes back
> PS> up before a process, which was talking to that server, notices, then it
> PS> will just continue to use that server, while another process, which
> PS> noticed the failed server, may have failed over to a new server.
>
> If you have multiple processes talking to different server replicas, can
> you then get cases where the processes aren't sharing the same files given
> the same name?
>
> Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and
> starts working on it. It then sits around doing nothing for a while.
>
> Process "B" cd's to /mount/a/b, gets a timeout, fails over to server 2,
> and then looks up "c/file.c" which will be referencing the object on
> server 2 ?
>
> A & B then try locking to cooperate...
>
> Are replicas only useful for read-only copies? If they're read-only, do
> locks even make sense?

Apps will take locks whether it makes sense or not.
So refusing to fail-over if locks are held is likely the best approach.

The case of replica filesystems themselves being updated could give rise
to some interesting difficulties.

Ian

2006-05-30 12:02:03

by Jeff Moyer

[permalink] [raw]

Subject: Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; [email protected] (John T. Kohl) adds:

>>>>>> "PS" == Peter Staubach <[email protected]> writes:
PS> When the Solaris client gets a timeout from an RPC, it checks to see
PS> whether this file and mount are failover'able. This checks to see
PS> whether there are alternate servers in the list and could contain a
PS> check to see if there are locks existing on the file. If there are
PS> locks, then don't failover. The alternative to doing this is to
PS> attempt to move the lock, but this could be problematic because there
PS> would be no guarantee that the new lock could be acquired.

PS> Anyway, if the file is failover'able, then a new server is chosen from
PS> the list and the file handle associated with the file is remapped to
PS> the equivalent file on the new server. This is done by repeating the
PS> lookups done to get the original file handle. Once the new file handle
PS> is acquired, then some minimal checks are done to try to ensure that
PS> the files are the "same". This is probably mostly checking to see
PS> whether the sizes of the two files are the same.

PS> Please note that this approach contains the interesting aspect that
PS> files are only failed over when they need to be and are not failed over
PS> proactively. This can lead to the situation where processes using the
PS> the file system can be talking to many of the different underlying
PS> servers, all at the sametime. If a server goes down and then comes
PS> back up before a process, which was talking to that server, notices,
PS> then it will just continue to use that server, while another process,
PS> which noticed the failed server, may have failed over to a new server.

jtk> If you have multiple processes talking to different server replicas,
jtk> can you then get cases where the processes aren't sharing the same
jtk> files given the same name?

jtk> Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and
jtk> starts working on it. It then sits around doing nothing for a while.

jtk> Process "B" cd's to /mount/a/b, gets a timeout, fails over to server
jtk> 2, and then looks up "c/file.c" which will be referencing the object
jtk> on server 2 ?

jtk> A & B then try locking to cooperate...

jtk> Are replicas only useful for read-only copies? If they're read-only,
jtk> do locks even make sense?

In the docs I've read, the replicated failover only works for read-only
file systems. You can have a replicated server entry for read-write file
systems, but only one of those will be mounted by the automounter. To
change servers would require a timeout (unmount) and subsequent lookup
(mount).

I don't think we need to try to kill ourselves by making this too complex.

-Jeff

2006-05-24 05:05:28

by Ian Kent

[permalink] [raw]

Subject: Re: [RFC] Multiple server selection and replicated mount failover

On Tue, 2 May 2006, Ian Kent wrote:

>
> Hi all,
>
> For some time now I have had code in autofs that attempts to select an
> appropriate server from a weighted list to satisfy server priority
> selection and Replicated Server requirements. The code has been
> problematic from the beginning and is still incorrect largely due to me
> not merging the original patch well and also not fixing it correctly
> afterward.
>
> So I'd like to have this work properly and to do that I also need to
> consider read-only NFS mount fail over.
>
> The rules for server selection are, in order of priority (I believe):
>
> 1) Hosts on the local subnet.
> 2) Hosts on the local network.
> 3) Hosts on other network.
>
> Each of these proximity groups is made up of the largest number of
> servers supporting a given NFS protocol version. For example if there were
> 5 servers and 4 supported v3 and 2 supported v2 then the candidate group
> would be made up of the 4 supporting v3. Within the group of candidate
> servers the one with the best response time is selected. Selection
> within a proximity group can be further influenced by a zero based weight
> associated with each host. The higher the weight (a cost really) the less
> likely a server is to be selected. I'm not clear on exactly how he weight
> influences the selection, so perhaps someone who is familiar with this
> could explain it?

I've re-written the server selection code now and I believe it works
correctly.

>
> Apart from mount time server selection read-only replicated servers need
> to be able to fail over to another server if the current one becomes
> unavailable.
>
> The questions I have are:
>
> 1) What is the best place for each part of this process to be
> carried out.
> - mount time selection.
> - read-only mount fail over.

I think mount time selection should be done in mount and I believe the
failover needs to be done in the kernel against the list established with
the user space selection. The list should only change when a umount
and then a mount occurs (surely this is the only practical way to do it
?).

The code that I now have for the selection process can potentially improve
the code used by patches to mount for probing NFS servers and doing this
once in one place has to be better than doing it in automount and mount.

The failover is another story.

It seems to me that there are two similar ways to do this:

1) Pass a list of address and path entries to NFS at mount time and
intercept errors, identify if the host is down and if it is select and
mount another server.

2) Mount each member of the list with the best one on top and intercept
errors, identify if the host is down and if it is select another from the
list of mounts and put it atop the mounts. Maintaining the ordering with
this approach could be difficult.

With either of these approaches handling open files and held locks appears
to be the the difficult part.

Anyone have anything to contribute on how I could handle this or problems
that I will encounter?

snip ..

>
> 3) Is there any existing work available that anyone is aware
> of that could be used as a reference.

Still wondering about this.

>
> 4) How does NFS v4 fit into this picture as I believe that some
> of this functionality is included within the protocol.

And this.

NFS v4 appears quite different so should I be considering this for v2 and
v3 only?

>
> Any comments or suggestions or reference code would be very much
> appreciated.

Still.

Ian

2006-05-02 18:14:16

by Jim Carter

[permalink] [raw]

Subject: Re: [autofs] [RFC] Multiple server selection and replicated mount failover

On Tue, 2 May 2006, Ian Kent wrote:

> For some time now I have had code in autofs that attempts to select an
> appropriate server from a weighted list to satisfy server priority
> selection and Replicated Server requirements. The code has been
> problematic from the beginning and is still incorrect largely due to me
> not merging the original patch well and also not fixing it correctly
> afterward.
>
> So I'd like to have this work properly and to do that I also need to
> consider read-only NFS mount fail over.

I'm glad to hear that there may be progress in server selection. But I'm
not sure if you're looking at the problem from the direction that I am.

First, I don't think it's necessary to replicate the original Sun
behavior exactly, although it would be helpful but not mandatory to
allow something in the automount maps that resembles Solaris syntax, to
ease user (sysop) training.

The current version of mount on Linux (util-linux-2.12) does not know
about picking servers from a list; at least the man page doesn't know.
This means that the whole job of server selection falls to automount. I
think that's the right way to design the system. However, that also
means that automount needs to know something about NFS servers
specifically. The less it knows, the better, in my opinion, so the
design of NFS mount options can be separated from automount.

Your task is to make an ordered list of servers, best to worst, and to
mount from the best one that answers. To my mind a concentration on
"groups" is confusing for the implementor and user, as well as requiring
inside knowledge so you can classify the servers. On the other hand,
the sysop does want to be able to use the same automount map on a
variety of machines (e.g. map served by NIS). So what discriminations
might be made?

Explicit preferences set by the sysop should be able to trump all other
discriminations -- or it should be possible to make them small enough to
be overridden by intrinsic differences. Let's specify that intrinsic
differences are 5 or 10 points, and the explicit preference could be set
to 30 to override, or 1 to subtly influence. For example, you could
give preferences of 0, 1 and 30 to designate a "last choice" server, and
two more preferred servers which would be picked on intrinsic grounds,
the first one being most preferred if all else is equal. (Low score
wins -- say that explicitly in the documentation.)

As for intrinsic differences, let's say that being on a different subnet
costs 10 points. That distinction is important. I'm not too sure what
the "same net" discrimination might mean. At my shop, if 128.97.4.x
(x.math.ucla.edu) is picking servers, then 128.97.70.x (x.math.ucla.edu)
is local, 128.97.12.x (x.pic.ucla.edu) is also local, but 128.97.31.x
(x.ess.ucla.edu) is a different department in another building. If we
have to make this discrimination, let's define it like this: let the
"length" be the number of bytes (including dots) in the client's (not
server's) canonical name. Starting from the end, count the number of
bytes that are equal in the two names. Then the penalty for the server
is 10*(1 - common/length). For example, if the client is
simba.math.ucla.edu and the server is tupelo.math.ucla.edu, length is
19, common is 14, and the penalty is 3 points (rounding up 2.6). It
would be 6 points for a server in PIC or ESS, which would be considered
equally bad.

I'd say to ignore server capabilities, e.g. NFSv4 versus NFSv2, because
that takes too much inside information -- you actually have to talk to
the server. If the client and server can negotiate to make the mount
happen, fine. If not, automount has to go to the next server. (And it
should remember that the back-version server didn't work out, for a
generous but non-infinite time like a few hours.)

NFSv4 has a nice behavior: if the client doesn't use a NFSv4 mount for a
configurable time (default 10 minutes), it will sync and give up its
filehandle, although I believe the client still claims that the
filesystem is mounted. On subsequent use it will [attempt to]
re-acquire the filehandle transparently. This means that if the server
crashes the filehandle will not be stale, although if the using program
wakes up before the server comes back, it will get an I/O error.

There's a lot of good stuff about NFSv4 on http://wiki.linux-nfs.org/ I got
NFSv4 working on SuSE 10.0 (kernel 2.6.13, nfs-utils-1.0.7) as a demo;
notes from this project (which need to be finished) are at
http://www.math.ucla.edu/~jimc/documents/nfsv4-0601.html

You asked where various steps should be implemented. Picking the server:
that's the job of the userspace daemon, and I don't see too much help that
the kernel might give. Readonly failover is another matter -- which I
think is important.

Here's a top of head kludge for failover: Autofs furnishes a synthetic
directory, let's call it /net/warez. The user daemon NFS mounts something
on it, example julia:/m1/warez. The user daemon mounts another
inter-layer, maybe FUSE, on top of the NFS, and client I/O operations go to
that filesystem. When the inter-layer starts getting I/O errors because
the NFS driver has decided that the server is dead, the inter-layer
notifies the automount daemon. It tells the kernel autofs driver to create
a temp name /net/xyz123, and it mounts a different server on it, let's say
sonia:/m2/warez. Then the names are renamed to, respectively, /net/xyz124
and /net/warez (the new one). Finally the automount daemon does a "bind"
or "move" mount to transfer the inter-layer to be mounted on the new
/net/warez. Then the I/O operation has to be re-tried on the new server.
Wrecked directories are cleaned up as circumstances allow.

I like the idea of minimal special case code in the kernel to support
failover. If it's even possible to do the "move" mount like I suggested.

James F. Carter Voice 310 825 2897 FAX 310 206 6673
UCLA-Mathnet; 6115 MSA; 405 Hilgard Ave.; Los Angeles, CA, USA 90095-1555
Email: [email protected] http://www.math.ucla.edu/~jimc (q.v. for PGP key)

-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-05-02 19:08:46

by Jeff Moyer

[permalink] [raw]

Subject: Re: [autofs] [RFC] Multiple server selection and replicated mount failover

==> Regarding Re: [autofs] [RFC] Multiple server selection and replicated mount failover; Jim Carter <[email protected]> adds:

jimc> On Tue, 2 May 2006, Ian Kent wrote:
>> For some time now I have had code in autofs that attempts to select an
>> appropriate server from a weighted list to satisfy server priority
>> selection and Replicated Server requirements. The code has been
>> problematic from the beginning and is still incorrect largely due to me
>> not merging the original patch well and also not fixing it correctly
>> afterward.
>>
>> So I'd like to have this work properly and to do that I also need to
>> consider read-only NFS mount fail over.

jimc> I'm glad to hear that there may be progress in server selection. But
jimc> I'm not sure if you're looking at the problem from the direction that
jimc> I am.

jimc> First, I don't think it's necessary to replicate the original Sun
jimc> behavior exactly, although it would be helpful but not mandatory to
jimc> allow something in the automount maps that resembles Solaris syntax,
jimc> to ease user (sysop) training.

Jim,

In my dealings with the automounter over the past 2.5 years, the major pain
point is that the Linux automounter does not function the same as that of
other UNIXes. Interoperability is a HUGE problem. I agree that the
replicated server selection business is, in some cases, a bit difficult to
follow. However, we can't simply ignore an enormous install base.

jimc> The current version of mount on Linux (util-linux-2.12) does not know
jimc> about picking servers from a list; at least the man page doesn't
jimc> know. This means that the whole job of server selection falls to
jimc> automount. I think that's the right way to design the system.
jimc> However, that also means that automount needs to know something about
jimc> NFS servers specifically. The less it knows, the better, in my
jimc> opinion, so the design of NFS mount options can be separated from
jimc> automount.

When initially thinking about this problem, I came to the same conclusion:
most, if not all of the server selection should be done in the automount
daemon. However, when you take into account the read-only NFS failover,
you realize that the kernel may have to figure a lot of this stuff out
anyway.

I'd really like to hear from someone with NFS expertise how they envision
read-only NFS failover working. We can take the server selection out of
the loop for the moment, and just concentrate on mechanics. Once we have a
good picture of how that will look, we can decide how to implement the
actual policy.

jimc> You asked where various steps should be implemented. Picking the
jimc> server: that's the job of the userspace daemon, and I don't see too
jimc> much help that the kernel might give. Readonly failover is another
jimc> matter -- which I think is important.

jimc> Here's a top of head kludge for failover: Autofs furnishes a
jimc> synthetic directory, let's call it /net/warez. The user daemon NFS
jimc> mounts something on it, example julia:/m1/warez. The user daemon
jimc> mounts another inter-layer, maybe FUSE, on top of the NFS, and client
jimc> I/O operations go to that filesystem. When the inter-layer starts
jimc> getting I/O errors because the NFS driver has decided that the server
jimc> is dead, the inter-layer notifies the automount daemon. It tells the
jimc> kernel autofs driver to create a temp name /net/xyz123, and it mounts
jimc> a different server on it, let's say sonia:/m2/warez. Then the names
jimc> are renamed to, respectively, /net/xyz124 and /net/warez (the new
jimc> one). Finally the automount daemon does a "bind" or "move" mount to
jimc> transfer the inter-layer to be mounted on the new /net/warez. Then
jimc> the I/O operation has to be re-tried on the new server. Wrecked
jimc> directories are cleaned up as circumstances allow.

We can do better than this. Again, I'd like to hear some ideas from Trond
et. al. on how this could be accomplished in a clean way.

-Jeff

-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-05-03 03:55:29

by Ian Kent

[permalink] [raw]

Subject: Re: [autofs] [RFC] Multiple server selection and replicated mount failover

On Tue, 2 May 2006, Jim Carter wrote:

> On Tue, 2 May 2006, Ian Kent wrote:
>
> > For some time now I have had code in autofs that attempts to select an
> > appropriate server from a weighted list to satisfy server priority
> > selection and Replicated Server requirements. The code has been
> > problematic from the beginning and is still incorrect largely due to me
> > not merging the original patch well and also not fixing it correctly
> > afterward.
> >
> > So I'd like to have this work properly and to do that I also need to
> > consider read-only NFS mount fail over.
>
> I'm glad to hear that there may be progress in server selection. But I'm
> not sure if you're looking at the problem from the direction that I am.
>
> First, I don't think it's necessary to replicate the original Sun
> behavior exactly, although it would be helpful but not mandatory to
> allow something in the automount maps that resembles Solaris syntax, to
> ease user (sysop) training.

Sure but if it's different it must be a superset of the existing expected
functionality for compatibility reasons. The minimum requirement must be
that existing maps behave as people have come to expect.

>
> The current version of mount on Linux (util-linux-2.12) does not know
> about picking servers from a list; at least the man page doesn't know.
> This means that the whole job of server selection falls to automount. I
> think that's the right way to design the system. However, that also
> means that automount needs to know something about NFS servers
> specifically. The less it knows, the better, in my opinion, so the
> design of NFS mount options can be separated from automount.

Agreed.

mount(8) may be the "right" place to do this.

The Solaris automounter did this before mount knew about server selection
but yes, we may need to keep this in autofs for a while longer.

>
> Your task is to make an ordered list of servers, best to worst, and to
> mount from the best one that answers. To my mind a concentration on
> "groups" is confusing for the implementor and user, as well as requiring
> inside knowledge so you can classify the servers. On the other hand,
> the sysop does want to be able to use the same automount map on a
> variety of machines (e.g. map served by NIS). So what discriminations
> might be made?

I don't think the user need know anything about the selection internals.

All that a user should need to know is that servers on a local network
will be selected before those on networks farther away. Our challenge is
to define proximity metric of "local" and "farther away" in a sensible
way.

The reference to "group" is to identify the requirement that all servers
at the same proximity need to be tried before moving onto the next closest
list of servers.

>
> Explicit preferences set by the sysop should be able to trump all other
> discriminations -- or it should be possible to make them small enough to
> be overridden by intrinsic differences. Let's specify that intrinsic
> differences are 5 or 10 points, and the explicit preference could be set
> to 30 to override, or 1 to subtly influence. For example, you could
> give preferences of 0, 1 and 30 to designate a "last choice" server, and
> two more preferred servers which would be picked on intrinsic grounds,
> the first one being most preferred if all else is equal. (Low score
> wins -- say that explicitly in the documentation.)

This sounds like the weighting that may be attached to a server.

My original incorrect interpretation was what you describe and it leads to
incorrect server selection.

To meet the minimum requirement network proximity needs to have a higher
priority than weights given to servers.

Of course, by and large, NFS servers that people use this way are close so
the trick then falls to our definition of "local" and "farther away".

I don't think we need to go as far as to allocate point values as they may
be specified as weights, a muliplier of cost, where cost is an as yet
undefined function of proximity. This is probably as simple as the nailing
"local" and "farther away" definition.

>
> As for intrinsic differences, let's say that being on a different subnet
> costs 10 points. That distinction is important. I'm not too sure what
> the "same net" discrimination might mean. At my shop, if 128.97.4.x
> (x.math.ucla.edu) is picking servers, then 128.97.70.x (x.math.ucla.edu)
> is local, 128.97.12.x (x.pic.ucla.edu) is also local, but 128.97.31.x
> (x.ess.ucla.edu) is a different department in another building. If we
> have to make this discrimination, let's define it like this: let the
> "length" be the number of bytes (including dots) in the client's (not
> server's) canonical name. Starting from the end, count the number of
> bytes that are equal in the two names. Then the penalty for the server
> is 10*(1 - common/length). For example, if the client is
> simba.math.ucla.edu and the server is tupelo.math.ucla.edu, length is
> 19, common is 14, and the penalty is 3 points (rounding up 2.6). It
> would be 6 points for a server in PIC or ESS, which would be considered
> equally bad.

Good idea but equally there are many examples where this would be really
bad.

Consider a company that has names like:

server.addom.company.com and server.nfsserv.company.com

where the addom contains their world-wide Active directory servers and
the unix subdomain contains their world-wide NFS servers.

Don't get me wrong this exact same problem exists with network addresses
as well, such as a company with a B class network with a bunch of VPN
connections.

Sure it's a contrived example but I've seen similar nameing schemes.

The only thing we can really relly on is the subnet of the local
interface(s) of the machine we are doing the calculation on.

>
> I'd say to ignore server capabilities, e.g. NFSv4 versus NFSv2, because
> that takes too much inside information -- you actually have to talk to
> the server. If the client and server can negotiate to make the mount
> happen, fine. If not, automount has to go to the next server. (And it
> should remember that the back-version server didn't work out, for a
> generous but non-infinite time like a few hours.)

Again this was my original approach which is probably contributing to the
incorrect selection in autofs now.

Nevertheless I'm not sure how usefull this descrimination is and NFS v4
probably needs to be considered seperatly.

I think we will have to connect to the server in some way to establish
cost or even just to establish availability. We don't want to even
attempt to mount from a server that is down but at the same time we
can't remove it from our list as it may come back.

>
> NFSv4 has a nice behavior: if the client doesn't use a NFSv4 mount for a
> configurable time (default 10 minutes), it will sync and give up its
> filehandle, although I believe the client still claims that the
> filesystem is mounted. On subsequent use it will [attempt to]
> re-acquire the filehandle transparently. This means that if the server
> crashes the filehandle will not be stale, although if the using program
> wakes up before the server comes back, it will get an I/O error.
>
> There's a lot of good stuff about NFSv4 on http://wiki.linux-nfs.org/ I got
> NFSv4 working on SuSE 10.0 (kernel 2.6.13, nfs-utils-1.0.7) as a demo;
> notes from this project (which need to be finished) are at
> http://www.math.ucla.edu/~jimc/documents/nfsv4-0601.html
>
> You asked where various steps should be implemented. Picking the server:
> that's the job of the userspace daemon, and I don't see too much help that
> the kernel might give. Readonly failover is another matter -- which I
> think is important.
>
> Here's a top of head kludge for failover: Autofs furnishes a synthetic
> directory, let's call it /net/warez. The user daemon NFS mounts something
> on it, example julia:/m1/warez. The user daemon mounts another
> inter-layer, maybe FUSE, on top of the NFS, and client I/O operations go to
> that filesystem. When the inter-layer starts getting I/O errors because
> the NFS driver has decided that the server is dead, the inter-layer
> notifies the automount daemon. It tells the kernel autofs driver to create
> a temp name /net/xyz123, and it mounts a different server on it, let's say
> sonia:/m2/warez. Then the names are renamed to, respectively, /net/xyz124
> and /net/warez (the new one). Finally the automount daemon does a "bind"
> or "move" mount to transfer the inter-layer to be mounted on the new
> /net/warez. Then the I/O operation has to be re-tried on the new server.
> Wrecked directories are cleaned up as circumstances allow.

This sounds a lot like it would require a stackable filesystem and that's
probably the only way such an approach would be workable.

Getting a stackable filesystem to work well enough to live in the kernel
is a huge task but is an option.

But there must be a simpler way.

Ian

-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs