LinuxLists.cc - failover-NFS server problem

2004-04-19 18:14:28

Subject: failover-NFS server problem

hi, all

I ran into some problem in running Connectathon with NFS. The scenario I want to test is if NFS works fine in case of the server is failed back and forth (linuxHA controls that. Linux-Ha is in http://www.linux-ha.org ). The following is my setup:

1. Machine A and Machine B are servers, there is one and only server active any time. The active machine has a floating IP as the service IP.
2. Machine C is the client. It mounts a directory from the floating IP
3. Machine A and Machine B have a fiber channel share disk. The exported filesystem resides in shared disk. On both machines,
/var/lib/nfs is a symbolic link to a directory in the shared disk.

all machines are running Redhat 9, kernel 2.4.26, with nfs-utils-1.0.1-2.9

The server will switch between A and B, in each switching:
the original server will stop nfs, stop nfslock, umount from the shared disk and give out the floating ip,
the backup server will grab the floaing ip, mount the shared disk, start nfslock and start nfs

in testing, I ran the locking test in Connectathon in the client site while servers are switching back and forth.
Usually I will get an error in the 7th or 8th pass, the error code is 37- no lock record available. There are several cases
for this failure for now (two processes--parent and child process-- are competing for the same lock)

1. the child gets the lock, then the child process is killed, then the parent tries to lock the file, it shall return success,
however it return an error with errno=37
2. the child locks the file, the it unlocks; the parent then tries to lock the file, it shall return success,
however it return an error with errno=37
3. this is similar to case 2, the parent locks the file, the it unlocks; the child then tries to lock the file, it shall return success,
however it return an error with errno=37

By the way, running bonnie++ instead of Connectathon in the above scenario finishes successfully, which is good :-)

any comment/suggestion is welcome.
Thanks
-Guochun

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-04-19 23:10:35

by Ragnar Kjørstad

[permalink] [raw]

Subject: Re: failover-NFS server problem

On Mon, Apr 19, 2004 at 01:14:24PM -0500, Guochun Shi wrote:
> I ran into some problem in running Connectathon with NFS. The scenario =
I want to test is if NFS works fine in case of the server is failed back =
and forth (linuxHA controls that. Linux-Ha is in http://www.linux-ha.org ). The =
following is my setup:
>=20
> 1. Machine A and Machine B are servers, there is one and only server ac=
tive any time. The active machine has a floating IP as the service IP.
> 2. Machine C is the client. It mounts a directory from the floating IP
> 3. Machine A and Machine B have a fiber channel share disk. The exporte=
d filesystem resides in shared disk. On both machines,=20
> /var/lib/nfs is a symbolic link to a directory in the shared disk.
>=20
> all machines are running Redhat 9, kernel 2.4.26, with nfs-utils-1.0.1-=
2.9
>=20
> The server will switch between A and B, in each switching:
> the original server will stop nfs, stop nfslock, umount from the shared=
disk and give out the floating ip,=20
> the backup server will grab the floaing ip, mount the shared disk, star=
t nfslock and start nfs

Did you modify the init-scripts to supply the "-n" option to rpc.statd?

Statd needs to use the same name on both servers (the one used by your
clients to mount filesystems) for locking to work accross failures.

The init-script in nfs-utils in RH9 doesn't support this. Take a look at
the init-script in Fedora Core 1 or the official nfs-utils package.
(Possible you can just upgrade to FC1's nfs-utils 1.0.6.

Then you need to change the sysconfig-files to set the STATD_HOSTNAME
variable to match the name of the floating IP.=20

--=20
Ragnar Kjo=F8rstad
Software Engineer
Scali - http://www.scali.com
High Performance Clustering

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-04-20 16:23:32

by Guochun Shi

[permalink] [raw]

Subject: Re: failover-NFS server problem

At 01:10 AM 4/20/2004 +0200, Ragnar =?iso-8859-15?Q?Kjo=F8rstad?= wrote:
>On Mon, Apr 19, 2004 at 01:14:24PM -0500, Guochun Shi wrote:
>> I ran into some problem in running Connectathon with NFS. The scenario I want to test is if NFS works fine in case of the server is failed back and forth (linuxHA controls that. Linux-Ha is in http://www.linux-ha.org ). The following is my setup:
>>
>> 1. Machine A and Machine B are servers, there is one and only server active any time. The active machine has a floating IP as the service IP.
>> 2. Machine C is the client. It mounts a directory from the floating IP
>> 3. Machine A and Machine B have a fiber channel share disk. The exported filesystem resides in shared disk. On both machines,
>> /var/lib/nfs is a symbolic link to a directory in the shared disk.
>>
>> all machines are running Redhat 9, kernel 2.4.26, with nfs-utils-1.0.1-2.9
>>
>> The server will switch between A and B, in each switching:
>> the original server will stop nfs, stop nfslock, umount from the shared disk and give out the floating ip,
>> the backup server will grab the floaing ip, mount the shared disk, start nfslock and start nfs
>
>Did you modify the init-scripts to supply the "-n" option to rpc.statd?
>
>Statd needs to use the same name on both servers (the one used by your
>clients to mount filesystems) for locking to work accross failures.
>
>The init-script in nfs-utils in RH9 doesn't support this. Take a look at
>the init-script in Fedora Core 1 or the official nfs-utils package.
>(Possible you can just upgrade to FC1's nfs-utils 1.0.6.
>
>Then you need to change the sysconfig-files to set the STATD_HOSTNAME
>variable to match the name of the floating IP.
>
thanks for your reply, Ragnar, I modified the init.d/nfslock script to supply STATD_HOSTNAME to rpc.statd and and are now re-running the test.
So far so good.

By the way I got lots of warning message in /var/log/message in the client side which may relate to the problem

Apr 20 11:08:15 probe02 portmap[18643]: connect from 141.142.61.99 to getport(st
atus): request from unauthorized host
Apr 20 11:15:49 probe02 portmap[18682]: connect from 141.142.61.98 to getport(nlockmgr): request from unauthorized host
Apr 20 11:15:59 probe02 portmap[18683]: connect from 141.142.61.98 to getport(nlockmgr): request from unauthorized ho

my floating ip is 141.142.61.111,
141.142.61.99 and 141.142.61.98 is the two server machines' original ip. Since I mount the client from the floating ip, it seems the client does not accept lock
message from actual server ips. Is that a possible reason?

thanks
Guochun

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-04-20 16:58:02

by Guochun Shi

[permalink] [raw]

Subject: Re: failover-NFS server problem

At 10:43 AM 4/20/2004 -0600, you wrote:
>Guochun Shi wrote:
>>At 01:10 AM 4/20/2004 +0200, Ragnar =?iso-8859-15?Q?Kjo=F8rstad?= wrote:
>>
>>>On Mon, Apr 19, 2004 at 01:14:24PM -0500, Guochun Shi wrote:
>>>
>>>>I ran into some problem in running Connectathon with NFS. The scenario I want to test is if NFS works fine in case of the server is failed back and forth (linuxHA controls that. Linux-Ha is in http://www.linux-ha.org ). The following is my setup:
>>>>
>>>>1. Machine A and Machine B are servers, there is one and only server active any time. The active machine has a floating IP as the service IP.
>>>>2. Machine C is the client. It mounts a directory from the floating IP
>>>>3. Machine A and Machine B have a fiber channel share disk. The exported filesystem resides in shared disk. On both machines, /var/lib/nfs is a symbolic link to a directory in the shared disk.
>>>>
>>>>all machines are running Redhat 9, kernel 2.4.26, with nfs-utils-1.0.1-2.9
>>>>
>>>>The server will switch between A and B, in each switching:
>>>>the original server will stop nfs, stop nfslock, umount from the shared disk and give out the floating ip, the backup server will grab the floaing ip, mount the shared disk, start nfslock and start nfs
>>>
>>>Did you modify the init-scripts to supply the "-n" option to rpc.statd?
>>>
>>>Statd needs to use the same name on both servers (the one used by your
>>>clients to mount filesystems) for locking to work accross failures.
>>>
>>>The init-script in nfs-utils in RH9 doesn't support this. Take a look at
>>>the init-script in Fedora Core 1 or the official nfs-utils package.
>>>(Possible you can just upgrade to FC1's nfs-utils 1.0.6.
>>>
>>>Then you need to change the sysconfig-files to set the STATD_HOSTNAME
>>>variable to match the name of the floating IP.
>>thanks for your reply, Ragnar, I modified the init.d/nfslock script to supply STATD_HOSTNAME to rpc.statd and and are now re-running the test. So far so good.
>
>Hi Guochun,
>
>This was mentioned in both my article and the one specifically on NFS with Linux-HA. I didn't realize you hadn't done this...

I shall go back to read the articles more carefully:).

I failed the test again after those modifications :-(
and I don't think those warning messages the reason to failure since I got it almost once in several seconds.

-Guochun

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-05-19 00:37:50

by Guochun Shi

[permalink] [raw]

Subject: Re: failover-NFS server problem

I added the "-n clustername" in the /etc/init.d/nfslockd and ran test again,=
same results.

I also tried multiple clients competing the same file lock tests, they all=
failed. A detail description is here
http://wiki.trick.ca/linux-ha/HaNFS

any comment and suggestion is welcome

Thanks
Guochun

At 01:10 AM 4/20/2004 +0200, Ragnar =3D?iso-8859-15?Q?Kjo=3DF8rstad?=3D=
wrote:
>On Mon, Apr 19, 2004 at 01:14:24PM -0500, Guochun Shi wrote:
>> I ran into some problem in running Connectathon with NFS. The scenario I=
want to test is if NFS works fine in case of the server is failed back and=
forth (linuxHA controls that. Linux-Ha is in http://www.linux-ha.org ). The=
following is my setup:
>>=20
>> 1. Machine A and Machine B are servers, there is one and only server=
active any time. The active machine has a floating IP as the service IP.
>> 2. Machine C is the client. It mounts a directory from the floating IP
>> 3. Machine A and Machine B have a fiber channel share disk. The exported=
filesystem resides in shared disk. On both machines,=20
>> /var/lib/nfs is a symbolic link to a directory in the shared disk.
>>=20
>> all machines are running Redhat 9, kernel 2.4.26, with=
nfs-utils-1.0.1-2.9
>>=20
>> The server will switch between A and B, in each switching:
>> the original server will stop nfs, stop nfslock, umount from the shared=
disk and give out the floating ip,=20
>> the backup server will grab the floaing ip, mount the shared disk, start=
nfslock and start nfs
>
>Did you modify the init-scripts to supply the "-n" option to rpc.statd?
>
>Statd needs to use the same name on both servers (the one used by your
>clients to mount filesystems) for locking to work accross failures.
>
>The init-script in nfs-utils in RH9 doesn't support this. Take a look at
>the init-script in Fedora Core 1 or the official nfs-utils package.
>(Possible you can just upgrade to FC1's nfs-utils 1.0.6.
>
>Then you need to change the sysconfig-files to set the STATD_HOSTNAME
>variable to match the name of the floating IP.=20
>
>
>--=20
>Ragnar Kjo=F8rstad
>Software Engineer
>Scali - http://www.scali.com
>High Performance Clustering

-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs