I've seem to have run across a problem with NFS and fcntl locks. I'm
trying to implement a HA-NFS solution using heartbeat, DRBD, LVM2, etc.
I'm running the following:
2.6.6 kernel
nfs-kernel-server and nfs-common 1.0.6-3 (debian packages)
The underlying filesystem is reiserfs. Essentially what I'm seeing is
that when I try to shut down NFS and unmount the filesystem for a
failover, I'm unable to unmount if I have an fcntl lock on the file.
Here's the C program I used to test the locks (my C coding is not the
best, I copied a lot of this from R. Stevens' book):
-------------------------[snip]------------------------
#include <sys/types.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#define write_lock(fd, offset, whence, len) \
lock_reg(fd, F_SETLK, F_WRLCK, offset, whence, len)
int main( int argv, char *argc[] ) {
int fd;
fd = open(argc[1], O_RDWR);
if ( write_lock(fd, 0, SEEK_SET, 0) < 0 ) {
printf("Unable to acquire lock!\n");
} else {
printf("Got the lock. Sleeping for 300 secs.\n");
sleep(300);
}
close(fd);
}
int lock_reg (int fd, int cmd, int type, off_t offset, int whence, off_t
len) {
struct flock lock;
lock.l_type = type;
lock.l_start = offset;
lock.l_whence = whence;
lock.l_len = len;
return (fcntl(fd, cmd, &lock));
}
-----------------------[snip]-----------------------------
I run this program against a file on an NFS mounted directory on the
client machine. On the server, I then shut down NFS (using the debian
nfs-common and nfs-kernel-server startup scripts).
I'm then unable to unmount the underlying filesystem. The error message
is (yes, umount prints it twice for some reason):
umount: /services/NFS/home: device is busy
umount: /services/NFS/home: device is busy
If I then start up NFS again, and kill the locktest program, I'm then
able to shut down nfs and unmount the filesystem.
I also did a test where I just opened the file r/w without locking it,
and it didn't seem to have the same problem, so it seems like the
fcntl() lock is what is causing the problem (though I could be wrong
here).
I've been able to replicate this problem with /proc/fs/nfsd mounted and
unmounted on the server.
I also tried applying the exportfs patch that was in the thread:
nfsd, rmtab, failover, and stale filehandles
on this mailing list earlier this month, and it didn't help. Has anyone
else seen this problem?
If there's any other info you need me to provide to help diagnose this,
please don't hesitate to ask!
Thanks,
Jeff
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
FWIW, I've also been able to replicate this problem on the 2.4 kernels
as well. I'm unable to unmount the underlying filesystem of the NFS
server until the POSIX locks that the clients hold have been released.
Needless to say, this is not good for a HA-NFS server, since it prevents
me from reliably failing over. How have others dealt with this problem?
Or perhaps I don't have something configured correctly?
Any help or insight would be much appreciated...
-- Jeff
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
Got a response from the linux-ha list that this is a known bug :-(. The
current workaround for HA setups seems to be to force an immediate
reboot if it occurs (blech!). Any ideas what the issue is, and how it
can be fixed?
Thanks,
Jeff
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs