Hi, the open(2) man page says:
O_EXCL When used with O_CREAT, if the file already exists
it is an error and the open will fail. O_EXCL is
broken on NFS file systems, programs which rely on
it for performing locking tasks will contain a race
condition. The solution for performing atomic file
locking using a lockfile is to create a unique file
on the same fs (e.g., incorporating hostname and
pid), use link(2) to make a link to the lockfile.
If link() returns 0, the lock is successful. Oth?
erwise, use stat(2) on the unique file to check if
its link count has increased to 2, in which case
the lock is also successful.
I coded this up and tried it here on a cluster of different operating
systems (Linux 2.4.5 server, linux, freebsd, solaris, aix, hpux, irix
clients) and it doesn't work.
2 questions:
a) is it the belief of folks here that this should work?
b) if performance isn't a big issue, is there any portable way to do
locking over NFS with just files?
On Sunday October 14, [email protected] wrote:
> Hi, the open(2) man page says:
>
> O_EXCL When used with O_CREAT, if the file already exists
> it is an error and the open will fail. O_EXCL is
> broken on NFS file systems, programs which rely on
> it for performing locking tasks will contain a race
> condition. The solution for performing atomic file
> locking using a lockfile is to create a unique file
> on the same fs (e.g., incorporating hostname and
> pid), use link(2) to make a link to the lockfile.
> If link() returns 0, the lock is successful. Oth?
> erwise, use stat(2) on the unique file to check if
> its link count has increased to 2, in which case
> the lock is also successful.
>
> I coded this up and tried it here on a cluster of different operating
> systems (Linux 2.4.5 server, linux, freebsd, solaris, aix, hpux, irix
> clients) and it doesn't work.
>
> 2 questions:
>
> a) is it the belief of folks here that this should work?
No. It is unsupportable with NFSv2.
The NFSv3 protocol does provide support, the I don't think the Linux
NFSv3 client supports it yet because the VFS layer tries to handle all
the exclusion, and doesn't give the file-system a chance.
>
> b) if performance isn't a big issue, is there any portable way to do
> locking over NFS with just files?
Instead of creating a lock file, create a lock symlink.
Have the content of the symlink be something recognisably unique.
e.g. hostname.pid
If the "symlink" syscall succeeds, you have got the lock.
If it fails, issue a readlink and see if the content is what you
tried to create (RPC packet loss and retransmit could have caused
an incorrect failure return). If it is, you have the lock.
If not, you don't.
Similar tricks can be done with hard links if you really want a
file.
i.e. create a file with a unique name and then hard-link it to the
lock-file-name. On apparent failure, check the inode number.
With all these approaches (including O_EXCL) the tricky bit is
cleaning up after a failed application left a lockfile lying
around.
Automatically deleting it is racy unless you guarantee that only
one process could ever consider deleting an old lock file. e.g. a
cron job on the fileserver that runs every 5 minutes and deletes
any lock file older that 10 minutes.
NeilBrown
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> a) is it the belief of folks here that this should work?
NFSv2 doesnt have the needed semantics
> b) if performance isn't a big issue, is there any portable way to do
> locking over NFS with just files?
The classic way is to use link().
> Instead of creating a lock file, create a lock symlink.
> Have the content of the symlink be something recognisably unique.
> e.g. hostname.pid
> If the "symlink" syscall succeeds, you have got the lock.
> If it fails, issue a readlink and see if the content is what you
> tried to create (RPC packet loss and retransmit could have caused
> an incorrect failure return). If it is, you have the lock.
> If not, you don't.
OK, tried that too, here's the code. Doesn't work. Neither does the
link approach. Am I doing something wrong? It seems to me that I'm
completely at the mercy of the client NFS implementation - if it caches
stuff wrong, I'm hosed. There has to be some cute trick to get past this.
--lm
int
sccs_lockfile(char *lockfile, int seconds)
{
char *s;
char buf[300];
int n, uslp = 1000, waited = 0;
s = aprintf("%u %s", getpid(), sccs_gethost());
for ( ;; ) {
if (symlink(s, lockfile) == 0) return (0);
n = readlink(lockfile, buf, sizeof(buf));
if (n > 0) {
buf[n] = 0;
if (streq(s, buf)) return (0);
}
if (seconds && ((waited / 1000000) >= seconds)) {
fprintf(stderr, "timed out waiting for %s\n", lockfile);
free(s);
return (-1);
}
usleep(uslp);
waited += uslp;
if (uslp < 20000) uslp <<= 1;
}
/* NOTREACHED */
}
/*
* Usage: a.out iterations lockfile
*/
int
main(int ac, char **av)
{
int i, iter;
int me = getpid();
unless (ac == 3) return (1);
unless ((iter = atoi(av[1])) > 0) return (1);
printf("%d starts\n", me);
for (i = 1; i <= iter; ++i) {
sccs_lockfile(av[2], 0);
assert(mine(av[2]));
unlink(av[2]);
unless (i % 10) printf("%d locked %d times\n", me, i);
}
printf("%d done\n", me);
return (0);
}
int
mine(char *file)
{
char buf[300];
char *s;
int n;
n = readlink(file, buf, sizeof(buf));
if (n > 0) {
s = aprintf("%u %s", getpid(), sccs_gethost());
buf[n] = 0;
n = streq(s, buf);
unless (n) fprintf(stderr, "%s != %s\n", s, buf);
free(s);
return (n);
}
return (0);
}
/*
* This function works like sprintf(), except it return a
* malloc'ed buffer which caller should free when done
*/
char *
aprintf(char *fmt, ...)
{
va_list ptr;
int rc, size = strlen(fmt) + 64;
char *buf = malloc(size);
va_start(ptr, fmt);
rc = vsnprintf(buf, size, fmt, ptr);
va_end(ptr);
/*
* On IRIX, it truncates and returns size-1.
* We can't assume that that is OK, even though that might be
* a perfect fit. We always bump up the size and try again.
* This can rarely lead to an extra alloc that we didn't need,
* but that's tough.
*/
while ((rc < 0) || (rc >= (size-1))) {
size *= 2;
free(buf);
buf = malloc(size);
va_start(ptr, fmt);
rc = vsnprintf(buf, size, fmt, ptr);
va_end(ptr);
}
return (buf); /* caller should free */
}
char *
sccs_gethost()
{
static char host[256];
if (gethostname(host, sizeof(host)) == -1) return "?";
return (host);
}
In article <[email protected]>,
Larry McVoy <[email protected]> wrote:
>OK, tried that too, here's the code. Doesn't work. Neither does the
>link approach. Am I doing something wrong? It seems to me that I'm
>completely at the mercy of the client NFS implementation - if it caches
>stuff wrong, I'm hosed. There has to be some cute trick to get past this.
Download ftp://ftp.debian.org/debian/pool/main/libl/liblockfile/liblockfile_1.03.tar.gz
It contains NFS safe locking functions, and it knows how to work around
NFS client caches. And it documents all algorithms in the manpages too.
ALGORITHM
The algorithm that is used to create a lockfile in an
atomic way, even over NFS, is as follows:
1 A unique file is created. In printf format, the
name of the file is .lk%05d%x%s. The first argument
(%05d) is the current process id. The second argu?
ment (%x) consists of the 4 minor bits of the value
returned by time(2). The last argument is the sys?
tem hostname.
2 Then the lockfile is created using link(2). The
return value of link is ignored.
3 Now the lockfile is stat()ed. If the stat fails, we
go to step 6.
4 The stat value of the lockfile is compared with
that of the temporary file. If they are the same,
we have the lock. The temporary file is deleted and
a value of 0 (success) is returned to the caller.
5 A check is made to see if the existing lockfile is
a valid one. If it isn't valid, the stale lockfile
is deleted.
6 Before retrying, we sleep for n seconds. n is ini?
tially 5 seconds, but after every retry 5 extra
seconds is added up to a maximum of 60 seconds (an
incremental backoff). Then we go to step 2 up to
retries times.
REMOTE FILE SYSTEMS AND THE KERNEL ATTRIBUTE CACHE
If you are using lockfile_create to create a lock on a
file that resides on a remote server, and you already have
that file open, you need to flush the NFS attribute cache
after locking. This is needed to prevent the following
scenario:
o open /var/mail/USERNAME
o attributes, such as size, inode, etc are now cached in
the kernel!
o meanwhile, another remote system appends data to
/var/mail/USERNAME
o grab lock using lockfile_create()
o seek to end of file
o write data
Now the end of the file really isn't the end of the file -
the kernel cached the attributes on open, and st_size is
not the end of the file anymore. So after locking the
file, you need to tell the kernel to flush the NFS file
attribute cache.
The only portable way to do this is the POSIX fcntl() file
locking primitives - locking a file using fcntl() has the
fortunate side-effect of invalidating the NFS file
attribute cache of the kernel.
lockfile_create() cannot do this for you for two reasons.
One, it just creates a lockfile- it doesn't know which
file you are actually trying to lock! Two, even if it
could deduce the file you're locking from the filename, by
just opening and closing it, it would invalidate any
existing POSIX locks the program might already have on
that file (yes, POSIX locking semantics are insane!).
So basically what you need to do is something like this:
fd = open("/var/mail/USER");
.. program code ..
lockfile_create("/var/mail/USER.lock", x, y);
/* Invalidate NFS attribute cache using POSIX locks */
if (lockf(fd, F_TLOCK, 0) == 0) lockf(fd, F_ULOCK, 0);
You have to be careful with this if you're putting this in
an existing program that might already be using fcntl(),
flock() or lockf() locking- you might invalidate existing
locks.
There is also a non-portable way. A lot of NFS operations
return the updated attributes - and the Linux kernel actu?
ally uses these to update the attribute cache. One of
these operations is chmod(2).
So stat()ing a file and then chmod()ing it to st.st_mode
will not actually change the file, nor will it interfere
with any locks on the file, but it will invalidate the
attribute cache. The equivalent to use from a shell script
would be
chmod u=u /var/mail/USER
Mike.
--
Move sig.