2002-10-22 14:09:52

by Bill Schrier

[permalink] [raw]
Subject: NFS Locking Issue - Solaris-Linux


#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>

/* Your basic Stevens cut-and-paste */
static int
lock_reg (int fd, int cmd, int type, off_t offset, int whence, off_t
len)
{
struct flock lock;

lock.l_type = type; /* F_RDLCK, F_WRLCK, F_UNLCK */
lock.l_start = offset; /* byte offset relative to whence */
lock.l_whence = whence; /* SEEK_SET, SEEK_CUR, SEEK_END */
lock.l_len = len; /* #bytes, 0 for eof */

return fcntl (fd, cmd, &lock);
}

#define lock_entire_file(fd) \
lock_reg ((fd), F_SETLK, F_WRLCK, 0, SEEK_SET, 0)
#define unlock_entire_file(fd) \
lock_reg ((fd), F_SETLK, F_UNLCK, 0, SEEK_SET, 0)

int
main (int argc, char **argv)
{
int result;
int fd;

if (argc != 2)
{
fprintf (stderr, "Must pass in a single file to lock\n");
return 1;
}

fd = open (argv[1], O_RDWR);
if (fd < 0)
{
fprintf (stderr, "Failed to open '%s': %s\n",
argv[1], strerror (errno));
return 1;
}

result = lock_entire_file (fd);

if (result < 0)
{
fprintf (stderr, "Failed to lock '%s': %s\n",
argv[1], strerror (errno));

return 1;
}

printf ("Successfully locked '%s', unlocking and exiting\n",
argv[1]);

close (fd);

return 0;
}


Attachments:
lockit.cc (1.30 kB)

2002-10-22 14:52:47

by Eff Norwood

[permalink] [raw]
Subject: RE: NFS Locking Issue - Solaris-Linux

Hi Bill,

There are a bunch of other variables I'd like to know about.

-Are you using automounter
-Are you using NIS (if yes what kind of machine is the master)
-What are your mount options
-What's the version of RedHat that works
-Solaris Sparc or Solaris Intel

Finally, will the Redhat box as a client lock the Raidzone or is it just the
Solaris machines?

Eff Norwood




-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-22 15:23:10

by Bill Schrier

[permalink] [raw]
Subject: Re: NFS Locking Issue - Solaris-Linux

> -Are you using automounter

Yes

> -Are you using NIS (if yes what kind of machine is the master)

No, we are using NIS+ though. The master and replica servers are both Sparc
Solaris - the master is Solaris 2.6, the replica is Solaris 8.

> -What are your mount options

The corresponding line from our auto_direct table follows:
/zone -actimeo=0,vers=2 raidzone:/home

The /etc/exports file on the raidzone box looks as follows:
/home 192.9.200.0/255.255.255.0(rw,no_root_squash,insecure,no_auth_nlm)
10.5.0.0/255.255.255.0(rw,insecure,no_
auth_nlm) 10.3.0.0/255.255.255.0(rw,insecure,no_auth_nlm)


> -What's the version of RedHat that works

I don't think I follow you here... Do you mean which Redhat version on the
Raidzone box? If so, I believe that Raidzone would have been testing with
their most recent kernel version - which is what we are also running. If you
mean as a client, I believe any version of RedHat is able to successfully lock
files.

> -Solaris Sparc or Solaris Intel

This is all on Solaris Sparc.

> Finally, will the Redhat box as a client lock the Raidzone or is it just the
> Solaris machines?

Yes - files shared from the Raidzone box are able to be locked from
HPUX machines, and other Linux boxes (both Redhat and Debian) on the network.
It is just the Solaris machines that are unable to lock files.

Thanks for the help! If I can provide any further information, please don't
hesitate to ask.

Bill

--
William J. Schrier Phone: 412.968.5780 x151
Neolinear, Inc. Fax: 412.968.5788
583 Epsilon Drive Email: [email protected]
Pittsburgh, PA 15238





-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-22 16:02:13

by Daniel Forrest

[permalink] [raw]
Subject: Re: NFS Locking Issue - Solaris-Linux

Bill,

>> We've been having a bit of trouble finding a solution to a problem
>> we've been having between our Solaris machines and our Raidzone
>> machine running Redhat (kernel 2.4.18-12smp). I would appreciate
>> any input on this subject as it is basically keeping us from
>> effectively using the storage space we have in the Raidzone box.
>>
>> The problem arises when we try to lock any file shared from the
>> Redhat machine from any of our Solaris machines. This happens
>> regardless of the Solaris kernel version - 2.6, 8, and multiple
>> kernel patch levels within those OS versions. However, with a
>> clean install of Redhat, we are able to successfully lock shared
>> files - it is just this Raidzone machine.

What is the mode of failure? Does the lock request just hang? I may
have seen this same problem (it was on a Raidzone under 2.4.18-10smp)
and can offer some insight.

Is the Raidzone offering lock service over TCP? Run "rpcinfo -p" on
the Raidzone to find out.

If it is, note the port number "nlockmgr" is using. Now, while one of
the Solaris machines is trying to lock a file, run "netstat --ip" on
the Raidzone and look for that port number. Are bytes accumulating in
the "Recv-Q" for that port? When I saw this problem I would see 192
bytes (the size of a lock request) show up every 30 seconds.

If this is the case, now do an "echo 256 > /proc/sys/sunrpc/rpc_debug"
(0x0100 hex which is RPCDBG_SVCSOCK) on the Raidzone and wait for the
next lock request to arrive. If it is the same problem I saw, the
"Recv-Q" will empty at this time and the lock request will succeed.

At this point, even if "/proc/sys/sunrpc/rpc_debug" is restored to 0,
all lock requests from the Solaris machine will succeed as long as the
"nlockmgr" connection remains "ESTABLISHED". If the connection is
dropped (occurs after 5 minutes of inactivity) and has to be remade
the same problem occurs.

There is obviously some sort of timing related bug going on here, but
I have no idea what it is. I discovered the "rpc_debug" trick while
trying to diagnose the problem. The first time I set all of the bits
in "rpc_debug" things suddenly started to work. I then tried setting
single bits until I found that "RPCDBG_SVCSOCK" did the trick.

Obviously, leaving the debug turned on is no solution because the
messages file will grow out of control and performance will suffer.

My solution was to downgrade the kernel to 2.4.16-10smp (which does
not offer "nlockmgr" over TCP) and everything works fine. I don't
know if it is possible to disable "nlockmgr" over TCP on 2.4.18-10smp,
I think it is a compile time option. I suppose you could edit the
output from pmap_dump and then use pmap_set to unregister it.

I would be interested to know if these are the same symptoms you are
seeing and if the "rpc_debug" trick works for you. I would also be
interested in knowing if Raidzone finds a fix for this. Once I got
things working again under 2.4.16-10smp I never found the time to
pursue it any further.

--
+----------------------------------+----------------------------------+
| Daniel K. Forrest | Laboratory for Molecular and |
| [email protected] | Computational Genomics |
| (608)262-9479 | University of Wisconsin, Madison |
+----------------------------------+----------------------------------+


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 18:54:23

by Bill Schrier

[permalink] [raw]
Subject: Re: NFS Locking Issue - Solaris-Linux

Daniel,

Thanks for the input - it is dead on. We tested your work around, and it
worked exactly as you described.

The one thing that confuses me is that Raidzone said that they were unable
to reproduce the error - but you said you thought that disabling nlockmgr
over TCP was a compile time option. If this is the case, then I would
assume that Raidzone would also be running with nlockmgr over TCP - since
I think we are running the same kernel that they are.

Does anyone know if this is indeed correct? Is there a way to disable
nlockmgr over TCP without a kernel recompile?

We're not too interested in downrevving our kernel on this machine. Since
a stock Redhat install appears to come with nlockmgr disabled on TCP by
default, it seems that it got turned on somewhere along the way with the
Raidzone kernel distributions, and really needs to be turned back off -
especially in an environment with Solaris clients.

Again, any further information is greatly appreciated.

Thanks!

Bill

Daniel Forrest wrote:

> Bill,
>
> >> We've been having a bit of trouble finding a solution to a problem
> >> we've been having between our Solaris machines and our Raidzone
> >> machine running Redhat (kernel 2.4.18-12smp). I would appreciate
> >> any input on this subject as it is basically keeping us from
> >> effectively using the storage space we have in the Raidzone box.
> >>
> >> The problem arises when we try to lock any file shared from the
> >> Redhat machine from any of our Solaris machines. This happens
> >> regardless of the Solaris kernel version - 2.6, 8, and multiple
> >> kernel patch levels within those OS versions. However, with a
> >> clean install of Redhat, we are able to successfully lock shared
> >> files - it is just this Raidzone machine.
>
> What is the mode of failure? Does the lock request just hang? I may
> have seen this same problem (it was on a Raidzone under 2.4.18-10smp)
> and can offer some insight.
>
> Is the Raidzone offering lock service over TCP? Run "rpcinfo -p" on
> the Raidzone to find out.
>
> If it is, note the port number "nlockmgr" is using. Now, while one of
> the Solaris machines is trying to lock a file, run "netstat --ip" on
> the Raidzone and look for that port number. Are bytes accumulating in
> the "Recv-Q" for that port? When I saw this problem I would see 192
> bytes (the size of a lock request) show up every 30 seconds.
>
> If this is the case, now do an "echo 256 > /proc/sys/sunrpc/rpc_debug"
> (0x0100 hex which is RPCDBG_SVCSOCK) on the Raidzone and wait for the
> next lock request to arrive. If it is the same problem I saw, the
> "Recv-Q" will empty at this time and the lock request will succeed.
>
> At this point, even if "/proc/sys/sunrpc/rpc_debug" is restored to 0,
> all lock requests from the Solaris machine will succeed as long as the
> "nlockmgr" connection remains "ESTABLISHED". If the connection is
> dropped (occurs after 5 minutes of inactivity) and has to be remade
> the same problem occurs.
>
> There is obviously some sort of timing related bug going on here, but
> I have no idea what it is. I discovered the "rpc_debug" trick while
> trying to diagnose the problem. The first time I set all of the bits
> in "rpc_debug" things suddenly started to work. I then tried setting
> single bits until I found that "RPCDBG_SVCSOCK" did the trick.
>
> Obviously, leaving the debug turned on is no solution because the
> messages file will grow out of control and performance will suffer.
>
> My solution was to downgrade the kernel to 2.4.16-10smp (which does
> not offer "nlockmgr" over TCP) and everything works fine. I don't
> know if it is possible to disable "nlockmgr" over TCP on 2.4.18-10smp,
> I think it is a compile time option. I suppose you could edit the
> output from pmap_dump and then use pmap_set to unregister it.
>
> I would be interested to know if these are the same symptoms you are
> seeing and if the "rpc_debug" trick works for you. I would also be
> interested in knowing if Raidzone finds a fix for this. Once I got
> things working again under 2.4.16-10smp I never found the time to
> pursue it any further.
>
> --
> +----------------------------------+----------------------------------+
> | Daniel K. Forrest | Laboratory for Molecular and |
> | [email protected] | Computational Genomics |
> | (608)262-9479 | University of Wisconsin, Madison |
> +----------------------------------+----------------------------------+

--
William J. Schrier Phone: 412.968.5780 x151
Neolinear, Inc. Fax: 412.968.5788
583 Epsilon Drive Email: [email protected]
Pittsburgh, PA 15238





-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 22:30:30

by Scott McDermott

[permalink] [raw]
Subject: Re: NFS Locking Issue - Solaris-Linux

Bill Schrier on Tue 22/10 11:21 -0400:
> The corresponding line from our auto_direct table follows:
> /zone -actimeo=0,vers=2 raidzone:/home

I thought vers=2 did not support the NLM (so you won't get locking). I
seem to remember eagerly awaiting the v3 code and patching my machines
so I could get file locks. That was a while ago...

why are you using vers=2 btw? I think you mentioned Solaris 2.6 as the
earliest version; that mounts v3 I think.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-25 17:53:55

by Daniel Forrest

[permalink] [raw]
Subject: Re: NFS Locking Issue - Solaris-Linux

Bill,

>> Thanks for the input - it is dead on. We tested your work around,
>> and it worked exactly as you described.

I'm actually glad that someone else saw this same bug.

>> The one thing that confuses me is that Raidzone said that they were
>> unable to reproduce the error - but you said you thought that
>> disabling nlockmgr over TCP was a compile time option. If this is
>> the case, then I would assume that Raidzone would also be running
>> with nlockmgr over TCP - since I think we are running the same
>> kernel that they are.

After I downgraded to 2.4.16-10, I upgraded another (non-Raidzone)
machine to 2.4.18 and figured I could continue my testing. Bzzzzt.
Everything worked fine with NLM over TCP. I am suspecting that not
only is it a timing related bug, it is probably a hardware specific
bug, and maybe even limited to only a subset of the Raidzone boxes if
they can't reproduce the error themselves. Otherwise they should have
complaints from lots of Sun users, right?

>> Does anyone know if this is indeed correct? Is there a way to
>> disable nlockmgr over TCP without a kernel recompile?

I would be interested in this too in case I run into it again.

>> We're not too interested in downrevving our kernel on this machine.
>> Since a stock Redhat install appears to come with nlockmgr disabled
>> on TCP by default, it seems that it got turned on somewhere along
>> the way with the Raidzone kernel distributions, and really needs to
>> be turned back off - especially in an environment with Solaris
>> clients.

If you turn on NFS over TCP you get NLM over TCP. I assume they are
turning on NFS over TCP for performance reasons.

--
+----------------------------------+----------------------------------+
| Daniel K. Forrest | Laboratory for Molecular and |
| [email protected] | Computational Genomics |
| (608)262-9479 | University of Wisconsin, Madison |
+----------------------------------+----------------------------------+


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-28 14:27:43

by Bill Schrier

[permalink] [raw]
Subject: Re: NFS Locking Issue - Solaris-Linux

> >> Thanks for the input - it is dead on. We tested your work around,
> >> and it worked exactly as you described.
>
> I'm actually glad that someone else saw this same bug.
>
> >> The one thing that confuses me is that Raidzone said that they were
> >> unable to reproduce the error - but you said you thought that
> >> disabling nlockmgr over TCP was a compile time option. If this is
> >> the case, then I would assume that Raidzone would also be running
> >> with nlockmgr over TCP - since I think we are running the same
> >> kernel that they are.
>
> After I downgraded to 2.4.16-10, I upgraded another (non-Raidzone)
> machine to 2.4.18 and figured I could continue my testing. Bzzzzt.
> Everything worked fine with NLM over TCP. I am suspecting that not
> only is it a timing related bug, it is probably a hardware specific
> bug, and maybe even limited to only a subset of the Raidzone boxes if
> they can't reproduce the error themselves. Otherwise they should have
> complaints from lots of Sun users, right?

We ended up downgrading back to 2.4.16 as well (we went to -7 since that
was what we had on hand). The reason for that was we had attempted to
clean out the NLM stuff on the 2.4.18-12, but it ended up messing up our
clients pretty bad - so we just downgraded. Running the 2.4.16 kernel
also fixed the NLM problem for us. Honestly, I was surprised when they
told us that they couldn't reproduce the problem. They sent us output
that showed that they were running the NLM on TCP, but they said they
weren't getting the issue - so I would agree that it might be limited to a
certain subset of the raidzone boxes (lucky us, huh?).

> >> Does anyone know if this is indeed correct? Is there a way to
> >> disable nlockmgr over TCP without a kernel recompile?
>
> I would be interested in this too in case I run into it again.
>
> >> We're not too interested in downrevving our kernel on this machine.
> >> Since a stock Redhat install appears to come with nlockmgr disabled
> >> on TCP by default, it seems that it got turned on somewhere along
> >> the way with the Raidzone kernel distributions, and really needs to
> >> be turned back off - especially in an environment with Solaris
> >> clients.
>
> If you turn on NFS over TCP you get NLM over TCP. I assume they are
> turning on NFS over TCP for performance reasons.

Daniel,

Again, thanks for the info on the work around. We greatly appreciate the
help!

Bill

--
William J. Schrier Phone: 412.968.5780 x151
Neolinear, Inc. Fax: 412.968.5788
583 Epsilon Drive Email: [email protected]
Pittsburgh, PA 15238





-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs