2007-10-11 08:56:43

by matteo.debiaggi

[permalink] [raw]
Subject: Lock problem.

Hi at all,
Situation :

I have a linux client :
debian, kernel 2.6.16
64bit
nfs-common_1%3a1.0.6-3.1_amd64.deb

rpcinfo -p localhosts stats(about nfs..)
100000 2 tcp 111 portmapper
100000 2 udp 111 portmapper
100021 1 udp 34865 nlockmgr
100021 3 udp 34865 nlockmgr
100021 4 udp 34865 nlockmgr
100021 1 tcp 40293 nlockmgr
100021 3 tcp 40293 nlockmgr
100021 4 tcp 40293 nlockmgr

And a soalris 8 with all possible patches( or better, all patches i
found..) relative to nfs.
32bit

rpcinfo -p localhost stats(about nfs..)
program vers proto port service
100011 1 udp 32773 rquotad
100001 2 udp 32777 rstatd
100001 3 udp 32777 rstatd
100001 4 udp 32777 rstatd
100021 1 udp 4045 nlockmgr
100021 2 udp 4045 nlockmgr
100021 3 udp 4045 nlockmgr
100021 4 udp 4045 nlockmgr
100021 1 tcp 4045 nlockmgr
100021 2 tcp 4045 nlockmgr
100021 3 tcp 4045 nlockmgr
100021 4 tcp 4045 nlockmgr
100005 1 udp 36355 mountd
100005 2 udp 36355 mountd
100005 3 udp 36355 mountd
100005 1 tcp 49762 mountd
100005 2 tcp 49762 mountd
100005 3 tcp 49762 mountd
100003 2 udp 2049 nfs
100003 3 udp 2049 nfs
100227 2 udp 2049 nfs_acl
100227 3 udp 2049 nfs_acl
100003 2 tcp 2049 nfs
100003 3 tcp 2049 nfs
100227 2 tcp 2049 nfs_acl
100227 3 tcp 2049 nfs_acl

Client wants work with files located in a dir shared with parameters:
share -F nfs -o rw,root=nb,anon=505 /dir
Client mounts dir in this way :
....type nfs (rw,noexec,nosuid,nodev,sync,actimeo=0,noac,addr=....,user=...)

In the shared dir there are concurrent processes writing their own
content on the same file.
They make files as root but with ugo 666.

In the client side another process has a loop looking for mentioned
file. Whenever it finds file, it tryes to read it.

Problem :
Some times, not every times ,we face a strange lock where reader waits
more or less for the same time( about 30sec,uhm..) lock gets acquired,
at the end it did it.

Here's simplyfied code:

WRITER:

for (;;) {
fd = open(file, O_WRONLY|O_APPEND|O_CREAT, 0666);
lockf(fd, F_LOCK, 0);
write(fd, line, LINE_SIZE);
lockf(fd, F_ULOCK, 0);
close(fd);

usleep(WR_PAUSE);
}


READER:

for (;;) {
while(access(file, F_OK))
usleep(RD_PAUSE);

fd = open(file, O_RDWR);
lockf(fd, F_LOCK, 0);
unlink(file);
lockf(fd, F_ULOCK, 0);
lockf(fd, F_LOCK, 0); /* incriminated lock !!! */

while (read(fd, line, LINE_SIZE) == LINE_SIZE)
++line_numb;

close(fd);
}


Any help would be appreciated.
Thanks in advance.
Matteo.







-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2007-10-11 10:09:02

by NeilBrown

[permalink] [raw]
Subject: Re: Lock problem.

On Thursday October 11, [email protected] wrote:
> Hi at all,
> Situation :
>
>
> Problem :
> Some times, not every times ,we face a strange lock where reader waits
> more or less for the same time( about 30sec,uhm..) lock gets acquired,
> at the end it did it.

Weird... you would need a 'tcpdump -s0 ' trace of the lockd requests,
and maybe the nfs requests too. My guess would be that the solaris
server isn't sending a 'GRANT' for some reason, and the Linux client
is timing out and retrying.

>
> Here's simplyfied code:
>
> WRITER:
>
> for (;;) {
> fd = open(file, O_WRONLY|O_APPEND|O_CREAT, 0666);

Suppose the reader wakes up here, opens the file, and gets the lock.

> lockf(fd, F_LOCK, 0);

This then might not until the READER has read all of the file and
closed it. So you could loose a line. Best to check the link-count
on the file at this point, and retry if it is 0.

> write(fd, line, LINE_SIZE);
> lockf(fd, F_ULOCK, 0);
> close(fd);
>
> usleep(WR_PAUSE);
> }
>
>
> READER:
>
> for (;;) {
> while(access(file, F_OK))
> usleep(RD_PAUSE);
>
> fd = open(file, O_RDWR);
> lockf(fd, F_LOCK, 0);
> unlink(file);
> lockf(fd, F_ULOCK, 0);
> lockf(fd, F_LOCK, 0); /* incriminated lock !!! */

Why unlock and relock? To try and avoid the race mentioned above? I
doubt that would be reliable.

NeilBrown


>
> while (read(fd, line, LINE_SIZE) == LINE_SIZE)
> ++line_numb;
>
> close(fd);
> }
>
>
> Any help would be appreciated.
> Thanks in advance.
> Matteo.
>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-11 13:07:41

by matteo.debiaggi

[permalink] [raw]
Subject: Re: Lock problem.

Well, the result of tcpdump of the packets seen in the uncorrect lock
scenario

14:48:56.154531 CLIENT.3542368298 > SERVER.nfs: 108 getattr [|nfs] (DF)
14:48:56.154984 SERVER.nfs > CLIENT.3542368298: reply ok 112 getattr
[|nfs] (DF)
14:48:56.155158 CLIENT.3559145514 > SERVER.nfs: 112 access [|nfs] (DF)
14:48:56.155307 SERVER.nfs > CLIENT.3559145514: reply ok 120 access
[|nfs] (DF)
// FIRST LOCK
14:48:56.155423 CLIENT.1022 > SERVER.lockd: udp 180 (DF)
14:48:56.155713 SERVER.lockd > CLIENT.1022: udp 36 (DF)
14:48:56.163972 SERVER.823 > CLIENT.34865: udp 164 (DF)
14:48:56.164156 CLIENT > SERVER: icmp: CLIENT udp port 34865 unreachable
[tos 0xc0]
14:48:57.096559 arp who-has SERVER tell CLIENT
14:48:57.096618 arp reply SERVER is-at xxxxxxxxxxx
14:49:16.156340 SERVER.756 > CLIENT.sunrpc: udp 84 (DF)
14:49:16.156598 CLIENT.sunrpc > SERVER.756: udp 28 (DF)
14:49:16.156872 SERVER.823 > CLIENT.34865: udp 164 (DF)
14:49:16.157078 CLIENT > SERVER: icmp: CLIENT udp port 34865 unreachable
[tos 0xc0]
14:49:21.157083 arp who-has SERVER tell CLIENT
14:49:21.157136 arp reply SERVER is-at xxxxxxxxxx
14:49:26.157197 CLIENT.1022 > SERVER.lockd: udp 180 (DF)
14:49:26.157637 SERVER.lockd > CLIENT.1022: udp 36 (DF)
14:49:26.157830 CLIENT.3575922730 > SERVER.nfs: 112 access [|nfs] (DF)
14:49:26.158025 SERVER.nfs > CLIENT.3575922730: reply ok 120 access
[|nfs] (DF)
14:49:26.158157 CLIENT.3592699946 > SERVER.nfs: 108 getattr [|nfs] (DF)
14:49:26.158220 SERVER.nfs > CLIENT.3592699946: reply ok 112 getattr
[|nfs] (DF)
14:49:26.158402 CLIENT.3609477162 > SERVER.nfs: 112 access [|nfs] (DF)
14:49:26.158465 SERVER.nfs > CLIENT.3609477162: reply ok 120 access
[|nfs] (DF)
14:49:26.158653 CLIENT.3626254378 > SERVER.nfs: 124 lookup [|nfs] (DF)
14:49:26.158717 SERVER.nfs > CLIENT.3626254378: reply ok 240 lookup
[|nfs] (DF)
14:49:26.158904 CLIENT.3643031594 > SERVER.nfs: 124 lookup [|nfs] (DF)
14:49:26.158960 SERVER.nfs > CLIENT.3643031594: reply ok 240 lookup
[|nfs] (DF)
14:49:26.159156 CLIENT.3659808810 > SERVER.nfs: 124 remove [|nfs] (DF)
14:49:26.165118 SERVER.nfs > CLIENT.3659808810: reply ok 144 remove
[|nfs] (DF)
/////SECOND LOCK
14:49:26.165300 CLIENT.1022 > SERVER.lockd: udp 164 (DF)
14:49:26.165511 SERVER.lockd > CLIENT.1022: udp 36 (DF)
14:49:26.165661 CLIENT.1022 > SERVER.lockd: udp 180 (DF)
14:49:26.165867 SERVER.lockd > CLIENT.1022: udp 36 (DF)
14:49:26.167697 SERVER.905 > CLIENT.34667: udp 164 (DF)
14:49:26.167898 CLIENT > SERVER: icmp: CLIENT udp port 34667 unreachable
[tos 0xc0]
14:49:36.153446 SERVER.755 > CLIENT.sunrpc: udp 84 (DF)
14:49:36.153724 CLIENT.sunrpc > SERVER.755: udp 28 (DF)
14:49:46.162473 SERVER.754 > CLIENT.sunrpc: udp 84 (DF)
14:49:46.173694 CLIENT.sunrpc > SERVER.754: udp 28 (DF)
14:49:46.173900 SERVER.823 > CLIENT.34865: udp 164 (DF)
14:49:46.174070 CLIENT > SERVER: icmp: CLIENT udp port 34865 unreachable
[tos 0xc0]
14:49:51.173721 arp who-has SERVER tell CLIENT
14:49:51.173745 arp reply SERVER is-at 0:1:2:9c:74:c1
14:49:56.165858 CLIENT.34867 > SERVER.sunrpc: udp 88 (DF)
14:49:56.166362 SERVER.sunrpc > CLIENT.34867: udp 28 (DF)
14:49:56.166547 CLIENT.1022 > SERVER.lockd: udp 180 (DF)
14:49:56.166922 SERVER.lockd > CLIENT.1022: udp 36 (DF)
14:49:56.167068 CLIENT.3676586026 > SERVER.nfs: 108 getattr [|nfs] (DF)
14:49:56.167178 SERVER.nfs > CLIENT.3676586026: reply ok 112 getattr
[|nfs] (DF)
14:49:56.167288 CLIENT.3693363242 > SERVER.nfs: 120 read [|nfs] (DF)
14:49:56.167396 SERVER.nfs > CLIENT.3693363242: reply ok 1036 read
[|nfs] (DF)
14:49:56.167703 CLIENT.1022 > SERVER.lockd: udp 164 (DF)
14:49:56.167779 SERVER.lockd > CLIENT.1022: udp 36 (DF)
14:49:56.167949 CLIENT.3710140458 > SERVER.nfs: 112 access [|nfs] (DF)
14:49:56.168016 SERVER.nfs > CLIENT.3710140458: reply ok 120 access
[|nfs] (DF)



I've forgot to mention that some times i've got problem in the first
reader lock too(and it seems the same behaviour).
Do you need a tcpdump more detailed?(launched with wich options?)
Matteo.

P.S. : Yes, unlock followed by lock helps to avoid line reading
mismatches and havy load tests tell me it is reliable.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-11 13:51:50

by Talpey, Thomas

[permalink] [raw]
Subject: Re: Lock problem.

Do you have a firewall enabled on your client? The problem seems
to be the port unreachable below:

At 09:07 AM 10/11/2007, matteo.debiaggi wrote:
>Well, the result of tcpdump of the packets seen in the uncorrect lock
>scenario
>
>14:48:56.154531 CLIENT.3542368298 > SERVER.nfs: 108 getattr [|nfs] (DF)
>14:48:56.154984 SERVER.nfs > CLIENT.3542368298: reply ok 112 getattr
>[|nfs] (DF)
>14:48:56.155158 CLIENT.3559145514 > SERVER.nfs: 112 access [|nfs] (DF)
>14:48:56.155307 SERVER.nfs > CLIENT.3559145514: reply ok 120 access
>[|nfs] (DF)
>// FIRST LOCK
>14:48:56.155423 CLIENT.1022 > SERVER.lockd: udp 180 (DF)
>14:48:56.155713 SERVER.lockd > CLIENT.1022: udp 36 (DF)
>14:48:56.163972 SERVER.823 > CLIENT.34865: udp 164 (DF)

--> here

>14:48:56.164156 CLIENT > SERVER: icmp: CLIENT udp port 34865 unreachable
>[tos 0xc0]

I wish there were some nlm decoding of the lockd packets, but it appears
the server attempted an NLM_GRANTED callback and the client refused
it. Your rpcinfo output indicates this was the right port number, so my best
guess is a firewall rejected the packet.

Tom.


>14:48:57.096559 arp who-has SERVER tell CLIENT
>14:48:57.096618 arp reply SERVER is-at xxxxxxxxxxx
>14:49:16.156340 SERVER.756 > CLIENT.sunrpc: udp 84 (DF)
>14:49:16.156598 CLIENT.sunrpc > SERVER.756: udp 28 (DF)
>14:49:16.156872 SERVER.823 > CLIENT.34865: udp 164 (DF)
>14:49:16.157078 CLIENT > SERVER: icmp: CLIENT udp port 34865 unreachable
>[tos 0xc0]
>14:49:21.157083 arp who-has SERVER tell CLIENT
>14:49:21.157136 arp reply SERVER is-at xxxxxxxxxx
>14:49:26.157197 CLIENT.1022 > SERVER.lockd: udp 180 (DF)
>14:49:26.157637 SERVER.lockd > CLIENT.1022: udp 36 (DF)
>14:49:26.157830 CLIENT.3575922730 > SERVER.nfs: 112 access [|nfs] (DF)
>14:49:26.158025 SERVER.nfs > CLIENT.3575922730: reply ok 120 access
>[|nfs] (DF)
>14:49:26.158157 CLIENT.3592699946 > SERVER.nfs: 108 getattr [|nfs] (DF)
>14:49:26.158220 SERVER.nfs > CLIENT.3592699946: reply ok 112 getattr
>[|nfs] (DF)
>14:49:26.158402 CLIENT.3609477162 > SERVER.nfs: 112 access [|nfs] (DF)
>14:49:26.158465 SERVER.nfs > CLIENT.3609477162: reply ok 120 access
>[|nfs] (DF)
>14:49:26.158653 CLIENT.3626254378 > SERVER.nfs: 124 lookup [|nfs] (DF)
>14:49:26.158717 SERVER.nfs > CLIENT.3626254378: reply ok 240 lookup
>[|nfs] (DF)
>14:49:26.158904 CLIENT.3643031594 > SERVER.nfs: 124 lookup [|nfs] (DF)
>14:49:26.158960 SERVER.nfs > CLIENT.3643031594: reply ok 240 lookup
>[|nfs] (DF)
>14:49:26.159156 CLIENT.3659808810 > SERVER.nfs: 124 remove [|nfs] (DF)
>14:49:26.165118 SERVER.nfs > CLIENT.3659808810: reply ok 144 remove
>[|nfs] (DF)
>/////SECOND LOCK
>14:49:26.165300 CLIENT.1022 > SERVER.lockd: udp 164 (DF)
>14:49:26.165511 SERVER.lockd > CLIENT.1022: udp 36 (DF)
>14:49:26.165661 CLIENT.1022 > SERVER.lockd: udp 180 (DF)
>14:49:26.165867 SERVER.lockd > CLIENT.1022: udp 36 (DF)
>14:49:26.167697 SERVER.905 > CLIENT.34667: udp 164 (DF)
>14:49:26.167898 CLIENT > SERVER: icmp: CLIENT udp port 34667 unreachable
>[tos 0xc0]
>14:49:36.153446 SERVER.755 > CLIENT.sunrpc: udp 84 (DF)
>14:49:36.153724 CLIENT.sunrpc > SERVER.755: udp 28 (DF)
>14:49:46.162473 SERVER.754 > CLIENT.sunrpc: udp 84 (DF)
>14:49:46.173694 CLIENT.sunrpc > SERVER.754: udp 28 (DF)
>14:49:46.173900 SERVER.823 > CLIENT.34865: udp 164 (DF)
>14:49:46.174070 CLIENT > SERVER: icmp: CLIENT udp port 34865 unreachable
>[tos 0xc0]
>14:49:51.173721 arp who-has SERVER tell CLIENT
>14:49:51.173745 arp reply SERVER is-at 0:1:2:9c:74:c1
>14:49:56.165858 CLIENT.34867 > SERVER.sunrpc: udp 88 (DF)
>14:49:56.166362 SERVER.sunrpc > CLIENT.34867: udp 28 (DF)
>14:49:56.166547 CLIENT.1022 > SERVER.lockd: udp 180 (DF)
>14:49:56.166922 SERVER.lockd > CLIENT.1022: udp 36 (DF)
>14:49:56.167068 CLIENT.3676586026 > SERVER.nfs: 108 getattr [|nfs] (DF)
>14:49:56.167178 SERVER.nfs > CLIENT.3676586026: reply ok 112 getattr
>[|nfs] (DF)
>14:49:56.167288 CLIENT.3693363242 > SERVER.nfs: 120 read [|nfs] (DF)
>14:49:56.167396 SERVER.nfs > CLIENT.3693363242: reply ok 1036 read
>[|nfs] (DF)
>14:49:56.167703 CLIENT.1022 > SERVER.lockd: udp 164 (DF)
>14:49:56.167779 SERVER.lockd > CLIENT.1022: udp 36 (DF)
>14:49:56.167949 CLIENT.3710140458 > SERVER.nfs: 112 access [|nfs] (DF)
>14:49:56.168016 SERVER.nfs > CLIENT.3710140458: reply ok 120 access
>[|nfs] (DF)
>
>
>
>I've forgot to mention that some times i've got problem in the first
>reader lock too(and it seems the same behaviour).
>Do you need a tcpdump more detailed?(launched with wich options?)
>Matteo.
>
>P.S. : Yes, unlock followed by lock helps to avoid line reading
>mismatches and havy load tests tell me it is reliable.
>
>-------------------------------------------------------------------------
>This SF.net email is sponsored by: Splunk Inc.
>Still grepping through log files to find problems? Stop.
>Now Search log events and configuration files using AJAX and a browser.
>Download your FREE copy of Splunk now >> http://get.splunk.com/
>_______________________________________________
>NFS maillist - [email protected]
>https://lists.sourceforge.net/lists/listinfo/nfs

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-11 14:24:56

by matteo.debiaggi

[permalink] [raw]
Subject: Re: Lock problem.

The 2 machines are in a local net without any kind of firewalls ..
But the problem "seems" to be solved :
I think the problem is in lockd daemon.
Explained 30 sec waiting : 2 locks request with default timeout 15 sec..
And then why lock not gained on server? Maybe due to max concurrent
threads asking fo lock to lockd daemon where the default is 20!My test
implyies 2000 lines writing on file with a pause of 10000 us between
write attempt and reading every 100ms.So 20 lock requests queued are a
possible context.
I restarted lockd daemon with 100 as maximum contemporary lock
requests from threads and test goes fine.
Thanks to everybody.
I hope not to face this problem again after this modify.
Matteo.

p.s. i can't get nlm decoding from tcpdump..

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-11 14:38:40

by Talpey, Thomas

[permalink] [raw]
Subject: Re: Lock problem.

At 10:24 AM 10/11/2007, matteo.debiaggi wrote:
>The 2 machines are in a local net without any kind of firewalls ..
>But the problem "seems" to be solved :
>I think the problem is in lockd daemon.
>Explained 30 sec waiting : 2 locks request with default timeout 15 sec..
>And then why lock not gained on server? Maybe due to max concurrent
>threads asking fo lock to lockd daemon where the default is 20!My test
>implyies 2000 lines writing on file with a pause of 10000 us between
>write attempt and reading every 100ms.So 20 lock requests queued are a
>possible context.
>I restarted lockd daemon with 100 as maximum contemporary lock
>requests from threads and test goes fine.

Is this some kind of Solaris server option? Yuck.

Anyway, I still think you need to understand why you got the ICMP port
unreachable from the client. It will come back, I predict...

Tom.

>Thanks to everybody.
>I hope not to face this problem again after this modify.
>Matteo.
>
>p.s. i can't get nlm decoding from tcpdump..
>
>-------------------------------------------------------------------------
>This SF.net email is sponsored by: Splunk Inc.
>Still grepping through log files to find problems? Stop.
>Now Search log events and configuration files using AJAX and a browser.
>Download your FREE copy of Splunk now >> http://get.splunk.com/
>_______________________________________________
>NFS maillist - [email protected]
>https://lists.sourceforge.net/lists/listinfo/nfs

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-11 15:23:35

by matteo.debiaggi

[permalink] [raw]
Subject: Re: Lock problem.

I found the parameters 'threads' only in the lockd solaris man.
And i think i'll perform other tests going deeply to be sure...


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs