2006-05-06 04:56:18

by Marc Eshel

[permalink] [raw]
Subject: lockd problem

Hi Trond,
I see a problem testing lockd with 2.6.16-CITI_NFS4_ALL-2, I see that you
made few changes in lockd, so let me describe it and see if can recall any
changes that might cause the following problem.

I think that the source of the problem is that 2 clients from 2 different
machine end up with the same fl_pid on the lockd server.

What I see is that client 1 gets a lock, client 2 request the same lock
and is blocked, client 1 unlock but in the process of unlocking it finds
the block queued for client 2 and deletes it, by the time
nlmsvc_notify_blocked() is called the block is gone so there is no grant
call to the client. Client 2 will retry after 30 seconds and get it if it
is still free.

May 5 21:01:55 fin20 kernel: lockd: UNLOCK called
May 5 21:01:55 fin20 kernel: lockd: nlm_lookup_host(090148e3, p=6, v=4)
May 5 21:01:55 fin20 kernel: lockd: get host 9.1.72.227
May 5 21:01:55 fin20 kernel: lockd: nlm_file_lookup (02010001 00000000
00013eec 15b2e9e4 00013eea 68b34ef5 00000000 00000000)
May 5 21:01:55 fin20 kernel: lockd: found file f58ea880 (count 1)
May 5 21:01:55 fin20 kernel: lockd: nlmsvc_unlock(sda2/81644, pi=6,
10-29)
May 5 21:01:55 fin20 kernel: lockd: nlmsvc_cancel(sda2/81644, pi=6,
10-29)
May 5 21:01:55 fin20 kernel: lockd: nlmsvc_lookup_block f=f58ea880 pd=6
10-29 ty=2
May 5 21:01:55 fin20 kernel: lockd: check f=f58ea880 pd=6 10-29 ty=1
cookie=97120000
May 5 21:01:55 fin20 kernel: lockd: unlinking block f4d6ca00...
May 5 21:01:55 fin20 kernel: lockd: freeing block f4d6ca00...
May 5 21:01:56 fin20 kernel: lockd: release host 9.1.72.239
May 5 21:01:56 fin20 kernel: lockd: nlm_release_file(f58ea880, ct = 2)
May 5 21:01:56 fin20 kernel: lockd: UNLOCK status 0

Marc.


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2006-05-08 23:20:29

by Marc Eshel

[permalink] [raw]
Subject: Re: lockd problem

Trond Myklebust <[email protected]> wrote on 05/06/2006 09:38:26
PM:

> On Fri, 2006-05-05 at 22:01 -0700, Marc Eshel wrote:
> > Hi Trond,
> > I see a problem testing lockd with 2.6.16-CITI_NFS4_ALL-2, I see that
you
> > made few changes in lockd, so let me describe it and see if can recall
any
> > changes that might cause the following problem.
> >
> > I think that the source of the problem is that 2 clients from 2
different
> > machine end up with the same fl_pid on the lockd server.
>
> It looks as if nlmsvc_lookup_block() fails to check the ip address of
> the request. That is not a new bug: it has been there for a few years...
>
> Cheers,
> Trond
>

The following patch seem to fix the problem. Do you see any reason not to
include fl_owner in the compare?
Marc.

diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
index 588c02a..0d44ad9 100644
--- a/include/linux/lockd/lockd.h
+++ b/include/linux/lockd/lockd.h
@@ -232,6 +232,7 @@ static __inline__ int
nlm_compare_locks(const struct file_lock *fl1, const struct file_lock
*fl2)
{
return fl1->fl_pid == fl2->fl_pid
+ && fl1->fl_owner == fl2->fl_owner
&& fl1->fl_start == fl2->fl_start
&& fl1->fl_end == fl2->fl_end
&&(fl1->fl_type == fl2->fl_type || fl2->fl_type == F_UNLCK);


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-05-08 23:17:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: lockd problem

On Sat, 2006-05-06 at 22:07 -0700, Marc Eshel wrote:
> Trond Myklebust <[email protected]> wrote on 05/06/2006 09:38:26
> PM:
>
> > On Fri, 2006-05-05 at 22:01 -0700, Marc Eshel wrote:
> > > Hi Trond,
> > > I see a problem testing lockd with 2.6.16-CITI_NFS4_ALL-2, I see that
> you
> > > made few changes in lockd, so let me describe it and see if can recall
> any
> > > changes that might cause the following problem.
> > >
> > > I think that the source of the problem is that 2 clients from 2
> different
> > > machine end up with the same fl_pid on the lockd server.
> >
> > It looks as if nlmsvc_lookup_block() fails to check the ip address of
> > the request. That is not a new bug: it has been there for a few years...
> >
> > Cheers,
> > Trond
> >
>
> The following patch seem to fix the problem. Do you see any reason not to
> include fl_owner in the compare?
> Marc.

Since fl_owner is just set to be a pointer to the struct nlm_host, then
that should indeed be fully equivalent to checking the ip address of the
client.

Can you resend this patch to me with a changelog description and a
signed-off-by line, please?

Cheers,
Trond


> diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
> index 588c02a..0d44ad9 100644
> --- a/include/linux/lockd/lockd.h
> +++ b/include/linux/lockd/lockd.h
> @@ -232,6 +232,7 @@ static __inline__ int
> nlm_compare_locks(const struct file_lock *fl1, const struct file_lock
> *fl2)
> {
> return fl1->fl_pid == fl2->fl_pid
> + && fl1->fl_owner == fl2->fl_owner
> && fl1->fl_start == fl2->fl_start
> && fl1->fl_end == fl2->fl_end
> &&(fl1->fl_type == fl2->fl_type || fl2->fl_type == F_UNLCK);



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-05-08 17:19:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: lockd problem

On Fri, 2006-05-05 at 22:01 -0700, Marc Eshel wrote:
> Hi Trond,
> I see a problem testing lockd with 2.6.16-CITI_NFS4_ALL-2, I see that you
> made few changes in lockd, so let me describe it and see if can recall any
> changes that might cause the following problem.
>
> I think that the source of the problem is that 2 clients from 2 different
> machine end up with the same fl_pid on the lockd server.

It looks as if nlmsvc_lookup_block() fails to check the ip address of
the request. That is not a new bug: it has been there for a few years...

Cheers,
Trond



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs