LinuxLists.cc - Virtual IPs and blocking locks

2009-05-15 14:48:23

Subject: Virtual IPs and blocking locks

We have had a few reported cases of problems using blocking locks on nfs shares
mounted using virtual ips. In these cases, the NFS server was using a floating
ip for clustering purposes.

Please consider the transaction below

NFS client: 10.33.8.75
NFS Server:
Primary IP : 10.33.8.71
Floating IP: 10.33.8.77

$ tshark -r block-virtual.pcap -R 'nlm'
19 2.487622 10.33.8.75 -> 10.33.8.77 NLM V4 LOCK Call FH:0x6176411a svid:4
pos:0-0
22 2.487760 10.33.8.77 -> 10.33.8.75 NLM V4 LOCK Reply (Call In 19)
NLM_BLOCKED
33 2.489518 10.33.8.71 -> 10.33.8.75 NLM V4 GRANTED_MSG Call FH:0x6176411a
svid:4 pos:0-0
36 2.489635 10.33.8.75 -> 10.33.8.71 NLM V4 GRANTED_MSG Reply (Call In 33)
46 2.489977 10.33.8.75 -> 10.33.8.71 NLM V4 GRANTED_RES Call NLM_DENIED
49 2.490096 10.33.8.71 -> 10.33.8.75 NLM V4 GRANTED_RES Reply (Call In 46)

19 - A lock request is sent from the client to the floating ip.
22 - A NLM_BLOCKED request is sent back by the Floating ip to the client.
33 - Server Primary IP address returns a NLM_GRANTED using the async callback
mechanism.
36 - Ack for GRANTED_MSG in 33.
47 - Client returns a NLM_DENIED to the SERVER. This is done since it doesn't
match the locks requested.
49 - Ack for GRANTED_RES in 46.

In this case, the GRANTED_MSG is sent by the primary ip as determined by the
routing table. This lock grant is rejected by the server since the ip address of
the server doesn't match the ip address of the server against which the request
was made. The locks are eventually granted after a 30 second poll timeout on the
client.

Similar problems are also seen when nfs shares are exported from GFS filesystems
since GFS uses deferred locks.

The problem was introduced by commit 5ac5f9d1ce8492163dbde5d357dc5d03becf7e36
which adds a check for the server ip address. This causes a regression for
clients which mount off a virtual ip address from the server.

A possible fix for this issue is to use the server ip address in the nlm_lock.oh
field used to make the request and compare it to the nlm_lock.oh returned in the
GRANTED_MSG call instead of checking the ip address of the server calling making
the GRANTED_MSG call.

Sachin Prabhu

2009-05-28 20:05:23

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

On Wed, May 20, 2009 at 10:37:05AM -0600, Rob Gardner wrote:
> Tom Talpey wrote:
>> At 02:55 AM 5/20/2009, Rob Gardner wrote:
>>
>>> Tom Talpey wrote:
>>>
>>>> At 04:43 PM 5/19/2009, Rob Gardner wrote:
>>>>
>>>>> I've got a question about lockd in conjunction with a filesystem
>>>>> that provides its own (async) locking.
>>>>>
>>>>> After nlmsvc_lock() calls vfs_lock_file(), it seems to be that we
>>>>> might get the async callback (nlmsvc_grant_deferred) at any time.
>>>>> What's to stop it from arriving before we even put the block on
>>>>> the nlm_block list? If this happens, then nlmsvc_grant_deferred()
>>>>> will print "grant for unknown block" and then we'll wait forever
>>>>> for a grant that will never come.
>>>>>
>>>> Yes, there's a race but the client will retry every 30 seconds, so it won't
>>>> wait forever.
>>>>
>>> OK, a blocking lock request will get retried in 30 seconds and work
>>> out "ok". But a non-blocking request will get in big trouble. Let's
>>> say the
>>
>> A non-blocking lock doesn't request, and won't get, a callback. So I
>> don't understand...
>>
>>
>
> What do you mean a non-blocking lock doesn't request? Remember that I'm
> dealing with a filesystem that provides its own locking functions via
> file->f_op->lock(). Such a filesystem might easily defer a non-blocking
> lock request and invoke the callback later. At least I don't know of any
> rule that says that it can't do this, and clearly the code expects this
> possibility:
>
> case FILE_LOCK_DEFERRED:
> if (wait)
> break;
> /* Filesystem lock operation is in progress
> Add it to the queue waiting for callback */
> ret = nlmsvc_defer_lock_rqst(rqstp, block);
>
>
>>> callback is invoked immediately after the vfs_lock_file call returns
>>> FILE_LOCK_DEFERRED. At this point, the block is not on the nlm_block
>>> list, so the callback routine will not be able to find it and mark it
>>> as granted. Then nlmsvc_lock() will call nlmsvc_defer_lock_rqst(),
>>> put the block on the nlm_block list, and eventually the request will
>>> timeout and the client will get lck_denied. Meanwhile, the lock has
>>> actually been granted, but nobody knows about it.
>>>
>>
>> Yes, this can happen, I've seen it too. Again, it's a bug in the protocol
>> more than a bug in the clients.
> It looks to me like a bug in the server. The server must be able to deal
> with async filesystem callbacks happening at any time, however
> inconvenient.

Absolutely, if that's possible then it's a server bug.

--b.

2009-05-28 21:34:19

by Rob Gardner

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

J. Bruce Fields wrote:
>>>>> At 04:43 PM 5/19/2009, Rob Gardner wrote:
>>>>>
>>>>>
>>>>>> I've got a question about lockd in conjunction with a filesystem
>>>>>> that provides its own (async) locking.
>>>>>>
>>>>>> After nlmsvc_lock() calls vfs_lock_file(), it seems to be that we
>>>>>> might get the async callback (nlmsvc_grant_deferred) at any time.
>>>>>> What's to stop it from arriving before we even put the block on
>>>>>> the nlm_block list? If this happens, then nlmsvc_grant_deferred()
>>>>>> will print "grant for unknown block" and then we'll wait forever
>>>>>> for a grant that will never come.
>>>>>>
>> dealing with a filesystem that provides its own locking functions via
>> file->f_op->lock(). Such a filesystem might easily defer a non-blocking
>> lock request and invoke the callback later. At least I don't know of any
>> rule that says that it can't do this, and clearly the code expects this
>> possibility:
>>
>> case FILE_LOCK_DEFERRED:
>> if (wait)
>> break;
>> /* Filesystem lock operation is in progress
>> Add it to the queue waiting for callback */
>> ret = nlmsvc_defer_lock_rqst(rqstp, block);
>>
>>
>> It looks to me like a bug in the server. The server must be able to deal
>> with async filesystem callbacks happening at any time, however
>> inconvenient.
>>
>
> Absolutely, if that's possible then it's a server bug.
>
> --b.
>

It's definitely possible for the async filesystem callback to occur at
any time. I think at the very least, nlmsvc_lock() ought to put the
block on the nlm_block list *before* calling vfs_lock_file(), and then
remove it immediately if the lock is granted synchronously. I would like
to develop and submit a patch for this, but I am currently working with
a much older kernel and it will take some time before I get to work with
newer bits.

Rob Gardner

2009-05-29 00:26:35

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

On Thu, May 28, 2009 at 03:34:19PM -0600, Rob Gardner wrote:
> J. Bruce Fields wrote:
>>>>>> At 04:43 PM 5/19/2009, Rob Gardner wrote:
>>>>>>
>>>>>>> I've got a question about lockd in conjunction with a
>>>>>>> filesystem that provides its own (async) locking.
>>>>>>>
>>>>>>> After nlmsvc_lock() calls vfs_lock_file(), it seems to be
>>>>>>> that we might get the async callback (nlmsvc_grant_deferred)
>>>>>>> at any time. What's to stop it from arriving before we even
>>>>>>> put the block on the nlm_block list? If this happens, then
>>>>>>> nlmsvc_grant_deferred() will print "grant for unknown block"
>>>>>>> and then we'll wait forever for a grant that will never come.
>>>>>>>
>>> dealing with a filesystem that provides its own locking functions via
>>> file->f_op->lock(). Such a filesystem might easily defer a
>>> non-blocking lock request and invoke the callback later. At least I
>>> don't know of any rule that says that it can't do this, and clearly
>>> the code expects this possibility:
>>>
>>> case FILE_LOCK_DEFERRED:
>>> if (wait)
>>> break;
>>> /* Filesystem lock operation is in progress
>>> Add it to the queue waiting for callback */
>>> ret = nlmsvc_defer_lock_rqst(rqstp, block);
>>>
>>> It looks to me like a bug in the server. The server must be able
>>> to deal with async filesystem callbacks happening at any time,
>>> however inconvenient.
>>>
>>
>> Absolutely, if that's possible then it's a server bug.
>>
>> --b.
>>
>
> It's definitely possible for the async filesystem callback to occur at
> any time.

Looking at the code.... This is all under the BKL, and as far as I can
tell there aren't any blocking operations anywhere there, so I don't
think this should happen if the filesystem is careful. Have you seen it
happen?

Of course this may be fragile--we'll have to think about what to do when
we eventually remove the BKL.

--b.

> I think at the very least, nlmsvc_lock() ought to put the
> block on the nlm_block list *before* calling vfs_lock_file(), and then
> remove it immediately if the lock is granted synchronously. I would like
> to develop and submit a patch for this, but I am currently working with
> a much older kernel and it will take some time before I get to work with
> newer bits.

2009-05-29 02:59:03

by Rob Gardner

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

J. Bruce Fields wrote:
>
> Looking at the code.... This is all under the BKL, and as far as I can
> tell there aren't any blocking operations anywhere there, so I don't
> think this should happen if the filesystem is careful. Have you seen it
> happen?

Aha, I just figured it out and you were right. The filesystem in this
case was not careful. It broke the rules and actually made the fl_grant
call *before* even returning to nlmsvc_lock's call to vfs_lock_file, and
it did it in the lockd thread! So the BKL was of no use, and I saw
nlmsvc_grant_deferred print "grant for unknown block". So I think
everything is ok, no huge race in lockd for async lock requests. Thank
you for clearing this up.

Rob Gardner

2009-05-29 13:23:13

by Tom Talpey

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

At 10:59 PM 5/28/2009, Rob Gardner wrote:
>J. Bruce Fields wrote:
>>
>> Looking at the code.... This is all under the BKL, and as far as I can
>> tell there aren't any blocking operations anywhere there, so I don't
>> think this should happen if the filesystem is careful. Have you seen it
>> happen?
>
>
>Aha, I just figured it out and you were right. The filesystem in this
>case was not careful. It broke the rules and actually made the fl_grant
>call *before* even returning to nlmsvc_lock's call to vfs_lock_file, and
>it did it in the lockd thread! So the BKL was of no use, and I saw
>nlmsvc_grant_deferred print "grant for unknown block". So I think
>everything is ok, no huge race in lockd for async lock requests. Thank
>you for clearing this up.

Gack! I'm surprised it worked at all. The fact that the BKL allows itself to
be taken recursively really masked your filesystem bug. If the BKL had
blocked, or asserted, the bug would never have happened.

This is as good a time as any to point out that the BKL's use in the lockd
code is insidious and needs some serious attention. Unfortunately, it's also
wrapped up in the BKL use of the VFS locking layer, but in practice those
are two different things. Using the BKL for upcall/downcall synchronization
and lockd thread protection are the issue here.

Tom.

2009-05-18 13:41:16

by Sachin Prabhu

[permalink] [raw]

Subject: Re: Virtual IPs and blocking locks

Rob Gardner wrote:
> It looks to me like recent kernels have added a "h_srcaddr" filed to the
> nlm_host structure, and this should be set to the server's virtual ip
> address. Then when the server sends the GRANTED_MSG call to the client,
> it should appear to be coming from the virtual ip address, not the
> server's primary ip address. So either h_srcaddr isn't getting set up
> correctly with your virtual ip address, or rpc_create() isn't binding it
> as the source address as it should. In our (older kernel) code, we
> explicitly call xprt_set_bindaddr() with the virtual ip address to make
> this happen, but I don't immediately see where this happens in the
> latest kernel source.

You are right, I cannot reproduce this issue with nfs servers based on later
versions on the kernel containing the patches for
'NLM: fix source address in server callbacks'

However this still leaves the question of the client handling such a situation
where a callback is made from a different ip address. Should the client accept
such callbacks or is the current behaviour of rejecting these callbacks correct?

Sachin Prabhu

2009-05-18 13:46:48

by Trond Myklebust

[permalink] [raw]

Subject: Re: Virtual IPs and blocking locks

On Mon, 2009-05-18 at 14:41 +0100, Sachin S. Prabhu wrote:
> Rob Gardner wrote:
> > It looks to me like recent kernels have added a "h_srcaddr" filed to the
> > nlm_host structure, and this should be set to the server's virtual ip
> > address. Then when the server sends the GRANTED_MSG call to the client,
> > it should appear to be coming from the virtual ip address, not the
> > server's primary ip address. So either h_srcaddr isn't getting set up
> > correctly with your virtual ip address, or rpc_create() isn't binding it
> > as the source address as it should. In our (older kernel) code, we
> > explicitly call xprt_set_bindaddr() with the virtual ip address to make
> > this happen, but I don't immediately see where this happens in the
> > latest kernel source.
>
> You are right, I cannot reproduce this issue with nfs servers based on later
> versions on the kernel containing the patches for
> 'NLM: fix source address in server callbacks'
>
> However this still leaves the question of the client handling such a situation
> where a callback is made from a different ip address. Should the client accept
> such callbacks or is the current behaviour of rejecting these callbacks correct?

The decision to reject callbacks from other ip addresses was deliberate.
There is no good justification for a server to send callbacks or replies
using an address that won't be recognised by the client.

Trond

2009-05-18 13:54:23

by Rob Gardner

[permalink] [raw]

Subject: Re: Virtual IPs and blocking locks

If the client gets a message from an IP address that he never sent a
request to, he is correct to reject that message.

Rob

Sachin S. Prabhu wrote:
>
> However this still leaves the question of the client handling such a situation
> where a callback is made from a different ip address. Should the client accept
> such callbacks or is the current behaviour of rejecting these callbacks correct?

2009-05-19 20:43:37

by Rob Gardner

[permalink] [raw]

Subject: Huge race in lockd for async lock requests?

I've got a question about lockd in conjunction with a filesystem that
provides its own (async) locking.

After nlmsvc_lock() calls vfs_lock_file(), it seems to be that we might
get the async callback (nlmsvc_grant_deferred) at any time. What's to
stop it from arriving before we even put the block on the nlm_block
list? If this happens, then nlmsvc_grant_deferred() will print "grant
for unknown block" and then we'll wait forever for a grant that will
never come.

Seems like we ought to do nlmsvc_insert_block() before vfs_lock_file()
at the very least; But this still leaves problems where the lock is
granted via the callback while we're still in nlmsvc_lock(), and we
ignore it and tell the client that the lock is blocked; Now they'll have
to retry before getting the lock.

Any thoughts on this besides "give up on using lockd"?

Rob Gardner

2009-05-19 21:33:56

by Tom Talpey

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

At 04:43 PM 5/19/2009, Rob Gardner wrote:
>I've got a question about lockd in conjunction with a filesystem that
>provides its own (async) locking.
>
>After nlmsvc_lock() calls vfs_lock_file(), it seems to be that we might
>get the async callback (nlmsvc_grant_deferred) at any time. What's to
>stop it from arriving before we even put the block on the nlm_block
>list? If this happens, then nlmsvc_grant_deferred() will print "grant
>for unknown block" and then we'll wait forever for a grant that will
>never come.

Yes, there's a race but the client will retry every 30 seconds, so it won't
wait forever. Depending on the kernel client version, there are some
improvements we've tried over time to close the raciness a little. What
exact client version are you working with?

>
>Seems like we ought to do nlmsvc_insert_block() before vfs_lock_file()
>at the very least; But this still leaves problems where the lock is
>granted via the callback while we're still in nlmsvc_lock(), and we
>ignore it and tell the client that the lock is blocked; Now they'll have
>to retry before getting the lock.

It's a little worse than that. It's also possible for the client to hold a lock,
but a stray or retried server callback can cause the client to reject it,
releasing the lock at the server. This causes the server to grant the lock
to another client even though the first client still thinks it holds it. It's an
NLM protocol issue, frankly, due to the fact that the server callback is a
completely separate channel.

>
>Any thoughts on this besides "give up on using lockd"?

Use NFSv4? ;-)

Tom.

>
>Rob Gardner
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2009-05-20 06:54:22

by Rob Gardner

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

Tom Talpey wrote:
> At 04:43 PM 5/19/2009, Rob Gardner wrote:
> >I've got a question about lockd in conjunction with a filesystem that
> >provides its own (async) locking.
> >
> >After nlmsvc_lock() calls vfs_lock_file(), it seems to be that we might
> >get the async callback (nlmsvc_grant_deferred) at any time. What's to
> >stop it from arriving before we even put the block on the nlm_block
> >list? If this happens, then nlmsvc_grant_deferred() will print "grant
> >for unknown block" and then we'll wait forever for a grant that will
> >never come.
>
> Yes, there's a race but the client will retry every 30 seconds, so it won't
> wait forever.
OK, a blocking lock request will get retried in 30 seconds and work out
"ok". But a non-blocking request will get in big trouble. Let's say the
callback is invoked immediately after the vfs_lock_file call returns
FILE_LOCK_DEFERRED. At this point, the block is not on the nlm_block
list, so the callback routine will not be able to find it and mark it as
granted. Then nlmsvc_lock() will call nlmsvc_defer_lock_rqst(), put the
block on the nlm_block list, and eventually the request will timeout and
the client will get lck_denied. Meanwhile, the lock has actually been
granted, but nobody knows about it.

> Depending on the kernel client version, there are some
> improvements we've tried over time to close the raciness a little. What
> exact client version are you working with?
>

I maintain nfs/nlm server code for a NAS product, and so there is no
"exact client" but rather a multitude of clients that I have no control
over. All I can do is hack the server. We have been working around this
by using a semaphore to cover the vfs_lock_file() to
nlmsvc_insert_block() sequence in nlmsvc_lock() and also
nlmsvc_grant_deferred(). So if the callback arrives at a bad time, it
has to wait until the lock actually makes it onto the nlm_block list,
and so the status of the lock gets updated properly.

> Use NFSv4? ;-)
>

I had a feeling you were going to say that. ;-) Unfortunately that
doesn't make NFSv3 and lockd go away.

Rob Gardner

2009-05-20 14:00:41

by Tom Talpey

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

At 02:55 AM 5/20/2009, Rob Gardner wrote:
>Tom Talpey wrote:
>> At 04:43 PM 5/19/2009, Rob Gardner wrote:
>> >I've got a question about lockd in conjunction with a filesystem that
>> >provides its own (async) locking.
>> >
>> >After nlmsvc_lock() calls vfs_lock_file(), it seems to be that we might
>> >get the async callback (nlmsvc_grant_deferred) at any time. What's to
>> >stop it from arriving before we even put the block on the nlm_block
>> >list? If this happens, then nlmsvc_grant_deferred() will print "grant
>> >for unknown block" and then we'll wait forever for a grant that will
>> >never come.
>>
>> Yes, there's a race but the client will retry every 30 seconds, so it won't
>> wait forever.
>OK, a blocking lock request will get retried in 30 seconds and work out
>"ok". But a non-blocking request will get in big trouble. Let's say the

A non-blocking lock doesn't request, and won't get, a callback. So I
don't understand...

>callback is invoked immediately after the vfs_lock_file call returns
>FILE_LOCK_DEFERRED. At this point, the block is not on the nlm_block
>list, so the callback routine will not be able to find it and mark it as
>granted. Then nlmsvc_lock() will call nlmsvc_defer_lock_rqst(), put the
>block on the nlm_block list, and eventually the request will timeout and
>the client will get lck_denied. Meanwhile, the lock has actually been
>granted, but nobody knows about it.

Yes, this can happen, I've seen it too. Again, it's a bug in the protocol
more than a bug in the clients. It gets even worse when retries occur.
If the reply cache doesn't catch the duplicates (and it never does), all
heck breaks out.

>
>> Depending on the kernel client version, there are some
>> improvements we've tried over time to close the raciness a little. What
>> exact client version are you working with?
>>
>
>I maintain nfs/nlm server code for a NAS product, and so there is no
>"exact client" but rather a multitude of clients that I have no control
>over. All I can do is hack the server. We have been working around this

I feel for ya (been there, done that) :-)

>by using a semaphore to cover the vfs_lock_file() to
>nlmsvc_insert_block() sequence in nlmsvc_lock() and also
>nlmsvc_grant_deferred(). So if the callback arrives at a bad time, it
>has to wait until the lock actually makes it onto the nlm_block list,
>and so the status of the lock gets updated properly.

Can you explain this further? If you're implementing the server, how do
you know your callback "arrives at a bad time", by the DENIED result
from the client?

Another thing to worry about is the presence of NLM_CANCEL calls
from the client which cross the callbacks.

I sent a patch which improves the situation at the client, some time
ago. Basically it was more willing to positively acknowledge a callback
which didn't match the nlm_blocked list, by also checking whether the
lock was actually being held. This was only half the solution however,
it didn't close the protocol race, just the client one. You want the
patch? I'll look for it.

>
>> Use NFSv4? ;-)
>>
>
>I had a feeling you were going to say that. ;-) Unfortunately that
>doesn't make NFSv3 and lockd go away.

Yes, I know. Unfortunately there aren't any elegant solutions to
the NLM protocol's flaws.

Tom.

2009-05-20 14:15:11

by Tom Talpey

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

At 10:00 AM 5/20/2009, Tom Talpey wrote:
>At 02:55 AM 5/20/2009, Rob Gardner wrote:
>>Tom Talpey wrote:
>>> At 04:43 PM 5/19/2009, Rob Gardner wrote:
>>> >I've got a question about lockd in conjunction with a filesystem that
>>> >provides its own (async) locking.
>>> >
>>> >After nlmsvc_lock() calls vfs_lock_file(), it seems to be that we might
>>> >get the async callback (nlmsvc_grant_deferred) at any time. What's to
>>> >stop it from arriving before we even put the block on the nlm_block
>>> >list? If this happens, then nlmsvc_grant_deferred() will print "grant
>>> >for unknown block" and then we'll wait forever for a grant that will
>>> >never come.
>>>
>>> Yes, there's a race but the client will retry every 30 seconds, so it won't
>>> wait forever.
>>OK, a blocking lock request will get retried in 30 seconds and work out
>>"ok". But a non-blocking request will get in big trouble. Let's say the
>
>A non-blocking lock doesn't request, and won't get, a callback. So I
>don't understand...
>
>>callback is invoked immediately after the vfs_lock_file call returns
>>FILE_LOCK_DEFERRED. At this point, the block is not on the nlm_block
>>list, so the callback routine will not be able to find it and mark it as
>>granted. Then nlmsvc_lock() will call nlmsvc_defer_lock_rqst(), put the
>>block on the nlm_block list, and eventually the request will timeout and
>>the client will get lck_denied. Meanwhile, the lock has actually been
>>granted, but nobody knows about it.
>
>Yes, this can happen, I've seen it too. Again, it's a bug in the protocol
>more than a bug in the clients. It gets even worse when retries occur.
>If the reply cache doesn't catch the duplicates (and it never does), all
>heck breaks out.
>
>>
>>> Depending on the kernel client version, there are some
>>> improvements we've tried over time to close the raciness a little. What
>>> exact client version are you working with?
>>>
>>
>>I maintain nfs/nlm server code for a NAS product, and so there is no
>>"exact client" but rather a multitude of clients that I have no control
>>over. All I can do is hack the server. We have been working around this
>
>I feel for ya (been there, done that) :-)
>
>>by using a semaphore to cover the vfs_lock_file() to
>>nlmsvc_insert_block() sequence in nlmsvc_lock() and also
>>nlmsvc_grant_deferred(). So if the callback arrives at a bad time, it
>>has to wait until the lock actually makes it onto the nlm_block list,
>>and so the status of the lock gets updated properly.
>
>Can you explain this further? If you're implementing the server, how do
>you know your callback "arrives at a bad time", by the DENIED result
>from the client?
>
>Another thing to worry about is the presence of NLM_CANCEL calls
>from the client which cross the callbacks.
>
>I sent a patch which improves the situation at the client, some time
>ago. Basically it was more willing to positively acknowledge a callback
>which didn't match the nlm_blocked list, by also checking whether the
>lock was actually being held. This was only half the solution however,
>it didn't close the protocol race, just the client one. You want the
>patch? I'll look for it.

Found it, on the old nfs list:

http://thread.gmane.org/gmane.linux.nfs/16611

Tom.

>
>>
>>> Use NFSv4? ;-)
>>>
>>
>>I had a feeling you were going to say that. ;-) Unfortunately that
>>doesn't make NFSv3 and lockd go away.
>
>Yes, I know. Unfortunately there aren't any elegant solutions to
>the NLM protocol's flaws.
>
>Tom.

2009-05-20 16:37:05

by Rob Gardner

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

Tom Talpey wrote:
> At 02:55 AM 5/20/2009, Rob Gardner wrote:
>
>> Tom Talpey wrote:
>>
>>> At 04:43 PM 5/19/2009, Rob Gardner wrote:
>>>
>>>> I've got a question about lockd in conjunction with a filesystem that
>>>> provides its own (async) locking.
>>>>
>>>> After nlmsvc_lock() calls vfs_lock_file(), it seems to be that we might
>>>> get the async callback (nlmsvc_grant_deferred) at any time. What's to
>>>> stop it from arriving before we even put the block on the nlm_block
>>>> list? If this happens, then nlmsvc_grant_deferred() will print "grant
>>>> for unknown block" and then we'll wait forever for a grant that will
>>>> never come.
>>>>
>>> Yes, there's a race but the client will retry every 30 seconds, so it won't
>>> wait forever.
>>>
>> OK, a blocking lock request will get retried in 30 seconds and work out
>> "ok". But a non-blocking request will get in big trouble. Let's say the
>>
>
> A non-blocking lock doesn't request, and won't get, a callback. So I
> don't understand...
>
>

What do you mean a non-blocking lock doesn't request? Remember that I'm
dealing with a filesystem that provides its own locking functions via
file->f_op->lock(). Such a filesystem might easily defer a non-blocking
lock request and invoke the callback later. At least I don't know of any
rule that says that it can't do this, and clearly the code expects this
possibility:

case FILE_LOCK_DEFERRED:
if (wait)
break;
/* Filesystem lock operation is in progress
Add it to the queue waiting for callback */
ret = nlmsvc_defer_lock_rqst(rqstp, block);

>> callback is invoked immediately after the vfs_lock_file call returns
>> FILE_LOCK_DEFERRED. At this point, the block is not on the nlm_block
>> list, so the callback routine will not be able to find it and mark it as
>> granted. Then nlmsvc_lock() will call nlmsvc_defer_lock_rqst(), put the
>> block on the nlm_block list, and eventually the request will timeout and
>> the client will get lck_denied. Meanwhile, the lock has actually been
>> granted, but nobody knows about it.
>>
>
> Yes, this can happen, I've seen it too. Again, it's a bug in the protocol
> more than a bug in the clients.
It looks to me like a bug in the server. The server must be able to deal
with async filesystem callbacks happening at any time, however inconvenient.

> It gets even worse when retries occur.
> If the reply cache doesn't catch the duplicates (and it never does), all
> heck breaks out.
>
You'll have to explain further what scenario you're talking about. I
don't understand what the reply cache has to do with lockd.

>> by using a semaphore to cover the vfs_lock_file() to
>> nlmsvc_insert_block() sequence in nlmsvc_lock() and also
>> nlmsvc_grant_deferred(). So if the callback arrives at a bad time, it
>> has to wait until the lock actually makes it onto the nlm_block list,
>> and so the status of the lock gets updated properly.
>>
>
> Can you explain this further? If you're implementing the server, how do
> you know your callback "arrives at a bad time", by the DENIED result
> from the client?
>

I sense a little confusion so let me be more precise. By "callback" I am
talking about the callback from the filesystem to lockd via
lock_manager_operations.fl_grant (ie, nlmsvc_grant_deferred). If this
callback is invoked while the lockd thread is executing code in
nlmsvc_lock() between the call to vfs_lock_file() and the call to
nlmsvc_insert_block(), then the callback routine (nlmsvc_grant_deferred)
will not find the block on the nlm_block list because it's not there
yet, and thus the "grant" is effectively lost. We use a semaphore in
nlmsvc_lock() and nlmsvc_grant_deferred() to avoid this race.

Rob Gardner

2009-05-20 23:20:03

by Rob Gardner

[permalink] [raw]

Subject: Re: Huge race in lockd for async lock requests?

Tom Talpey wrote:
> Found it, on the old nfs list:
> http://thread.gmane.org/gmane.linux.nfs/16611
>
>

Sorry I didn't see this until now... I realize now that we were
definitely not talking about the same problem. I apologize for not being
clear. My 'callbacks' are file system grant calls to lockd, not rpc
calls to/from clients.

Rob Gardner

2009-05-15 16:50:43

by Rob Gardner

[permalink] [raw]

Subject: Re: Virtual IPs and blocking locks

It looks to me like recent kernels have added a "h_srcaddr" filed to the
nlm_host structure, and this should be set to the server's virtual ip
address. Then when the server sends the GRANTED_MSG call to the client,
it should appear to be coming from the virtual ip address, not the
server's primary ip address. So either h_srcaddr isn't getting set up
correctly with your virtual ip address, or rpc_create() isn't binding it
as the source address as it should. In our (older kernel) code, we
explicitly call xprt_set_bindaddr() with the virtual ip address to make
this happen, but I don't immediately see where this happens in the
latest kernel source.

Rob Gardner
HP Storage Works / NAS

Sachin S. Prabhu wrote:
> We have had a few reported cases of problems using blocking locks on nfs shares
> mounted using virtual ips. In these cases, the NFS server was using a floating
> ip for clustering purposes.
>
> Please consider the transaction below
>
> NFS client: 10.33.8.75
> NFS Server:
> Primary IP : 10.33.8.71
> Floating IP: 10.33.8.77
>
> $ tshark -r block-virtual.pcap -R 'nlm'
> 19 2.487622 10.33.8.75 -> 10.33.8.77 NLM V4 LOCK Call FH:0x6176411a svid:4
> pos:0-0
> 22 2.487760 10.33.8.77 -> 10.33.8.75 NLM V4 LOCK Reply (Call In 19)
> NLM_BLOCKED
> 33 2.489518 10.33.8.71 -> 10.33.8.75 NLM V4 GRANTED_MSG Call FH:0x6176411a
> svid:4 pos:0-0
> 36 2.489635 10.33.8.75 -> 10.33.8.71 NLM V4 GRANTED_MSG Reply (Call In 33)
> 46 2.489977 10.33.8.75 -> 10.33.8.71 NLM V4 GRANTED_RES Call NLM_DENIED
> 49 2.490096 10.33.8.71 -> 10.33.8.75 NLM V4 GRANTED_RES Reply (Call In 46)
>
> 19 - A lock request is sent from the client to the floating ip.
> 22 - A NLM_BLOCKED request is sent back by the Floating ip to the client.
> 33 - Server Primary IP address returns a NLM_GRANTED using the async callback
> mechanism.
> 36 - Ack for GRANTED_MSG in 33.
> 47 - Client returns a NLM_DENIED to the SERVER. This is done since it doesn't
> match the locks requested.
> 49 - Ack for GRANTED_RES in 46.
>
> In this case, the GRANTED_MSG is sent by the primary ip as determined by the
> routing table. This lock grant is rejected by the server since the ip address of
> the server doesn't match the ip address of the server against which the request
> was made. The locks are eventually granted after a 30 second poll timeout on the
> client.
>
> Similar problems are also seen when nfs shares are exported from GFS filesystems
> since GFS uses deferred locks.
>
> The problem was introduced by commit 5ac5f9d1ce8492163dbde5d357dc5d03becf7e36
> which adds a check for the server ip address. This causes a regression for
> clients which mount off a virtual ip address from the server.
>
> A possible fix for this issue is to use the server ip address in the nlm_lock.oh
> field used to make the request and compare it to the nlm_lock.oh returned in the
> GRANTED_MSG call instead of checking the ip address of the server calling making
> the GRANTED_MSG call.
>
> Sachin Prabhu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>