2010-04-09 19:40:19

by David Teigland

[permalink] [raw]
Subject: lockd and lock cancellation

Here's what I think was the first time we discussed cancelation and
Bruce's provisional locks: http://marc.info/?t=116538335700005&r=1&w=2
I'm still skeptical of trying to handle cancels, it seems too complex to
become reliable in the lifetime of nfs3.

What I would be interested to see fixed is this oops that's not difficult
to trigger by doing lock/unlock loops on a client:
https://bugzilla.redhat.com/show_bug.cgi?id=502977#c18

But, for all the kernel work on these nfs/gfs/dlm hooks, there's a larger
issue that no one is working on AFAIK: the mechanisms for recovering
client locks on remaining gfs nodes when one gfs node fails. That would
take a lot of work, and until it's done all the kernel apis will be a moot
point since clustered nfs locks on gfs will be unusable.

Dave



2010-04-09 20:50:08

by David Teigland

[permalink] [raw]
Subject: Re: lockd and lock cancellation

On Fri, Apr 09, 2010 at 04:25:09PM -0400, Chuck Lever wrote:
> >But, for all the kernel work on these nfs/gfs/dlm hooks, there's a larger
> >issue that no one is working on AFAIK: the mechanisms for recovering
> >client locks on remaining gfs nodes when one gfs node fails. That would
> >take a lot of work, and until it's done all the kernel apis will be a moot
> >point since clustered nfs locks on gfs will be unusable.
>
> To support IPv6, I've studied and modified the NFSv2/v3 lock
> recovery mechanisms quite a bit recently. What kernel APIs do you
> think would be needed to manage cluster lock recovery? Just
> something to release stale locks on a single node?

I only have a general idea of what needs to be done; I think Wendy Cheng
may have written a more detailed TODO list a few years ago. The main
problem is that when a gfs node fails, the other gfs nodes purge all
the posix locks that it held. In the case of nfs that's a problem, of
course, because the plocks being purged didn't finally belong to that
node/server but to the clients connected to it. The clients are still
alive and either failing over to an alternate gfs/nfs server or waiting
for the failed server to return.

So, when a gfs/nfs node/server fails, the remaining gfs servers need to
reclaim locks from the nfs clients that were connected to it, and insert
these locks into the gfs/dlm posix lock table. That recovery of client
locks needs to happen more or less during the grace period, after the
purging of locks from the failed node and before any locks are granted.

Basically, nfs lock recovery needs to be integrated with gfs/dlm lock
recovery.

Dave

2010-04-09 20:26:08

by Chuck Lever III

[permalink] [raw]
Subject: Re: lockd and lock cancellation

Hi David-

On 04/09/2010 03:40 PM, David Teigland wrote:
> Here's what I think was the first time we discussed cancelation and
> Bruce's provisional locks: http://marc.info/?t=116538335700005&r=1&w=2
> I'm still skeptical of trying to handle cancels, it seems too complex to
> become reliable in the lifetime of nfs3.
>
> What I would be interested to see fixed is this oops that's not difficult
> to trigger by doing lock/unlock loops on a client:
> https://bugzilla.redhat.com/show_bug.cgi?id=502977#c18
>
> But, for all the kernel work on these nfs/gfs/dlm hooks, there's a larger
> issue that no one is working on AFAIK: the mechanisms for recovering
> client locks on remaining gfs nodes when one gfs node fails. That would
> take a lot of work, and until it's done all the kernel apis will be a moot
> point since clustered nfs locks on gfs will be unusable.

To support IPv6, I've studied and modified the NFSv2/v3 lock recovery
mechanisms quite a bit recently. What kernel APIs do you think would be
needed to manage cluster lock recovery? Just something to release stale
locks on a single node?

--
chuck[dot]lever[at]oracle[dot]com

2010-04-01 14:06:27

by Rob Gardner

[permalink] [raw]
Subject: Re: lockd and lock cancellation

Steven Whitehouse wrote:
> Hi,
>
> Thanks for the fast response...
>
> On Thu, 2010-04-01 at 13:40 +0100, Rob Gardner wrote:
>
>> Steven Whitehouse wrote:
>>
>>> Hi,
>>>
>>> I'm trying to find my way around the lockd code and I'm currently a bit
>>> stumped by the code relating to lock cancellation. There is only one
>>> call to vfs_cancel_lock() in lockd/svclock.c and its return value isn't
>>> checked.
>>>
>>> It is used in combination with nlmsvc_unlink_block() which
>>> unconditionally calls posix_unblock_lock(). There are also other places
>>> where the code also calls nlmsvc_unlink_block() without first canceling
>>> the lock. The way in which vfs_cancel_lock() is used suggests that
>>> canceling a lock is a synchronous operation, and that it must succeed
>>> before returning.
>>>
>>> I'd have expected to see (pseudo code) something more like the
>>> following:
>>>
>>> ret = vfs_cancel_lock();
>>> if (ret == -ENOENT) /* never had the lock in the first place */
>>> do_something_appropriate();
>>> else if (ret == -EINVAL) /* we raced with a grant */
>>> unlock_lock();
>>> else /* lock successfully canceled */
>>> nlmsvc_unlink_block();
>>>
>>> Is there a reason why that is not required? and indeed, is there a
>>> reason why its safe to call nlmsvc_unlink_block() in the cases where the
>>> lock isn't canceled first? I'm trying to work out how the underlying fs
>>> can tell that a lock has gone away in those particular cases,
>>>
>>>
>> Steve,
>>
>> I noticed the missing cancel scenario some time ago and reported on it
>> here. Bruce agreed that it was a bug, but I regret that I haven't had
>> time to follow up on it. The problem was that vfs_cancel_lock was not
>> being called in all cases where it should be, possibly resulting in an
>> orphaned lock in the filesystem. See attached message for more detail.
>> (Or http://marc.info/?l=linux-nfs&m=125849395630496&w=2)
>>
>>
> I have one question relating to that message (see below)
>
>
>> By the way, if a lock grant wins a race with a cancel, I do not think it
>> is "safe" to simply unlock the lock at that point.
>>
>>
> Why not? If the cancel has failed, then we are left holding the lock
> just as if we'd requested it and no cancel had been issued. Or another
> way to ask the same question, if that does occur, what would be the
> correct way to dispose of the unwanted lock?
>

If the lock were actually granted, then unlocking in lieu of a cancel
can potentially leave a range unlocked that should be left locked. This
can happen in the case of a lock upgrade or a coalesce operation; For
instance, suppose the client holds a lock on bytes 0-100, then issues
another lock request for bytes 50-150, but sends a cancel just after the
lock is actually granted. If you now simply unlock 50-150, then the
client is left holding only 0-50, and has "lost" the lock on bytes
51-100. In other words, the client will *believe* that he has 0-100
locked, but in reality, only 0-50 are locked.

As for what to do in this situation... well it would be nice if the
filesystem treated the cancel request as an "undo" if the grant won the
race. But seriously, I think this is just one of the (many) flaws in the
protocol and we probably have to live with it. My personal feeling is
that it's safer to leave bytes locked rather than have a client believe
it holds a lock when it doesn't.

> [snip]
>
>> Seems reasonable, though it is a bit annoying trying to determine which
>> of these should be called where, so...
>>
>>
>>> Another possibility is to change nlmsvc_unlink_block() to make the call to
>>> vfs_cancel_lock() and then remove the call to vfs_cancel_lock in
>>> nlmsvc_cancel_blocked(). But I don't really like this as most other
>>> calls to nlmsvc_unlink_block() do not require a call to vfs_cancel_lock().
>>>
>> ..yes, I understand why the ideal initially appeals, but don't have a
>> better suggestion.
>>
>> --b.
>>
>>
>
> Can we not use a flag to figure out when a cancel needs to be sent? We
> could set the flag when an async request was sent to the underlying fs
> and clear it when the reply arrives. It would thus only be valid to send
> a vfs_cancel_lock() request when the flag was set.
>
We could do all this, but I don't see the point, since there is still a
race window you could sail a boat through. It's the period of time
between when the client sends a cancel request and the time that lockd
sends the cancel request to the filesystem. If a grant happens during
this time, what can be done? The protocol just doesn't have a way to
deal with this.

> My other thought is whether or not posix_unblock_lock() could be merged
> into vfs_cancel_lock() or whether there are really cases where that
> needs to be called without a cancellation having taken place,

I think that the filesystem should do the posix_unblock_lock call when
it (successfully?) processes a cancel request. After all, the fs is
already calling posix_lock_file when it successfully grants a lock. And
just as vfs_lock_file falls through to posix_lock_file when the fs
doesn't provide a lock function, so should vfs_cancel_lock fall through
to posix_unblock_lock in that situation.


Rob Gardner


2010-04-01 15:53:54

by J. Bruce Fields

[permalink] [raw]
Subject: Re: lockd and lock cancellation

On Thu, Apr 01, 2010 at 03:07:00PM +0100, Rob Gardner wrote:
> If the lock were actually granted, then unlocking in lieu of a cancel
> can potentially leave a range unlocked that should be left locked. This
> can happen in the case of a lock upgrade or a coalesce operation; For
> instance, suppose the client holds a lock on bytes 0-100, then issues
> another lock request for bytes 50-150, but sends a cancel just after the
> lock is actually granted. If you now simply unlock 50-150, then the
> client is left holding only 0-50, and has "lost" the lock on bytes
> 51-100. In other words, the client will *believe* that he has 0-100
> locked, but in reality, only 0-50 are locked.
>
> As for what to do in this situation... well it would be nice if the
> filesystem treated the cancel request as an "undo" if the grant won the
> race. But seriously, I think this is just one of the (many) flaws in the
> protocol and we probably have to live with it. My personal feeling is
> that it's safer to leave bytes locked rather than have a client believe
> it holds a lock when it doesn't.

I wrote some code to address this sort of problem a long time ago, and
didn't submit it:

http://git.linux-nfs.org/?p=bfields/linux-topics.git;a=shortlog;h=refs/heads/fair-queueing

The idea was to support correct cancelling by introducing a new type of
lock. Let's call it a "provisional lock". It behaves in every way like
a posix lock, *except* that it doesn't combine with (or downgrade)
existing locks. This allows a provisional lock to be cancelled--by just
removing it from the lock list--or to be "upgraded" to a normal posix
lock. (Note if you're looking at the code above, "provisional locks"
are identified with the FL_BLOCK flag.)

We should probably look back at that idea at some point.

The original motivation was to implement "fair queuing" for NFSv4 locks.
(Since NFSv4 clients wait on blocking locks by polling the server, a
server is supposed to be willing to temporarily hold a lock on a polling
client's behalf. But since posix locks aren't really guaranteed to be
granted to waiters in any particular order, I don't know how much that
matters.)

A side-effect was to eliminate a potential thundering-herd problem by
allowing the lock code to immediately grant a single process a
provisional lock, wake up just that process, and allow it to upgrade the
provisional lock to a normal posix lock, instead of waking up all the
waiters. I was never able to figure out how to measure any benefit from
that, though!

You could push the same "provisional lock" idea down into GFS2, and I
suspect it would make it easier to avoid some races. But I haven't
thoguht that through.

--b.