2008-01-15 20:50:44

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Tuesday January 15, [email protected] wrote:
>
> I don't feel comfortable to change the existing code structure,
> especially a BUG() statement. It would be better to separate lock
> failover function away from lockd code clean-up. This is to make it
> easier for problem isolations (just in case).

Fair enough.

>
> On the other hand, if we view "ret" as a file count that tells us how
> many files fail to get unlocked, it would be great for debugging
> purpose. So the changes I made (currently in the middle of cluster
> testing) end up like this:
>
> if (likely(is_failover_file == NULL) ||
> is_failover_file(data, file)) {
> /*
> * Note that nlm_inspect_file updates f_locks
> * and ret is the number of files that can't
> * be unlocked.
> */
> ret += nlm_inspect_file(data, file, match);
> } else
> file->f_locks = nlm_file_inuse(file);

Looks good.

> >> mutex_lock(&nlm_file_mutex);
> >> file->f_count--;
> >> /* No more references to this file. Let go of it. */
> >> - if (list_empty(&file->f_blocks) && !file->f_locks
> >> + if (!file->f_locks && list_empty(&file->f_blocks)
> >>
> >
> > Is this change actually achieving something? or is it just noise?
> >
> Not really - but I thought checking for f_locks would be faster (tiny
> bit of optimization :))

You don't want to put the fastest operation first. You want the one
that is most likely to fail (as it is an '&&' where failure aborts the
sequence).
So you would need some argument (preferably with measurements) to say
which of the conditions will fail most often, before rearranging as an
optimisation.

NeilBrown


2008-01-28 03:46:43

by Felix Blyakher

[permalink] [raw]
Subject: Re: Re: [PATCH 1/2] NLM failover unlock commands

Hi Bruce,

On Jan 24, 2008, at 10:00 AM, J. Bruce Fields wrote:

> On Thu, Jan 17, 2008 at 03:23:42PM -0500, J. Bruce Fields wrote:
>> To summarize a phone conversation from today:
>>
>> On Thu, Jan 17, 2008 at 01:07:02PM -0500, Wendy Cheng wrote:
>>> J. Bruce Fields wrote:
>>>> Would there be any advantage to enforcing that requirement in the
>>>> server? (For example, teaching nlm to reject any locking
>>>> request for a
>>>> certain filesystem that wasn't sent to a certain server IP.)
>>>>
>>>> --b.
>>>>
>>> It is doable... could be added into the "resume" patch that is
>>> currently
>>> being tested (since the logic is so similar to the per-ip base grace
>>> period) that should be out for review no later than next Monday.
>>>
>>> However, as any new code added into the system, there are trade-
>>> off(s).
>>> I'm not sure we want to keep enhancing this too much though.
>>
>> Sure. And I don't want to make this terribly complicated. The patch
>> looks good, and solves a clear problem. That said, there are a few
>> related problems we'd like to solve:
>>
>> - We want to be able to move an export to a node with an already
>> active nfs server. Currently that requires restarting all of
>> nfsd on the target node. This is what I understand your next
>> patch fixes.
>> - In the case of a filesystem that may be mounted from multiple
>> nodes at once, we need to make sure we're not leaving a window
>> allowing other applications to claim locks that nfs clients
>> haven't recovered yet.
>> - Ideally we'd like this to be possible without making the
>> filesystem block all lock requests during a 90-second grace
>> period; instead it should only have to block those requests
>> that conflict with to-be-recovered locks.
>> - All this should work for nfsv4, where we want to eventually
>> also allow migration of individual clients, and
>> client-initiated failover.
>>
>> I absolutely don't want to delay solving this particular problem
>> until
>> all the above is figured out, but I would like to be reasonably
>> confident that the new user-interface can be extended naturally to
>> handle the above cases; or at least that it won't unnecessarily
>> complicate their implementation.
>>
>> I'll try to sketch an implementation of most of the above in the next
>> week.
>
> Bah. Apologies, this is taking me longer than it should to figure
> out--I've only barely started writing patches.
>
> The basic idea, though:
>
> In practice, it seems that both the unlock_ip and unlock_pathname
> methods that revoke locks are going to be called together. The two
> separate calls therefore seem a little redundant. The reason we
> *need*
> both is that it's possible that a misconfigured client could grab
> locks
> for a (server ip, export) combination that it isn't supposed to.
>
> So it makes sense to me to restrict locking from the beginning to
> prevent that from happening. Therefore I'd like to add a call at the
> beginning like:
>
> echo "192.168.1.1 /exports/example" > /proc/fs/nfsd/start_grace
>
> before any exports are set up, which both starts a grace period, and
> tells nfs to allow locks on the filesystem /exports/example only if
> they're addressed to the server ip 192.168.1.1. Then on shutdown,
>
> echo "192.168.1.1" >/proc/fs/nfsd/unlock_ip
>
> should be sufficient to guarantee that nfsd/lockd no longer holds
> locks
> on /exports/example.
>
> (I think Wendy's pretty close to that api already after adding the
> second method to start grace?)
>
> The other advantage to having the server-ip from the start is that at
> the time we make lock requests to the cluster filesystem, we can
> tell it
> that the locks associated with 192.168.1.1 are special: they may
> migrate
> as a group to another node, and on node failure they should (if
> possible) be held to give a chance for another node to take them over.
>
> Internally I'd like to have an object like
>
> struct lock_manager {
> char *lm_name;
> ...
> }
>
> for each server ip address. A pointer to this structure would be
> passed
> with each lock request, allowing the filesystem to associate locks to
> lock_manager's. The name would be a string derived from the server ip
> address that the cluster can compare to match reclaim requests with
> the
> locks that they're reclaiming from another node.
>
> (And in the NFSv4 case we would eventually also allow lock_managers
> with
> single nfsv4 client (as opposed to server-ip) granularity.)
>
> Does that seem sane?

It does. Though, I'd like to elaborate on effect of this change on the
disk filesystem, and in particular on a cluster filesystem.
I know, I'm jumping ahead, but I'd like to make sure that it's all
going to work well with cluster filesystems.

As part of processing "unlock by ip" request (from above example of
writing into /proc/fs/nfsd/unlock_ip) nfsd would call the underlying
filesystem. In cluster filesystem we really can't just delete locks,
as filesystem is still available and accessible from other nodes in
the cluster. We need to protect the nfs client's locks till they're
reclaimed by the new nfs server.

Bruce mentioned in another mail that communication to the underlying
filesystem would be through the lock_manager calls:

On Jan 24, 2008, at 10:39 AM, J. Bruce Fields wrote:
> In the case of a cluster filesystem, what I hope we end up with is an
> api with calls to the filesystem like:
>
> lock_manager_start(lock_manager, super_block);
> lock_manager_end_grace(lock_manager, super_block);
> lock_manager_shutdown(lock_manager, super_block);

Would that be part of lock_manager_ops, which any filesystem can
implement,
with the generic ops handling the case of two servers mounting single
ext3
filesystem one at a time?

Back to cluster filesystem, lock_manager_shutdown would be called as
result
of unlock_ip request. That should trigger "grace period" in the cluster
filesystem, such that none of the locks associated with the lock_manager
are getting really unlocked, but rather marked for reclaim. That period
lasts till new nfs server comes up, or some predefined (or tunable)
timeout.
New nfs server will call lock_manager_start() to mark the start of
its new grace period, during which the nfs clients are reclaiming the
locks. At the end of the grace period lock_manager_end_grace() is
called signaling filesystem that the grace period is over and any
remaining unreclaimed locks could be cleaned up.

Seems sane to me. Is that how you envision nfsd interaction with the
underlying (clustered) filesystem?

The next step would be to think of recovery of nfs server on another
node. The key difference is that the there is no controlled way to
shutdown filesystem and release the file locks. Though, that will
be the subject for another topic.

Cheers,
Felix

2008-01-28 15:56:31

by Wendy Cheng

[permalink] [raw]
Subject: Re: Re: [PATCH 1/2] NLM failover unlock commands

Felix Blyakher wrote:
>>
>> (I think Wendy's pretty close to that api already after adding the
>> second method to start grace?)

For reclaiming grace period issues, maybe we should move to
https://www.redhat.com/archives/cluster-devel/2008-January/msg00340.html
thread ?

I view this (unlock) patch set is mostly done and really don't want to
add more complications to it.

-- Wendy

2008-01-28 17:06:11

by Felix Blyakher

[permalink] [raw]
Subject: Re: [Cluster-devel] Re: [PATCH 1/2] NLM failover unlock commands


On Jan 28, 2008, at 9:56 AM, Wendy Cheng wrote:

> Felix Blyakher wrote:
>>>
>>> (I think Wendy's pretty close to that api already after adding the
>>> second method to start grace?)
>
> For reclaiming grace period issues, maybe we should move to
> https://www.redhat.com/archives/cluster-devel/2008-January/
> msg00340.html thread ?
>
> I view this (unlock) patch set is mostly done and really don't want
> to add more complications to it.

Heh, that was rather reply to Bruce's looking forward discussion
on related issues. Incidentally it was attached to your unlock
patch thread.

No problem with your patch, Wendy.

Cheers,
Felix


2008-01-24 21:49:53

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

J. Bruce Fields wrote:
> On Thu, Jan 24, 2008 at 04:06:49PM -0500, Wendy Cheng wrote:
>
>> J. Bruce Fields wrote:
>>
>>> On Thu, Jan 24, 2008 at 02:45:37PM -0500, Wendy Cheng wrote:
>>>
>>>
>>>> J. Bruce Fields wrote:
>>>>
>>>>
>>>>> In practice, it seems that both the unlock_ip and unlock_pathname
>>>>> methods that revoke locks are going to be called together. The two
>>>>> separate calls therefore seem a little redundant. The reason we *need*
>>>>> both is that it's possible that a misconfigured client could grab locks
>>>>> for a (server ip, export) combination that it isn't supposed to.
>>>>>
>>>>>
>>>> That is not a correct assumption. The two commands (unlock_ip and
>>>> unlock_pathname) are not necessarily called together. It is ok for
>>>> local filesystem (ext3) but not for cluster filesystem where the
>>>> very same filesystem (or subtree) can be exported from multiple
>>>> servers using different subtrees.
>>>>
>>>>
>>> Ouch. Are people really doing that, and why? What happens if the
>>> subtrees share files (because of hard links) that are locked from both
>>> nodes?
>>>
>>>
>> It is *more* common than you would expect - say server1 exports
>> "/mnt/gfs/maildir/namea-j" and server2 exports
>> "/mnt/gfs/maildir/namek-z".
>>
>
> I believe it, but how hard would it be for them to just set those up as
> separate partitions?
>
>
The answers that I normally hear is that "then why would I bother to use
a cluster filesystem ?" .... Wendy

2008-01-24 16:00:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Thu, Jan 17, 2008 at 03:23:42PM -0500, J. Bruce Fields wrote:
> To summarize a phone conversation from today:
>
> On Thu, Jan 17, 2008 at 01:07:02PM -0500, Wendy Cheng wrote:
> > J. Bruce Fields wrote:
> >> Would there be any advantage to enforcing that requirement in the
> >> server? (For example, teaching nlm to reject any locking request for a
> >> certain filesystem that wasn't sent to a certain server IP.)
> >>
> >> --b.
> >>
> > It is doable... could be added into the "resume" patch that is currently
> > being tested (since the logic is so similar to the per-ip base grace
> > period) that should be out for review no later than next Monday.
> >
> > However, as any new code added into the system, there are trade-off(s).
> > I'm not sure we want to keep enhancing this too much though.
>
> Sure. And I don't want to make this terribly complicated. The patch
> looks good, and solves a clear problem. That said, there are a few
> related problems we'd like to solve:
>
> - We want to be able to move an export to a node with an already
> active nfs server. Currently that requires restarting all of
> nfsd on the target node. This is what I understand your next
> patch fixes.
> - In the case of a filesystem that may be mounted from multiple
> nodes at once, we need to make sure we're not leaving a window
> allowing other applications to claim locks that nfs clients
> haven't recovered yet.
> - Ideally we'd like this to be possible without making the
> filesystem block all lock requests during a 90-second grace
> period; instead it should only have to block those requests
> that conflict with to-be-recovered locks.
> - All this should work for nfsv4, where we want to eventually
> also allow migration of individual clients, and
> client-initiated failover.
>
> I absolutely don't want to delay solving this particular problem until
> all the above is figured out, but I would like to be reasonably
> confident that the new user-interface can be extended naturally to
> handle the above cases; or at least that it won't unnecessarily
> complicate their implementation.
>
> I'll try to sketch an implementation of most of the above in the next
> week.

Bah. Apologies, this is taking me longer than it should to figure
out--I've only barely started writing patches.

The basic idea, though:

In practice, it seems that both the unlock_ip and unlock_pathname
methods that revoke locks are going to be called together. The two
separate calls therefore seem a little redundant. The reason we *need*
both is that it's possible that a misconfigured client could grab locks
for a (server ip, export) combination that it isn't supposed to.

So it makes sense to me to restrict locking from the beginning to
prevent that from happening. Therefore I'd like to add a call at the
beginning like:

echo "192.168.1.1 /exports/example" > /proc/fs/nfsd/start_grace

before any exports are set up, which both starts a grace period, and
tells nfs to allow locks on the filesystem /exports/example only if
they're addressed to the server ip 192.168.1.1. Then on shutdown,

echo "192.168.1.1" >/proc/fs/nfsd/unlock_ip

should be sufficient to guarantee that nfsd/lockd no longer holds locks
on /exports/example.

(I think Wendy's pretty close to that api already after adding the
second method to start grace?)

The other advantage to having the server-ip from the start is that at
the time we make lock requests to the cluster filesystem, we can tell it
that the locks associated with 192.168.1.1 are special: they may migrate
as a group to another node, and on node failure they should (if
possible) be held to give a chance for another node to take them over.

Internally I'd like to have an object like

struct lock_manager {
char *lm_name;
...
}

for each server ip address. A pointer to this structure would be passed
with each lock request, allowing the filesystem to associate locks to
lock_manager's. The name would be a string derived from the server ip
address that the cluster can compare to match reclaim requests with the
locks that they're reclaiming from another node.

(And in the NFSv4 case we would eventually also allow lock_managers with
single nfsv4 client (as opposed to server-ip) granularity.)

Does that seem sane?

But it's taking me longer than I'd like to get patches that implement
this. Hopefully by next week I can get working code together for people
to look at....

--b.

2008-01-24 16:20:16

by Peter Staubach

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

J. Bruce Fields wrote:
> On Thu, Jan 17, 2008 at 03:23:42PM -0500, J. Bruce Fields wrote:
>
>> To summarize a phone conversation from today:
>>
>> On Thu, Jan 17, 2008 at 01:07:02PM -0500, Wendy Cheng wrote:
>>
>>> J. Bruce Fields wrote:
>>>
>>>> Would there be any advantage to enforcing that requirement in the
>>>> server? (For example, teaching nlm to reject any locking request for a
>>>> certain filesystem that wasn't sent to a certain server IP.)
>>>>
>>>> --b.
>>>>
>>>>
>>> It is doable... could be added into the "resume" patch that is currently
>>> being tested (since the logic is so similar to the per-ip base grace
>>> period) that should be out for review no later than next Monday.
>>>
>>> However, as any new code added into the system, there are trade-off(s).
>>> I'm not sure we want to keep enhancing this too much though.
>>>
>> Sure. And I don't want to make this terribly complicated. The patch
>> looks good, and solves a clear problem. That said, there are a few
>> related problems we'd like to solve:
>>
>> - We want to be able to move an export to a node with an already
>> active nfs server. Currently that requires restarting all of
>> nfsd on the target node. This is what I understand your next
>> patch fixes.
>> - In the case of a filesystem that may be mounted from multiple
>> nodes at once, we need to make sure we're not leaving a window
>> allowing other applications to claim locks that nfs clients
>> haven't recovered yet.
>> - Ideally we'd like this to be possible without making the
>> filesystem block all lock requests during a 90-second grace
>> period; instead it should only have to block those requests
>> that conflict with to-be-recovered locks.
>> - All this should work for nfsv4, where we want to eventually
>> also allow migration of individual clients, and
>> client-initiated failover.
>>
>> I absolutely don't want to delay solving this particular problem until
>> all the above is figured out, but I would like to be reasonably
>> confident that the new user-interface can be extended naturally to
>> handle the above cases; or at least that it won't unnecessarily
>> complicate their implementation.
>>
>> I'll try to sketch an implementation of most of the above in the next
>> week.
>>
>
> Bah. Apologies, this is taking me longer than it should to figure
> out--I've only barely started writing patches.
>
> The basic idea, though:
>
> In practice, it seems that both the unlock_ip and unlock_pathname
> methods that revoke locks are going to be called together. The two
> separate calls therefore seem a little redundant. The reason we *need*
> both is that it's possible that a misconfigured client could grab locks
> for a (server ip, export) combination that it isn't supposed to.
>
> So it makes sense to me to restrict locking from the beginning to
> prevent that from happening. Therefore I'd like to add a call at the
> beginning like:
>
> echo "192.168.1.1 /exports/example" > /proc/fs/nfsd/start_grace
>
> before any exports are set up, which both starts a grace period, and
> tells nfs to allow locks on the filesystem /exports/example only if
> they're addressed to the server ip 192.168.1.1. Then on shutdown,
>
> echo "192.168.1.1" >/proc/fs/nfsd/unlock_ip
>
> should be sufficient to guarantee that nfsd/lockd no longer holds locks
> on /exports/example.
>
> (I think Wendy's pretty close to that api already after adding the
> second method to start grace?)
>
> The other advantage to having the server-ip from the start is that at
> the time we make lock requests to the cluster filesystem, we can tell it
> that the locks associated with 192.168.1.1 are special: they may migrate
> as a group to another node, and on node failure they should (if
> possible) be held to give a chance for another node to take them over.
>
> Internally I'd like to have an object like
>
> struct lock_manager {
> char *lm_name;
> ...
> }
>
> for each server ip address. A pointer to this structure would be passed
> with each lock request, allowing the filesystem to associate locks to
> lock_manager's. The name would be a string derived from the server ip
> address that the cluster can compare to match reclaim requests with the
> locks that they're reclaiming from another node.
>
> (And in the NFSv4 case we would eventually also allow lock_managers with
> single nfsv4 client (as opposed to server-ip) granularity.)
>
> Does that seem sane?
>
> But it's taking me longer than I'd like to get patches that implement
> this. Hopefully by next week I can get working code together for people
> to look at....

This seems somewhat less than scalable and not particularly
generally useful except in this specific sort of situation
to me. The ratios between servers and clients tend to be
one to many, where many can be not uncommonly 4 digits.
What happens if the cluster is 1000+ nodes? This strikes me
as tough on a systems administrator.

It seems to me that it would be nice to be able to solve this
problem someplace else like a cluster lock manager and not
clutter up the NFS lock manager.

Perhaps the cluster folks could explain why this problem
can't be solved there? Is this another case of attempting
to create a cluster concept without actually doing all of
the work to make the cluster appear to be a single system?

Or am I misunderstanding the situation here?

Thanx...

ps

2008-01-24 16:39:24

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Thu, Jan 24, 2008 at 11:19:58AM -0500, Peter Staubach wrote:
> J. Bruce Fields wrote:
>> On Thu, Jan 17, 2008 at 03:23:42PM -0500, J. Bruce Fields wrote:
>>
>>> To summarize a phone conversation from today:
>>>
>>> On Thu, Jan 17, 2008 at 01:07:02PM -0500, Wendy Cheng wrote:
>>>
>>>> J. Bruce Fields wrote:
>>>>
>>>>> Would there be any advantage to enforcing that requirement in the
>>>>> server? (For example, teaching nlm to reject any locking request for a
>>>>> certain filesystem that wasn't sent to a certain server IP.)
>>>>>
>>>>> --b.
>>>>>
>>>> It is doable... could be added into the "resume" patch that is
>>>> currently being tested (since the logic is so similar to the
>>>> per-ip base grace period) that should be out for review no later
>>>> than next Monday.
>>>>
>>>> However, as any new code added into the system, there are
>>>> trade-off(s). I'm not sure we want to keep enhancing this too much
>>>> though.
>>>>
>>> Sure. And I don't want to make this terribly complicated. The patch
>>> looks good, and solves a clear problem. That said, there are a few
>>> related problems we'd like to solve:
>>>
>>> - We want to be able to move an export to a node with an already
>>> active nfs server. Currently that requires restarting all of
>>> nfsd on the target node. This is what I understand your next
>>> patch fixes.
>>> - In the case of a filesystem that may be mounted from multiple
>>> nodes at once, we need to make sure we're not leaving a window
>>> allowing other applications to claim locks that nfs clients
>>> haven't recovered yet.
>>> - Ideally we'd like this to be possible without making the
>>> filesystem block all lock requests during a 90-second grace
>>> period; instead it should only have to block those requests
>>> that conflict with to-be-recovered locks.
>>> - All this should work for nfsv4, where we want to eventually
>>> also allow migration of individual clients, and
>>> client-initiated failover.
>>>
>>> I absolutely don't want to delay solving this particular problem until
>>> all the above is figured out, but I would like to be reasonably
>>> confident that the new user-interface can be extended naturally to
>>> handle the above cases; or at least that it won't unnecessarily
>>> complicate their implementation.
>>>
>>> I'll try to sketch an implementation of most of the above in the next
>>> week.
>>>
>>
>> Bah. Apologies, this is taking me longer than it should to figure
>> out--I've only barely started writing patches.
>>
>> The basic idea, though:
>>
>> In practice, it seems that both the unlock_ip and unlock_pathname
>> methods that revoke locks are going to be called together. The two
>> separate calls therefore seem a little redundant. The reason we *need*
>> both is that it's possible that a misconfigured client could grab locks
>> for a (server ip, export) combination that it isn't supposed to.
>>
>> So it makes sense to me to restrict locking from the beginning to
>> prevent that from happening. Therefore I'd like to add a call at the
>> beginning like:
>>
>> echo "192.168.1.1 /exports/example" > /proc/fs/nfsd/start_grace
>>
>> before any exports are set up, which both starts a grace period, and
>> tells nfs to allow locks on the filesystem /exports/example only if
>> they're addressed to the server ip 192.168.1.1. Then on shutdown,
>>
>> echo "192.168.1.1" >/proc/fs/nfsd/unlock_ip
>>
>> should be sufficient to guarantee that nfsd/lockd no longer holds locks
>> on /exports/example.
>>
>> (I think Wendy's pretty close to that api already after adding the
>> second method to start grace?)
>>
>> The other advantage to having the server-ip from the start is that at
>> the time we make lock requests to the cluster filesystem, we can tell it
>> that the locks associated with 192.168.1.1 are special: they may migrate
>> as a group to another node, and on node failure they should (if
>> possible) be held to give a chance for another node to take them over.
>>
>> Internally I'd like to have an object like
>>
>> struct lock_manager {
>> char *lm_name;
>> ...
>> }
>>
>> for each server ip address. A pointer to this structure would be passed
>> with each lock request, allowing the filesystem to associate locks to
>> lock_manager's. The name would be a string derived from the server ip
>> address that the cluster can compare to match reclaim requests with the
>> locks that they're reclaiming from another node.
>>
>> (And in the NFSv4 case we would eventually also allow lock_managers with
>> single nfsv4 client (as opposed to server-ip) granularity.)
>>
>> Does that seem sane?
>>
>> But it's taking me longer than I'd like to get patches that implement
>> this. Hopefully by next week I can get working code together for people
>> to look at....
>
> This seems somewhat less than scalable and not particularly
> generally useful except in this specific sort of situation
> to me. The ratios between servers and clients tend to be
> one to many, where many can be not uncommonly 4 digits.
> What happens if the cluster is 1000+ nodes? This strikes me
> as tough on a systems administrator.

I think that may turn out to be a valid objection to the notion of allowing
migration of individual v4 clients. Does it really apply to the case of
migration by server ip addresses?

Maybe I'm not understanding the exact scalability problem you're
thinking of.

> It seems to me that it would be nice to be able to solve this
> problem someplace else like a cluster lock manager and not
> clutter up the NFS lock manager.

All we're doing in lock/nfsd is associating individual lock requests to "lock
managers"--sets of locks that may migrate as a group and making it possible to
individually shut down and start up lock managers.

In the case of a cluster filesystem, what I hope we end up with is an
api with calls to the filesystem like:

lock_manager_start(lock_manager, super_block);
lock_manager_end_grace(lock_manager, super_block);
lock_manager_shutdown(lock_manager, super_block);

that inform the filesystem of the status of each lock manager. If we pass lock
requests (including reclaims) to the filesystem as well, then it can if it
wants get complete control over the lock recovery process--so for example if it
knows which locks a dead node held, it may decide to allow normal locking that
doesn't conflict with those locks during the grace period. Or it may choose to
do something simpler.

So I *think* this allows us to do what you want--it adds the minimal
infrastructure to lockd required to allow the cluster's lock manager to do the
real work.

(Though we'd also want a simple default implementation in the vfs which handles
simpler cases--like two servers that mount a single ext3 filesystem one at a
time from a shared block device.)

--b.

>
> Perhaps the cluster folks could explain why this problem
> can't be solved there? Is this another case of attempting
> to create a cluster concept without actually doing all of
> the work to make the cluster appear to be a single system?
>
> Or am I misunderstanding the situation here?

2008-01-24 19:45:37

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

J. Bruce Fields wrote:
> On Thu, Jan 17, 2008 at 03:23:42PM -0500, J. Bruce Fields wrote:
>
>> To summarize a phone conversation from today:
>>
>> On Thu, Jan 17, 2008 at 01:07:02PM -0500, Wendy Cheng wrote:
>>
>>> J. Bruce Fields wrote:
>>>
>>>> Would there be any advantage to enforcing that requirement in the
>>>> server? (For example, teaching nlm to reject any locking request for a
>>>> certain filesystem that wasn't sent to a certain server IP.)
>>>>
>>>> --b.
>>>>
>>>>
>>> It is doable... could be added into the "resume" patch that is currently
>>> being tested (since the logic is so similar to the per-ip base grace
>>> period) that should be out for review no later than next Monday.
>>>
>>> However, as any new code added into the system, there are trade-off(s).
>>> I'm not sure we want to keep enhancing this too much though.
>>>
>> Sure. And I don't want to make this terribly complicated. The patch
>> looks good, and solves a clear problem. That said, there are a few
>> related problems we'd like to solve:
>>
>> - We want to be able to move an export to a node with an already
>> active nfs server. Currently that requires restarting all of
>> nfsd on the target node. This is what I understand your next
>> patch fixes.
>> - In the case of a filesystem that may be mounted from multiple
>> nodes at once, we need to make sure we're not leaving a window
>> allowing other applications to claim locks that nfs clients
>> haven't recovered yet.
>> - Ideally we'd like this to be possible without making the
>> filesystem block all lock requests during a 90-second grace
>> period; instead it should only have to block those requests
>> that conflict with to-be-recovered locks.
>> - All this should work for nfsv4, where we want to eventually
>> also allow migration of individual clients, and
>> client-initiated failover.
>>
>> I absolutely don't want to delay solving this particular problem until
>> all the above is figured out, but I would like to be reasonably
>> confident that the new user-interface can be extended naturally to
>> handle the above cases; or at least that it won't unnecessarily
>> complicate their implementation.
>>
>> I'll try to sketch an implementation of most of the above in the next
>> week.
>>
>
> Bah. Apologies, this is taking me longer than it should to figure
> out--I've only barely started writing patches.
>
> The basic idea, though:
>
> In practice, it seems that both the unlock_ip and unlock_pathname
> methods that revoke locks are going to be called together. The two
> separate calls therefore seem a little redundant. The reason we *need*
> both is that it's possible that a misconfigured client could grab locks
> for a (server ip, export) combination that it isn't supposed to.
>

That is not a correct assumption. The two commands (unlock_ip and
unlock_pathname) are not necessarily called together. It is ok for local
filesystem (ext3) but not for cluster filesystem where the very same
filesystem (or subtree) can be exported from multiple servers using
different subtrees. Also as we discussed before, it is
"unlock_filesystem", *not* "unlock_pathname" (this implies sub-tree
exports) due to implementation difficulties (see the "Implementation
Notes" from http://people.redhat.com/wcheng/Patches/NFS/NLM/004.txt).
> So it makes sense to me to restrict locking from the beginning to
> prevent that from happening. Therefore I'd like to add a call at the
> beginning like:
>
> echo "192.168.1.1 /exports/example" > /proc/fs/nfsd/start_grace
>

My second patch set is about to be sent out (doing text description at
this moment .. sorry for the delay).

> before any exports are set up, which both starts a grace period, and
> tells nfs to allow locks on the filesystem /exports/example only if
> they're addressed to the server ip 192.168.1.1. Then on shutdown,
>
> echo "192.168.1.1" >/proc/fs/nfsd/unlock_ip
>
> should be sufficient to guarantee that nfsd/lockd no longer holds locks
> on /exports/example.
>
> (I think Wendy's pretty close to that api already after adding the
> second method to start grace?)
>
> The other advantage to having the server-ip from the start is that at
> the time we make lock requests to the cluster filesystem, we can tell it
> that the locks associated with 192.168.1.1 are special: they may migrate
> as a group to another node, and on node failure they should (if
> possible) be held to give a chance for another node to take them over.
>
> Internally I'd like to have an object like
>
> struct lock_manager {
> char *lm_name;
> ...
> }
>
> for each server ip address. A pointer to this structure would be passed
> with each lock request, allowing the filesystem to associate locks to
> lock_manager's. The name would be a string derived from the server ip
> address that the cluster can compare to match reclaim requests with the
> locks that they're reclaiming from another node.
>

I still don't have a warm feeling about adding this (at this moment) -
somehow feel we over-engineer the lock failover issues. Remember lock
failover is just a small portion of the general NFS server failover (for
both HA and load balancing purpose) issues. Maybe we should have
something simple and usable for 2.6 kernel before adding this type of
complication ?
> (And in the NFSv4 case we would eventually also allow lock_managers with
> single nfsv4 client (as opposed to server-ip) granularity.)
>
> Does that seem sane?
>
> But it's taking me longer than I'd like to get patches that implement
> this. Hopefully by next week I can get working code together for people
> to look at....
>
> --b.
>

2008-01-24 20:19:15

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Thu, Jan 24, 2008 at 02:45:37PM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> In practice, it seems that both the unlock_ip and unlock_pathname
>> methods that revoke locks are going to be called together. The two
>> separate calls therefore seem a little redundant. The reason we *need*
>> both is that it's possible that a misconfigured client could grab locks
>> for a (server ip, export) combination that it isn't supposed to.
>>
>
> That is not a correct assumption. The two commands (unlock_ip and
> unlock_pathname) are not necessarily called together. It is ok for local
> filesystem (ext3) but not for cluster filesystem where the very same
> filesystem (or subtree) can be exported from multiple servers using
> different subtrees.

Ouch. Are people really doing that, and why? What happens if the
subtrees share files (because of hard links) that are locked from both
nodes?

> Also as we discussed before, it is
> "unlock_filesystem", *not* "unlock_pathname" (this implies sub-tree
> exports) due to implementation difficulties (see the "Implementation
> Notes" from http://people.redhat.com/wcheng/Patches/NFS/NLM/004.txt).

Unless I misread the latest patch, it's actually matching on the
vfsmount, right?

I guess that means we *could* handle the above situation by doing a

mount --bind /path/to/export/point /path/to/export/point

on each export, at which point there will be a separate vfsmount for
each export point?

But I don't think that's what we really want. The goal is to ensure
that the nfs server holds no locks on a disk filesystem so we can
unmount it completely from this machine and mount it elsewhere. So we
should really be removing all locks for the superblock, not just for a
particular mount of that superblock. Otherwise we'll have odd problems
if someone happens to do the unlock_filesystem downcall from a different
namespace or something.

>> So it makes sense to me to restrict locking from the beginning to
>> prevent that from happening. Therefore I'd like to add a call at the
>> beginning like:
>>
>> echo "192.168.1.1 /exports/example" > /proc/fs/nfsd/start_grace
>>
>
> My second patch set is about to be sent out (doing text description at
> this moment .. sorry for the delay).

Good, thanks.

>> before any exports are set up, which both starts a grace period, and
>> tells nfs to allow locks on the filesystem /exports/example only if
>> they're addressed to the server ip 192.168.1.1. Then on shutdown,
>>
>> echo "192.168.1.1" >/proc/fs/nfsd/unlock_ip
>>
>> should be sufficient to guarantee that nfsd/lockd no longer holds locks
>> on /exports/example.
>>
>> (I think Wendy's pretty close to that api already after adding the
>> second method to start grace?)
>>
>> The other advantage to having the server-ip from the start is that at
>> the time we make lock requests to the cluster filesystem, we can tell it
>> that the locks associated with 192.168.1.1 are special: they may migrate
>> as a group to another node, and on node failure they should (if
>> possible) be held to give a chance for another node to take them over.
>>
>> Internally I'd like to have an object like
>>
>> struct lock_manager {
>> char *lm_name;
>> ...
>> }
>>
>> for each server ip address. A pointer to this structure would be passed
>> with each lock request, allowing the filesystem to associate locks to
>> lock_manager's. The name would be a string derived from the server ip
>> address that the cluster can compare to match reclaim requests with the
>> locks that they're reclaiming from another node.
>>
>
> I still don't have a warm feeling about adding this (at this moment) -
> somehow feel we over-engineer the lock failover issues.

I agree that that's a risk.

> Remember lock failover is just a small portion of the general NFS
> server failover (for both HA and load balancing purpose) issues.
> Maybe we should have something simple and usable for 2.6 kernel
> before adding this type of complication ?

Yeah. We should aim to include basic failover functionality in 2.6.26,
one way or another--I think that dealing with the other issues I'm
worried about won't actually be a great deal more complicated, but if
that doesn't pan out then fine. I would like to at least make sure this
is all working for nfsv4 as well, though. (Currently only locks held by
v2/v3 clients are dropped.)

--b.

2008-01-24 21:06:49

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

J. Bruce Fields wrote:
> On Thu, Jan 24, 2008 at 02:45:37PM -0500, Wendy Cheng wrote:
>
>> J. Bruce Fields wrote:
>>
>>> In practice, it seems that both the unlock_ip and unlock_pathname
>>> methods that revoke locks are going to be called together. The two
>>> separate calls therefore seem a little redundant. The reason we *need*
>>> both is that it's possible that a misconfigured client could grab locks
>>> for a (server ip, export) combination that it isn't supposed to.
>>>
>>>
>> That is not a correct assumption. The two commands (unlock_ip and
>> unlock_pathname) are not necessarily called together. It is ok for local
>> filesystem (ext3) but not for cluster filesystem where the very same
>> filesystem (or subtree) can be exported from multiple servers using
>> different subtrees.
>>
>
> Ouch. Are people really doing that, and why? What happens if the
> subtrees share files (because of hard links) that are locked from both
> nodes?
>

It is *more* common than you would expect - say server1 exports
"/mnt/gfs/maildir/namea-j" and server2 exports "/mnt/gfs/maildir/namek-z".

>
>> Also as we discussed before, it is
>> "unlock_filesystem", *not* "unlock_pathname" (this implies sub-tree
>> exports) due to implementation difficulties (see the "Implementation
>> Notes" from http://people.redhat.com/wcheng/Patches/NFS/NLM/004.txt).
>>
>
> Unless I misread the latest patch, it's actually matching on the
> vfsmount, right?
>

Yes.

> I guess that means we *could* handle the above situation by doing a
>
> mount --bind /path/to/export/point /path/to/export/point
>
> on each export, at which point there will be a separate vfsmount for
> each export point?
>

Cluster configuration itself has been cumbersome and error-prone.
Requirement like this will not be well received. On the other hand,
force-unlock a mount point is a *last* resort - since NFS clients using
another ip interface would lose the contact with the server. We should
*not* consider "unlock_filesystem" a frequent event.

> But I don't think that's what we really want. The goal is to ensure
> that the nfs server holds no locks on a disk filesystem so we can
> unmount it completely from this machine and mount it elsewhere. So we
> should really be removing all locks for the superblock, not just for a
> particular mount of that superblock. Otherwise we'll have odd problems
> if someone happens to do the unlock_filesystem downcall from a different
> namespace or something.
>

Oh ... sorry .. didn't read this far... so we agree the "--bind" is not
a good idea :) ..

-- Wendy

2008-01-24 21:40:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Thu, Jan 24, 2008 at 04:06:49PM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> On Thu, Jan 24, 2008 at 02:45:37PM -0500, Wendy Cheng wrote:
>>
>>> J. Bruce Fields wrote:
>>>
>>>> In practice, it seems that both the unlock_ip and unlock_pathname
>>>> methods that revoke locks are going to be called together. The two
>>>> separate calls therefore seem a little redundant. The reason we *need*
>>>> both is that it's possible that a misconfigured client could grab locks
>>>> for a (server ip, export) combination that it isn't supposed to.
>>>>
>>> That is not a correct assumption. The two commands (unlock_ip and
>>> unlock_pathname) are not necessarily called together. It is ok for
>>> local filesystem (ext3) but not for cluster filesystem where the
>>> very same filesystem (or subtree) can be exported from multiple
>>> servers using different subtrees.
>>>
>>
>> Ouch. Are people really doing that, and why? What happens if the
>> subtrees share files (because of hard links) that are locked from both
>> nodes?
>>
>
> It is *more* common than you would expect - say server1 exports
> "/mnt/gfs/maildir/namea-j" and server2 exports
> "/mnt/gfs/maildir/namek-z".

I believe it, but how hard would it be for them to just set those up as
separate partitions?

I'm really not fond of exports of subdirectories of filesystems, mainly
because I'm worried that many administrators don't understand the
security issue (which is that they probably are exposing the whole
filesystem when they export a subdirectory).

--b.

2008-01-15 20:56:53

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

Neil Brown wrote:
> On Tuesday January 15, [email protected] wrote:
>
>> I don't feel comfortable to change the existing code structure,
>> especially a BUG() statement. It would be better to separate lock
>> failover function away from lockd code clean-up. This is to make it
>> easier for problem isolations (just in case).
>>
>
> Fair enough.
>
>
>> On the other hand, if we view "ret" as a file count that tells us how
>> many files fail to get unlocked, it would be great for debugging
>> purpose. So the changes I made (currently in the middle of cluster
>> testing) end up like this:
>>
>> if (likely(is_failover_file == NULL) ||
>> is_failover_file(data, file)) {
>> /*
>> * Note that nlm_inspect_file updates f_locks
>> * and ret is the number of files that can't
>> * be unlocked.
>> */
>> ret += nlm_inspect_file(data, file, match);
>> } else
>> file->f_locks = nlm_file_inuse(file);
>>
>
> Looks good.
>
>
>>>> mutex_lock(&nlm_file_mutex);
>>>> file->f_count--;
>>>> /* No more references to this file. Let go of it. */
>>>> - if (list_empty(&file->f_blocks) && !file->f_locks
>>>> + if (!file->f_locks && list_empty(&file->f_blocks)
>>>>
>>>>
>>> Is this change actually achieving something? or is it just noise?
>>>
>>>
>> Not really - but I thought checking for f_locks would be faster (tiny
>> bit of optimization :))
>>
>
> You don't want to put the fastest operation first. You want the one
> that is most likely to fail (as it is an '&&' where failure aborts the
> sequence).
> So you would need some argument (preferably with measurements) to say
> which of the conditions will fail most often, before rearranging as an
> optimisation.
>
>
>
yep .. really think about it, my bad. Have reset it back ... Wendy

2008-01-15 22:48:00

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

Revised version of the patch:

* based on comment from Neil Brown, use sscanf() to parse IP address (a
cool idea imo).
* the "ret" inside nlm_traverse_files() is now the file count that can't
be unlocked.
* other minor changes from latest round of review.

Thanks to all for the review !

-- Wendy


Attachments:
unlock_v3.patch (8.14 kB)

2008-01-17 15:10:13

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Tue, Jan 15, 2008 at 05:48:00PM -0500, Wendy Cheng wrote:
> Revised version of the patch:
>
> * based on comment from Neil Brown, use sscanf() to parse IP address (a
> cool idea imo).
> * the "ret" inside nlm_traverse_files() is now the file count that can't
> be unlocked.
> * other minor changes from latest round of review.

Thanks!

Remind me: why do we need both per-ip and per-filesystem methods? In
practice, I assume that we'll always do *both*?

We're migrating clients by moving a server ip address from one node to
another. And I assume we're permitting at most one node to export each
filesystem at a time. So it *should* be the case that the set of locks
held on the filesystem(s) that are moving are the same as the set of
locks held by the virtual ip that is moving.

But presumably in some scenarios clients can get confused, and we need
to ensure that stale locks are not left behind?

We've discussed this before, but we should get the answer into comments
in the code (or on the patches).

--b.

>
> Thanks to all for the review !
>
> -- Wendy
>

> Two new NFSD procfs files are added:
> /proc/fs/nfsd/unlock_ip
> /proc/fs/nfsd/unlock_filesystem
>
> They are intended to allow admin or user mode script to release NLM locks
> based on either a path name or a server in-bound ip address (ipv4 for now)
> as;
>
> shell> echo 10.1.1.2 > /proc/fs/nfsd/unlock_ip
> shell> echo /mnt/sfs1 > /proc/fs/nfsd/unlock_filesystem
>
> Signed-off-by: S. Wendy Cheng <[email protected]>
> Signed-off-by: Lon Hohberger <[email protected]>
> Signed-off-by: Christoph Hellwig <[email protected]>
>
> fs/lockd/svcsubs.c | 66 +++++++++++++++++++++++++++++++++++++++-----
> fs/nfsd/nfsctl.c | 65 +++++++++++++++++++++++++++++++++++++++++++
> include/linux/lockd/lockd.h | 7 ++++
> 3 files changed, 131 insertions(+), 7 deletions(-)
>
> --- linux-o/fs/nfsd/nfsctl.c 2008-01-04 10:01:08.000000000 -0500
> +++ linux/fs/nfsd/nfsctl.c 2008-01-15 11:30:19.000000000 -0500
> @@ -22,6 +22,7 @@
> #include <linux/seq_file.h>
> #include <linux/pagemap.h>
> #include <linux/init.h>
> +#include <linux/inet.h>
> #include <linux/string.h>
> #include <linux/smp_lock.h>
> #include <linux/ctype.h>
> @@ -35,6 +36,7 @@
> #include <linux/nfsd/cache.h>
> #include <linux/nfsd/xdr.h>
> #include <linux/nfsd/syscall.h>
> +#include <linux/lockd/lockd.h>
>
> #include <asm/uaccess.h>
>
> @@ -52,6 +54,8 @@ enum {
> NFSD_Getfs,
> NFSD_List,
> NFSD_Fh,
> + NFSD_FO_UnlockIP,
> + NFSD_FO_UnlockFS,
> NFSD_Threads,
> NFSD_Pool_Threads,
> NFSD_Versions,
> @@ -88,6 +92,9 @@ static ssize_t write_leasetime(struct fi
> static ssize_t write_recoverydir(struct file *file, char *buf, size_t size);
> #endif
>
> +static ssize_t failover_unlock_ip(struct file *file, char *buf, size_t size);
> +static ssize_t failover_unlock_fs(struct file *file, char *buf, size_t size);
> +
> static ssize_t (*write_op[])(struct file *, char *, size_t) = {
> [NFSD_Svc] = write_svc,
> [NFSD_Add] = write_add,
> @@ -97,6 +104,8 @@ static ssize_t (*write_op[])(struct file
> [NFSD_Getfd] = write_getfd,
> [NFSD_Getfs] = write_getfs,
> [NFSD_Fh] = write_filehandle,
> + [NFSD_FO_UnlockIP] = failover_unlock_ip,
> + [NFSD_FO_UnlockFS] = failover_unlock_fs,
> [NFSD_Threads] = write_threads,
> [NFSD_Pool_Threads] = write_pool_threads,
> [NFSD_Versions] = write_versions,
> @@ -288,6 +297,58 @@ static ssize_t write_getfd(struct file *
> return err;
> }
>
> +static ssize_t failover_unlock_ip(struct file *file, char *buf, size_t size)
> +{
> + __be32 server_ip;
> + char *fo_path, c;
> + int b1, b2, b3, b4;
> +
> + /* sanity check */
> + if (size == 0)
> + return -EINVAL;
> +
> + if (buf[size-1] != '\n')
> + return -EINVAL;
> +
> + fo_path = buf;
> + if (qword_get(&buf, fo_path, size) < 0)
> + return -EINVAL;
> +
> + /* get ipv4 address */
> + if (sscanf(fo_path, "%u.%u.%u.%u%c", &b1, &b2, &b3, &b4, &c) != 4)
> + return -EINVAL;
> + server_ip = htonl((((((b1<<8)|b2)<<8)|b3)<<8)|b4);
> +
> + return nlmsvc_failover_ip(server_ip);
> +}
> +
> +static ssize_t failover_unlock_fs(struct file *file, char *buf, size_t size)
> +{
> + struct nameidata nd;
> + char *fo_path;
> + int error;
> +
> + /* sanity check */
> + if (size == 0)
> + return -EINVAL;
> +
> + if (buf[size-1] != '\n')
> + return -EINVAL;
> +
> + fo_path = buf;
> + if (qword_get(&buf, fo_path, size) < 0)
> + return -EINVAL;
> +
> + error = path_lookup(fo_path, 0, &nd);
> + if (error)
> + return error;
> +
> + error = nlmsvc_failover_path(&nd);
> +
> + path_release(&nd);
> + return error;
> +}
> +
> static ssize_t write_filehandle(struct file *file, char *buf, size_t size)
> {
> /* request is:
> @@ -646,6 +707,10 @@ static int nfsd_fill_super(struct super_
> [NFSD_Getfd] = {".getfd", &transaction_ops, S_IWUSR|S_IRUSR},
> [NFSD_Getfs] = {".getfs", &transaction_ops, S_IWUSR|S_IRUSR},
> [NFSD_List] = {"exports", &exports_operations, S_IRUGO},
> + [NFSD_FO_UnlockIP] = {"unlock_ip",
> + &transaction_ops, S_IWUSR|S_IRUSR},
> + [NFSD_FO_UnlockFS] = {"unlock_filesystem",
> + &transaction_ops, S_IWUSR|S_IRUSR},
> [NFSD_Fh] = {"filehandle", &transaction_ops, S_IWUSR|S_IRUSR},
> [NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR},
> [NFSD_Pool_Threads] = {"pool_threads", &transaction_ops, S_IWUSR|S_IRUSR},
> --- linux-o/fs/lockd/svcsubs.c 2008-01-04 10:01:08.000000000 -0500
> +++ linux/fs/lockd/svcsubs.c 2008-01-15 11:16:48.000000000 -0500
> @@ -18,6 +18,8 @@
> #include <linux/lockd/lockd.h>
> #include <linux/lockd/share.h>
> #include <linux/lockd/sm_inter.h>
> +#include <linux/module.h>
> +#include <linux/mount.h>
>
> #define NLMDBG_FACILITY NLMDBG_SVCSUBS
>
> @@ -87,7 +89,7 @@ nlm_lookup_file(struct svc_rqst *rqstp,
> unsigned int hash;
> __be32 nfserr;
>
> - nlm_debug_print_fh("nlm_file_lookup", f);
> + nlm_debug_print_fh("nlm_lookup_file", f);
>
> hash = file_hash(f);
>
> @@ -123,6 +125,9 @@ nlm_lookup_file(struct svc_rqst *rqstp,
>
> hlist_add_head(&file->f_list, &nlm_files[hash]);
>
> + /* fill in f_iaddr for nlm lock failover */
> + file->f_iaddr = rqstp->rq_daddr;
> +
> found:
> dprintk("lockd: found file %p (count %d)\n", file, file->f_count);
> *result = file;
> @@ -194,6 +199,12 @@ again:
> return 0;
> }
>
> +static int
> +nlmsvc_always_match(struct nlm_host *dummy1, struct nlm_host *dummy2)
> +{
> + return 1;
> +}
> +
> /*
> * Inspect a single file
> */
> @@ -230,7 +241,8 @@ nlm_file_inuse(struct nlm_file *file)
> * Loop over all files in the file table.
> */
> static int
> -nlm_traverse_files(struct nlm_host *host, nlm_host_match_fn_t match)
> +nlm_traverse_files(void *data, nlm_host_match_fn_t match,
> + int (*is_failover_file)(void *data, struct nlm_file *file))
> {
> struct hlist_node *pos, *next;
> struct nlm_file *file;
> @@ -244,8 +256,17 @@ nlm_traverse_files(struct nlm_host *host
>
> /* Traverse locks, blocks and shares of this file
> * and update file->f_locks count */
> - if (nlm_inspect_file(host, file, match))
> - ret = 1;
> +
> + if (likely(is_failover_file == NULL) ||
> + is_failover_file(data, file)) {
> + /*
> + * Note that nlm_inspect_file updates f_locks
> + * and ret is the number of files that can't
> + * be unlocked.
> + */
> + ret += nlm_inspect_file(data, file, match);
> + } else
> + file->f_locks = nlm_file_inuse(file);
>
> mutex_lock(&nlm_file_mutex);
> file->f_count--;
> @@ -337,7 +358,7 @@ void
> nlmsvc_mark_resources(void)
> {
> dprintk("lockd: nlmsvc_mark_resources\n");
> - nlm_traverse_files(NULL, nlmsvc_mark_host);
> + nlm_traverse_files(NULL, nlmsvc_mark_host, NULL);
> }
>
> /*
> @@ -348,7 +369,7 @@ nlmsvc_free_host_resources(struct nlm_ho
> {
> dprintk("lockd: nlmsvc_free_host_resources\n");
>
> - if (nlm_traverse_files(host, nlmsvc_same_host)) {
> + if (nlm_traverse_files(host, nlmsvc_same_host, NULL)) {
> printk(KERN_WARNING
> "lockd: couldn't remove all locks held by %s\n",
> host->h_name);
> @@ -368,5 +389,36 @@ nlmsvc_invalidate_all(void)
> * turn, which is about as inefficient as it gets.
> * Now we just do it once in nlm_traverse_files.
> */
> - nlm_traverse_files(NULL, nlmsvc_is_client);
> + nlm_traverse_files(NULL, nlmsvc_is_client, NULL);
> +}
> +
> +static int
> +nlmsvc_failover_file_ok_path(void *datap, struct nlm_file *file)
> +{
> + struct nameidata *nd = datap;
> + return nd->mnt == file->f_file->f_path.mnt;
> +}
> +
> +int
> +nlmsvc_failover_path(struct nameidata *nd)
> +{
> + return nlm_traverse_files(nd, nlmsvc_always_match,
> + nlmsvc_failover_file_ok_path);
> +}
> +EXPORT_SYMBOL_GPL(nlmsvc_failover_path);
> +
> +static int
> +nlmsvc_failover_file_ok_ip(void *datap, struct nlm_file *file)
> +{
> + __be32 *server_addr = datap;
> +
> + return file->f_iaddr.addr.s_addr == *server_addr;
> +}
> +
> +int
> +nlmsvc_failover_ip(__be32 server_addr)
> +{
> + return nlm_traverse_files(&server_addr, nlmsvc_always_match,
> + nlmsvc_failover_file_ok_ip);
> }
> +EXPORT_SYMBOL_GPL(nlmsvc_failover_ip);
> --- linux-o/include/linux/lockd/lockd.h 2008-01-04 10:01:08.000000000 -0500
> +++ linux/include/linux/lockd/lockd.h 2008-01-15 11:13:04.000000000 -0500
> @@ -113,6 +113,7 @@ struct nlm_file {
> unsigned int f_locks; /* guesstimate # of locks */
> unsigned int f_count; /* reference count */
> struct mutex f_mutex; /* avoid concurrent access */
> + union svc_addr_u f_iaddr; /* server ip for failover */
> };
>
> /*
> @@ -214,6 +215,12 @@ void nlmsvc_mark_resources(void);
> void nlmsvc_free_host_resources(struct nlm_host *);
> void nlmsvc_invalidate_all(void);
>
> +/*
> + * Cluster failover support
> + */
> +int nlmsvc_failover_path(struct nameidata *nd);
> +int nlmsvc_failover_ip(__be32 server_addr);
> +
> static __inline__ struct inode *
> nlmsvc_file_inode(struct nlm_file *file)
> {


2008-01-17 15:48:56

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

J. Bruce Fields wrote:
> Remind me: why do we need both per-ip and per-filesystem methods? In
> practice, I assume that we'll always do *both*?
>

Failover normally is done via virtual IP address - so per-ip base method
should be the core routine. However, for non-cluster filesystem such as
ext3/4, changing server also implies umount. If there are clients not
following rule and obtaining locks via different ip interfaces, umount
would fail that ends up aborting the failover process. That's the place
we need the per-filesystem method.

ServerA:
1. Tear down the IP address
2. Unexport the path
3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
4. If unmount required,
write path name to /proc/fs/nfsd/unlock_filesystem, then unmount.
5. Signal peer to begin take-over.

Sometime ago we were looking at "export name" as the core method (so
per-filesystem method is a subset of that). Unfortunately, the prototype
efforts showed the code would be too intrusive (if filesystem sub-tree
is exported).
> We're migrating clients by moving a server ip address from one node to
> another. And I assume we're permitting at most one node to export each
> filesystem at a time. So it *should* be the case that the set of locks
> held on the filesystem(s) that are moving are the same as the set of
> locks held by the virtual ip that is moving.
>

This is true for non-cluster filesystem. But a cluster filesystem can be
exported from multiple servers.
> But presumably in some scenarios clients can get confused, and we need
> to ensure that stale locks are not left behind?
>

Yes.

> We've discussed this before, but we should get the answer into comments
> in the code (or on the patches).
>
>
ok, working on it. or should we add something into linux/Documentation
to describe the overall logic ?

-- Wendy

2008-01-17 16:12:56

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

Add a more detailed description into the top of the patch itself. I'm
working on the resume patch now - it will include an overall write-up in
the Documentation directory.

-- Wendy

2008-01-17 16:10:12

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

Wendy Cheng wrote:
> Add a more detailed description into the top of the patch itself. I'm
> working on the resume patch now - it will include an overall write-up
> in the Documentation directory.

Sorry, forgot to attach the patch. Here it is ... Wendy


Attachments:
unlock_v4.patch (8.41 kB)

2008-01-17 16:14:19

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Thu, Jan 17, 2008 at 10:48:56AM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> Remind me: why do we need both per-ip and per-filesystem methods? In
>> practice, I assume that we'll always do *both*?
>>
>
> Failover normally is done via virtual IP address - so per-ip base method
> should be the core routine. However, for non-cluster filesystem such as
> ext3/4, changing server also implies umount. If there are clients not
> following rule and obtaining locks via different ip interfaces, umount
> would fail that ends up aborting the failover process. That's the place
> we need the per-filesystem method.
>
> ServerA:
> 1. Tear down the IP address
> 2. Unexport the path
> 3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
> 4. If unmount required,
> write path name to /proc/fs/nfsd/unlock_filesystem, then unmount.
> 5. Signal peer to begin take-over.
>
> Sometime ago we were looking at "export name" as the core method (so
> per-filesystem method is a subset of that). Unfortunately, the prototype
> efforts showed the code would be too intrusive (if filesystem sub-tree
> is exported).
>> We're migrating clients by moving a server ip address from one node to
>> another. And I assume we're permitting at most one node to export each
>> filesystem at a time. So it *should* be the case that the set of locks
>> held on the filesystem(s) that are moving are the same as the set of
>> locks held by the virtual ip that is moving.
>>
>
> This is true for non-cluster filesystem. But a cluster filesystem can be
> exported from multiple servers.
>> But presumably in some scenarios clients can get confused, and we need
>> to ensure that stale locks are not left behind?
>>
>
> Yes.
>
>> We've discussed this before, but we should get the answer into comments
>> in the code (or on the patches).
>>
>>
> ok, working on it. or should we add something into linux/Documentation
> to describe the overall logic ?

Yeah, sounds good. Maybe under Documentation/filesystems? And it might
also be helpful to leave a reference to it in the code, e.g., in
nfsctl.c:

/*
* The following are used for failover; see
* Documentation/filesystems/nfsd-failover.txt for details.
*/

--b.

2008-01-17 16:17:08

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

J. Bruce Fields wrote:
> Yeah, sounds good. Maybe under Documentation/filesystems? And it might
> also be helpful to leave a reference to it in the code, e.g., in
> nfsctl.c:
>
> /*
> * The following are used for failover; see
> * Documentation/filesystems/nfsd-failover.txt for details.
> */
>
> --b.
>

ok, looks good ... but let's "fix" this on the resume patch ? or you
want it on the unlock patch *now* ?

-- Wendy

2008-01-17 16:22:20

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Thu, Jan 17, 2008 at 11:17:08AM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> Yeah, sounds good. Maybe under Documentation/filesystems? And it might
>> also be helpful to leave a reference to it in the code, e.g., in
>> nfsctl.c:
>>
>> /*
>> * The following are used for failover; see
>> * Documentation/filesystems/nfsd-failover.txt for details.
>> */
>>
>> --b.
>>
>
> ok, looks good ... but let's "fix" this on the resume patch ? or you
> want it on the unlock patch *now* ?

It's not urgent.

--b.

2008-01-17 16:31:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Thu, Jan 17, 2008 at 10:48:56AM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> Remind me: why do we need both per-ip and per-filesystem methods? In
>> practice, I assume that we'll always do *both*?
>>
>
> Failover normally is done via virtual IP address - so per-ip base method
> should be the core routine. However, for non-cluster filesystem such as
> ext3/4, changing server also implies umount. If there are clients not
> following rule and obtaining locks via different ip interfaces, umount
> would fail that ends up aborting the failover process. That's the place
> we need the per-filesystem method.
>
> ServerA:
> 1. Tear down the IP address
> 2. Unexport the path
> 3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
> 4. If unmount required,
> write path name to /proc/fs/nfsd/unlock_filesystem, then unmount.
> 5. Signal peer to begin take-over.
>
> Sometime ago we were looking at "export name" as the core method (so
> per-filesystem method is a subset of that). Unfortunately, the prototype
> efforts showed the code would be too intrusive (if filesystem sub-tree
> is exported).
>> We're migrating clients by moving a server ip address from one node to
>> another. And I assume we're permitting at most one node to export each
>> filesystem at a time. So it *should* be the case that the set of locks
>> held on the filesystem(s) that are moving are the same as the set of
>> locks held by the virtual ip that is moving.
>>
>
> This is true for non-cluster filesystem. But a cluster filesystem can be
> exported from multiple servers.

But that last sentence:

it *should* be the case that the set of locks held on the
filesystem(s) that are moving are the same as the set of locks
held by the virtual ip that is moving.

is still true in the cluster filesystem case, right?

--b.

2008-01-17 16:31:22

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

J. Bruce Fields wrote:
> On Thu, Jan 17, 2008 at 10:48:56AM -0500, Wendy Cheng wrote:
>
>> J. Bruce Fields wrote:
>>
>>> Remind me: why do we need both per-ip and per-filesystem methods? In
>>> practice, I assume that we'll always do *both*?
>>>
>>>
>> Failover normally is done via virtual IP address - so per-ip base method
>> should be the core routine. However, for non-cluster filesystem such as
>> ext3/4, changing server also implies umount. If there are clients not
>> following rule and obtaining locks via different ip interfaces, umount
>> would fail that ends up aborting the failover process. That's the place
>> we need the per-filesystem method.
>>
>> ServerA:
>> 1. Tear down the IP address
>> 2. Unexport the path
>> 3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
>> 4. If unmount required,
>> write path name to /proc/fs/nfsd/unlock_filesystem, then unmount.
>> 5. Signal peer to begin take-over.
>>
>> Sometime ago we were looking at "export name" as the core method (so
>> per-filesystem method is a subset of that). Unfortunately, the prototype
>> efforts showed the code would be too intrusive (if filesystem sub-tree
>> is exported).
>>
>>> We're migrating clients by moving a server ip address from one node to
>>> another. And I assume we're permitting at most one node to export each
>>> filesystem at a time. So it *should* be the case that the set of locks
>>> held on the filesystem(s) that are moving are the same as the set of
>>> locks held by the virtual ip that is moving.
>>>
>>>
>> This is true for non-cluster filesystem. But a cluster filesystem can be
>> exported from multiple servers.
>>
>
> But that last sentence:
>
> it *should* be the case that the set of locks held on the
> filesystem(s) that are moving are the same as the set of locks
> held by the virtual ip that is moving.
>
> is still true in the cluster filesystem case, right?
>
> --b.
>
Yes .... Wendy

2008-01-17 16:40:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Thu, Jan 17, 2008 at 11:31:22AM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> On Thu, Jan 17, 2008 at 10:48:56AM -0500, Wendy Cheng wrote:
>>
>>> J. Bruce Fields wrote:
>>>
>>>> Remind me: why do we need both per-ip and per-filesystem methods? In
>>>> practice, I assume that we'll always do *both*?
>>>>
>>> Failover normally is done via virtual IP address - so per-ip base
>>> method should be the core routine. However, for non-cluster
>>> filesystem such as ext3/4, changing server also implies umount. If
>>> there are clients not following rule and obtaining locks via
>>> different ip interfaces, umount would fail that ends up aborting the
>>> failover process. That's the place we need the per-filesystem
>>> method.
>>>
>>> ServerA:
>>> 1. Tear down the IP address
>>> 2. Unexport the path
>>> 3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
>>> 4. If unmount required,
>>> write path name to /proc/fs/nfsd/unlock_filesystem, then unmount.
>>> 5. Signal peer to begin take-over.
>>>
>>> Sometime ago we were looking at "export name" as the core method (so
>>> per-filesystem method is a subset of that). Unfortunately, the
>>> prototype efforts showed the code would be too intrusive (if
>>> filesystem sub-tree is exported).
>>>
>>>> We're migrating clients by moving a server ip address from one node to
>>>> another. And I assume we're permitting at most one node to export each
>>>> filesystem at a time. So it *should* be the case that the set of locks
>>>> held on the filesystem(s) that are moving are the same as the set of
>>>> locks held by the virtual ip that is moving.
>>>>
>>> This is true for non-cluster filesystem. But a cluster filesystem can
>>> be exported from multiple servers.
>>>
>>
>> But that last sentence:
>>
>> it *should* be the case that the set of locks held on the
>> filesystem(s) that are moving are the same as the set of locks
>> held by the virtual ip that is moving.
>>
>> is still true in the cluster filesystem case, right?
>>
>> --b.
>>
> Yes .... Wendy

In one situations (buggy client? Weird network failure?) could that
fail to be the case?

Would there be any advantage to enforcing that requirement in the
server? (For example, teaching nlm to reject any locking request for a
certain filesystem that wasn't sent to a certain server IP.)

--b.

2008-01-17 17:35:47

by Frank Filz

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Thu, 2008-01-17 at 11:40 -0500, J. Bruce Fields wrote:
> On Thu, Jan 17, 2008 at 11:31:22AM -0500, Wendy Cheng wrote:
> >> it *should* be the case that the set of locks held on the
> >> filesystem(s) that are moving are the same as the set of locks
> >> held by the virtual ip that is moving.
> >>
> >> is still true in the cluster filesystem case, right?
> >>
> >> --b.
> >>
> > Yes .... Wendy
>
> In one situations (buggy client? Weird network failure?) could that
> fail to be the case?
>
> Would there be any advantage to enforcing that requirement in the
> server? (For example, teaching nlm to reject any locking request for a
> certain filesystem that wasn't sent to a certain server IP.)

Trying to dredge up my clustered nfsd/lockd memories from having worked
on an implementation more than 7 years ago...

With a clustered filesystem being exported, it might be the case that
the cluster has a set of IP addresses (probably one per node) that are
used for load balancing clients. Each node exports all file systems. As
nodes fail (and all of this only matters when an interface failing is
the cause of node failure - a node crash need not apply here), ip
addresses are failed-over to other nodes, taking with them the set of
clients that were accessing the cluster via that ip address.

I assume the intent here with this implementation is that the node
taking over will start lock recovery for the ip address? With that
perspective, I guess it would be important that each file system only be
accessed with a single ip address otherwise lock recovery will not work
correctly since another node/ip could accept locks for that filesystem,
possibly "stealing" a lock that is in recovery. As I recall, our
implementation put the entire filesystem cluster-wide into recovery
during fail-over.

Frank



2008-01-17 17:59:38

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

Frank Filz wrote:
>
> I assume the intent here with this implementation is that the node
> taking over will start lock recovery for the ip address? With that
> perspective, I guess it would be important that each file system only be
> accessed with a single ip address otherwise lock recovery will not work
> correctly since another node/ip could accept locks for that filesystem,
> possibly "stealing" a lock that is in recovery. As I recall, our
> implementation put the entire filesystem cluster-wide into recovery
> during fail-over.
>
>

We have 2 more patches on their way to this mailing that:

* Set per-ip based grace period
* Notify only relevant clients about reclaim events

-- Wendy

2008-01-17 18:07:02

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

J. Bruce Fields wrote:
> On Thu, Jan 17, 2008 at 11:31:22AM -0500, Wendy Cheng wrote:
>
>> J. Bruce Fields wrote:
>>
>>> On Thu, Jan 17, 2008 at 10:48:56AM -0500, Wendy Cheng wrote:
>>>
>>>
>>>> J. Bruce Fields wrote:
>>>>
>>>>
>>>>> Remind me: why do we need both per-ip and per-filesystem methods? In
>>>>> practice, I assume that we'll always do *both*?
>>>>>
>>>>>
>>>> Failover normally is done via virtual IP address - so per-ip base
>>>> method should be the core routine. However, for non-cluster
>>>> filesystem such as ext3/4, changing server also implies umount. If
>>>> there are clients not following rule and obtaining locks via
>>>> different ip interfaces, umount would fail that ends up aborting the
>>>> failover process. That's the place we need the per-filesystem
>>>> method.
>>>>
>>>> ServerA:
>>>> 1. Tear down the IP address
>>>> 2. Unexport the path
>>>> 3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
>>>> 4. If unmount required,
>>>> write path name to /proc/fs/nfsd/unlock_filesystem, then unmount.
>>>> 5. Signal peer to begin take-over.
>>>>
>>>> Sometime ago we were looking at "export name" as the core method (so
>>>> per-filesystem method is a subset of that). Unfortunately, the
>>>> prototype efforts showed the code would be too intrusive (if
>>>> filesystem sub-tree is exported).
>>>>
>>>>
>>>>> We're migrating clients by moving a server ip address from one node to
>>>>> another. And I assume we're permitting at most one node to export each
>>>>> filesystem at a time. So it *should* be the case that the set of locks
>>>>> held on the filesystem(s) that are moving are the same as the set of
>>>>> locks held by the virtual ip that is moving.
>>>>>
>>>>>
>>>> This is true for non-cluster filesystem. But a cluster filesystem can
>>>> be exported from multiple servers.
>>>>
>>>>
>>> But that last sentence:
>>>
>>> it *should* be the case that the set of locks held on the
>>> filesystem(s) that are moving are the same as the set of locks
>>> held by the virtual ip that is moving.
>>>
>>> is still true in the cluster filesystem case, right?
>>>
>>> --b.
>>>
>>>
>> Yes .... Wendy
>>
>
> In one situations (buggy client? Weird network failure?) could that
> fail to be the case?
>
> Would there be any advantage to enforcing that requirement in the
> server? (For example, teaching nlm to reject any locking request for a
> certain filesystem that wasn't sent to a certain server IP.)
>
> --b.
>
It is doable... could be added into the "resume" patch that is currently
being tested (since the logic is so similar to the per-ip base grace
period) that should be out for review no later than next Monday.

However, as any new code added into the system, there are trade-off(s).
I'm not sure we want to keep enhancing this too much though. Remember,
locking is about latency. Adding more checking will hurt latency.

-- Wendy

2008-01-17 20:23:48

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

To summarize a phone conversation from today:

On Thu, Jan 17, 2008 at 01:07:02PM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> Would there be any advantage to enforcing that requirement in the
>> server? (For example, teaching nlm to reject any locking request for a
>> certain filesystem that wasn't sent to a certain server IP.)
>>
>> --b.
>>
> It is doable... could be added into the "resume" patch that is currently
> being tested (since the logic is so similar to the per-ip base grace
> period) that should be out for review no later than next Monday.
>
> However, as any new code added into the system, there are trade-off(s).
> I'm not sure we want to keep enhancing this too much though.

Sure. And I don't want to make this terribly complicated. The patch
looks good, and solves a clear problem. That said, there are a few
related problems we'd like to solve:

- We want to be able to move an export to a node with an already
active nfs server. Currently that requires restarting all of
nfsd on the target node. This is what I understand your next
patch fixes.
- In the case of a filesystem that may be mounted from multiple
nodes at once, we need to make sure we're not leaving a window
allowing other applications to claim locks that nfs clients
haven't recovered yet.
- Ideally we'd like this to be possible without making the
filesystem block all lock requests during a 90-second grace
period; instead it should only have to block those requests
that conflict with to-be-recovered locks.
- All this should work for nfsv4, where we want to eventually
also allow migration of individual clients, and
client-initiated failover.

I absolutely don't want to delay solving this particular problem until
all the above is figured out, but I would like to be reasonably
confident that the new user-interface can be extended naturally to
handle the above cases; or at least that it won't unnecessarily
complicate their implementation.

I'll try to sketch an implementation of most of the above in the next
week.

Anyway, that together with the fact that 2.6.25 is opening up soon (in a
week or so?) inclines me toward delay submitting this until 2.6.26.

> Remember,
> locking is about latency. Adding more checking will hurt latency.

Do you have any latency tests that we could use, or latency-sensitive
workloads that you use as benchmarks?

My suspicion is that checks such as these would be dwarfed by the posix
deadlock detection checks, not to mention the roundtrip to the server
for the nlm rpc and (in the gfs2 case) the communication with gfs2's
posix lock manager.

But I'd love any chance to demonstrate lock latency problems--I'm sure
there's good work to be done there.

--b.

2008-01-18 10:03:10

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

On Thu, Jan 17, 2008 at 03:23:42PM -0500, J. Bruce Fields wrote:
> To summarize a phone conversation from today:
>
> On Thu, Jan 17, 2008 at 01:07:02PM -0500, Wendy Cheng wrote:
> > J. Bruce Fields wrote:
> >> Would there be any advantage to enforcing that requirement in the
> >> server? (For example, teaching nlm to reject any locking request for a
> >> certain filesystem that wasn't sent to a certain server IP.)
> >>
> >> --b.
> >>
> > It is doable... could be added into the "resume" patch that is currently
> > being tested (since the logic is so similar to the per-ip base grace
> > period) that should be out for review no later than next Monday.
> >
> > However, as any new code added into the system, there are trade-off(s).
> > I'm not sure we want to keep enhancing this too much though.
>
> Sure. And I don't want to make this terribly complicated. The patch
> looks good, and solves a clear problem. That said, there are a few
> related problems we'd like to solve:
>
> - We want to be able to move an export to a node with an already
> active nfs server. Currently that requires restarting all of
> nfsd on the target node. This is what I understand your next
> patch fixes.

Maybe a silly question but what about using "exportfs -r" for this?

--
Frank

2008-01-18 10:21:09

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands


> shell> echo 10.1.1.2 > /proc/fs/nfsd/unlock_ip
> shell> echo /mnt/sfs1 > /proc/fs/nfsd/unlock_filesystem
>
> The expected sequence of events can be:
> 1. Tear down the IP address

You might consider using iptables at this point for dropping outgoing
packets with that source IP address to catch any packet still in
flight. It fixed ESTALE problems for me IIRC (NFSv3, UDP).

> 2. Unexport the path
> 3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
> 4. If unmount required, write path name to
> /proc/fs/nfsd/unlock_filesystem, then unmount.
> 5. Signal peer to begin take-over.
>

--
Frank

2008-01-18 15:00:44

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

Frank van Maarseveen wrote:
>
>> - We want to be able to move an export to a node with an already
>> active nfs server. Currently that requires restarting all of
>> nfsd on the target node. This is what I understand your next
>> patch fixes.
>>
>
> Maybe a silly question but what about using "exportfs -r" for this?
>

/me prays we won't go back to our *old* export failover proposal (about
two years ago) :) ...

Anyway, re-export is part of the required steps that take-over server
needs to do. It, however, doesn't handle lockd grace period so we will
have the possibility that other client will steal the locks away. That's
why Bruce and I are working on the second "resume" patch at this moment.

-- Wendy


2008-01-18 15:00:00

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH 1/2] NLM failover unlock commands

Frank van Maarseveen wrote:
>> shell> echo 10.1.1.2 > /proc/fs/nfsd/unlock_ip
>> shell> echo /mnt/sfs1 > /proc/fs/nfsd/unlock_filesystem
>>
>> The expected sequence of events can be:
>> 1. Tear down the IP address
>>
>
> You might consider using iptables at this point for dropping outgoing
> packets with that source IP address to catch any packet still in
> flight. It fixed ESTALE problems for me IIRC (NFSv3, UDP).
>

ok, thanks ... Wendy
>
>> 2. Unexport the path
>> 3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
>> 4. If unmount required, write path name to
>> /proc/fs/nfsd/unlock_filesystem, then unmount.
>> 5. Signal peer to begin take-over.
>>
>>
>
>