2008-04-25 13:31:22

by Bernd Schubert

[permalink] [raw]
Subject: multiple instances of rpc.statd

Hello,

on servers with heartbeat managed resources one rather often has the situation
one exports different directories from different resources.

It now may happen all resources are running on one host, but they can also run
from different hosts. The situation gets even more complicated if the server
is also a nfs client.

In principle having different nfs resources works fine, only the statd state
directory is a problem. Or in principle the statd concept at all. Actually we
would need to have several instances of statd running using different
directories. These then would have to be migrated from one server to the
other on resource movement.
However, as far I understand it, there does not even exist the basic concept
for this, doesn't it?


Thanks,
Bernd



--
Bernd Schubert
Q-Leap Networks GmbH


2008-04-25 13:45:30

by Wendy Cheng

[permalink] [raw]
Subject: Re: multiple instances of rpc.statd

Bernd Schubert wrote:
> Hello,
>
> on servers with heartbeat managed resources one rather often has the situation
> one exports different directories from different resources.
>
> It now may happen all resources are running on one host, but they can also run
> from different hosts. The situation gets even more complicated if the server
> is also a nfs client.
>
> In principle having different nfs resources works fine, only the statd state
> directory is a problem. Or in principle the statd concept at all. Actually we
> would need to have several instances of statd running using different
> directories. These then would have to be migrated from one server to the
> other on resource movement.
> However, as far I understand it, there does not even exist the basic concept
> for this, doesn't it?
>
>
The efforts have been attempted (to remedy this issue) and a complete
set of patches have been (kept) submitting for the past two years. The
patch acceptance progress is very slow (I guess people just don't want
to get bothered with cluster issues ?).

Anyway, the kernel side has the basic infrastructure to handle the
problem (it stores the incoming clients IP address as part of its
book-keeping record) - just a little bit tweak will do the job. However,
the user side statd directory needs to get re-structured. I didn't
publish the user side directory structure script during my last round of
submission. Forking statd into multiple threads do not solve all the
issues. Check out:
https://www.redhat.com/archives/cluster-devel/2007-April/msg00028.html

-- Wendy




2008-04-25 14:30:38

by Bernd Schubert

[permalink] [raw]
Subject: Re: multiple instances of rpc.statd

Hello Wendy.

On Friday 25 April 2008 15:47:03 Wendy Cheng wrote:
> Bernd Schubert wrote:
> > Hello,
> >
> > on servers with heartbeat managed resources one rather often has the
> > situation one exports different directories from different resources.
> >
> > It now may happen all resources are running on one host, but they can
> > also run from different hosts. The situation gets even more complicated
> > if the server is also a nfs client.
> >
> > In principle having different nfs resources works fine, only the statd
> > state directory is a problem. Or in principle the statd concept at all.
> > Actually we would need to have several instances of statd running using
> > different directories. These then would have to be migrated from one
> > server to the other on resource movement.
> > However, as far I understand it, there does not even exist the basic
> > concept for this, doesn't it?
>
> The efforts have been attempted (to remedy this issue) and a complete
> set of patches have been (kept) submitting for the past two years. The
> patch acceptance progress is very slow (I guess people just don't want
> to get bothered with cluster issues ?).

Well, I think people are just ignorant. I did see your discussions about NLM
in the past on the NFS mailing list, but actually I didn't understand the
entire point of discussion ;) I was simply used to active-passive services
(mostly due to heartbeat-1.x) and there we just had /var/lib/nfs linked to
the exported directory.

After I started to work here, I was confronted with the fact we do have
working active-active clusters here, but nobody besides me ever cared about
the locking problem :( NFS failovers just are done ignoring file locks.
Seems so far also nobody run into a problem, but maybe the result was so
obscure that nobody ever bothered to complain...
I'm just afraid most admins will simply do like this...

>
> Anyway, the kernel side has the basic infrastructure to handle the
> problem (it stores the incoming clients IP address as part of its
> book-keeping record) - just a little bit tweak will do the job. However,
> the user side statd directory needs to get re-structured. I didn't
> publish the user side directory structure script during my last round of
> submission. Forking statd into multiple threads do not solve all the
> issues. Check out:
> https://www.redhat.com/archives/cluster-devel/2007-April/msg00028.html

Thanks, I will read this!


Thanks again,
Bernd

--
Bernd Schubert
Q-Leap Networks GmbH

2008-04-25 15:37:22

by Wendy Cheng

[permalink] [raw]
Subject: Re: multiple instances of rpc.statd

Bernd Schubert wrote:
> Hello Wendy.
>
> On Friday 25 April 2008 15:47:03 Wendy Cheng wrote:
>
>> The efforts have been attempted (to remedy this issue) and a complete
>> set of patches have been (kept) submitting for the past two years. The
>> patch acceptance progress is very slow (I guess people just don't want
>> to get bothered with cluster issues ?).
>>
>
> Well, I think people are just ignorant. I did see your discussions about NLM
> in the past on the NFS mailing list, but actually I didn't understand the
> entire point of discussion ;) I was simply used to active-passive services
> (mostly due to heartbeat-1.x) and there we just had /var/lib/nfs linked to
> the exported directory.
>
> After I started to work here, I was confronted with the fact we do have
> working active-active clusters here, but nobody besides me ever cared about
> the locking problem :( NFS failovers just are done ignoring file locks.
> Seems so far also nobody run into a problem, but maybe the result was so
> obscure that nobody ever bothered to complain...
> I'm just afraid most admins will simply do like this...
>

That's an accurate observation :) .. people are just ignorant until they
get bitten by the problem. Then they blush out nasty words about Linux
servers and go for proprietary solutions.

There are amazing amount of "workaround"(s) and funny setup(s) to bypass
various Linux problems. Admins normally don't care the details but just
know if they do certain "tricks", things work. I was looking at a
performance issue last week why clustered mail servers ran miserably
slow. As a person who don't know much about mail server, I was surprised
to learn it is a common practice that linux email servers could be
configured to grab flock, followed by posix lock, then wrote a lock file
whenever a "write" occurs - all three actions are used concurrently to
protect one single file (?). It was a very interesting conversation.

-- Wendy


2008-04-25 22:07:31

by J. Bruce Fields

[permalink] [raw]
Subject: Re: multiple instances of rpc.statd

On Fri, Apr 25, 2008 at 09:47:03AM -0400, Wendy Cheng wrote:
> Bernd Schubert wrote:
>> Hello,
>>
>> on servers with heartbeat managed resources one rather often has the
>> situation one exports different directories from different resources.
>>
>> It now may happen all resources are running on one host, but they can
>> also run from different hosts. The situation gets even more complicated
>> if the server is also a nfs client.
>>
>> In principle having different nfs resources works fine, only the statd
>> state directory is a problem. Or in principle the statd concept at all.
>> Actually we would need to have several instances of statd running using
>> different directories. These then would have to be migrated from one
>> server to the other on resource movement. However, as far I understand
>> it, there does not even exist the basic concept for this, doesn't it?
>>
>>
> The efforts have been attempted (to remedy this issue) and a complete
> set of patches have been (kept) submitting for the past two years. The
> patch acceptance progress is very slow (I guess people just don't want
> to get bothered with cluster issues ?).

We definitely want to get this all figured out....

> Anyway, the kernel side has the basic infrastructure to handle the
> problem (it stores the incoming clients IP address as part of its
> book-keeping record) - just a little bit tweak will do the job. However,
> the user side statd directory needs to get re-structured. I didn't
> publish the user side directory structure script during my last round of
> submission. Forking statd into multiple threads do not solve all the
> issues. Check out:
> https://www.redhat.com/archives/cluster-devel/2007-April/msg00028.html

So for basic v2/v3 failover, what remains is some statd -H scripts, and
some form of grace period control? Is there anything else we're
missing?

--b.

2008-04-28 03:59:14

by Wendy Cheng

[permalink] [raw]
Subject: Re: multiple instances of rpc.statd

J. Bruce Fields wrote:
> On Fri, Apr 25, 2008 at 09:47:03AM -0400, Wendy Cheng wrote:
>
>> Bernd Schubert wrote:
>>
>>> Hello,
>>>
>>> on servers with heartbeat managed resources one rather often has the
>>> situation one exports different directories from different resources.
>>>
>>> It now may happen all resources are running on one host, but they can
>>> also run from different hosts. The situation gets even more complicated
>>> if the server is also a nfs client.
>>>
>>> In principle having different nfs resources works fine, only the statd
>>> state directory is a problem. Or in principle the statd concept at all.
>>> Actually we would need to have several instances of statd running using
>>> different directories. These then would have to be migrated from one
>>> server to the other on resource movement. However, as far I understand
>>> it, there does not even exist the basic concept for this, doesn't it?
>>>
>>>
>>>
>> The efforts have been attempted (to remedy this issue) and a complete
>> set of patches have been (kept) submitting for the past two years. The
>> patch acceptance progress is very slow (I guess people just don't want
>> to get bothered with cluster issues ?).
>>
>
> We definitely want to get this all figured out....
>
>
>> Anyway, the kernel side has the basic infrastructure to handle the
>> problem (it stores the incoming clients IP address as part of its
>> book-keeping record) - just a little bit tweak will do the job. However,
>> the user side statd directory needs to get re-structured. I didn't
>> publish the user side directory structure script during my last round of
>> submission. Forking statd into multiple threads do not solve all the
>> issues. Check out:
>> https://www.redhat.com/archives/cluster-devel/2007-April/msg00028.html
>>
>
> So for basic v2/v3 failover, what remains is some statd -H scripts, and
> some form of grace period control? Is there anything else we're
> missing?
>
>
>
The submitted patch set is reasonably complete ... .

There was another thought about statd patches though - mostly because of
the concerns over statd's responsiveness. It depended so much on network
status and clients' participations. I was hoping NFS V4 would catch up
by the time v2/v3 grace period patches got accepted into mainline
kernel. Ideally the v2/v3 lock reclaiming logic could use (or at least
did a similar implementation) the communication channel established by
v4 servers - that is,

1. Enable grace period as previous submitted patches on secondary server.
2. Drop the locks on primary server (and chained the dropped locks into
a lock-list).
3. Send the lock-list via v4 communication channel (or similar
implementation) from primary server to backup server.
4. Reclaim the lock base on the lock-list on backup server.

In short, it would be nice to replace the existing statd lock reclaiming
logic with the above steps if all possible during active-active
failover. For reboot, on the other hand, should stay same as today's
statd logic without changes.

-- Wendy



2008-04-28 18:26:14

by J. Bruce Fields

[permalink] [raw]
Subject: Re: multiple instances of rpc.statd

On Sun, Apr 27, 2008 at 10:59:11PM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> On Fri, Apr 25, 2008 at 09:47:03AM -0400, Wendy Cheng wrote:
>>
>>> Bernd Schubert wrote:
>>>
>>>> Hello,
>>>>
>>>> on servers with heartbeat managed resources one rather often has
>>>> the situation one exports different directories from different
>>>> resources.
>>>>
>>>> It now may happen all resources are running on one host, but they
>>>> can also run from different hosts. The situation gets even more
>>>> complicated if the server is also a nfs client.
>>>>
>>>> In principle having different nfs resources works fine, only the
>>>> statd state directory is a problem. Or in principle the statd
>>>> concept at all. Actually we would need to have several instances of
>>>> statd running using different directories. These then would have to
>>>> be migrated from one server to the other on resource movement.
>>>> However, as far I understand it, there does not even exist the
>>>> basic concept for this, doesn't it?
>>>>
>>>>
>>> The efforts have been attempted (to remedy this issue) and a complete
>>> set of patches have been (kept) submitting for the past two years.
>>> The patch acceptance progress is very slow (I guess people just
>>> don't want to get bothered with cluster issues ?).
>>>
>>
>> We definitely want to get this all figured out....
>>
>>
>>> Anyway, the kernel side has the basic infrastructure to handle the
>>> problem (it stores the incoming clients IP address as part of its
>>> book-keeping record) - just a little bit tweak will do the job.
>>> However, the user side statd directory needs to get re-structured. I
>>> didn't publish the user side directory structure script during my
>>> last round of submission. Forking statd into multiple threads do not
>>> solve all the issues. Check out:
>>> https://www.redhat.com/archives/cluster-devel/2007-April/msg00028.html
>>>
>>
>> So for basic v2/v3 failover, what remains is some statd -H scripts, and
>> some form of grace period control? Is there anything else we're
>> missing?
>>
>>
>>
> The submitted patch set is reasonably complete ... .
>
> There was another thought about statd patches though - mostly because of
> the concerns over statd's responsiveness. It depended so much on network
> status and clients' participations. I was hoping NFS V4 would catch up
> by the time v2/v3 grace period patches got accepted into mainline
> kernel. Ideally the v2/v3 lock reclaiming logic could use (or at least
> did a similar implementation) the communication channel established by
> v4 servers - that is,
>
> 1. Enable grace period as previous submitted patches on secondary server.
> 2. Drop the locks on primary server (and chained the dropped locks into
> a lock-list).

What information exactly would be on that lock list?

> 3. Send the lock-list via v4 communication channel (or similar
> implementation) from primary server to backup server.
> 4. Reclaim the lock base on the lock-list on backup server.

So at this step it's the server itself reclaiming those locks, and
you're talking about a completely transparent migration that doesn't
look to the client like a reboot?

My feeling has been that that's best done after first making sure we can
handle the case where the client reclaims the locks, since the latter is
easier, and is likely to involve at least some of the same work. I
could be wrong.

Exactly which data has to be transferred from the old server to the new?
(Lock types, ranges, fh's, owners, and pid's, for established locks; do
we also need to hand off blocking locks? Statd data still needs to be
transferred. Ideally rpc reply caches. What else?)

> In short, it would be nice to replace the existing statd lock reclaiming
> logic with the above steps if all possible during active-active
> failover. For reboot, on the other hand, should stay same as today's
> statd logic without changes.

--b.

2008-04-28 19:17:26

by Wendy Cheng

[permalink] [raw]
Subject: Re: multiple instances of rpc.statd

J. Bruce Fields wrote:
> On Sun, Apr 27, 2008 at 10:59:11PM -0500, Wendy Cheng wrote:
>
>>
>>> So for basic v2/v3 failover, what remains is some statd -H scripts, and
>>> some form of grace period control? Is there anything else we're
>>> missing?
>>>
>> The submitted patch set is reasonably complete ... .
>>
>> There was another thought about statd patches though - mostly because of
>> the concerns over statd's responsiveness. It depended so much on network
>> status and clients' participations. I was hoping NFS V4 would catch up
>> by the time v2/v3 grace period patches got accepted into mainline
>> kernel. Ideally the v2/v3 lock reclaiming logic could use (or at least
>> did a similar implementation) the communication channel established by
>> v4 servers - that is,
>>
>> 1. Enable grace period as previous submitted patches on secondary server.
>> 2. Drop the locks on primary server (and chained the dropped locks into
>> a lock-list).
>>
>
> What information exactly would be on that lock list?
>

Can't believe I get myself into this ... I'm supposed to be a disk
firmware person *now* .. Anyway,

Are the lock state finalized in v4 yet ? Can we borrow the concepts (and
saved lock states) from v4 ? We certainly can define the saved state
useful for v3 independent of v4, say client IP, file path, lock range,
lock type, and user id ? Need to re-read linux source to make sure it is
doable though.

>
>> 3. Send the lock-list via v4 communication channel (or similar
>> implementation) from primary server to backup server.
>> 4. Reclaim the lock base on the lock-list on backup server.
>>
>
> So at this step it's the server itself reclaiming those locks, and
> you're talking about a completely transparent migration that doesn't
> look to the client like a reboot?
>

Yes, that's the idea .. never implement any prototype code yet - so not
sure how feasible it would be.
> My feeling has been that that's best done after first making sure we can
> handle the case where the client reclaims the locks, since the latter is
> easier, and is likely to involve at least some of the same work. I
> could be wrong.
>

Makes sense .. so the steps taken may be:

1. Push the patch sets that we originally submitted. This is to make
sure we have something working.
2. Prototype the new logic, parallel with v4 development, observe and
learn the results from step 1 based on user feedbacks.
3. Integrate the new logic, if it turns out to be good.

> Exactly which data has to be transferred from the old server to the new?
> (Lock types, ranges, fh's, owners, and pid's, for established locks; do
> we also need to hand off blocking locks? Statd data still needs to be
> transferred. Ideally rpc reply caches. What else?)
>

All statd has is the client network addresses (that is already part of
current NLM states anyway). Yes, rpc reply cache is important (and
that's exactly the motivation for this thread of discussion). Eventually
the rpc reply cache needs to get transferred. As long as the
communication channel is established, there is no reason for lock states
not taking this advantages.

>
>> In short, it would be nice to replace the existing statd lock reclaiming
>> logic with the above steps if all possible during active-active
>> failover. For reboot, on the other hand, should stay same as today's
>> statd logic without changes.
>>

As mentioned before, cluster issues are not trivial. Take one step at a
time .. So the next task we should be focusing may be the grace period
patch. Will see what I can do to help out here.

-- Wendy


2008-04-29 16:20:58

by J. Bruce Fields

[permalink] [raw]
Subject: Re: multiple instances of rpc.statd

On Mon, Apr 28, 2008 at 03:19:28PM -0400, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> On Sun, Apr 27, 2008 at 10:59:11PM -0500, Wendy Cheng wrote:
>>
>>>
>>>> So for basic v2/v3 failover, what remains is some statd -H scripts, and
>>>> some form of grace period control? Is there anything else we're
>>>> missing?
>>>>
>>> The submitted patch set is reasonably complete ... .
>>>
>>> There was another thought about statd patches though - mostly because of
>>> the concerns over statd's responsiveness. It depended so much on network
>>> status and clients' participations. I was hoping NFS V4 would catch up
>>> by the time v2/v3 grace period patches got accepted into mainline
>>> kernel. Ideally the v2/v3 lock reclaiming logic could use (or at least
>>> did a similar implementation) the communication channel established by
>>> v4 servers - that is,
>>>
>>> 1. Enable grace period as previous submitted patches on secondary server.
>>> 2. Drop the locks on primary server (and chained the dropped locks into
>>> a lock-list).
>>>
>>
>> What information exactly would be on that lock list?
>>
>
> Can't believe I get myself into this ... I'm supposed to be a disk
> firmware person *now* .. Anyway,
>
> Are the lock state finalized in v4 yet ?

You mean, have we figured out what to send across for a transparent
migration? Somebody did a prototype that I think we set aside for a
while, but I don't recall if it tried to handle truly transparent
migration, or whether it just sent across the v4 equivalent of the statd
data; I'll check.

--b.

> Can we borrow the concepts (and
> saved lock states) from v4 ? We certainly can define the saved state
> useful for v3 independent of v4, say client IP, file path, lock range,
> lock type, and user id ? Need to re-read linux source to make sure it is
> doable though.
>
>>
>>> 3. Send the lock-list via v4 communication channel (or similar
>>> implementation) from primary server to backup server.
>>> 4. Reclaim the lock base on the lock-list on backup server.
>>>
>>
>> So at this step it's the server itself reclaiming those locks, and
>> you're talking about a completely transparent migration that doesn't
>> look to the client like a reboot?
>>
>
> Yes, that's the idea .. never implement any prototype code yet - so not
> sure how feasible it would be.
>> My feeling has been that that's best done after first making sure we can
>> handle the case where the client reclaims the locks, since the latter is
>> easier, and is likely to involve at least some of the same work. I
>> could be wrong.
>>
>
> Makes sense .. so the steps taken may be:
>
> 1. Push the patch sets that we originally submitted. This is to make
> sure we have something working.
> 2. Prototype the new logic, parallel with v4 development, observe and
> learn the results from step 1 based on user feedbacks.
> 3. Integrate the new logic, if it turns out to be good.
>
>> Exactly which data has to be transferred from the old server to the new?
>> (Lock types, ranges, fh's, owners, and pid's, for established locks; do
>> we also need to hand off blocking locks? Statd data still needs to be
>> transferred. Ideally rpc reply caches. What else?)
>>
>
> All statd has is the client network addresses (that is already part of
> current NLM states anyway). Yes, rpc reply cache is important (and
> that's exactly the motivation for this thread of discussion). Eventually
> the rpc reply cache needs to get transferred. As long as the
> communication channel is established, there is no reason for lock states
> not taking this advantages.
>
>>
>>> In short, it would be nice to replace the existing statd lock reclaiming
>>> logic with the above steps if all possible during active-active
>>> failover. For reboot, on the other hand, should stay same as today's
>>> statd logic without changes.
>>>
>
> As mentioned before, cluster issues are not trivial. Take one step at a
> time .. So the next task we should be focusing may be the grace period
> patch. Will see what I can do to help out here.
>
> -- Wendy
>