2015-04-27 06:19:12

by Saso Slavicic

[permalink] [raw]
Subject: server_scope v4.1 lock reclaim

Hi,

I'm doing a NFS HA setup for KVM and need lock reclaim to work. I've been
doing a lot of testing and reading in the past week and finally figured out
that for reclaims to work on a 4.1 mount (4.1 is preferable due to
RECLAIM_COMPLETE and thus faster failover), the server hostnames need to be
the same. RFC specifies that reclaim can succeed if server scope is the same
and in fact, the client will not even attempt a reclaim if the server scope
does not match.

But...there doesn't seem to be any way of setting server scope other than
changing server hostname? RFC states: "The purpose of the server scope is to
allow a group of servers to indicate to clients that a set of servers
sharing the same server scope value has arranged to use compatible values of
otherwise opaque identifiers." The nfsdcltrack directory is properly handed
over during failover so I'd need some way of configuring server scope on
this "set of servers"? From the code, the server scope is simply set to
utsname()->nodename in nfs4xdr.c.

What am I missing here, how can this work when Heartbeat needs different
names for nodes?

Thanks,
Saso Slavicic




2015-04-27 15:19:45

by J. Bruce Fields

[permalink] [raw]
Subject: Re: server_scope v4.1 lock reclaim

On Mon, Apr 27, 2015 at 08:07:12AM +0200, Saso Slavicic wrote:
> I'm doing a NFS HA setup for KVM and need lock reclaim to work. I've been
> doing a lot of testing and reading in the past week and finally figured out
> that for reclaims to work on a 4.1 mount (4.1 is preferable due to
> RECLAIM_COMPLETE and thus faster failover), the server hostnames need to be
> the same. RFC specifies that reclaim can succeed if server scope is the same
> and in fact, the client will not even attempt a reclaim if the server scope
> does not match.
>
> But...there doesn't seem to be any way of setting server scope other than
> changing server hostname? RFC states: "The purpose of the server scope is to
> allow a group of servers to indicate to clients that a set of servers
> sharing the same server scope value has arranged to use compatible values of
> otherwise opaque identifiers." The nfsdcltrack directory is properly handed
> over during failover so I'd need some way of configuring server scope on
> this "set of servers"? From the code, the server scope is simply set to
> utsname()->nodename in nfs4xdr.c.
>
> What am I missing here, how can this work when Heartbeat needs different
> names for nodes?

So in theory we could add some sort of way to configure the server scope
and then you could set the server scope to the same thing on all your
servers.

But that's not enough to satisfy
https://tools.ietf.org/html/rfc5661#section-2.10.4, which also requires
stateid's and the rest to be compatible between the servers.

In practice given current Linux servers and clients maybe that could
work, because in your situation the only case when they see each other's
stateid's is after a restart, in which case the id's will include a boot
time that will result in a STALE error as long as the server clocks are
roughly synchronized. But that makes some assumptions about how our
servers generate id's and how the clients use them. And I don't think
those assumptions are guaranteed by the spec. It seems fragile.

If it's simple active-to-passive failover then I suppose you could
arrange for the utsname to be the same too.

--b.

2015-04-28 16:44:43

by Saso Slavicic

[permalink] [raw]
Subject: RE: server_scope v4.1 lock reclaim

> From: J. Bruce Fields
> Sent: Monday, April 27, 2015 5:20 PM

> So in theory we could add some sort of way to configure the server scope
> and then you could set the server scope to the same thing on all your
> servers.
>
> But that's not enough to satisfy
> https://tools.ietf.org/html/rfc5661#section-2.10.4, which also requires
> stateid's and the rest to be compatible between the servers.

OK...I have to admit that with the amount of NFS HA tutorials and the
improvements that NFS v4(.1) brings in the specs, I assumed that HA failover
was supported. I apologize if that is not the case.

So, such a config option could be added but it's not planned to be added,
since it could be wrongly used in some situations (ie. not doing
active-to-passive failover)?
Active-active setup is then totally out of the question?

> In practice given current Linux servers and clients maybe that could
> work, because in your situation the only case when they see each other's
> stateid's is after a restart, in which case the id's will include a boot
> time that will result in a STALE error as long as the server clocks are
> roughly synchronized. But that makes some assumptions about how our
> servers generate id's and how the clients use them. And I don't think
> those assumptions are guaranteed by the spec. It seems fragile.

I read (part of) the specs and stateids are supposed to hold over sessions
but not for different client ids.
Doing a wireshark dump, the (failover) server sends STALE_CLIENTID after
reconnect so that should properly invalidate all the ids?
Would I assume correctly that this is read from the nfsdcltrack? Is there
even a need for this database to sync between each failover, if the client
is already known since it's last failover (only the timestamp would be
older)?

> If it's simple active-to-passive failover then I suppose you could
> arrange for the utsname to be the same too.

I could, but then I don't know which server is active when I login to ssh :)
What would happen, if the 'migration' mount option would be modified for
v4.1 mounts not to check for server scope when doing reclaims (as opposed to
configuring server scope)? :)

Thanks,
Saso Slavicic


2015-04-28 18:23:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: server_scope v4.1 lock reclaim

On Tue, Apr 28, 2015 at 06:44:27PM +0200, Saso Slavicic wrote:
> > From: J. Bruce Fields
> > Sent: Monday, April 27, 2015 5:20 PM
>
> > So in theory we could add some sort of way to configure the server scope
> > and then you could set the server scope to the same thing on all your
> > servers.
> >
> > But that's not enough to satisfy
> > https://tools.ietf.org/html/rfc5661#section-2.10.4, which also requires
> > stateid's and the rest to be compatible between the servers.
>
> OK...I have to admit that with the amount of NFS HA tutorials and the
> improvements that NFS v4(.1) brings in the specs, I assumed that HA failover
> was supported. I apologize if that is not the case.

I'm afraid you're in the vanguard--I doubt many people have tried HA
with 4.1 and knfsd yet. (And I hadn't noticed the server scope problem,
thanks for bringing it up.)

> So, such a config option could be added but it's not planned to be added,
> since it could be wrongly used in some situations (ie. not doing
> active-to-passive failover)?
> Active-active setup is then totally out of the question?

I'm not sure what the right fix is yet.

> > In practice given current Linux servers and clients maybe that could
> > work, because in your situation the only case when they see each other's
> > stateid's is after a restart, in which case the id's will include a boot
> > time that will result in a STALE error as long as the server clocks are
> > roughly synchronized. But that makes some assumptions about how our
> > servers generate id's and how the clients use them. And I don't think
> > those assumptions are guaranteed by the spec. It seems fragile.
>
> I read (part of) the specs and stateids are supposed to hold over sessions
> but not for different client ids.
> Doing a wireshark dump, the (failover) server sends STALE_CLIENTID after
> reconnect so that should properly invalidate all the ids?

Since this is 4.1, I guess the first rpc the new server sees will have
either a clientid or a sessionid. So we want to make sure the new
server will handle either of those correctly.

> Would I assume correctly that this is read from the nfsdcltrack? Is there
> even a need for this database to sync between each failover, if the client
> is already known since it's last failover (only the timestamp would be
> older)?

So, you're thinking of a case where there's a failover from server A to
server B, then back to server A again, and a single client is
continuously active throughout both failovers?

Here's the sort of case that's a concern:

- A->B failover happens
- client gets a file lock from B
- client loses contact with B (network problem or something)
- B->A failover happens.

At this point, should A allow the client to reclaim its lock? B could
have given up on the client, released its lock, and granted conflicting
lock to other clients. Or it might not have. Neither the client nor A
knows, B's the only one that knows what happened, so we need to get that
database from B to find out.

--b.

> > If it's simple active-to-passive failover then I suppose you could
> > arrange for the utsname to be the same too.
>
> I could, but then I don't know which server is active when I login to ssh :)
> What would happen, if the 'migration' mount option would be modified for
> v4.1 mounts not to check for server scope when doing reclaims (as opposed to
> configuring server scope)? :)
>
> Thanks,
> Saso Slavicic