2011-02-11 12:18:31

by Ferenc Wagner

[permalink] [raw]
Subject: server does not abort grace period

Hi,

We're running 2.6.32 (Debian squeeze) NFS4 server and clients. The
server boots and runs purely from SAN, so we can start it on different
computers. In case of such "hardware failovers" I'd expect the clients
to quickly reclaim their locks (if any) and thus the server to abort
it's 90-second grace period early. However, this does not happen,
ruining our HA like, totally.

So, the questions: is the functionality of aborting the grace period
early missing from version 2.6.32 of the Linux kernel? If yes, is it
present in any kernel version? If it should work, could someone offer
some advice on debugging it? If it isn't supported, what's the
best practice of providing highly available NFSv4 today?
--
Thanks,
Feri.


2011-02-25 16:51:44

by Ferenc Wagner

[permalink] [raw]
Subject: Re: server does not abort grace period

"J. Bruce Fields" <[email protected]> writes:

> On Thu, Feb 24, 2011 at 06:06:00PM +0100, Ferenc Wagner wrote:
>> "J. Bruce Fields" <[email protected]> writes:
>>
>>> On Tue, Feb 22, 2011 at 06:05:14PM +0100, Ferenc Wagner wrote:
>>>
>>>> "J. Bruce Fields" <[email protected]> writes:
>>>>
>>>>> - In the NFSv4.1 case there is a "reclaim complete" rpc that
>>>>> clients are required to send. Currently we don't take
>>>>> advantage of that to end the grace period early, but we
>>>>> should. That's no help for 4.0 clients.
>>>>
>>>> /proc/fs/nfsd/versions shows +4.1 on the server, does this mean that
>>>> nfs4 type Linux client mounts should issue "reclaim complete"?
>>>
>>> It means that a 4.1 is supported, so a client *could* use 4.1 if it
>>> asked to. And if it did use 4.1, yes, it would be required to issue
>>> reclaim complete. Current linux clients do not use 4.1 unless you
>>> explicitly ask for it on the mount commandline.
>>
>> I can't find any mention of 4.1 in man nfs (nfs-common version 1.2.2),
>> is there an undocumented nfsvers=4.1 mount option or some other means?
>
> -ominorversion=1

Hmm, looks like this feature didn't make it into the squeeze version of
mount.nfs4. And it's disabled in the kernel, anyway.

>>> (Aside: the server really shouldn't have +4.1 by default, as the 4.1
>>> server is not done. We should fix that; which distro are you using?)
>>
>> Debian squeeze. If it's switchable, then it's possible I switched it
>> on, I can't remember. However, 4.1 client support is disabled in the
>> stock kernel config, and 4.1 server support isn't even mentioned:
>
> There's no separate config option, but the kernel keeps it off by
> default. I think nfs-utils is overriding the kernel's default. We
> should fix that.

At least I can't find any occurence of fs/nfsd/version in my startup and
config scripts.
--
Regards,
Feri.

2011-02-23 19:52:53

by J. Bruce Fields

[permalink] [raw]
Subject: Re: server does not abort grace period

On Tue, Feb 22, 2011 at 06:05:14PM +0100, Ferenc Wagner wrote:
> "J. Bruce Fields" <[email protected]> writes:
> > The NFSv4.0 protocol doesn't provide any way for clients to tell the
> > server that they have finished recovering; as long as *any* clients
> > held state on the previous server instance, the new server is stuck
> > waiting out the whole grace period. Some things we could do:
> >
> > - We could at least recognize the case where *no* clients held
> > state before, and end the grace period early in that case.
>
> Would this mean that /var/lib/nfs/v4recovery is empty on the server?

Right.

> Actually, it contains a hex-named empty directory, sometimes two (we're
> running with two clients at the moment).
>
> > - In the NFSv4.1 case there is a "reclaim complete" rpc that
> > clients are required to send. Currently we don't take
> > advantage of that to end the grace period early, but we
> > should. That's no help for 4.0 clients.
>
> /proc/fs/nfsd/versions shows +4.1 on the server, does this mean that
> nfs4 type Linux client mounts should issue "reclaim complete"?

It means that a 4.1 is supported, so a client *could* use 4.1 if it
asked to. And if it did use 4.1, yes, it would be required to issue
reclaim complete. Current linux clients do not use 4.1 unless you
explicitly ask for it on the mont commandline.

(Aside: the server really shouldn't have +4.1 by default, as the 4.1
server is not done. We should fix that; which distro are you using?)

> I see
> that it won't help anyway at the moment, lacking server support, just
> out of interest...
>
> > - We could record a count of all locks/opens held in stable
> > storage and use that to decide when a client is done
> > recovering. That would be complicated and risk slowing down
> > normal opens and locks a lot.
>
> And the "reclaim complete" client RPC seems must better anyway, as the
> server and the client may get out of sync in case of an unclean client
> shutdown.
>
> > I don't think decreasing the lease time would be so terrible. Perhaps
> > the default should even be a little less.
>
> Fine, then. Does the Linux nfs server implementation use the lease time
> of the previous server instance as grace period on startup, or does it
> simply take whatever it finds in /proc/fs/nfsd/nfsv4leasetime?

The latest server has separately tunable "nfsv4gracetime" and
"nfsv4leasetime", and if you want to be careful, you should:

- stop the server
- set nfsv4gracetime to the *previous* lease time
- set nfsv4leasetime to the *new* lease time
- start the server

That gives you the new (lower) lease time while still giving a
sufficiently long grace period for clients who only knew about the old
time to recover. After doing that once, on future restarts you can use
the shorter time for both.

Probably we should write utilites which do this right for you....

--b.

2011-02-22 01:11:45

by J. Bruce Fields

[permalink] [raw]
Subject: Re: server does not abort grace period

On Mon, Feb 21, 2011 at 08:54:24PM +0100, Ferenc Wagner wrote:
> Ferenc Wagner <[email protected]> writes:
>
> > We're running 2.6.32 (Debian squeeze) NFS4 server and clients. The
> > server boots and runs purely from SAN, so we can start it on different
> > computers. In case of such "hardware failovers" I'd expect the clients
> > to quickly reclaim their locks (if any) and thus the server to abort
> > it's 90-second grace period early. However, this does not happen,
> > ruining our HA like, totally.
> >
> > So, the questions: is the functionality of aborting the grace period
> > early missing from version 2.6.32 of the Linux kernel? If yes, is it
> > present in any kernel version? If it should work, could someone offer
> > some advice on debugging it? If it isn't supported, what's the
> > best practice of providing highly available NFSv4 today?
>
> Hi,
>
> Could somebody please share any related wisdom? Pretty please?
> In short, how to fight grace period in a HA NFS4 setup?
> Decreasing it (of course after cutting the lock lease time) seems a
> rather big hammer, I'd like to avoid using it if reasonably possible.

The NFSv4.0 protocol doesn't provide any way for clients to tell the
server that they have finished recovering; as long as *any* clients held
state on the previous server instance, the new server is stuck waiting
out the whole grace period. Some things we could do:

- We could at least recognize the case where *no* clients held
state before, and end the grace period early in that case.
- In the NFSv4.1 case there is a "reclaim complete" rpc that
clients are required to send. Currently we don't take
advantage of that to end the grace period early, but we
should. That's no help for 4.0 clients.
- We could record a count of all locks/opens held in stable
storage and use that to decide when a client is done
recovering. That would be complicated and risk slowing down
normal opens and locks a lot.

In short, it's hard.

I don't think decreasing the lease time would be so terrible. Perhaps
the default should even be a little less.

--b.

2011-02-24 17:30:15

by J. Bruce Fields

[permalink] [raw]
Subject: Re: server does not abort grace period

On Thu, Feb 24, 2011 at 06:06:00PM +0100, Ferenc Wagner wrote:
> "J. Bruce Fields" <[email protected]> writes:
>
> > On Tue, Feb 22, 2011 at 06:05:14PM +0100, Ferenc Wagner wrote:
> >
> >> "J. Bruce Fields" <[email protected]> writes:
> >>
> >>> - In the NFSv4.1 case there is a "reclaim complete" rpc that
> >>> clients are required to send. Currently we don't take
> >>> advantage of that to end the grace period early, but we
> >>> should. That's no help for 4.0 clients.
> >>
> >> /proc/fs/nfsd/versions shows +4.1 on the server, does this mean that
> >> nfs4 type Linux client mounts should issue "reclaim complete"?
> >
> > It means that a 4.1 is supported, so a client *could* use 4.1 if it
> > asked to. And if it did use 4.1, yes, it would be required to issue
> > reclaim complete. Current linux clients do not use 4.1 unless you
> > explicitly ask for it on the mount commandline.
>
> I can't find any mention of 4.1 in man nfs (nfs-common version 1.2.2),
> is there an undocumented nfsvers=4.1 mount option or some other means?

-ominorversion=1

> > (Aside: the server really shouldn't have +4.1 by default, as the 4.1
> > server is not done. We should fix that; which distro are you using?)
>
> Debian squeeze. If it's switchable, then it's possible I switched it
> on, I can't remember. However, 4.1 client support is disabled in the
> stock kernel config, and 4.1 server support isn't even mentioned:

There's no separate config option, but the kernel keeps it off by
default. I think nfs-utils is overriding the kernel's default. We
should fix that.

--b.

2011-02-24 17:06:01

by Ferenc Wagner

[permalink] [raw]
Subject: Re: server does not abort grace period

"J. Bruce Fields" <[email protected]> writes:

> On Tue, Feb 22, 2011 at 06:05:14PM +0100, Ferenc Wagner wrote:
>
>> "J. Bruce Fields" <[email protected]> writes:
>>
>>> - In the NFSv4.1 case there is a "reclaim complete" rpc that
>>> clients are required to send. Currently we don't take
>>> advantage of that to end the grace period early, but we
>>> should. That's no help for 4.0 clients.
>>
>> /proc/fs/nfsd/versions shows +4.1 on the server, does this mean that
>> nfs4 type Linux client mounts should issue "reclaim complete"?
>
> It means that a 4.1 is supported, so a client *could* use 4.1 if it
> asked to. And if it did use 4.1, yes, it would be required to issue
> reclaim complete. Current linux clients do not use 4.1 unless you
> explicitly ask for it on the mount commandline.

I can't find any mention of 4.1 in man nfs (nfs-common version 1.2.2),
is there an undocumented nfsvers=4.1 mount option or some other means?

> (Aside: the server really shouldn't have +4.1 by default, as the 4.1
> server is not done. We should fix that; which distro are you using?)

Debian squeeze. If it's switchable, then it's possible I switched it
on, I can't remember. However, 4.1 client support is disabled in the
stock kernel config, and 4.1 server support isn't even mentioned:

$ fgrep NFS /boot/config-2.6.32-5-686
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_V4_1 is not set
CONFIG_NFS_FSCACHE=y
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_NCPFS_NFS_NS=y

>> Does the Linux nfs server implementation use the lease time of the
>> previous server instance as grace period on startup, or does it
>> simply take whatever it finds in /proc/fs/nfsd/nfsv4leasetime?
>
> The latest server has separately tunable "nfsv4gracetime" and
> "nfsv4leasetime", and if you want to be careful, you should:
>
> - stop the server
> - set nfsv4gracetime to the *previous* lease time
> - set nfsv4leasetime to the *new* lease time
> - start the server
>
> That gives you the new (lower) lease time while still giving a
> sufficiently long grace period for clients who only knew about the old
> time to recover. After doing that once, on future restarts you can use
> the shorter time for both.

Yes, this is exactly where I was going to (and what's recommended in the
RFC). Good to hear it's already implemented!

> Probably we should write utilites which do this right for you....

No worries, I won't be changing the lease time that frequently. :)
--
Thanks a lot,
Feri.

2011-02-22 17:05:16

by Ferenc Wagner

[permalink] [raw]
Subject: Re: server does not abort grace period

"J. Bruce Fields" <[email protected]> writes:

First of all, thank you very much for the detailed and useful reply!

> On Mon, Feb 21, 2011 at 08:54:24PM +0100, Ferenc Wagner wrote:
>
>> Ferenc Wagner <[email protected]> writes:
>>
>>> We're running 2.6.32 (Debian squeeze) NFS4 server and clients. The
>>> server boots and runs purely from SAN, so we can start it on different
>>> computers. In case of such "hardware failovers" I'd expect the clients
>>> to quickly reclaim their locks (if any) and thus the server to abort
>>> it's 90-second grace period early. However, this does not happen,
>>> ruining our HA like, totally.
>>>
>>> So, the questions: is the functionality of aborting the grace period
>>> early missing from version 2.6.32 of the Linux kernel? If yes, is it
>>> present in any kernel version? If it should work, could someone offer
>>> some advice on debugging it? If it isn't supported, what's the
>>> best practice of providing highly available NFSv4 today?
>>
>> Could somebody please share any related wisdom? Pretty please?
>> In short, how to fight grace period in a HA NFS4 setup?
>> Decreasing it (of course after cutting the lock lease time) seems a
>> rather big hammer, I'd like to avoid using it if reasonably possible.
>
> The NFSv4.0 protocol doesn't provide any way for clients to tell the
> server that they have finished recovering; as long as *any* clients
> held state on the previous server instance, the new server is stuck
> waiting out the whole grace period. Some things we could do:
>
> - We could at least recognize the case where *no* clients held
> state before, and end the grace period early in that case.

Would this mean that /var/lib/nfs/v4recovery is empty on the server?
Actually, it contains a hex-named empty directory, sometimes two (we're
running with two clients at the moment).

> - In the NFSv4.1 case there is a "reclaim complete" rpc that
> clients are required to send. Currently we don't take
> advantage of that to end the grace period early, but we
> should. That's no help for 4.0 clients.

/proc/fs/nfsd/versions shows +4.1 on the server, does this mean that
nfs4 type Linux client mounts should issue "reclaim complete"? I see
that it won't help anyway at the moment, lacking server support, just
out of interest...

> - We could record a count of all locks/opens held in stable
> storage and use that to decide when a client is done
> recovering. That would be complicated and risk slowing down
> normal opens and locks a lot.

And the "reclaim complete" client RPC seems must better anyway, as the
server and the client may get out of sync in case of an unclean client
shutdown.

> I don't think decreasing the lease time would be so terrible. Perhaps
> the default should even be a little less.

Fine, then. Does the Linux nfs server implementation use the lease time
of the previous server instance as grace period on startup, or does it
simply take whatever it finds in /proc/fs/nfsd/nfsv4leasetime?
--
Thanks for taking time,
Feri.

2011-02-21 19:54:30

by Ferenc Wagner

[permalink] [raw]
Subject: Re: server does not abort grace period

Ferenc Wagner <[email protected]> writes:

> We're running 2.6.32 (Debian squeeze) NFS4 server and clients. The
> server boots and runs purely from SAN, so we can start it on different
> computers. In case of such "hardware failovers" I'd expect the clients
> to quickly reclaim their locks (if any) and thus the server to abort
> it's 90-second grace period early. However, this does not happen,
> ruining our HA like, totally.
>
> So, the questions: is the functionality of aborting the grace period
> early missing from version 2.6.32 of the Linux kernel? If yes, is it
> present in any kernel version? If it should work, could someone offer
> some advice on debugging it? If it isn't supported, what's the
> best practice of providing highly available NFSv4 today?

Hi,

Could somebody please share any related wisdom? Pretty please?
In short, how to fight grace period in a HA NFS4 setup?
Decreasing it (of course after cutting the lock lease time) seems a
rather big hammer, I'd like to avoid using it if reasonably possible.
--
Thanks,
Feri.