In-Reply-To: <20100309145354.GB21862@fieldses.org>
References: <20100309014624.GF2999@fieldses.org>
	 <ABA6EE88-3584-4DC9-A50B-3B0E06C701FB@netapp.com>
	 <20100309145354.GB21862@fieldses.org>
Date: Tue, 9 Mar 2010 09:55:52 -0500
Message-ID: <89c397151003090655h7f3465f2u36c60cfa0b580516@mail.gmail.com>
Subject: Re: reboot recovery
From: "William A. (Andy) Adamson" <androsadamson@gmail.com>
To: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org
Content-Type: text/plain; charset=ISO-8859-1
Sender: linux-nfs-owner@vger.kernel.org
MIME-Version: 1.0

On Tue, Mar 9, 2010 at 9:53 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Tue, Mar 09, 2010 at 09:46:04AM -0500, Andy Adamson wrote:
>>
>> On Mar 8, 2010, at 8:46 PM, J. Bruce Fields wrote:
>>
>>> The Linux server's reboot recovery code has long-standing
>>> architectural
>>> problems, fails to adhere to the specifications in some cases, and
>>> does
>>> not yet handle NFSv4.1 reboot recovery. ?An overhaul has been a
>>> long-standing todo.
>>>
>>> This is my attempt to state the problem and a rough solution.
>>>
>>> Requirements
>>> ^^^^^^^^^^^^
>>>
>>> Requirements, as compared to current code:
>>>
>>> ? ? ?- Correctly implements the algorithm described in section 8.6.3
>>> ? ? ? ?of rfc 3530, and eliminates known race conditions on recovery.
>>> ? ? ?- Does not attempt to manage files and directories directly from
>>> ? ? ? ?inside the kernel.
>>> ? ? ?- Supports RECLAIM_COMPLETE.
>>>
>>> Requirements, in more detail:
>>>
>>> A "server instance" is the lifetime from start to shutdown of a
>>> server;
>>> a reboot ends one server instance and starts another. ?Normally a
>>> server
>>> instance consists of a grace period followed by a period of normal
>>> operation. ?However, a server could go down before the grace period
>>> completes. ?Call a server instance that completes the grace period
>>> "full", and one that does not "partial".
>>>
>>> Call a client "active" if it holds unexpired state on the server.
>>> Then:
>>>
>>> ? ? ?- An NFSv4.0 client becomes active as soon as it succesfully
>>> ? ? ? ?performs its first OPEN_CONFIRM, or its first reclaim OPEN.
>>> ? ? ?- An NFSv4.1 client becomes active when it succesfully performs
>>> ? ? ? ?its first OPEN, or a RECLAIM_COMPLETE.
>>
>> RFC 5661 in section 18.51.3
>>
>> ? ?Whenever a client establishes a new client ID and before it does the
>> ? ?first non-reclaim operation that obtains a lock, it MUST send a
>> ? ?RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no
>> ? ?locks to reclaim. ?If non-reclaim locking operations are done before
>> ? ?the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned.
>>
>> So there will never be a 'first OPEN' (except for an OPEN reclaim)
>> without a RECLAIM_COMPLETE.
>
> There will be in the case of an entirely new client, or a client that
> missed the grace period completely.

No, the MUST above applies to both a new client/client that missed the
grace period completely. In both cases the client is establishing a
new client ID.

-->Andy


>
> (But I should have specified "first non-reclaim OPEN" in the 4.1 case,
> not just "first OPEN".)
>
> --b.
>
>>
>>
>>> ? ? ?- Active clients become inactive when they expire. ?(Or when
>>> ? ? ? ?they are revoked--but the Linux server does not currently
>>> ? ? ? ?support revocation.)
>>> ? ? ?- On startup all clients are initially inactive.
>>>
>>> On startup the server needs access to the list of clients which are
>>> permitted to reclaim state. ?That list is exactly the list of clients
>>> that were active at the end of the most recent full server instance.
>>>
>>> To maintain such a list, we need records to be stored in stable
>>> storage.
>>> Whenever a client changes from inactive to active, or active to
>>> inactive, stable storage must be updated, and until the update has
>>> completed the server must do nothing that acknowledges the new state.
>>> So:
>>>
>>> ? ? ?- When a new client becomes active, a record for that client
>>> ? ? ? ?must be created in stable storage before responding to the rpc
>>> ? ? ? ?in question (OPEN, OPEN_CONFIRM, or RECLAIM_COMPLETE).
>>> ? ? ?- When a client expires, the record must be removed (or
>>> ? ? ? ?otherwise marked expired) before responding to any requests
>>> ? ? ? ?for locks or other state which would conflict with state held
>>> ? ? ? ?by the expiring client.
>>>
>>> Updates must be made by upcalls to userspace; the kernel will not be
>>> directly involved in managing stable storage. ?The upcall interface
>>> should be extensible.
>>>
>>> The records must include the client owner name, to allow identifying
>>> clients on restart. ?The protocol allows client owner names to consist
>>> of up to 1024 bytes of binary data. ?(This is the client-supplied
>>> long form, not the server-generated shorthand clientid; co_ownerid for
>>> 4.1).
>>>
>>> Also desireable, but not absolutely required in the first
>>> implementation:
>>>
>>> ? ? ?- We should not take the state lock while waiting for records to
>>> ? ? ? ?be stored. ?(Doing so blocks all other stateful operations
>>> ? ? ? ?while we wait for disk.)
>>> ? ? ?- The server should be able to end the grace period early when
>>> ? ? ? ?the list of clients allowed to reclaim is empty, or when they
>>> ? ? ? ?are all 4.1 clients, after all have sent RECLAIM_COMPLETE.
>>> ? ? ?- Will allow pluggable methods for storage of reboot recovery
>>> ? ? ? ?records, as the NFSv2 and NFSv3 code currently does (in order
>>> ? ? ? ?to support high-availability).
>>>
>>> Possibly also desireable:
>>>
>>> ? ? ?- Record the principal that originally created the client, and
>>> ? ? ? ?whether it had EXCHGID4_FLAG_BIND_PRINC_STATEID (see rfc 5661
>>> ? ? ? ?section 8.4.2.1).
>>>
>>> Draft design
>>> ^^^^^^^^^^^^
>>>
>>> We will modify rpc.statd to handle to manage state in userspace.
>>>
>>> Previous prototype code from CITI will be considered as a starting
>>> point.
>>>
>>> Kernel<->user communication will use four files in the "nfsd"
>>> filesystem. ?All of them will use the encoding used for rpc cache
>>> upcalls and downcalls, which consist of whitespace-separated fields
>>> escaped as necessary to allow binary data.
>>>
>>> Three of them will be used for upcalls; statd reads request from them,
>>> and writes responses back:
>>>
>>> create_client:
>>> ? ? ?- given a client owner, returns an error. ?Does not return until
>>> ? ? ? ?a new record has safely been recorded on disk.
>>>
>>> grace_done:
>>> ? ? ?- request and reply are both empty; rpc.statd returns only after
>>> ? ? ? ?it has recorded to disk the fact that the grace period
>>> ? ? ? ?completed.
>>>
>>> expire_client:
>>> ? ? ?- given a client owner, replies with an empty reply. ?Replies
>>> ? ? ? ?only after it has recorded to disk the fact that the client
>>> ? ? ? ?has expired.
>>>
>>> One additional file will be used for a downcall:
>>>
>>> allow_client:
>>> ? ? ?- before starting the server, statd will open this file, write a
>>> ? ? ? ?newline-separated list of client owners permitted to recover,
>>> ? ? ? ?then close the file. ?If no clients are allowed to recover, it
>>> ? ? ? ?will still open and close the file.
>>>
>>> Statd will use the presence of these upcalls to determine whether the
>>> server supports the new recovery mechanism. ?nfsd may use rpc.statd's
>>> open of allow_client to decide whether userspace supports the new
>>> mechanism. ?Thus allows a mismatched kernel and userspace to still
>>> maintain reboot recovery records.
>>>
>>> In addition, we could support seamless reboot recovery across the
>>> transition to the new system by making statd convert between on-disk
>>> formats. ?However, for simplicity's sake we plan for the server to be
>>> refuse all reclaims on the first boot after the transition.
>>>
>>> By default, statd will store records as files in the directory
>>> /var/lib/nfs/v4clients. ?The file name will be a hash of the
>>> client_owner, and the contents will consist of two newline-separated
>>> fields:
>>> ? ? ?- The client owner, encoded as in the upcall.
>>> ? ? ?- A timestamp.
>>>
>>> More fields may be added in the future.
>>>
>>> Before starting the server, and writing to allow_client, statd will
>>> manage boot times and old clients using files in /var/lib/nfs:
>>>
>>> ? ? ?If boot_time exists:
>>> ? ? ? ? ? ? ?- It will be read, and the contents interpreted as an
>>> ? ? ? ? ? ? ? ?ascii-encoded unix time in seconds.
>>> ? ? ? ? ? ? ?- All client records older than that time will be removed.
>>> ? ? ? ? ? ? ?- The current boot_time will be recorded to
>>> ? ? ? ? ? ? ? ?new_boot_time (replacing any existing such file).
>>> ? ? ? ? ? ? ?- All remaining clients will be written to allow_client.
>>> ? ? ?If boot_time does not exist, an empty /var/lib/nfs/v4clients/ is
>>> ? ? ? ? ? ? ?created if necessary, but nothing else is done.
>>>
>>> Statd will then wait for create_client, expire_client, and grace_done
>>> calls. ?On grace_done, it will rename boot_time to old_boot_time, and
>>> new_boot_time to boot_time.
>>>
>>> --b.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>