Return-Path: Received: from mail-yw0-f189.google.com ([209.85.211.189]:46557 "EHLO mail-yw0-f189.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751619Ab0CIOz4 convert rfc822-to-8bit (ORCPT ); Tue, 9 Mar 2010 09:55:56 -0500 Received: by ywh27 with SMTP id 27so3239385ywh.22 for ; Tue, 09 Mar 2010 06:55:56 -0800 (PST) In-Reply-To: <20100309145354.GB21862@fieldses.org> References: <20100309014624.GF2999@fieldses.org> <20100309145354.GB21862@fieldses.org> Date: Tue, 9 Mar 2010 09:55:52 -0500 Message-ID: <89c397151003090655h7f3465f2u36c60cfa0b580516@mail.gmail.com> Subject: Re: reboot recovery From: "William A. (Andy) Adamson" To: "J. Bruce Fields" Cc: linux-nfs@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-nfs-owner@vger.kernel.org List-ID: MIME-Version: 1.0 On Tue, Mar 9, 2010 at 9:53 AM, J. Bruce Fields wrote: > On Tue, Mar 09, 2010 at 09:46:04AM -0500, Andy Adamson wrote: >> >> On Mar 8, 2010, at 8:46 PM, J. Bruce Fields wrote: >> >>> The Linux server's reboot recovery code has long-standing >>> architectural >>> problems, fails to adhere to the specifications in some cases, and >>> does >>> not yet handle NFSv4.1 reboot recovery. ?An overhaul has been a >>> long-standing todo. >>> >>> This is my attempt to state the problem and a rough solution. >>> >>> Requirements >>> ^^^^^^^^^^^^ >>> >>> Requirements, as compared to current code: >>> >>> ? ? ?- Correctly implements the algorithm described in section 8.6.3 >>> ? ? ? ?of rfc 3530, and eliminates known race conditions on recovery. >>> ? ? ?- Does not attempt to manage files and directories directly from >>> ? ? ? ?inside the kernel. >>> ? ? ?- Supports RECLAIM_COMPLETE. >>> >>> Requirements, in more detail: >>> >>> A "server instance" is the lifetime from start to shutdown of a >>> server; >>> a reboot ends one server instance and starts another. ?Normally a >>> server >>> instance consists of a grace period followed by a period of normal >>> operation. ?However, a server could go down before the grace period >>> completes. ?Call a server instance that completes the grace period >>> "full", and one that does not "partial". >>> >>> Call a client "active" if it holds unexpired state on the server. >>> Then: >>> >>> ? ? ?- An NFSv4.0 client becomes active as soon as it succesfully >>> ? ? ? ?performs its first OPEN_CONFIRM, or its first reclaim OPEN. >>> ? ? ?- An NFSv4.1 client becomes active when it succesfully performs >>> ? ? ? ?its first OPEN, or a RECLAIM_COMPLETE. >> >> RFC 5661 in section 18.51.3 >> >> ? ?Whenever a client establishes a new client ID and before it does the >> ? ?first non-reclaim operation that obtains a lock, it MUST send a >> ? ?RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no >> ? ?locks to reclaim. ?If non-reclaim locking operations are done before >> ? ?the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. >> >> So there will never be a 'first OPEN' (except for an OPEN reclaim) >> without a RECLAIM_COMPLETE. > > There will be in the case of an entirely new client, or a client that > missed the grace period completely. No, the MUST above applies to both a new client/client that missed the grace period completely. In both cases the client is establishing a new client ID. -->Andy > > (But I should have specified "first non-reclaim OPEN" in the 4.1 case, > not just "first OPEN".) > > --b. > >> >> >>> ? ? ?- Active clients become inactive when they expire. ?(Or when >>> ? ? ? ?they are revoked--but the Linux server does not currently >>> ? ? ? ?support revocation.) >>> ? ? ?- On startup all clients are initially inactive. >>> >>> On startup the server needs access to the list of clients which are >>> permitted to reclaim state. ?That list is exactly the list of clients >>> that were active at the end of the most recent full server instance. >>> >>> To maintain such a list, we need records to be stored in stable >>> storage. >>> Whenever a client changes from inactive to active, or active to >>> inactive, stable storage must be updated, and until the update has >>> completed the server must do nothing that acknowledges the new state. >>> So: >>> >>> ? ? ?- When a new client becomes active, a record for that client >>> ? ? ? ?must be created in stable storage before responding to the rpc >>> ? ? ? ?in question (OPEN, OPEN_CONFIRM, or RECLAIM_COMPLETE). >>> ? ? ?- When a client expires, the record must be removed (or >>> ? ? ? ?otherwise marked expired) before responding to any requests >>> ? ? ? ?for locks or other state which would conflict with state held >>> ? ? ? ?by the expiring client. >>> >>> Updates must be made by upcalls to userspace; the kernel will not be >>> directly involved in managing stable storage. ?The upcall interface >>> should be extensible. >>> >>> The records must include the client owner name, to allow identifying >>> clients on restart. ?The protocol allows client owner names to consist >>> of up to 1024 bytes of binary data. ?(This is the client-supplied >>> long form, not the server-generated shorthand clientid; co_ownerid for >>> 4.1). >>> >>> Also desireable, but not absolutely required in the first >>> implementation: >>> >>> ? ? ?- We should not take the state lock while waiting for records to >>> ? ? ? ?be stored. ?(Doing so blocks all other stateful operations >>> ? ? ? ?while we wait for disk.) >>> ? ? ?- The server should be able to end the grace period early when >>> ? ? ? ?the list of clients allowed to reclaim is empty, or when they >>> ? ? ? ?are all 4.1 clients, after all have sent RECLAIM_COMPLETE. >>> ? ? ?- Will allow pluggable methods for storage of reboot recovery >>> ? ? ? ?records, as the NFSv2 and NFSv3 code currently does (in order >>> ? ? ? ?to support high-availability). >>> >>> Possibly also desireable: >>> >>> ? ? ?- Record the principal that originally created the client, and >>> ? ? ? ?whether it had EXCHGID4_FLAG_BIND_PRINC_STATEID (see rfc 5661 >>> ? ? ? ?section 8.4.2.1). >>> >>> Draft design >>> ^^^^^^^^^^^^ >>> >>> We will modify rpc.statd to handle to manage state in userspace. >>> >>> Previous prototype code from CITI will be considered as a starting >>> point. >>> >>> Kernel<->user communication will use four files in the "nfsd" >>> filesystem. ?All of them will use the encoding used for rpc cache >>> upcalls and downcalls, which consist of whitespace-separated fields >>> escaped as necessary to allow binary data. >>> >>> Three of them will be used for upcalls; statd reads request from them, >>> and writes responses back: >>> >>> create_client: >>> ? ? ?- given a client owner, returns an error. ?Does not return until >>> ? ? ? ?a new record has safely been recorded on disk. >>> >>> grace_done: >>> ? ? ?- request and reply are both empty; rpc.statd returns only after >>> ? ? ? ?it has recorded to disk the fact that the grace period >>> ? ? ? ?completed. >>> >>> expire_client: >>> ? ? ?- given a client owner, replies with an empty reply. ?Replies >>> ? ? ? ?only after it has recorded to disk the fact that the client >>> ? ? ? ?has expired. >>> >>> One additional file will be used for a downcall: >>> >>> allow_client: >>> ? ? ?- before starting the server, statd will open this file, write a >>> ? ? ? ?newline-separated list of client owners permitted to recover, >>> ? ? ? ?then close the file. ?If no clients are allowed to recover, it >>> ? ? ? ?will still open and close the file. >>> >>> Statd will use the presence of these upcalls to determine whether the >>> server supports the new recovery mechanism. ?nfsd may use rpc.statd's >>> open of allow_client to decide whether userspace supports the new >>> mechanism. ?Thus allows a mismatched kernel and userspace to still >>> maintain reboot recovery records. >>> >>> In addition, we could support seamless reboot recovery across the >>> transition to the new system by making statd convert between on-disk >>> formats. ?However, for simplicity's sake we plan for the server to be >>> refuse all reclaims on the first boot after the transition. >>> >>> By default, statd will store records as files in the directory >>> /var/lib/nfs/v4clients. ?The file name will be a hash of the >>> client_owner, and the contents will consist of two newline-separated >>> fields: >>> ? ? ?- The client owner, encoded as in the upcall. >>> ? ? ?- A timestamp. >>> >>> More fields may be added in the future. >>> >>> Before starting the server, and writing to allow_client, statd will >>> manage boot times and old clients using files in /var/lib/nfs: >>> >>> ? ? ?If boot_time exists: >>> ? ? ? ? ? ? ?- It will be read, and the contents interpreted as an >>> ? ? ? ? ? ? ? ?ascii-encoded unix time in seconds. >>> ? ? ? ? ? ? ?- All client records older than that time will be removed. >>> ? ? ? ? ? ? ?- The current boot_time will be recorded to >>> ? ? ? ? ? ? ? ?new_boot_time (replacing any existing such file). >>> ? ? ? ? ? ? ?- All remaining clients will be written to allow_client. >>> ? ? ?If boot_time does not exist, an empty /var/lib/nfs/v4clients/ is >>> ? ? ? ? ? ? ?created if necessary, but nothing else is done. >>> >>> Statd will then wait for create_client, expire_client, and grace_done >>> calls. ?On grace_done, it will rename boot_time to old_boot_time, and >>> new_boot_time to boot_time. >>> >>> --b. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" >>> in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at ?http://vger.kernel.org/majordomo-info.html >