Received: by 2002:a05:7412:da14:b0:e2:908c:2ebd with SMTP id fe20csp2049205rdb; Mon, 9 Oct 2023 10:44:58 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHTcZU+XLHs5rLdbYlZSyrNx5kk3707X3/sXxg0QKY8kvusWbQgR2kbTof5c32WUMgbwdxz X-Received: by 2002:a05:6a20:4327:b0:16b:7f7d:8364 with SMTP id h39-20020a056a20432700b0016b7f7d8364mr12797071pzk.58.1696873497925; Mon, 09 Oct 2023 10:44:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696873497; cv=none; d=google.com; s=arc-20160816; b=TfYMmV5rs66dWtbGV0unUcNb4zHKgqwUM8/Xa4p3KpxC1vX7l9+xPNXnePPuf9SOdH qwGGebn1rHZg9xn9nCsOTWmvV51dA3PuQlxdrDinW7/XvXH6GQpWCqog7kVZ7F6vhJAY VdNx6vWpqGy/XVQRT2x4VPuY8oq0wOXgaOdtpoF90kOLb6rAVLQreCtJqVAPIA1NA/qR OVV2BnS6G3of/wnQHYOTMfnGwv4aJptrxUllYoUNp6i4ET6puhs68Gto1/ci4Paii8Fj p3ptA7DBy8f5QNhDwRJQij9Qf8gIfd9YeOxD2kr6zWNPNCJxjc72sBR6eTN1dEBRlGjo f1Sg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=oOFGAT/pxb+nGnVFd8L8Bm/HnDoJ9G5RSZeuETbRoJg=; fh=FbfjgM7HER5KFOyTBWep264KWJfBzVGyXkmf7C/2thM=; b=hODxcHnIfmu4XbvVO6eCJe57tRl/ZhnprbO8wcwQpcSHBiN38eQyUwopsC7uKkINHP T9oE33dS+BixOCK5rd6+N//9gr/CbvVdJmkZfZXG6+2ZV8rwJcz0Pa/EkcDgbzOk/688 JjCkJivgMDnrqYQrAnXQv4YwktBkZ3CEVMFNyUAkqvMOYrzUTICDpMgbQExGNJMhISnC +D+9+o9FuZpU0fTfwZ86HFdqQjQWeHBpI1l3zLjezifZsuFxkYdVEdu2q5lPZFY5PFxS 9716oCP6h1dYbvbvk2+rA+idbbZqLSNfZ0wc4mGhyJ+GF+WlmYjfc65+2H9PjDJ4BoOE nOhw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=QEdHryOD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id u9-20020a056a00098900b00690b88a9c32si8227775pfg.82.2023.10.09.10.44.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 09 Oct 2023 10:44:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=QEdHryOD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id D6BDB802A0DA; Mon, 9 Oct 2023 10:44:54 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1377934AbjJIRom (ORCPT + 99 others); Mon, 9 Oct 2023 13:44:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36472 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1377460AbjJIRol (ORCPT ); Mon, 9 Oct 2023 13:44:41 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC70A91; Mon, 9 Oct 2023 10:44:39 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7DF78C433C8; Mon, 9 Oct 2023 17:44:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1696873479; bh=ulFW6LSt301EIno4tTQsbIE1fUhJGazrjkE/8EDaSCs=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=QEdHryOD48pncmxLlJ5bft0gnmB6asyRNvX8CTbb0MXa8fZ0eAENZl+KtMKmQbHKS yhT+hyKLUdn35HtHpNzlCUfcjjemV4Yog1YAbymyIPQqSfH47Evb0bt/sd/KC7jIY8 bx3GhLcK0h1TvK2QDKcmVm9wi5gIiOkvzVbkoIdQQ3FuBx4yT7t94XBiYMgM7C+NTH NLXgzI4lgHivPc9uivID3N9XA58XeuB/slfArvnLKF1BCzraY1RTRKDxPM+clyJjZP YDl6St49lCgkCHxF1LGhbUesBPJrmGX2DpborkV2t8poSgbMVnUmHTDRIPZCfueH8+ APEwo5gOMCCzg== Date: Mon, 9 Oct 2023 10:44:38 -0700 From: "Darrick J. Wong" To: John Garry Cc: linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, martin.petersen@oracle.com, david@fromorbit.com, himanshu.madhani@oracle.com Subject: Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag Message-ID: <20231009174438.GE21283@frogsfrogsfrogs> References: <20230929093717.2972367-1-john.g.garry@oracle.com> <20230929093717.2972367-3-john.g.garry@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230929093717.2972367-3-john.g.garry@oracle.com> X-Spam-Status: No, score=2.4 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Mon, 09 Oct 2023 10:44:55 -0700 (PDT) X-Spam-Level: ** On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote: > From: Himanshu Madhani > > Add RWF_ATOMIC flag description for pwritev2(). > > Signed-off-by: Himanshu Madhani > #jpg: complete rewrite > Signed-off-by: John Garry > --- > man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 45 insertions(+) > > diff --git a/man2/readv.2 b/man2/readv.2 > index fa9b0e4e44a2..ff09f3bc9792 100644 > --- a/man2/readv.2 > +++ b/man2/readv.2 > @@ -193,6 +193,51 @@ which provides lower latency, but may use additional resources. > .B O_DIRECT > flag.) > .TP > +.BR RWF_ATOMIC " (since Linux 6.7)" > +Allows block-based filesystems to indicate that write operations will be issued "Require regular file write operations to be issued with torn write protection." > +with torn-write protection. Torn-write protection means that for a power or any > +other hardware failure, all or none of the data from the write will be stored, > +but never a mix of old and new data. This flag is meaningful only for > +.BR pwritev2 (), > +and its effect applies only to the data range written by the system call. > +The total write length must be power-of-2 and must be sized between > +stx_atomic_write_unit_min and stx_atomic_write_unit_max, both inclusive. The > +write must be at a natural offset within the file with respect to the total What is a "natural" offset? That should be defined with more specificity. Does that mean that the position of a XX-KiB write must also be aligned to XX-KiB? e.g. a 32K untorn write can only start at a multiple of 32K? What if the device supports untorn writes between 4K and 64K, does that mean I /cannot/ issue a 32K untorn write at offset 48K? > +write length. Torn-write protection only works with > +.B O_DIRECT > +flag, i.e. buffered writes are not supported. To guarantee consistency from > +the write between a file's in-core state with the storage device, > +.BR fdatasync (2) > +or > +.BR fsync (2) > +or > +.BR open (2) > +and > +.B O_SYNC > +or > +.B O_DSYNC > +or > +.B pwritev2 () > +flag > +.B RWF_SYNC > +or > +.B RWF_DSYNC > +is required. I'm starting to think that this manpage shouldn't be restating durability information here. "Application programs with data or file integrity completion requirements must configure synchronous writes with the DSYNC or SYNC flags, as explained above." > +For when regular files are opened with > +.BR open (2) > +but without > +.B O_SYNC > +or > +.B O_DSYNC > +and the > +.BR pwritev2() > +call is made without > +.B RWF_SYNC > +or > +.BR RWF_DSYNC > +set, the range metadata must already be flushed to storage and the data range > +must not be in unwritten state, shared, a preallocation, or a hole. I think that we can drop all of these flags requirements, since the contiguous small space allocation requirement means that the fs can provide all-or-nothing writes even if metadata updates are needed: If the file range is allocated and marked unwritten (i.e. a preallocation), the ioend will clear the unwritten bit from the file mapping atomically. After a crash, the application sees either zeroes or all the data that was written. If the file range is shared, the ioend will map the COW staging extent into the file atomically. After a crash, the application sees either the old contents from the old blocks, or the new contents from the new blocks. If the file range is a sparse hole, the directio setup will allocate space and create an unwritten mapping before issuing the write bio. The rest of the process works the same as preallocations and has the same behaviors. If the file range is allocated and was previously written, the write is issued and that's all that's needed from the fs. After a crash, reads of the storage device produce the old contents or the new contents. Summarizing: An (ATOMIC|SYNC) request provides the strongest guarantees (data will not be torn, and all file metadata updates are persisted before the write is returned to userspace. Programs see either the old data or the new data, even if there's a crash. (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file updates for just that region are persisted before the write is returned. (ATOMIC) is the least strong -- data will not be torn. Neither the filesystem nor the device make guarantees that anything ended up on stable storage, but if it does, programs see either the old data or the new data. Maybe we should rename the whole UAPI s/atomic/untorn/... --D > +.TP > .BR RWF_SYNC " (since Linux 4.7)" > .\" commit e864f39569f4092c2b2bc72c773b6e486c7e3bd9 > Provide a per-write equivalent of the > -- > 2.31.1 >