Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp298464rdb; Tue, 5 Dec 2023 05:59:42 -0800 (PST) X-Google-Smtp-Source: AGHT+IGRNMC6KYy+UcYwZzZoxLyY/fCh55at53hHaOuX23T+fh5fWMYaEevVCUW50r41WMQoqAZ9 X-Received: by 2002:a17:90b:360f:b0:286:6cc1:867a with SMTP id ml15-20020a17090b360f00b002866cc1867amr1044360pjb.95.1701784781910; Tue, 05 Dec 2023 05:59:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701784781; cv=none; d=google.com; s=arc-20160816; b=KFBj3pl9rSppxIAF8FQfUm4jOZLLzp3hahU9kJwtA2KE+2nOEZBSOI4vUDPnnyx3Ti Z9G85kdkY9Rb3DhqZYjW5vFXW8EonAO8GqNKMd3b/0eE+lsVG//PWvml6IJNumg9U613 9xqQr0ikerUIW/0KnUoIHD3KqdNyJqBDsAnudwpQk3cTr6dQBkb61HOUapuuo3PaLoMd 4HdGdTAVPG8aH5MPakWngfUIUPke8//EEwEqDi/qb7Q5ArNe7arIIx/ssZKWSsyLYWIV PrHkFAlnAmSjrGVlxPvk/x7CwApL8OLeZ5OG/2008zYr3C3AMG7KEg5uxvKDuUii1+D9 EU7g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=79PD3d2kumFOy7A9eRn22aWg7Y9KfDQNCF0FEIsFKv8=; fh=3ryIpOVQ48K5xHKHuuMQwdDhzAsDsw1xPz0+MrkgulU=; b=LNSxoJ5Ww/IbWE4C3jprEEG7l8VkJoq7FiicNMo9wTrQ2qv5yGgfkVZIsjSyPeVtzH EN4EGYqJA/RUtf8bB0vf6W6ANfz1+IXHal4O2JGR0rG/Q52dQWJSBlN+KsdfbZGZhRyf gJ4v5BAth+UA9gdT29o+UNQPQrvTUgTnuGPpX3fHpB6ZPAsHVwE6OP2HAq3xsReL6kY6 KS5U6EmgR1FEoCckXBBAT+WdAqrap07ToGSGhBSf4ihiC3iBIDoSrR0hDxYyrBHZXHIo A6I3XsoEZp2c/r3sPLOcewRSt+db7mwMA+RpipHEIhxvfaX+dbY3MD9shiJHtYbzaWzc AVTg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=JN++hwSb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id t4-20020a17090abc4400b00286dd5ae1cbsi1320767pjv.125.2023.12.05.05.59.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 05 Dec 2023 05:59:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=JN++hwSb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id CC5668050603; Tue, 5 Dec 2023 05:59:38 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345562AbjLEN7W (ORCPT + 99 others); Tue, 5 Dec 2023 08:59:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51050 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345523AbjLEN7V (ORCPT ); Tue, 5 Dec 2023 08:59:21 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0788612C for ; Tue, 5 Dec 2023 05:59:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1701784767; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=79PD3d2kumFOy7A9eRn22aWg7Y9KfDQNCF0FEIsFKv8=; b=JN++hwSbeEtvYIDQEnTocqRIwTKPwizQ8MLJCNogk37MfOmNrssTygAM7/UMLO7BWL1Bhg QnmDrKXb6arKiNmKG0UIeiF4iH4dqLYa0m19StL4rQgcVi3tPLAE5C2jCsQ49MbRsc4BcS VtYGxBggNq7haR8TqYzROq4xVuFtiV4= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-80-uCsCU4nHN_ezAkIn8k7hWw-1; Tue, 05 Dec 2023 08:59:24 -0500 X-MC-Unique: uCsCU4nHN_ezAkIn8k7hWw-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E090885A58C; Tue, 5 Dec 2023 13:59:22 +0000 (UTC) Received: from fedora (unknown [10.72.120.3]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1A5A8492BC7; Tue, 5 Dec 2023 13:59:12 +0000 (UTC) Date: Tue, 5 Dec 2023 21:59:08 +0800 From: Ming Lei To: John Garry Cc: Christoph Hellwig , axboe@kernel.dk, kbusch@kernel.org, sagi@grimberg.me, jejb@linux.ibm.com, martin.petersen@oracle.com, djwong@kernel.org, viro@zeniv.linux.org.uk, brauner@kernel.org, chandan.babu@oracle.com, dchinner@redhat.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, tytso@mit.edu, jbongio@google.com, linux-api@vger.kernel.org Subject: Re: [PATCH 17/21] fs: xfs: iomap atomic write support Message-ID: References: <20230929102726.2985188-1-john.g.garry@oracle.com> <20230929102726.2985188-18-john.g.garry@oracle.com> <20231109152615.GB1521@lst.de> <20231128135619.GA12202@lst.de> <20231204134509.GA25834@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.9 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Tue, 05 Dec 2023 05:59:39 -0800 (PST) On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote: > On 04/12/2023 13:45, Christoph Hellwig wrote: > > On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote: > > > ok, fine, it would not be required for XFS with CoW. Some concerns still: > > > a. device atomic write boundary, if any > > > b. other FSes which do not have CoW support. ext4 is already being used for > > > "atomic writes" in the field - see dubious amazon torn-write prevention. > > > > What is the 'dubious amazon torn-write prevention'? > > https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html > > AFAICS, this is without any kernel changes, so no guarantee of unwanted > splitting or merging of bios. > > Anyway, there will still be !CoW FSes which people want to support. > > > > > > About b., we could add the pow-of-2 and file offset alignment requirement > > > for other FSes, but then need to add some method to advertise that > > > restriction. > > > > We really need a better way to communicate I/O limitations anyway. > > Something like XFS_IOC_DIOINFO on steroids. > > > > > Sure, but to me it is a concern that we have 2x paths to make robust a. > > > offload via hw, which may involve CoW b. no HW support, i.e. CoW always > > > > Relying just on the hardware seems very limited, especially as there is > > plenty of hardware that won't guarantee anything larger than 4k, and > > plenty of NVMe hardware without has some other small limit like 32k > > because it doesn't support multiple atomicy mode. > > So what would you propose as the next step? Would it to be first achieve > atomic write support for XFS with HW support + CoW to ensure contiguous > extents (and without XFS forcealign)? > > > > > > And for no HW support, if we don't follow the O_ATOMIC model of committing > > > nothing until a SYNC is issued, would we allocate, write, and later free a > > > new extent for each write, right? > > > > Yes. Then again if you do data journalling you do that anyway, and as > > one little project I'm doing right now shows that data journling is > > often the fastest thing we can do for very small writes. > > Ignoring FSes, then how is this supposed to work for block devices? We just > always need HW support, right? Looks the HW support could be minimized, just like what Google and Amazon did, 16KB physical block size with proper queue limit setting. Now seems it is easy to make such device with ublk-loop by: - use one backing disk with 16KB/32KB/.. physical block size - expose proper physical bs & chunk_sectors & max sectors queue limit Then any 16KB aligned direct WRITE with N*16KB length(N in [1, 8] with 256 chunk_sectors) can be atomic-write. Thanks, Ming