Received: by 2002:ab2:b82:0:b0:1f3:401:3cfb with SMTP id 2csp30855lqh; Wed, 27 Mar 2024 13:50:13 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCWPsFeeHE8m3xCf6XYyXCshbzziSOdqX0oHzw8RJcVRg9RAKGemnucHscS0/AuKZRfa2GuZ10GorY1SkuqStAzeTG+hLNe83JYEkak0IQ== X-Google-Smtp-Source: AGHT+IGtd1yhzwNO7yarsHW+iwW5RNhGD91+d0Z+vDq3fsDLf0z6tkE/iCdvkgoXmInqDuhm6DRv X-Received: by 2002:a05:6358:480a:b0:181:7b22:d845 with SMTP id k10-20020a056358480a00b001817b22d845mr584084rwn.16.1711572613613; Wed, 27 Mar 2024 13:50:13 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1711572613; cv=pass; d=google.com; s=arc-20160816; b=DRVOEHO3MHi1ev+We6Qvvu0vmkmzgazvF3foXj9srvqagTc1XUSFI3JspH3WIghM96 fomHrW7ojUFxqnpP30DRkgdTyQtCaWDiNRfoGxpzKkjgONnmwdZDyx67raytEP76gS2X qZ+/UB7BFq14eBpDSTvy+HqeU7AmmRdwXRNvNYxanLFyJgwNOjGPCS7w2iTEGH7NLMUM PuocObjJAEH0JAAhAhC7Nl+o9c8bXpkFvzCpfEn+Nzg+YOpiLQQ6IKRASsdi3zW+XTdT vkhCkfTTDY34vXU9WezyoUR2e/dgYASnGU5o1hzBrFbyZ+g/xkxfrcxvjXGmMGr1jFNF vYQw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date:dkim-signature; bh=xsbDq/umAVs2THPOn+dgmzNcXWhcjS94eKkuWI7aCcw=; fh=rD80R9QYERoC8qzUHaVhcMX4uSGwyoZ4m0N0SMfA2bQ=; b=fCyVRLYrbrxJzo+kEFRycUupfp812n/hzvFj3B9wmf7peADZVXU/zHkuKdNlqvC9y4 70/FzHOOHS4G15I9USac2enTVAb4+ZZexRx8DH5uaEf0DAlPJN1nSNSzMKgBGB0MSTrl oy+cKAEO/AW08lDiQsv0iBO4FoU7SXz+p0ykGcp5pcyP2VxWz4N3epPR/rHHRXG+nYmc NVcFFGHRjj/06pfJH/4L8ZmqQalwuS/K7XX3YfeDxXByHd/1J8MLKTUwdlFQ7gskqHFr vlaFcDLs/KgOfVJPbVg+7/sn7zjmeYqO4sMjb5KQ8nTny/JagNugJYdgzKK9bZaWmiyD Zqhg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=WdV7v24D; arc=pass (i=1 spf=pass spfdomain=fromorbit.com dkim=pass dkdomain=fromorbit-com.20230601.gappssmtp.com dmarc=pass fromdomain=fromorbit.com); spf=pass (google.com: domain of linux-kernel+bounces-121978-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-121978-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE) header.from=fromorbit.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id s34-20020a632162000000b005dc855c40a4si11946439pgm.645.2024.03.27.13.50.13 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Mar 2024 13:50:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-121978-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=WdV7v24D; arc=pass (i=1 spf=pass spfdomain=fromorbit.com dkim=pass dkdomain=fromorbit-com.20230601.gappssmtp.com dmarc=pass fromdomain=fromorbit.com); spf=pass (google.com: domain of linux-kernel+bounces-121978-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-121978-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE) header.from=fromorbit.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id B8B88B22581 for ; Wed, 27 Mar 2024 20:32:10 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id BF8E7153506; Wed, 27 Mar 2024 20:31:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="WdV7v24D" Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CF1B152526 for ; Wed, 27 Mar 2024 20:31:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711571511; cv=none; b=s2ASgPOUETyPX9gH+ymzGMeQ+GF7zUbFmtxIdz7QKceJDqKFwjeCsk1ufLE2xI5mnq/u05JmFlcBdBNcs5RypLH8ARvDCw9oUcUnKMT/1M/nSoUEbq/Q3wGzPZhrNM4I0L7QOqQWSguLw8PkpwD8AiKU10EkT6p356xklvmicYY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711571511; c=relaxed/simple; bh=t9C9+riY1OxmsAmH5QBPtHJT8HnbNBNjvpO0E8IF0BE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=m+BxtZx3nKt7qUBcyG57CzO1N9CVxLjVWqhqdJQQMJ92OyAUOAdGFN8KhKhDoN5JonYqdXbdXAVwY5jOvx6w0US+Nq0sJatUFHrF+YFJIvp8QMrs1su7Se31w/gwvbrv/UugzZCeXALTaEbs4rVLdBuRgOaUCkH3KPt7GyyJ28E= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=WdV7v24D; arc=none smtp.client-ip=209.85.214.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Received: by mail-pl1-f181.google.com with SMTP id d9443c01a7336-1dffa5e3f2dso2091095ad.2 for ; Wed, 27 Mar 2024 13:31:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1711571509; x=1712176309; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=xsbDq/umAVs2THPOn+dgmzNcXWhcjS94eKkuWI7aCcw=; b=WdV7v24DFxW8nfNW5OWhAwfjj+i+ZVdBXd9AjRVX3+FR6PNQyPvX86pyquKqdmn6BS bfc15NGU+fFCY9xMyEo0TctEynV+f47N5thGqFoiVeZ9Leo0Hqd9BEL7Urs4kMSYNY+v uMYHsfFr6osUaF6beCF3AsA02QuXQ58hVSP4ADsOivo2jCCgU1tKef5h3bICgvuqLECd t75vl4QJfwrdmNsiMu8say58xXLMtyw0PuQAqFH6gvfQLrmN4cKCefKboz5ZDBLeM9Pe TEiwmjp1GzixsJWd7GzRU50cLPloHmf0kM1OBSEn37xY0s7glSCfe+zFvhF+Q0ziQCgW MaIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711571509; x=1712176309; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=xsbDq/umAVs2THPOn+dgmzNcXWhcjS94eKkuWI7aCcw=; b=nCaeBsG2zAzutHfJur9zqZB1ImtZjcR2IfqdKHiw+tfgU9dwIVAMUNIRk+toO1mzD6 dyNpRcOYqslLUr9Opy7XcwgJNDaJOe170g4BQqT1EhYpJqB+TK4OFb2Y66GUAiCumM4Z /oi5dTBKBshodbSE3/HCUJOyHdRzYKQQbllP/Ck2s5fqDuyQwbrRx6tvpquGMV7PYI4z qFFFbGq6F3Zd39BLaw8lY3aJCWgbs7qqEgw3DzAIcDaWCj7o9lpfZHzMWjGfzP1pP8LO Ytcc0mYyxYtXRwthXy4YfDtSVcuar0p0j7/OPU4OX23vxaTfTXfQuIENyhGV9VU053Wl mMFw== X-Forwarded-Encrypted: i=1; AJvYcCVPDYKbJ9txelml50so330o0puI9gTYWaTnw8B/TU9sB1N3GBnlCp3a+bfR/Hl2KWk3gSHSm4aWpADU3wqAGP5sNokldQZIvY0Gn2QJ X-Gm-Message-State: AOJu0YyeAWW79yiPL9Zxn6NRfMagX4ljXN+eUTBO5cWVm1apAOmXDJye PkHjUUtnrHw5aCREXUANFZ4etVTrWctiP+Y1/cggYIJYnjmT0O7jhKf7RaS4mHs= X-Received: by 2002:a17:903:8cd:b0:1df:fa1a:529f with SMTP id lk13-20020a17090308cd00b001dffa1a529fmr955250plb.24.1711571509203; Wed, 27 Mar 2024 13:31:49 -0700 (PDT) Received: from dread.disaster.area (pa49-181-56-237.pa.nsw.optusnet.com.au. [49.181.56.237]) by smtp.gmail.com with ESMTPSA id l13-20020a170903120d00b001deed044b7dsm4122560plh.185.2024.03.27.13.31.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Mar 2024 13:31:48 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1rpZw1-00CItN-2k; Thu, 28 Mar 2024 07:31:45 +1100 Date: Thu, 28 Mar 2024 07:31:45 +1100 From: Dave Chinner To: Matthew Wilcox Cc: John Garry , axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, jejb@linux.ibm.com, martin.petersen@oracle.com, djwong@kernel.org, viro@zeniv.linux.org.uk, brauner@kernel.org, dchinner@redhat.com, jack@suse.cz, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org, tytso@mit.edu, jbongio@google.com, linux-scsi@vger.kernel.org, ojaswin@linux.ibm.com, linux-aio@kvack.org, linux-btrfs@vger.kernel.org, io-uring@vger.kernel.org, nilay@linux.ibm.com, ritesh.list@gmail.com Subject: Re: [PATCH v6 00/10] block atomic writes Message-ID: References: <20240326133813.3224593-1-john.g.garry@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Wed, Mar 27, 2024 at 03:50:07AM +0000, Matthew Wilcox wrote: > On Tue, Mar 26, 2024 at 01:38:03PM +0000, John Garry wrote: > > The goal here is to provide an interface that allows applications use > > application-specific block sizes larger than logical block size > > reported by the storage device or larger than filesystem block size as > > reported by stat(). > > > > With this new interface, application blocks will never be torn or > > fractured when written. For a power fail, for each individual application > > block, all or none of the data to be written. A racing atomic write and > > read will mean that the read sees all the old data or all the new data, > > but never a mix of old and new. > > > > Three new fields are added to struct statx - atomic_write_unit_min, > > atomic_write_unit_max, and atomic_write_segments_max. For each atomic > > individual write, the total length of a write must be a between > > atomic_write_unit_min and atomic_write_unit_max, inclusive, and a > > power-of-2. The write must also be at a natural offset in the file > > wrt the write length. For pwritev2, iovcnt is limited by > > atomic_write_segments_max. > > > > There has been some discussion on supporting buffered IO and whether the > > API is suitable, like: > > https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@casper.infradead.org/ > > > > Specifically the concern is that supporting a range of sizes of atomic IO > > in the pagecache is complex to support. For this, my idea is that FSes can > > fix atomic_write_unit_min and atomic_write_unit_max at the same size, the > > extent alignment size, which should be easier to support. We may need to > > implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I > > have no proposed solution for atomic write buffered IO for bdev file > > operations, but I know of no requirement for this. > > The thing is that there's no requirement for an interface as complex as > the one you're proposing here. I've talked to a few database people > and all they want is to increase the untorn write boundary from "one > disc block" to one database block, typically 8kB or 16kB. > > So they would be quite happy with a much simpler interface where they > set the inode block size at inode creation time, and then all writes to > that inode were guaranteed to be untorn. This would also be simpler to > implement for buffered writes. You're conflating filesystem functionality that applications will use with hardware and block-layer enablement that filesystems and filesystem utilities need to configure the filesystem in ways that allow users to make use of atomic write capability of the hardware. The block layer functionality needs to export everything that the hardware can do and filesystems will make use of. The actual application usage and setup of atomic writes at the filesystem/page cache layer is a separate problem. i.e. The block layer interfaces need only support direct IO and expose limits for issuing atomic direct IO, and nothing more. All the more complex stuff to make it "easy to use" is filesystem level functionality and completely outside the scope of this patchset.... -Dave. -- Dave Chinner david@fromorbit.com