Received: by 2002:ab2:23c8:0:b0:1f2:fdbc:cb93 with SMTP id a8csp258550lqe; Wed, 27 Mar 2024 05:08:33 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCW/ApPIdmjBMKsim26x4Dh+B3hsGjcxeDadVyZnQC0HSCl/XEKPRfzdr/OZkArKaNvKPNg8qvFEDvx01TXsPa22u0nr5pf0X+ai8Xmc9w== X-Google-Smtp-Source: AGHT+IHckzP8hD/B5wXBmUPltgyQY+G0bgViYLUlnuOvz89hWrno81m2/SvDp4J1BEZvFoMicuc+ X-Received: by 2002:a17:90a:b88b:b0:29d:fe93:52ac with SMTP id o11-20020a17090ab88b00b0029dfe9352acmr3609385pjr.13.1711541313297; Wed, 27 Mar 2024 05:08:33 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1711541313; cv=pass; d=google.com; s=arc-20160816; b=VL+mWKom3DDazd/toGcd8FNn03Qr52glRAEmmYKExrGSlwPDMt0J/PVwmAW6fV5PFQ KWv+ZTT3rsBQfB1OvXLjyhF2UnxZHZcKavuTZfbdGyvyaVvAjJKvScBTYYMWcCa5O7Mx wFw/QHnJxhTwo/Xt1N4T9+nsa5V+S2jNEzGvl+VSAKxBBS4/ZN816CQkb53u8SgsbM1S pIqH3GFnyW3bPTDon5UceZtYUOwCXWiR6TJTYkhv5W/xrE4+VK/9X/RuA2ddylhDJgNJ HlpZIvft/Z+IFtQkVKYs1tuABUaucu24h+0edVvUMuJQkPv+OYNWyPnokpIwpTCGlmwV RgpA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date:dkim-signature; bh=rk1STaKASJZMOQESaE4YdsbKcHTwjlEtdkr30vV9pdg=; fh=atg69ljbgvAWuHkerXy5tbCQ/g9XvGvDH2BLJLAvRLY=; b=RvUQb++uFQ6NTnBsjhrrlBqJP1A2ePpR0lf/x38O3EsnwrKfEva7hidPBS5d/DIPWU ifgE38oA+K1hrjxVDU4/5s7LQ6UwTTyMSK9zWHCI1fO4bO7fiEMCejl7+l0SPhpPtskc KHd2QIVE3gPjQvI5XFlRZ8d+390VBKUD5aRMUE6FC38I6VATPj584wb7cXyEqpUcCG8o shFRNN2N1n9SGChBr+HlfvQ/khsyyz4OLCoigxxx92giEaV0NcgMB+AP+Xp9464sNToA KTC+0XZzg6VERZNaLyRC46PepltxdJjFIkzA0Jis7LbBv89nH09+kL8GawcHCtsW95QA tHUw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=k6AI2bfn; arc=pass (i=1 dkim=pass dkdomain=infradead.org); spf=pass (google.com: domain of linux-kernel+bounces-120275-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-120275-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id k7-20020a635a47000000b005d8e352afefsi12002931pgm.695.2024.03.27.05.08.33 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Mar 2024 05:08:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-120275-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=k6AI2bfn; arc=pass (i=1 dkim=pass dkdomain=infradead.org); spf=pass (google.com: domain of linux-kernel+bounces-120275-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-120275-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 93D192A4669 for ; Wed, 27 Mar 2024 03:50:38 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 96B9D249E5; Wed, 27 Mar 2024 03:50:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="k6AI2bfn" Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F2171BF31; Wed, 27 Mar 2024 03:50:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711511419; cv=none; b=NKLWOwmyqvHnMNUZcv2SKNcQe0Ti8ZG6xSHa35iIVq3cN6GcKLziCuaKleL3v6cdtT7EXQZmaxwztSl09y563ZWz8Y30EVXMeR2my0FJ/jK9K0lWiBnaSz1km9H6UpwaIj1NUY0XstT8WJ++JoozxjbU0XOGLypaDmTRQ0LXJyo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711511419; c=relaxed/simple; bh=yxDuwLY/j7X4qrwh8xDbw8rteroxa22o47tiK/LWHgM=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=rGUWnWSEzO7Db++j1e814O0vouSLqYWj8wLrkftf8KZj7b7KWokiZvb4jMsxjYDO7XLRhqmDQLLw943v8IDSJDFHvIx70d00npWjPXc3uq0mmTnyI/k6HMjAcKaigU9mdwMkQFlldF+AaZl92ymM4Ggk/afAd/BLYOGsZgI4ZtQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=k6AI2bfn; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=rk1STaKASJZMOQESaE4YdsbKcHTwjlEtdkr30vV9pdg=; b=k6AI2bfnRnr2I3SiqEJhR0ATer L/tOjZtLNhHZu7N988MPkHPWZJoBDo+zKoYivmHKkU31eNPCva+VjDW+1R4Gm2R9QCg9X4JWeyq/e dI4XLAOq1DzhKn4YwIj+lcTmhL1RkuTgK/i1pDSmi9D6Lm5kMFE+Q+1wlg5UMNTm5hOwkCAqbRAJj IEUAQSGMrFpTRQY4p8k8ZrEEaxVf4ciY1E1jJcKZIkIQFhgMyGawFsbfpoeYbqpLUo4LPaD/rKBC3 +6OIEw+02Zj2ueHh3TKcU/mDZsebPI/D73/FkvUZGJNikrmvbubgTIxVumhEwo5Ifk7nNsozpsf7o FJjRTWDA==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rpKIh-00000002xYu-3gF1; Wed, 27 Mar 2024 03:50:07 +0000 Date: Wed, 27 Mar 2024 03:50:07 +0000 From: Matthew Wilcox To: John Garry Cc: axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, jejb@linux.ibm.com, martin.petersen@oracle.com, djwong@kernel.org, viro@zeniv.linux.org.uk, brauner@kernel.org, dchinner@redhat.com, jack@suse.cz, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org, tytso@mit.edu, jbongio@google.com, linux-scsi@vger.kernel.org, ojaswin@linux.ibm.com, linux-aio@kvack.org, linux-btrfs@vger.kernel.org, io-uring@vger.kernel.org, nilay@linux.ibm.com, ritesh.list@gmail.com Subject: Re: [PATCH v6 00/10] block atomic writes Message-ID: References: <20240326133813.3224593-1-john.g.garry@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240326133813.3224593-1-john.g.garry@oracle.com> On Tue, Mar 26, 2024 at 01:38:03PM +0000, John Garry wrote: > The goal here is to provide an interface that allows applications use > application-specific block sizes larger than logical block size > reported by the storage device or larger than filesystem block size as > reported by stat(). > > With this new interface, application blocks will never be torn or > fractured when written. For a power fail, for each individual application > block, all or none of the data to be written. A racing atomic write and > read will mean that the read sees all the old data or all the new data, > but never a mix of old and new. > > Three new fields are added to struct statx - atomic_write_unit_min, > atomic_write_unit_max, and atomic_write_segments_max. For each atomic > individual write, the total length of a write must be a between > atomic_write_unit_min and atomic_write_unit_max, inclusive, and a > power-of-2. The write must also be at a natural offset in the file > wrt the write length. For pwritev2, iovcnt is limited by > atomic_write_segments_max. > > There has been some discussion on supporting buffered IO and whether the > API is suitable, like: > https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@casper.infradead.org/ > > Specifically the concern is that supporting a range of sizes of atomic IO > in the pagecache is complex to support. For this, my idea is that FSes can > fix atomic_write_unit_min and atomic_write_unit_max at the same size, the > extent alignment size, which should be easier to support. We may need to > implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I > have no proposed solution for atomic write buffered IO for bdev file > operations, but I know of no requirement for this. The thing is that there's no requirement for an interface as complex as the one you're proposing here. I've talked to a few database people and all they want is to increase the untorn write boundary from "one disc block" to one database block, typically 8kB or 16kB. So they would be quite happy with a much simpler interface where they set the inode block size at inode creation time, and then all writes to that inode were guaranteed to be untorn. This would also be simpler to implement for buffered writes. Who's asking for this more complex interface?