Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp3302167rwd; Mon, 22 May 2023 11:32:29 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ630+eRtzUO3CE8vSIzj6N1YMZOik694HGVtBrVYcGBn0y0Em28Ttm4IM4QPlHAipsLAyVX X-Received: by 2002:a05:6a20:8403:b0:f8:b39b:b24e with SMTP id c3-20020a056a20840300b000f8b39bb24emr14352780pzd.11.1684780348724; Mon, 22 May 2023 11:32:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684780348; cv=none; d=google.com; s=arc-20160816; b=FqrwuvP3Dt6F2ghGAsUGAjW0EP7rB/1+Gbegwe1PRzRtNP983DXOH97nqNH5mDnSdv O4F+KTnAFWsQvKdOYvLqprsIsXBKqEXLPhUABc4YaJxzuNUWIZzT4U3ZqanRdk1nrFEI G/xhxfY3jyuWfFUJYNIVWQA1GazonOq2IqLMM1ZSDCeZ3gLAOKninAxWgf53h+7Etq5n bCspIhced/SDQtWtv1l9+bSWDn0j/rd64V8kmAp5/U5eCP1wvw2d0inXM9IQ+s6vYLsG pJp9Q5bflk2y8XJuA0DH/pGASoIm5gPWUzwYqmM+gJKWxLpx2br0rU5tVWgqrkGZfxM1 xWaQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=p7GzxNEPB5RrqIL7H106LwNVqNBV2sg10P6y5N7WaSk=; b=b1HxweIKQu1a7Gc6UPPKlQqZBGCjoau0zwdu8myAckPhPkW3L9QA4yVlkL/3i3Wj73 iB+KzsAdwlUKa+YlDtFxrcoxZVRB2nW+a8C5+/tx/RK0gjpXSt0OEbrS7BO444Qm3lUj 013nlBmn3Oe2Gn6H8rNtQsa1nw2T5bclhJTXqtk7Vc6EEj2ArZ7HwdgYP/kKXcca/7Iy ZYXFFyHWpsL2KP+VL/Ichl6BKKqvgVmx6vXlo8X1VQVzsqdx7fsHuKLYS7PZKiPqRtGz gxtqP+teyAJ1XHA8+DsW/G8SWRp5mmgQmk/QIwcZJyQ+H/gaoqxUqjzPZ1SpG9pP4Ife IUlw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k3-20020a637b43000000b00530b7f4243dsi5176598pgn.191.2023.05.22.11.32.14; Mon, 22 May 2023 11:32:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233620AbjEVS2t (ORCPT + 99 others); Mon, 22 May 2023 14:28:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42128 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229559AbjEVS2s (ORCPT ); Mon, 22 May 2023 14:28:48 -0400 Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9DEBDB7 for ; Mon, 22 May 2023 11:28:00 -0700 (PDT) Received: by mail-qk1-f171.google.com with SMTP id af79cd13be357-75b1219506fso72584385a.1 for ; Mon, 22 May 2023 11:28:00 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684780080; x=1687372080; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=p7GzxNEPB5RrqIL7H106LwNVqNBV2sg10P6y5N7WaSk=; b=MxWmmZup1Ovmw0wsYB7xiSdA9F8hZ8SHctEgDPhTfLeGNwjjOH8UQRQeZTJGcKdznb hZQVTO864Fmyu5WtMEOBcOIpjwYkxWwu5tInrFWhxOeyW+GfyMt0pL8aLskp1ghiW44z L45+RCuIsXHatxBcmky/RCg+wVrupZDm9yV1wQh8+gGN0TNP4cpjKc65N+JGrzYtVMv3 abb7R0Sln1t6OTZXq6M0NoSXthqcJ99lyshXPUZv0tQ/BUVJXcv8YSpYru8JI5qOpfED 11Ny1gwWIxXZqJZuVZ09hxcUe/FB6h7fnXtjGyq57dOdTVhXzSCIlppADV/7o+PYNOAW Nocw== X-Gm-Message-State: AC+VfDyb82H9kOuHA8iDYxLcYXd3bYOp2I692S5mtq3Ck9jGHkH1k6y/ 5tg4mhHXD5WfjCZjT2JZMqIp X-Received: by 2002:a05:6214:f22:b0:625:833e:8825 with SMTP id iw2-20020a0562140f2200b00625833e8825mr5440304qvb.4.1684780079641; Mon, 22 May 2023 11:27:59 -0700 (PDT) Received: from localhost (pool-68-160-166-30.bstnma.fios.verizon.net. [68.160.166.30]) by smtp.gmail.com with ESMTPSA id u12-20020a0ced2c000000b0061b5c45f970sm2137700qvq.74.2023.05.22.11.27.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 May 2023 11:27:59 -0700 (PDT) Date: Mon, 22 May 2023 14:27:57 -0400 From: Mike Snitzer To: Dave Chinner , Joe Thornber Cc: Jens Axboe , linux-block@vger.kernel.org, Theodore Ts'o , Stefan Hajnoczi , "Michael S. Tsirkin" , "Darrick J. Wong" , Jason Wang , Bart Van Assche , linux-kernel@vger.kernel.org, Christoph Hellwig , dm-devel@redhat.com, Andreas Dilger , Sarthak Kukreti , linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, Brian Foster , Alasdair Kergon Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives Message-ID: References: <20230518223326.18744-1-sarthakkukreti@chromium.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Fri, May 19 2023 at 7:07P -0400, Dave Chinner wrote: > On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote: > > On Fri, May 19 2023 at 12:09P -0400, > > Christoph Hellwig wrote: > > > > > FYI, I really don't think this primitive is a good idea. In the > > > concept of non-overwritable storage (NAND, SMR drives) the entire > > > concept of a one-shoot 'provisioning' that will guarantee later writes > > > are always possible is simply bogus. > > > > Valid point for sure, such storage shouldn't advertise support (and > > will return -EOPNOTSUPP). > > > > But the primitive still has utility for other classes of storage. > > Yet the thing people are wanting to us filesystem developers to use > this with is thinly provisioned storage that has snapshot > capability. That, by definition, is non-overwritable storage. These > are the use cases people are asking filesystes to gracefully handle > and report errors when the sparse backing store runs out of space. DM thinp falls into this category but as you detailed it can be made to work reliably. To carry that forward we need to first establish the REQ_PROVISION primitive (with this series). Follow-on associated dm-thinp enhancements can then serve as reference for how to take advantage of XFS's ability to operate reliably of thinly provisioned storage. > e.g. journal writes after a snapshot is taken on a busy filesystem > are always an overwrite and this requires more space in the storage > device for the write to succeed. ENOSPC from the backing device for > journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't > guarantee space for overwrites after snapshots, then it's not > actually useful for solving the real world use cases we actually > need device-level provisioning to solve. > > It is not viable for filesystems to have to reprovision space for > in-place metadata overwrites after every snapshot - the filesystem > may not even know a snapshot has been taken! And it's not feasible > for filesystems to provision on demand before they modify metadata > because we don't know what metadata is going to need to be modified > before we start modifying metadata in transactions. If we get ENOSPC > from provisioning in the middle of a dirty transcation, it's all > over just the same as if we get ENOSPC during metadata writeback... > > Hence what filesystems actually need is device provisioned space to > be -always over-writable- without ENOSPC occurring. Ideally, if we > provision a range of the block device, the block device *must* > guarantee all future writes to that LBA range succeeds. That > guarantee needs to stand until we discard or unmap the LBA range, > and for however many writes we do to that LBA range. > > e.g. If the device takes a snapshot, it needs to reprovision the > potential COW ranges that overlap with the provisioned LBA range at > snapshot time. e.g. by re-reserving the space from the backing pool > for the provisioned space so if a COW occurs there is space > guaranteed for it to succeed. If there isn't space in the backing > pool for the reprovisioning, then whatever operation that triggers > the COW behaviour should fail with ENOSPC before doing anything > else.... Happy to implement this in dm-thinp. Each thin block will need a bit to say if the block must be REQ_PROVISION'd at time of snapshot (and the resulting block will need the same bit set). Walking all blocks of a thin device and triggering REQ_PROVISION for each will obviously make thin snapshot creation take more time. I think this approach is better than having a dedicated bitmap hooked off each thin device's metadata (with bitmap being copied and walked at the time of snapshot). But we'll see... I'll get with Joe to discuss further. > Software devices like dm-thin/snapshot should really only need to > keep a persistent map of the provisioned space and refresh space > reservations for used space within that map whenever something that > triggers COW behaviour occurs. i.e. a snapshot needs to reset the > provisioned ranges back to "all ranges are freshly provisioned" > before the snapshot is started. If that space is not available in > the backing pool, then the snapshot attempt gets ENOSPC.... > > That means filesystems only need to provision space for journals and > fixed metadata at mkfs time, and they only need issue a > REQ_PROVISION bio when they first allocate over-write in place > metadata. We already have online discard and/or fstrim for releasing > provisioned space via discards. > > This will require some mods to filesystems like ext4 and XFS to > issue REQ_PROVISION and fail gracefully during metadata allocation. > However, doing so means that we can actually harden filesystems > against sparse block device ENOSPC errors by ensuring they will > never occur in critical filesystem structures.... Yes, let's finally _do_ this! ;) Mike