Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp2647308rwd; Fri, 26 May 2023 09:12:18 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5d9ubUHV9N+iPsLcyEOQR2RIyYnfPJBhkqHbKcxaddACrOu7ncQwtRyLgtWEd50MvPw32B X-Received: by 2002:a05:6a21:3281:b0:10f:d1d4:40c6 with SMTP id yt1-20020a056a21328100b0010fd1d440c6mr2126433pzb.7.1685117538282; Fri, 26 May 2023 09:12:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685117538; cv=none; d=google.com; s=arc-20160816; b=jw7p+pyUvMQ052FF/seNNR4krRpUBkDSMTh+5HdC0tCDqM4K5WkctkrudOVuPV0U0U ciWLArWejvvZp8u2zLCQ/2xOqyNuvgWXPMwmnB9aXTgEFOq8B9b762bscN8F+y3LlyWl S+SpFDrQFVBbTIrbYj/Gmw39Nj/zl1ClIHTeJrtHe33DKq3Pj9dhyRmigV57mrROXdUK yYXHeHu1NwLU06lXo7jKlusjsyaWQESkM0qqKWreLvpTdP13JlQeH54gxhSvhLofhoit bIA/iC0L/A5CNvZN8pgl2eKXgKLbey0qiewY3pwq4UmSNv9kzJOw/v1pGM1rVtXHr5ge fXdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=yvxKbFohSEsTJ6py9Z2WVJoxOwakRTNCjSwfmDukp7I=; b=LN3AcnXlmpxAv5LeZvquvPUX8ixpVuTO9FLSNVp02A2U5eqcfw6OcY7dFP42wqXBC0 wZwwYdtcpWABisnoRnBSckWgpauVZYpRP4RIi+L2zGAUKDQULxzTy/XkWvjgbKcG9Ip4 nH11GUkAp+eNFnrSCRRpCjpfAEmSvUUcW+Xkq4a9M2qZN2kdAPaKQz4FVqIWD8Ib/E/x HHVzMrzIDtnMzFoEsMTFdupfk1YkxTZG+22x8GkhD9w2GNxnP1dd9VQcfThWT+x0oPgs 5K9fbIZZKIF+KyJC83siEpSC6YeRJAwTTu3gg4K8AqYeK7wefT+gkJATJwtPzin8+XlG EM6g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MzfDShTV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a7-20020aa79707000000b00643652c8879si4139460pfg.326.2023.05.26.09.12.03; Fri, 26 May 2023 09:12:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MzfDShTV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244039AbjEZPzG (ORCPT + 99 others); Fri, 26 May 2023 11:55:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244265AbjEZPzC (ORCPT ); Fri, 26 May 2023 11:55:02 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4F35E116 for ; Fri, 26 May 2023 08:54:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1685116454; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yvxKbFohSEsTJ6py9Z2WVJoxOwakRTNCjSwfmDukp7I=; b=MzfDShTVTa7dWXJOul1ueljuYqZqwuoAkZEm3OEA76bTceY2obUIqLMeh8wtT6TdK044zZ xtRjApPoZO7ScbOUcPs/2ZUkQTlXBlxZUxKqC/O9Dosh4yA9jq0nXuR/6c5gvcbqL8OOMC UBTAl+3G/NC8uGU3uPqFSHhbqJq+xxQ= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-369-8HBP8S3_NR2Ht4weB56pYA-1; Fri, 26 May 2023 11:54:12 -0400 X-MC-Unique: 8HBP8S3_NR2Ht4weB56pYA-1 Received: by mail-qv1-f72.google.com with SMTP id 6a1803df08f44-6260d4a9802so2459506d6.1 for ; Fri, 26 May 2023 08:54:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685116451; x=1687708451; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=yvxKbFohSEsTJ6py9Z2WVJoxOwakRTNCjSwfmDukp7I=; b=WhLwACmwVNQZqtjeYm91BZK/bl0Lh7NqK89jrl+V9DHtjIkeP+t8qiBwrGOIDQyNN/ JCMiMK0JdWQZVKnPopMcscMyJtZIyHw6YOSP8e3DmWA1mmlY/Zl4QtPIo35btr4PaYfM mXKS8jCLRDnr9M9ozNSvnPOTCTU0FYKPMXFrwxFUYpFeNoPPImAjmDbDf9H8xwWT4Yea LUk3t0dl9dOKsPUkkAKH0esfF1I9ncx5MJcUVVPvAYFe9f38i0Kh5eujMiQRxIZgqIuf CpFyStHuILKlzcaO9UqpDcDfPtssGtqL01gBkC5hEX9LhEN3s86RJ/fwx8ABWc71wGxm i29Q== X-Gm-Message-State: AC+VfDxAm+S9HR3esAjUBmllU0FyFzZ94F875JkdGEjhl3GCLmqTYa2s MWBkzKVt2rX5mY8MRL2Qh8z9klJFcK9++GPg7cQ34M9e2QmtnbYBUVElb+tlh4Zeg6T6jqWhmtj 8nZjIg91a1tQdTtc0lCrjljzo X-Received: by 2002:a05:6214:20e2:b0:621:6217:f528 with SMTP id 2-20020a05621420e200b006216217f528mr2129207qvk.30.1685116451610; Fri, 26 May 2023 08:54:11 -0700 (PDT) X-Received: by 2002:a05:6214:20e2:b0:621:6217:f528 with SMTP id 2-20020a05621420e200b006216217f528mr2129188qvk.30.1685116451285; Fri, 26 May 2023 08:54:11 -0700 (PDT) Received: from bfoster (c-24-61-119-116.hsd1.ma.comcast.net. [24.61.119.116]) by smtp.gmail.com with ESMTPSA id m6-20020a0ce8c6000000b006260bff22d7sm310600qvo.27.2023.05.26.08.54.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 May 2023 08:54:10 -0700 (PDT) Date: Fri, 26 May 2023 11:56:41 -0400 From: Brian Foster To: Sarthak Kukreti Cc: Dave Chinner , Mike Snitzer , Joe Thornber , Jens Axboe , linux-block@vger.kernel.org, Theodore Ts'o , Stefan Hajnoczi , "Michael S. Tsirkin" , "Darrick J. Wong" , Bart Van Assche , linux-kernel@vger.kernel.org, Christoph Hellwig , dm-devel@redhat.com, Andreas Dilger , linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, Jason Wang , Alasdair Kergon Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 25, 2023 at 07:35:14PM -0700, Sarthak Kukreti wrote: > On Thu, May 25, 2023 at 6:36 PM Dave Chinner wrote: > > > > On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote: > > > On Thu, May 25, 2023 at 9:00 AM Mike Snitzer wrote: > > > > On Thu, May 25 2023 at 7:39P -0400, > > > > Dave Chinner wrote: > > > > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote: > > > > > > On Tue, May 23 2023 at 8:40P -0400, > > > > > > Dave Chinner wrote: > > > > > > > It's worth noting that XFS already has a coarse-grained > > > > > > > implementation of preferred regions for metadata storage. It will > > > > > > > currently not use those metadata-preferred regions for user data > > > > > > > unless all the remaining user data space is full. Hence I'm pretty > > > > > > > sure that a pre-provisioning enhancment like this can be done > > > > > > > entirely in-memory without requiring any new on-disk state to be > > > > > > > added. > > > > > > > > > > > > > > Sure, if we crash and remount, then we might chose a different LBA > > > > > > > region for pre-provisioning. But that's not really a huge deal as we > > > > > > > could also run an internal background post-mount fstrim operation to > > > > > > > remove any unused pre-provisioning that was left over from when the > > > > > > > system went down. > > > > > > > > > > > > This would be the FITRIM with extension you mention below? Which is a > > > > > > filesystem interface detail? > > > > > > > > > > No. We might reuse some of the internal infrastructure we use to > > > > > implement FITRIM, but that's about it. It's just something kinda > > > > > like FITRIM but with different constraints determined by the > > > > > filesystem rather than the user... > > > > > > > > > > As it is, I'm not sure we'd even need it - a preiodic userspace > > > > > FITRIM would acheive the same result, so leaked provisioned spaces > > > > > would get cleaned up eventually without the filesystem having to do > > > > > anything specific... > > > > > > > > > > > So dm-thinp would _not_ need to have new > > > > > > state that tracks "provisioned but unused" block? > > > > > > > > > > No idea - that's your domain. :) > > > > > > > > > > dm-snapshot, for certain, will need to track provisioned regions > > > > > because it has to guarantee that overwrites to provisioned space in > > > > > the origin device will always succeed. Hence it needs to know how > > > > > much space breaking sharing in provisioned regions after a snapshot > > > > > has been taken with be required... > > > > > > > > dm-thinp offers its own much more scalable snapshot support (doesn't > > > > use old dm-snapshot N-way copyout target). > > > > > > > > dm-snapshot isn't going to be modified to support this level of > > > > hardening (dm-snapshot is basically in "maintenance only" now). > > > > Ah, of course. Sorry for the confusion, I was kinda using > > dm-snapshot as shorthand for "dm-thinp + snapshots". > > > > > > But I understand your meaning: what you said is 100% applicable to > > > > dm-thinp's snapshot implementation and needs to be accounted for in > > > > thinp's metadata (inherent 'provisioned' flag). > > > > *nod* > > > > > A bit orthogonal: would dm-thinp need to differentiate between > > > user-triggered provision requests (eg. from fallocate()) vs > > > fs-triggered requests? > > > > Why? How is the guarantee the block device has to provide to > > provisioned areas different for user vs filesystem internal > > provisioned space? > > > After thinking this through, I stand corrected. I was primarily > concerned with how this would balloon thin snapshot sizes if users > potentially provision a large chunk of the filesystem but that's > putting the cart way before the horse. > I think that's a legitimate concern. At some point to provide full -ENOSPC protection the filesystem needs to provision space before it writes to it, whether it be data or metadata, right? At what point does that turn into a case where pretty much everything the fs wrote is provisioned, and therefore a snapshot is just a full copy operation? That might be Ok I guess, but if that's an eventuality then what's the need to track provision state at dm-thin block level? Using some kind of flag you mention below could be a good way to qualify which blocks you'd want to copy vs. which to share on snapshot and perhaps mitigate that problem. > Best > Sarthak > > > > I would lean towards user provisioned areas not > > > getting dedup'd on snapshot creation, > > > > > > > > Snapshotting is a clone operation, not a dedupe operation. > > > > Yes, the end result of both is that you have a block shared between > > multiple indexes that needs COW on the next overwrite, but the two > > operations that get to that point are very different... > > > > > > > > > but that would entail tracking > > > the state of the original request and possibly a provision request > > > flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag > > > (REQ_PROVISION_NODEDUP). Possibly too convoluted... > > > > Let's not try to add everyone's favourite pony to this interface > > before we've even got it off the ground. > > > > It's the simple precision of the API, the lack of cross-layer > > communication requirements and the ability to implement and optimise > > the independent layers independently that makes this a very > > appealing solution. > > > > We need to start with getting the simple stuff working and prove the > > concept. Then once we can observe the behaviour of a working system > > we can start working on optimising individual layers for efficiency > > and performance.... > > I think to prove the concept may not necessarily require changes to dm-thin at all. If you want to guarantee preexisting metadata block writeability, just scan through and provision all metadata blocks at mount time. Hit the log, AG bufs, IIRC XFS already has btree walking code that can be used for btrees and associated metadata, etc. Maybe online scrub has something even better to hook into temporarily for this sort of thing? Mount performance would obviously be bad, but that doesn't matter for the purposes of a prototype. The goal should really be that once mounted, you have established expected writeability invariants and have the ability to test for reliable prevention of -ENOSPC errors from dm-thin from that point forward. If that ultimately works, then refine the ideal implementation from there and ask dm to do whatever writeability tracking and whatnot. FWIW, that may also help deal with things like the fact that xfs_repair can basically relocate the entire set of filesystem metadata to completely different ranges of free space, completely breaking any writeability guarantees tracked by previous provisions of those ranges. Brian > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com >