Received: by 2002:a05:6358:16cc:b0:ea:6187:17c9 with SMTP id r12csp5655794rwl; Thu, 29 Dec 2022 00:19:39 -0800 (PST) X-Google-Smtp-Source: AMrXdXvXUjbfGtFereCMCPA/ENOu6X2jAk6+bRzRiPjnHo427fvRGnyBcnZBXscrs1GYcgbSXPDx X-Received: by 2002:a17:906:5dd2:b0:7d5:29e1:15ea with SMTP id p18-20020a1709065dd200b007d529e115eamr24010860ejv.8.1672301979666; Thu, 29 Dec 2022 00:19:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672301979; cv=none; d=google.com; s=arc-20160816; b=Bp1bfmP0bIGgconWToLgV+e0GxOx+TT+bdoTYNp4pcy1ylXgATQAPkCzD27Ep7LiY3 D3fPhGxoIQ4SPD2yM9Gl+dfYLaG4QlQkphUPjHf0sR1xq/Foawg9Fbfzuj4w5BK98Vsg Z2sRHAJqMjLD0/xf/5gl1Q+KaDT3ecDb8+5lzSBL4SqaRqPsC2Tfa0Y/ihjXJ+mS9SDQ i3cFt9DTcSS5ccGnb7EcivnNW/qzD2lTZBi+Qad4Y2H+Hw2HAPTqp9hKavHRRBnDSgd5 eEHLcggV3ifp5u5JwzidReqKmPc6FLu6aMx6cFlcy7UyY+mH3+VSBGT0ZjAFyOsi9fjW XCJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=w0NMCIz7aH+VtGKA1bV7VY9PC5bA85xBilVtidrBy8A=; b=tInUpV5NqFZzfMUy8BF5htiepOLHYNss42ZBvo32QtgcP0m+Usx9rNhhNqQsCe6NNK X4Rqn9bcgnpcBHwlt05hK342FaUnKirEk1ZfFaDPFF2Pk/e555TN9KEMfdZWlH6hG/W7 8Yy+5qIIxqarLxcVADKC5PyB2/M3VFNHJK1iyNWddj953nAEsozD3sxjGO30i00Cm9Fm DkJ85JRBQ8CKaUzNfX2jhmOoGUy29evudno8gugW7AKFjpK3iA5RMClQA0VWTx3nRP41 Op7EkS6AZ/2fOoasSU+qamDNMjZi/tRqHuLWIMg481SPQpR8wOkRsEnxdpOAMFm/mcch UQTQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=Dv54hQGx; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h20-20020a170906111400b007ae5b41855fsi12477349eja.895.2022.12.29.00.19.15; Thu, 29 Dec 2022 00:19:39 -0800 (PST) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=Dv54hQGx; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229919AbiL2ISo (ORCPT + 99 others); Thu, 29 Dec 2022 03:18:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38740 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233233AbiL2ISC (ORCPT ); Thu, 29 Dec 2022 03:18:02 -0500 Received: from mail-ed1-x52b.google.com (mail-ed1-x52b.google.com [IPv6:2a00:1450:4864:20::52b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7A67B13F7B for ; Thu, 29 Dec 2022 00:17:13 -0800 (PST) Received: by mail-ed1-x52b.google.com with SMTP id c17so25738897edj.13 for ; Thu, 29 Dec 2022 00:17:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=w0NMCIz7aH+VtGKA1bV7VY9PC5bA85xBilVtidrBy8A=; b=Dv54hQGxUOdBIlokyeIJGJM6C9dSb3WVRynrmpQdXN4zRwmNZbdD6fSlv8r6gBwxHw c59E7zQLobJEgVpznEZ+LJjUufBU4Gm7pmBUoRzl6FXxrUcIIRktCKOSpDR7Si+60hLO hbdIMmrf6vVI4hz95givK9GKzPnRV3GQ4ca5o= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=w0NMCIz7aH+VtGKA1bV7VY9PC5bA85xBilVtidrBy8A=; b=cAjcV3hWsZAp54ftyCneMZXParH2rLffP0+aimXHUvdFYo86nenYtQAq52alV4fw+D VpjQt54bSLpFYUC64+TJjvtAsQV2xLPTdQA8gmzpENF08HF4frM9jEDlc4/yNCG6xMkM RCwTHHGoiNadsT9i3kTqkjkW06X/EKb+IiikkJUpq1fRV+P0IC9aMjw4QBstmQk10N/c j++0zGBtxt0uWOmrW8juVd97a8NA3pBjujUEJ1rB5S2hZaknQdUMpLXM1w8AYdIMM0Ff gvVVtZXCb6tiADKPZ7issWNW8Ygi71Tlpy6+JE8CVYXODuFFVoP0m0ihNw+BvfILlWd1 yVKQ== X-Gm-Message-State: AFqh2kpfpXocAk6CFA12fwm3CGIMM6mw9CPRTWxAaYblOHSynMsjCg0G Mka2N8ln9jt72fE91vdscJMAJ6YHAGzl4IN6cAxnzg== X-Received: by 2002:a05:6402:f27:b0:485:8114:9779 with SMTP id i39-20020a0564020f2700b0048581149779mr1316677eda.41.1672301831977; Thu, 29 Dec 2022 00:17:11 -0800 (PST) MIME-Version: 1.0 References: <20220915164826.1396245-1-sarthakkukreti@google.com> In-Reply-To: From: Sarthak Kukreti Date: Thu, 29 Dec 2022 00:17:00 -0800 Message-ID: Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage To: Mike Snitzer Cc: Christoph Hellwig , Daniil Lunev , Jens Axboe , linux-block@vger.kernel.org, "Theodore Ts'o" , "Michael S . Tsirkin" , Jason Wang , Bart Van Assche , Mike Snitzer , linux-kernel@vger.kernel.org, Gwendal Grignou , virtualization@lists.linux-foundation.org, dm-devel@redhat.com, Andreas Dilger , Stefan Hajnoczi , Paolo Bonzini , linux-ext4@vger.kernel.org, Evan Green , Alasdair Kergon Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Fri, Sep 23, 2022 at 7:08 AM Mike Snitzer wrote: > > On Fri, Sep 23 2022 at 4:51P -0400, > Christoph Hellwig wrote: > > > On Wed, Sep 21, 2022 at 07:48:50AM +1000, Daniil Lunev wrote: > > > > There is no such thing as WRITE UNAVAILABLE in NVMe. > > > Apologize, that is WRITE UNCORRECTABLE. Chapter 3.2.7 of > > > NVM Express NVM Command Set Specification 1.0b > > > > Write uncorrectable is a very different thing, and the equivalent of the > > horribly misnamed SCSI WRITE LONG COMMAND. It injects an unrecoverable > > error, and does not provision anything. > > > > > * Each application is potentially allowed to consume the entirety > > > of the disk space - there is no strict size limit for application > > > * Applications need to pre-allocate space sometime, for which > > > they use fallocate. Once the operation succeeded, the application > > > assumed the space is guaranteed to be there for it. > > > * Since filesystems on the volumes are independent, filesystem > > > level enforcement of size constraints is impossible and the only > > > common level is the thin pool, thus, each fallocate has to find its > > > representation in thin pool one way or another - otherwise you > > > may end up in the situation, where FS thinks it has allocated space > > > but when it tries to actually write it, the thin pool is already > > > exhausted. > > > * Hole-Punching fallocate will not reach the thin pool, so the only > > > solution presently is zero-writing pre-allocate. > > > > To me it sounds like you want a non-thin pool in dm-thin and/or > > guaranted space reservations for it. > > What is implemented in this patchset: enablement for dm-thinp to > actually provide guarantees which fallocate requires. > > Seems you're getting hung up on the finishing details in HW (details > which are _not_ the point of this patchset). > > The proposed changes are in service to _Linux_ code. The patchset > implements the primitive from top (ext4) to bottom (dm-thinp, loop). > It stops short of implementing handling everywhere that'd need it > (e.g. in XFS, etc). But those changes can come as follow-on work once > the primitive is established top to bottom. > > But you know all this ;) > > > > * Thus, a provisioning block operation allows an interface specific > > > operation that guarantees the presence of the block in the > > > mapped space. LVM Thin-pool itself is the primary target for our > > > use case but the argument is that this operation maps well to > > > other interfaces which allow thinly provisioned units. > > > > I think where you are trying to go here is badly mistaken. With flash > > (or hard drive SMR) there is no such thing as provisioning LBAs. Every > > write is out of place, and a one time space allocation does not help > > you at all. So fundamentally what you try to here just goes against > > the actual physics of modern storage media. While there are some > > layers that keep up a pretence, trying to that an an exposed API > > level is a really bad idea. > > This doesn't need to be so feudal. Reserving an LBA in physical HW > really isn't the point. > > Fact remains: an operation that ensures space is actually reserved via > fallocate is long overdue (just because an FS did its job doesn't mean > underlying layers reflect that). And certainly useful, even if "only" > benefiting dm-thinp and the loop driver. Like other block primitives, > REQ_OP_PROVISION is filtered out by block core if the device doesn't > support it. > > That said, I agree with Brian Foster that we need really solid > documentation and justification for why fallocate mode=0 cannot be > used (but the case has been made in this thread). > > Also, I do see an issue with the implementation (relative to stacked > devices): dm_table_supports_provision() is too myopic about DM. It > needs to go a step further and verify that some layer in the stack > actually services REQ_OP_PROVISION. Will respond to DM patch too. > Thanks all for the suggestions and feedback! I just posted v2 (more than a bit belatedly) on the various mailing lists with the relevant fixes, documentation and some benchmarks on performance. Best Sarthak