Received: by 2002:a05:6358:16cc:b0:ea:6187:17c9 with SMTP id r12csp13813133rwl; Wed, 4 Jan 2023 13:32:13 -0800 (PST) X-Google-Smtp-Source: AMrXdXv99XYwsWy8IX+/6zgWDr9tdgdAJukTvQ9+rHoPKA3lG8ZEVaQtAPX7kNYoM1gzkzEK8Vy9 X-Received: by 2002:a05:6a20:4291:b0:b0:47e7:6cba with SMTP id o17-20020a056a20429100b000b047e76cbamr69047735pzj.46.1672867933201; Wed, 04 Jan 2023 13:32:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672867933; cv=none; d=google.com; s=arc-20160816; b=QW2EhRxAKIRE2jFGbDgxlCoQ0AmNFyzcjo97EJJDuQvYXs5MuycSzCuRVa1tTuqzfe WlWuKAwWmVndQLey3iCpTh+vb5fesUTK6kGZNHQrm0LWTU4Cw48mmFBTKg55Ir3jW5oL wBCDMgo9eIJf9UyoIXfyWewVN69scm3wpALbZVuQe0N0PzVgy5PS5mMK7JkDk25fbG1a dBjT9CS5dkOnlq2Ic9aAnEvW3h26RgX43dc2nrjwE1wKa2eUX9v6X9cYGLkHDZ9sI5ru xRzHP/ewnsd9dW3ReBVubIp5zWOblKEOomyVsaEIAWDtwzpT6ta7xDfKngml/iZXfVbU e7tw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=5r/V3HPXwjpuD7IphcjrrB8OiZUQp0rchrkxxQGzaWI=; b=GlEbx8FdIcL38m+Rd31ymmrYbEsyXD77NlUB+/hKFfwtjv+lMMA/hjf3HMqSOZZ4dB 1PaXqR6EeyZS4EWlIUvzNoGlUV2ztrVIqTJ3clFhFrqxxJdNazXDSqZYsx1okowFjAIy Uwva4+khoJIEvWf6sekHDz0RkdBJeypHgCi4xHwtbyYTblJ7gxD+Xv+gId0eJgtqrgLt rfbZpqK/8GKKUbWlBMC6/GinsdBDTJI9sazrMZKpe5EngcUk4go4AvOokNUnYv/Amtxa 4fyPmS30VPwwdMXZr/1Q5eynuhFqYbvAg5YXS56h4T0/jGYUXC8L3c0W6c9/BAcHO09r yn6Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=KEIC0i6T; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c10-20020a655a8a000000b004779a46d07csi36336919pgt.183.2023.01.04.13.31.59; Wed, 04 Jan 2023 13:32:13 -0800 (PST) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=KEIC0i6T; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240441AbjADV2j (ORCPT + 99 others); Wed, 4 Jan 2023 16:28:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42118 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240451AbjADV2Q (ORCPT ); Wed, 4 Jan 2023 16:28:16 -0500 Received: from mail-ed1-x536.google.com (mail-ed1-x536.google.com [IPv6:2a00:1450:4864:20::536]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 74D753FA17 for ; Wed, 4 Jan 2023 13:22:18 -0800 (PST) Received: by mail-ed1-x536.google.com with SMTP id m21so50319891edc.3 for ; Wed, 04 Jan 2023 13:22:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=5r/V3HPXwjpuD7IphcjrrB8OiZUQp0rchrkxxQGzaWI=; b=KEIC0i6TYit5Z0cdFjku8f3xpdIif4U2Ymvz3XlFlq2j6BxcLI8+aRhr5ueD/4NPZt 5J/8Tmn+RoO27XqmSSN+2MvKrF23PsPGVyxQsTSipAdevtpNTCAHHgE1Q/dRVQTvoBqO 4T5iaYELTABGEiuIT0Ntyw6CUkWJlk7WiKrlY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=5r/V3HPXwjpuD7IphcjrrB8OiZUQp0rchrkxxQGzaWI=; b=xJDQqabxJVGOX3Ts+HuMGKuVJjZ4+IqBIPEE3IEuVBCdI4U9Fn2inHq0O4k+I+wrxK JDuVlBd9Z26mRTQRp/dzLzX8SF6EaTfWcIuzM15nSWC1sYHi2pdBq9BTTDL21imCanxx +R/L74LJS1RPEIXxNm50H9zO0KJnIGF5tsVRxtLlkJ1lR9C1hGxvJowZhnoC8oairUvG QqzvK8NUgmB+DtCAHqqNPsrsr6treKbSRATJ7DDKQ/roX9cyJaAQw7ESk4QQH5HX+Y2z gHyTURxnOcHoPDLKMwFLp0h79mpv9OR/avdxqeUwiCikqfVQyV9dJT/Y2/lQ3e3MBlLZ NonQ== X-Gm-Message-State: AFqh2kqSwOLGfeO42cWUb2OeDMrIAUjZ+snslD095VG/YhC8/UroLKET 9dMnMhXl3Ggi7HrhUcp+uyoapSBA6CfDlZO+lnvQ0A== X-Received: by 2002:a05:6402:f27:b0:485:8114:9779 with SMTP id i39-20020a0564020f2700b0048581149779mr3674496eda.41.1672867336970; Wed, 04 Jan 2023 13:22:16 -0800 (PST) MIME-Version: 1.0 References: <20221229081252.452240-1-sarthakkukreti@chromium.org> <20221229081252.452240-4-sarthakkukreti@chromium.org> In-Reply-To: From: Sarthak Kukreti Date: Wed, 4 Jan 2023 13:22:06 -0800 Message-ID: Subject: Re: [PATCH v2 3/7] fs: Introduce FALLOC_FL_PROVISION To: "Darrick J. Wong" Cc: sarthakkukreti@google.com, dm-devel@redhat.com, linux-block@vger.kernel.org, linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Jens Axboe , "Michael S. Tsirkin" , Jason Wang , Stefan Hajnoczi , Alasdair Kergon , Mike Snitzer , Christoph Hellwig , Brian Foster , "Theodore Ts'o" , Andreas Dilger , Bart Van Assche , Daniil Lunev Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org (Resend; the text flow made the last reply unreadable) On Wed, Jan 4, 2023 at 8:39 AM Darrick J. Wong wrote: > > On Thu, Dec 29, 2022 at 12:12:48AM -0800, Sarthak Kukreti wrote: > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that > > sends a hint to (supported) thinly provisioned block devices to > > allocate space for the given range of sectors via REQ_OP_PROVISION. > > > > The man pages for both fallocate(2) and posix_fallocate(3) describe > > the default allocation mode as: > > > > ``` > > The default operation (i.e., mode is zero) of fallocate() > > allocates the disk space within the range specified by offset and len. > > ... > > subsequent writes to bytes in the specified range are guaranteed > > not to fail because of lack of disk space. > > ``` > > > > For thinly provisioned storage constructs (dm-thin, filesystems on sparse > > files), the term 'disk space' is overloaded and can either mean the apparent > > disk space in the filesystem/thin logical volume or the true disk > > space that will be utilized on the underlying non-sparse allocation layer. > > > > The use of a separate mode allows us to cleanly disambiguate whether fallocate() > > causes allocation only at the current layer (default mode) or whether it propagates > > allocations to underlying layers (provision mode) > > Why is it important to make this distinction? The outcome of fallocate > is supposed to be that subsequent writes do not fail with ENOSPC. In my > (fs developer) mind, REQ_OP_PROVISION simply an extra step to be taken > after allocating file blocks. > Some use cases still benefit from keeping the default mode - eg. virtual machines running on massive storage pools that don't expect to hit the storage limit anytime soon (like most cloud storage providers). Essentially, if the 'no ENOSPC' guarantee is maintained via other means, then REQ_OP_PROVISION adds latency that isn't needed (and cloud storage providers don't need to set aside that extra space that may or may not be used). > If you *don't* add this API flag and simply bake the REQ_OP_PROVISION > call into mode 0 fallocate, then the new functionality can be added (or > even backported) to existing kernels and customers can use it > immediately. If you *do*, then you get to wait a few years for > developers to add it to their codebases only after enough enterprise > distros pick up a new kernel to make it worth their while. > > > for thinly provisioned filesystems/ > > block devices. For devices that do not support REQ_OP_PROVISION, both these > > allocation modes will be equivalent. Given the performance cost of sending provision > > requests to the underlying layers, keeping the default mode as-is allows users to > > preserve existing behavior. > > How expensive is this expected to be? Is this why you wanted a separate > mode flag? > Yes, the exact latency will depend on the stacked block devices and the fragmentation at the allocation layers. I did a quick test for benchmarking fallocate() with an: A) ext4 filesystem mounted with 'noprovision' B) ext4 filesystem mounted with 'provision' on a dm-thin device. C) ext4 filesystem mounted with 'provision' on a loop device with a sparse backing file on the filesystem in (B). I tested file sizes from 512M to 8G, time taken for fallocate() in (A) remains expectedly flat at ~0.01-0.02s, but for (B), it scales from 0.03-0.4s and for (C) it scales from 0.04s-0.52s (I captured the exact time distribution in the cover letter https://marc.info/?l=linux-ext4&m=167230113520636&w=2) +0.5s for a 8G fallocate doesn't sound a lot but I think fragmentation and how the block device is layered can make this worse... > --D > > > Signed-off-by: Sarthak Kukreti > > --- > > block/fops.c | 15 +++++++++++---- > > include/linux/falloc.h | 3 ++- > > include/uapi/linux/falloc.h | 8 ++++++++ > > 3 files changed, 21 insertions(+), 5 deletions(-) > > > > diff --git a/block/fops.c b/block/fops.c > > index 50d245e8c913..01bde561e1e2 100644 > > --- a/block/fops.c > > +++ b/block/fops.c > > @@ -598,7 +598,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) > > > > #define BLKDEV_FALLOC_FL_SUPPORTED \ > > (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \ > > - FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE) > > + FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE | \ > > + FALLOC_FL_PROVISION) > > > > static long blkdev_fallocate(struct file *file, int mode, loff_t start, > > loff_t len) > > @@ -634,9 +635,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, > > filemap_invalidate_lock(inode->i_mapping); > > > > /* Invalidate the page cache, including dirty pages. */ > > - error = truncate_bdev_range(bdev, file->f_mode, start, end); > > - if (error) > > - goto fail; > > + if (mode != FALLOC_FL_PROVISION) { > > + error = truncate_bdev_range(bdev, file->f_mode, start, end); > > + if (error) > > + goto fail; > > + } > > > > switch (mode) { > > case FALLOC_FL_ZERO_RANGE: > > @@ -654,6 +657,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, > > error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT, > > len >> SECTOR_SHIFT, GFP_KERNEL); > > break; > > + case FALLOC_FL_PROVISION: > > + error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT, > > + len >> SECTOR_SHIFT, GFP_KERNEL); > > + break; > > default: > > error = -EOPNOTSUPP; > > } > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h > > index f3f0b97b1675..b9a40a61a59b 100644 > > --- a/include/linux/falloc.h > > +++ b/include/linux/falloc.h > > @@ -30,7 +30,8 @@ struct space_resv { > > FALLOC_FL_COLLAPSE_RANGE | \ > > FALLOC_FL_ZERO_RANGE | \ > > FALLOC_FL_INSERT_RANGE | \ > > - FALLOC_FL_UNSHARE_RANGE) > > + FALLOC_FL_UNSHARE_RANGE | \ > > + FALLOC_FL_PROVISION) > > > > /* on ia32 l_start is on a 32-bit boundary */ > > #if defined(CONFIG_X86_64) > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h > > index 51398fa57f6c..2d323d113eed 100644 > > --- a/include/uapi/linux/falloc.h > > +++ b/include/uapi/linux/falloc.h > > @@ -77,4 +77,12 @@ > > */ > > #define FALLOC_FL_UNSHARE_RANGE 0x40 > > > > +/* > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate > > + * blocks for the range/EOF. > > + * > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate. > > + */ > > +#define FALLOC_FL_PROVISION 0x80 > > + > > #endif /* _UAPI_FALLOC_H_ */ > > -- > > 2.37.3 > >