Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp604791rwr; Wed, 19 Apr 2023 10:27:44 -0700 (PDT) X-Google-Smtp-Source: AKy350ZPC1mDRBYR6D6ycTB1k/UA9c13kACaUyFvBbViGzMJQYu2Cdy6mO1q7+DY6Hepj6pgFves X-Received: by 2002:a05:6a20:7d83:b0:ee:448b:946e with SMTP id v3-20020a056a207d8300b000ee448b946emr5312915pzj.35.1681925264376; Wed, 19 Apr 2023 10:27:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681925264; cv=none; d=google.com; s=arc-20160816; b=WP4aVVCOte1D1QkKdhB7xFfBvhKeIkJd0m5290Fx7dVd8VfgvDEpTYV1Ex1j46U8Yk 3oFBx7HUh6O6M2RiViCn4HCjNOcfk5EVNcmGQ4viVBQPKF8R1v3ZJeVBkr/F5n5LbdtM P71cihMatV1SAO6NESLe4M9/WoZtzFMRNJB8YpPLnGPIcV30tgLv/iZGib0n3Zu39Ju9 ZcscWOylNNJz3a65lu0G7Zr10lrs3klJljiMUCfKf+r/ViFjxJVYuzHRRymiKQmF5NbH IDVqcFpGh97X5YkrMFAqZyHM+8c76vaQ23a+RyT8BdHF3P3BYWuuIRBdIZc0JW/6SGed vShg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=hwHVeq8SU0Y29haLuu60uEvBPFI5BdEJwA0M7cOo5oc=; b=AZEwH1S/I8tpNHuPOrbbDPeDi4RS/41b313ylTO90W8aygZ/haoaAMBtVHE/7a7QL3 qZDmSSZDWraHuQk7n9YOeYAY2FrgMFRFb0qB+7a45Tqc3X0O0BSOfwt+OW1xVYO+Nzpm Ze6nAik4ow4D7Pkx2T5YqjncFotQ9mlMxxR7rWQ8lnjwGL5ciU5JY6f6x11oCLXkFwLP ykcplfiDEQWe09WyTpwmf4vMh8C05p0wzNh8+LfWgVltFwYSr9yo99lCqzs9DIG7rtkw iduLKW75wvdw/hTc1gGTpwWawddPaGfFnjI1Hmnn+dLlFpPLUYtkTzZE3MM/wbeDd0C2 fusQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=UycFIy4d; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y36-20020a634964000000b005138c1f1fcdsi16800491pgk.663.2023.04.19.10.27.30; Wed, 19 Apr 2023 10:27:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=UycFIy4d; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232803AbjDSR0H (ORCPT + 99 others); Wed, 19 Apr 2023 13:26:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40040 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230346AbjDSR0F (ORCPT ); Wed, 19 Apr 2023 13:26:05 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BEDEA659B; Wed, 19 Apr 2023 10:26:03 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 53B6A64122; Wed, 19 Apr 2023 17:26:03 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id AD6ABC433EF; Wed, 19 Apr 2023 17:26:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1681925162; bh=VFRws15p+q59RSTrU0VATjC0LQ6eFGIpuOjxTTSlgZA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=UycFIy4d7bCijEj9Km8Xjqeq9Y7EDL81kXO55cTmYZWaCoWU32Dc670Zrnc7oLZIE zohoopB6vvjwHWAIgZi2HjslXGgNOIJNhWwyycPnyOYwJZEXyC3u+30gOvOxZSPYmz UoBxqtx/hq9Yb4HYkTl/ePo7blbSjZczejMiqCwIcKJRmGoEgpFxw635kJIa5LL6cK fj7fBUFeirx+lzm/NNEwcg+rYsJHgsDOwXoE9yDM1+xQ7Vq0YueKHyzojjw0y6iVIo MzRSXQLJPb3cF4Xqo1ry0NuHAPTGe48TEIN9UMSK0LxFIaOB2wSFR2bO9Zr4jBxkZj 6EldBf3ubjw1Q== Date: Wed, 19 Apr 2023 10:26:02 -0700 From: "Darrick J. Wong" To: Mike Snitzer Cc: Sarthak Kukreti , dm-devel@redhat.com, linux-block@vger.kernel.org, linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Jens Axboe , Theodore Ts'o , "Michael S. Tsirkin" , Jason Wang , Bart Van Assche , Christoph Hellwig , Andreas Dilger , Daniil Lunev , Stefan Hajnoczi , Brian Foster , Alasdair Kergon Subject: Re: [PATCH v4 1/4] block: Introduce provisioning primitives Message-ID: <20230419172602.GE360881@frogsfrogsfrogs> References: <20230414000219.92640-1-sarthakkukreti@chromium.org> <20230418221207.244685-1-sarthakkukreti@chromium.org> <20230418221207.244685-2-sarthakkukreti@chromium.org> <20230419153611.GE360885@frogsfrogsfrogs> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 19, 2023 at 12:17:34PM -0400, Mike Snitzer wrote: > On Wed, Apr 19 2023 at 11:36P -0400, > Darrick J. Wong wrote: > > > On Tue, Apr 18, 2023 at 03:12:04PM -0700, Sarthak Kukreti wrote: > > > Introduce block request REQ_OP_PROVISION. The intent of this request > > > is to request underlying storage to preallocate disk space for the given > > > block range. Block devices that support this capability will export > > > a provision limit within their request queues. > > > > > > This patch also adds the capability to call fallocate() in mode 0 > > > on block devices, which will send REQ_OP_PROVISION to the block > > > device for the specified range, > > > > > > Signed-off-by: Sarthak Kukreti > > > --- > > > block/blk-core.c | 5 ++++ > > > block/blk-lib.c | 53 +++++++++++++++++++++++++++++++++++++++ > > > block/blk-merge.c | 18 +++++++++++++ > > > block/blk-settings.c | 19 ++++++++++++++ > > > block/blk-sysfs.c | 8 ++++++ > > > block/bounce.c | 1 + > > > block/fops.c | 25 +++++++++++++----- > > > include/linux/bio.h | 6 +++-- > > > include/linux/blk_types.h | 5 +++- > > > include/linux/blkdev.h | 16 ++++++++++++ > > > 10 files changed, 147 insertions(+), 9 deletions(-) > > > > > > > > first glance, but what do I know... ;)> > > > > > diff --git a/block/fops.c b/block/fops.c > > > index d2e6be4e3d1c..e1775269654a 100644 > > > --- a/block/fops.c > > > +++ b/block/fops.c > > > @@ -611,9 +611,13 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) > > > return ret; > > > } > > > > > > +#define BLKDEV_FALLOC_FL_TRUNCATE \ > > > > At first I thought from this name that you were defining a new truncate > > mode for fallocate, then I realized that this is mask for deciding if we > > /want/ to truncate the pagecache. > > > > #define BLKDEV_FALLOC_TRUNCATE_MASK ? > > > > > + (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE | \ > > > > Ok, so discarding and writing zeroes truncates the page cache, makes > > sense since we're "writing" directly to the block device. > > > > > + FALLOC_FL_NO_HIDE_STALE) > > > > Here things get tricky -- some of the FALLOC_FL mode bits are really an > > opcode and cannot be specified together, whereas others select optional > > behavior for certain opcodes. > > > > IIRC, the mutually exclusive opcodes are: > > > > PUNCH_HOLE > > ZERO_RANGE > > COLLAPSE_RANGE > > INSERT_RANGE > > (none of the above, for allocation) > > > > and the "variants on a theme are": > > > > KEEP_SIZE > > NO_HIDE_STALE > > UNSHARE_RANGE > > > > not all of which are supported by all the opcodes. > > > > Does it make sense to truncate the page cache if userspace passes in > > mode == NO_HIDE_STALE? There's currently no defined meaning for this > > combination, but I think this means we'll truncate the pagecache before > > deciding if we're actually going to issue any commands. > > > > I think that's just a bug in the existing code -- it should be > > validating that @mode is any of the supported combinations *before* > > truncating the pagecache. > > > > Otherwise you could have a mkfs program that starts writing new fs > > metadata, decides to provision the storage (say for a logging region), > > doesn't realize it's running on an old kernel, and then oops the > > provision attempt fails but have we now shredded the pagecache and lost > > all the writes? > > While that just caused me to have an "oh shit, that's crazy" (in a > scary way) belly laugh... I just tried this and: # xfs_io -c 'pwrite -S 0x58 1m 1m' -c fsync -c 'pwrite -S 0x59 1m 4096' -c 'pread -v 1m 64' -c 'falloc 1m 4096' -c 'pread -v 1m 64' /dev/sda wrote 1048576/1048576 bytes at offset 1048576 1 MiB, 256 ops; 0.0013 sec (723.589 MiB/sec and 185238.7844 ops/sec) wrote 4096/4096 bytes at offset 1048576 4 KiB, 1 ops; 0.0000 sec (355.114 MiB/sec and 90909.0909 ops/sec) 00100000: 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 YYYYYYYYYYYYYYYY 00100010: 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 YYYYYYYYYYYYYYYY 00100020: 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 YYYYYYYYYYYYYYYY 00100030: 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 YYYYYYYYYYYYYYYY read 64/64 bytes at offset 1048576 64.000000 bytes, 1 ops; 0.0000 sec (1.565 MiB/sec and 25641.0256 ops/sec) fallocate: Operation not supported 00100000: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX 00100010: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX 00100020: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX 00100030: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX read 64/64 bytes at offset 1048576 64.000000 bytes, 1 ops; 0.0003 sec (176.554 KiB/sec and 2824.8588 ops/sec) (Write 1MB of Xs, flush it to disk, write 4k of Ys, confirm the Y's are in the page cache, fail to fallocate, reread and spot the Xs that we supposedly overwrote.) oops. > (And obviously needs fixing independent of this patchset) > > Shouldn't mkfs first check that the underlying storage supports > REQ_OP_PROVISION by verifying > /sys/block//queue/provision_max_bytes exists and is not 0? > (Just saying, we need to add new features more defensively.. you just > made the case based on this scenario's implications alone) Not for fallocate -- for regular files, there's no way to check if the filesystem actually supports the operation requested other than to try it and see if it succeeds. We probably should've defined a DRY_RUN flag for that purpose back when it was introduced. For fallocate calls to block devices, yes, the program can check the queue limits in sysfs if fstat says the supplied path is a block device, but I don't know that most programs are that thorough. The fallocate(1) CLI program does not check. Then I moved on to fs utilities: ext4: For discard, mke2fs calls BLKDISCARD if it detects a block device via fstat, and falloc(PUNCH|KEEP_SIZE) for anything else. For zeroing, it only uses falloc(ZERO) or falloc(PUNCH|KEEP_SIZE) and does not try to use BLKZEROOUT. It does not check sysfs queue limits at all. XFS: mkfs.xfs issues BLKDISCARD before writing anything to the device, so that's fine. It uses falloc(ZERO) to erase the log, but since xfsprogs provides its own buffer cache and uses O_DIRECT, pagecache coherency problems aren't an issue. btrfs: mkfs.btrfs only issues BLKDISCARD, and only before it starts writing the new fs, so no problems there. --D > Sarthak, please note I said "provision_max_bytes": all other ops > (e.g. DISCARD, WRITE_ZEROES, etc) have _max_bytes exported through > sysfs, not _max_sectors. Please export provision_max_bytes, e.g.: > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 202aa78f933e..2e5ac7b1ffbd 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -605,12 +605,12 @@ QUEUE_RO_ENTRY(queue_io_min, "minimum_io_size"); > QUEUE_RO_ENTRY(queue_io_opt, "optimal_io_size"); > > QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments"); > -QUEUE_RO_ENTRY(queue_max_provision_sectors, "max_provision_sectors"); > QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity"); > QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes"); > QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes"); > QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data"); > > +QUEUE_RO_ENTRY(queue_provision_max, "provision_max_bytes"); > QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes"); > QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes"); > QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");