Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp2942358rwd; Fri, 2 Jun 2023 17:54:59 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ67P7JZQ9p8CQzEddNLVCnVyVlijSsAk4Hih3oe1xTc6CB9PCjeNi+dZvwDPxkjjwLtj+NB X-Received: by 2002:a05:6358:419f:b0:127:b98b:3aec with SMTP id w31-20020a056358419f00b00127b98b3aecmr13111930rwc.15.1685753699061; Fri, 02 Jun 2023 17:54:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685753699; cv=none; d=google.com; s=arc-20160816; b=BthHYf7RNMn2zUkU6vXC/x1p62+d67Yb83vjKGlNXQrJ2OKIeMWt661bKul0+CTy8E JDVygz8O/pDKXwRPT12Du/7Uf8C5TaUyiQvRjM6KfcXbpNzKjEXLgovLwNwYAziBPgkf tVDeONFORQZcX6FnyoQVFIx6Qw2YGzdfiU+O49PeAISibo9DpCjgLsF1CBJPfjXG2ryO 1RJ/8eAp11OhAtY3BGuSgE3He7u9ekQp8uoKbss+ZrFrAiIEjggyO3JzVT89XUcGQc4p JNO5odSO7HnwCnyDhEU3fk+S7+hj44Pi9HqQjMT4+Xo3jPYOjUtGhbo4Ee3O+6IFvXuD jYMw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=47yKag1HPVe1tlg/ueV2EmabxwqalA0eGEjj+7JE1uc=; b=QSL7z78giJjWx7b6nO2Vqt9ux2M7qhUUXOzumT0BIwms3uwWOVKWbC+mS3cWq+7ZUX L81hm5d/7saODvCfI8TUq1EdbNUOGQFa64dfopHpk4MzNZMiHPkVLc1cPEa0vhQ+NUgc X+N8NFRFMikCzVaqWh50ZnTahgEMA/rhHsn8TtFgOuNJhKmv9ZfxfnjdyUuPL5X27bzF fkRIrqd6X3EIjEpGKg/c7TmE0ps9Lbaua0q3VrLQ8CpA9xJdzcZVJpohk/jyqZ53ZQ9J rVDCjE3sAdKbDQOyxTuItiOQ2yYzE8A1YKaJFgXZV5mBQhhXoOzKnrlhOQ6ZS1KmSwdb AnMw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@fromorbit-com.20221208.gappssmtp.com header.s=20221208 header.b="UWIvN/qv"; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y16-20020a63b510000000b00520d737cf49si1799347pge.271.2023.06.02.17.54.41; Fri, 02 Jun 2023 17:54:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@fromorbit-com.20221208.gappssmtp.com header.s=20221208 header.b="UWIvN/qv"; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236938AbjFCAwY (ORCPT + 99 others); Fri, 2 Jun 2023 20:52:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236927AbjFCAwX (ORCPT ); Fri, 2 Jun 2023 20:52:23 -0400 Received: from mail-il1-x12c.google.com (mail-il1-x12c.google.com [IPv6:2607:f8b0:4864:20::12c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C3A9FE51 for ; Fri, 2 Jun 2023 17:52:20 -0700 (PDT) Received: by mail-il1-x12c.google.com with SMTP id e9e14a558f8ab-33c1e7743b7so11847375ab.1 for ; Fri, 02 Jun 2023 17:52:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1685753540; x=1688345540; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=47yKag1HPVe1tlg/ueV2EmabxwqalA0eGEjj+7JE1uc=; b=UWIvN/qvXfrWqOcG4gb6e/SydmD4XyYL0b/3cLXe1b9WDm329KgaO9cKycjRwD473K 5TCOPQ/yMBhVqGwhNuU7qPsLBldyyFhIutUweZTyGlDXuPLfeV4ghxFgVoVMaznaJQqL lso2N/YgqQcS9WlXHd0+ykyR+ZNQ4k4Tnd4MLYPidVNLRQPJoOkakl0onc1j45sA3wL4 OtVhA8x2Mh5PRgihy8gl+v/XxfHMWX0aOgW/R1aK1HFXbY4CqVcJFxRSMQxSzI0rGq+P xomIjZ7uHiOrIMIyetSUiwCh+tuhcZGtJwu5tFtMYT0gq/oKTpAfK3FSlbqpzuV0E7P7 oLkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685753540; x=1688345540; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=47yKag1HPVe1tlg/ueV2EmabxwqalA0eGEjj+7JE1uc=; b=h0G53BgDgeBsW8lW4ACf+0XXNtSPr4wuZ5EBXdUOU0+zgF6pmqmIXUys19lj/eoQ9s DAzJ0f3meSJhByL/RnDq8qJ0uTCT5nLxWzoY27YCNJiKqNQsaJw9OlGmMDioAqnD83OZ InVWjvy8gr/FBN836ezxHS/eSl61nrVadXZtQvamSuM0OIcOJyKEg/jd/IdLeGphA06V 0GqijmnD3ae9dLDUVxGXyit7uRbN2vsCr7urUwWWc1CiHV8u+Q8lbgqrwAaM37aKmxoV hqnBlRt9xNBPMuLETs19v5P7I5auW9nmVzx40GNdJmrNLa0d9h7x+5C7rB4KHOTJMMWq kEgQ== X-Gm-Message-State: AC+VfDzGScvg3U1EMcl+AzbX93kdlWXDPZJDHZrFXqnTfk1At3icFja9 WJEsDNcObAnwwwO5wdJ8dA7vZw== X-Received: by 2002:a05:6e02:810:b0:338:c685:83d1 with SMTP id u16-20020a056e02081000b00338c68583d1mr10978918ilm.10.1685753539952; Fri, 02 Jun 2023 17:52:19 -0700 (PDT) Received: from dread.disaster.area (pa49-179-0-188.pa.nsw.optusnet.com.au. [49.179.0.188]) by smtp.gmail.com with ESMTPSA id u6-20020a634706000000b0053b8a4f9465sm1788619pga.45.2023.06.02.17.52.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Jun 2023 17:52:18 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1q5FV8-0076kT-2D; Sat, 03 Jun 2023 10:52:14 +1000 Date: Sat, 3 Jun 2023 10:52:14 +1000 From: Dave Chinner To: Sarthak Kukreti Cc: Mike Snitzer , Joe Thornber , Jens Axboe , Christoph Hellwig , Theodore Ts'o , dm-devel@redhat.com, "Michael S. Tsirkin" , "Darrick J. Wong" , Brian Foster , Bart Van Assche , linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, Joe Thornber , Andreas Dilger , Stefan Hajnoczi , linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, Jason Wang , Alasdair Kergon Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote: > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer wrote: > > > > On Tue, May 30 2023 at 10:55P -0400, > > Joe Thornber wrote: > > > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer wrote: > > > > > > > > > > > Also Joe, for you proposed dm-thinp design where you distinquish > > > > between "provision" and "reserve": Would it make sense for REQ_META > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an > > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides > > > > more freedom to just reserve the length of blocks? (e.g. for XFS > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to > > > > reserve space to accomodate it). > > > > > > > > > > My proposal only involves 'reserve'. Provisioning will be done as part of > > > the usual io path. > > > > OK, I think we'd do well to pin down the top-level block interfaces in > > question. Because this patchset's block interface patch (2/5) header > > says: > > > > "This patch also adds the capability to call fallocate() in mode 0 > > on block devices, which will send REQ_OP_PROVISION to the block > > device for the specified range," > > > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A > > user of XFS could then use fallocate() for user data -- which would > > cause thinp's reserve to _not_ be used for critical metadata. Mike, I think you might have misunderstood what I have been proposing. Possibly unintentionally, I didn't call it REQ_OP_PROVISION but that's what I intended - the operation does not contain data at all. It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it contains a range of sectors that need to be provisioned (or discarded), and nothing else. The write IOs themselves are not tagged with anything special at all. i.e. The proposal I made does not use REQ_PROVISION anywhere in the metadata/data IO path; provisioned regions are created by separate operations and must be tracked by the underlying block device, then treat any write IO to those regions as "must not fail w/ ENOSPC" IOs. There seems to be a lot of fear about user data requiring provisioning. This is unfounded - provisioning is only needed for explicitly provisioned space via fallocate(), not every byte of user data written to the filesystem (the model Brian is proposing). Excessive use of fallocate() is self correcting - if users and/or their applications provision too much, they are going to get ENOSPC or have to pay more to expand the backing pool reserves they need. But that's not a problem the block device should be trying to solve; that's a problem for the sysadmin and/or bean counters to address. > > > > The only way to distinquish the caller (between on-behalf of user data > > vs XFS metadata) would be REQ_META? > > > > So should dm-thinp have a REQ_META-based distinction? Or just treat > > all REQ_OP_PROVISION the same? > > > I'm in favor of a REQ_META-based distinction. Why? What *requirement* is driving the need for this distinction? As the person who proposed this new REQ_OP_PROVISION architecture, I'm dead set against it. Allowing the block device provide a set of poorly defined "conditional guarantees" policies instead of a mechanism with a single ironclad guarantee defeats the entire purpose of the proposal. We have a requirement from the *kernel ABI* that *user data writes* must not fail with ENOSPC after an fallocate() operation. That's one of the high level policies we need to implement. The filesystem is already capable of guaranteeing it won't give the user ENOSPC after fallocate, we now need a guarantee from the filesystem's backing store that it won't give ENOSPC, too. The _other thing_ we need to implement is a method of guaranteeing the filesystem won't shut down when the backing device goes ENOSPC unexpected during metadata writeback. So we also need the backing device to guarantee the regions we write metadata to won't give ENOSPC. That's the whole point of REQ_OP_PROVISION: from the layers above the block device, there is -zero- difference between the guarantee we need for user data writes to avoid ENOSPC and for metadata writes to avoid ENOSPC. They are one and the same. Hence if the block device is going to say "I support provisioning" but then give different conditional guarantees according to the *type of data* in the IO request, then it does not provide the functionality the higher layers actually require from it. Indeed, what type of data the IO contains is *context dependent*. For example, sometimes we write metadata with user data IO and but we still need provisioning guarantees as if it was issued as metadata IO. This is the case for mkfs initialising the file system by writing directly to the block device. IOWs, filesystem metadata IO issued from kernel context would be considered metadata IO, but from userspace it would be considered normal user data IO and hence treated differently. But the reality is that they both need the same provisioning guarantees to be provided by the block device. So how do userspace tools deal with this if the block device requires REQ_META on user data IOs to do the right thing here? And if we provide a mechanism to allow this, how do we prevent userspace for always using it on writes to fallocate() provisioned space? It's just not practical for the block device to add arbitrary constraints based on the type of IO because we then have to add mechanisms to userspace APIs to allow them to control the IO context so the block device will do the right thing. Especially considering we really only need one type of guarantee regardless of where the IO originates from or what type of data the IO contains.... > Does that imply that > REQ_META also needs to be passed through the block/filesystem stack > (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a > fallocate() to the underlying file)? This is exactly the same case as above: the loopback device does user data IO to the backing file. Hence we have another situation where metadata IO is issued to fallocate()d user data ranges as user data ranges and so would be given a lesser guarantee that would lead to upper filesystem failure. BOth upper and lower filesystem data and metadata need to be provided the same ENOSPC guarantees by their backing stores.... The whole point of the REQ_OP_PROVISION proposal I made is that it doesn't require any special handling in corner cases like this. There are no cross-layer interactions needed to make everything work correctly because the provisioning guarantee is not -data type dependent*. The entire user IO path code remains untouched and blissfully unaware of provisioned regions. And, realistically, if we have to start handling complex corner cases in the filesystem and IO path layers to make REQ_OP_PROVISION work correctly because of arbitary constraints imposed by the block layer implementations, then we've failed miserably at the design and architecture stage. Keep in mind that every attempt made so far to address the problems with block device ENOSPC errors has failed because of the complexity of the corner cases that have arisen during design and/or implementation. It's pretty ironic that now we have a proposal that is remarkably simple, free of corner cases and has virtually no cross-layer coupling at all, the first thing that people want to do is add arbitrary implementation constraints that result in complex cross-layer corner cases that now need to be handled.... Put simply: if we restrict REQ_OP_PROVISION guarantees to just REQ_META writes (or any other specific type of write operation) then it's simply not worth persuing at the filesystem level because the guarantees we actually need just aren't there and the complexity of discovering and handling those corner cases just isn't worth the effort. Cheers, Dave. -- Dave Chinner david@fromorbit.com