On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote: > From: Sarthak Kukreti > > Hi, > > This patch series is an RFC of a mechanism to pass through provision requests on stacked thinly provisioned storage devices/filesystems. > > The linux kernel provides several mechanisms to set up thinly provisioned block storage abstractions (eg. dm-thin, loop devices over sparse files), either directly as block devices or backing storage for filesystems. Currently, short of writing data to either the device or filesystem, there is no way for users to pre-allocate space for use in such storage setups. Consider the following use-cases: > > 1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that the underlying thinpool metadata is not modified during the suspend mechanism, the dm-thin device needs to be fully provisioned. > 2) If a filesystem uses a loop device over a sparse file, fallocate() on the filesystem will allocate blocks for files but the underlying sparse file will remain intact. > 3) Another example is virtual machine using a sparse file/dm-thin as a storage device; by default, allocations within the VM boundaries will not affect the host. > 4) Several storage standards support mechanisms for thin provisioning on real hardware devices. For example: > a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning: "When the THINP bit in the NSFEAT field of the Identify Namespace data structure is set to ‘1’, the controller ... shall track the number of allocated blocks in the Namespace Utilization field" > b. The SCSi Block Commands reference - 4 section references "Thin provisioned logical units", > c. UFS 3.0 spec section 13.3.3 references "Thin provisioning". When REQ_OP_PROVISION is sent on an already-allocated range of blocks, are those blocks zeroed? NVMe Write Zeroes with Deallocate=0 works this way, for example. That behavior is counterintuitive since the operation name suggests it just affects the logical block's provisioning state, not the contents of the blocks. > In all of the above situations, currently the only way for pre-allocating space is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not scale well with larger pre-allocation sizes. What exactly is the issue with WRITE_ZEROES scalability? Are you referring to cases where the device doesn't support an efficient WRITE_ZEROES command and actually writes blocks filled with zeroes instead of updating internal allocation metadata cheaply? Stefan