2022-12-29 08:16:03

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH v2 0/8] Introduce provisioning primitives for thinly provisioned storage

Hi,

This patch series adds a mechanism to pass through provision requests on
stacked thinly provisioned storage devices/filesystems.

The linux kernel provides several mechanisms to set up thinly provisioned
block storage abstractions (eg. dm-thin, loop devices over sparse files),
either directly as block devices or backing storage for filesystems. Currently,
short of writing data to either the device or filesystem, there is no way for
users to pre-allocate space for use in such storage setups. Consider the
following use-cases:

1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that
the underlying thinpool metadata is not modified during the suspend
mechanism, the dm-thin device needs to be fully provisioned.
2) If a filesystem uses a loop device over a sparse file, fallocate() on the
filesystem will allocate blocks for files but the underlying sparse file
will remain intact.
3) Another example is virtual machine using a sparse file/dm-thin as a storage
device; by default, allocations within the VM boundaries will not affect
the host.
4) Several storage standards support mechanisms for thin provisioning on
real hardware devices. For example:
a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning:
"When the THINP bit in the NSFEAT field of the Identify Namespace data
structure is set to ‘1’, the controller ... shall track the number of
allocated blocks in the Namespace Utilization field"
b. The SCSi Block Commands reference - 4 section references "Thin
provisioned logical units",
c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".

In all the above situations, currently, the only way for pre-allocating space
is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not
scale well with larger pre-allocation sizes.

This patchset introduces primitives to support block-level provisioning (note:
the term 'provisioning' is used to prevent overloading the term
'allocations/pre-allocations') requests across filesystems and block devices.
This allows fallocate() and file creation requests to reserve space across
stacked layers of block devices and filesystems. Currently, the patchset covers
a prototype on the device-mapper targets, loop device and ext4, but the same
mechanism can be extended to other filesystems/block devices as well as extended
for use with devices in 4 a-c.

Patch 1 introduces REQ_OP_PROVISION as a new request type.
The provision request acts like the inverse of a discard request; instead
of notifying lower layers that the block range will no longer be used, provision
acts as a request to lower layers to provision disk space for the given block
range. Real hardware storage devices will currently disable the provisioing
capability but for the standards listed in 4a.-c., REQ_OP_PROVISION can be
overloaded for use as the provisioing primitive for future devices.

Patch 2 implements REQ_OP_PROVISION handling for some of the device-mapper
targets. This additionally adds support for pre-allocating space for thinly
provisioned logical volumes via fallocate()

Patch 3 introduces an fallocate() mode (FALLOC_FL_PROVISION) that sends a
provision request to the underlying block device (and beyond). This acts as the
primary mechanism for file provisioning as well as disambiguates the notion of
virtual and true disk space allocations for thinly provisioned storage devices/
filesystems. With patch 3, the 'default' fallocate() mode is preserved to
perform preallocation at the current allocation layer and 'provision' mode
adds the capability to punch through the allocations to the underlying thinly
provisioned storage layers. For regular filesystems, both allocation modes
are equivalent.

Patch 4 wires up the loop device handling of REQ_OP_PROVISION.

Patches 5-7 cover a prototype implementation for ext4, which includes wiring up
the fallocate() implementation, introducing a filesystem level option (called
'provision') to control the default allocation behaviour and, finally, a
file-level override to retain current handling, even on filesystems mounted with
'provision'. These options allow users of stacked filesystems to flexibly take
advantage of provisioning.

Testing:
--------
- Tested on a VM running a 6.2 kernel.
- The following perfomrmance measurements were collected with fallocate(2)
patched to add support for FALLOC_FL_PROVISION via a command line option
`-p/--provision`.

- Preallocation of dm-thin devices:
As expected, avoiding the need to zero out thinly-provisioned block devices to
preallocate space speeds up the provisioning operation significantly:

The following was tested on a dm-thin device set up on top of a dm-thinp with
skip_block_zeroing=true.
A) Zeroout was measured using `fallocate -z ...`
B) Provision was measured using `fallocate -p ...`.

Size Time A B
512M real 1.093 0.034
user 0 0
sys 0.022 0.01
1G real 2.182 0.048
user 0 0.01
sys 0.022 0
2G real 4.344 0.082
user 0 0.01
sys 0.036 0
4G real 8.679 0.153
user 0 0.01
sys 0.073 0
8G real 17.777 0.318
user 0 0.01
sys 0.144 0

- Preallocation of files on filesystems
Since fallocate() with FALLOC_FL_PROVISION can now pass down through
filesystems/block devices, this results in an expected slowdown in fallocate()
calls if the provision request is sent to the underlying layers.

The measurements were taken using fallocate() on ext4 filesystems set up with
the following opts/block devices:
A) ext4 filesystem mounted with 'noprovision'
B) ext4 filesystem mounted with 'provision' on a dm-thin device.
C) ext4 filesystem mounted with 'provision' on a loop device with a sparse
backing file on the filesystem in (B).

Size Time A B C
512M real 0.011 0.036 0.041
user 0.02 0.03 0.002
sys 0 0 0
1G real 0.011 0.055 0.064
user 0 0 0.03
sys 0.003 0.004 0
2G real 0.011 0.109 0.117
user 0 0 0.004
sys 0.003 0.006 0
4G real 0.011 0.224 0.231
user 0 0 0.006
sys 0.004 0.012 0
8G real 0.017 0.426 0.527
user 0 0 0.013
sys 0.009 0.024 0

As expected, the additional provision requests will slow down fallocate() calls
and the degree of slowdown depends on the number of layers that the provision
request is passed through to as well as the complexity of allocation on those
layers.

TODOs:
------
- Xfstests for validating provisioning results in allocation.

Changelog:

V2:
- Fix stacked limit handling.
- Enable provision request handling in dm-snapshot
- Don't call truncate_bdev_range if blkdev_fallocate() is called with
FALLOC_FL_PROVISION.
- Clarify semantics of FALLOC_FL_PROVISION and why it needs to be a separate flag
(as opposed to overloading mode == 0).


2022-12-29 08:16:19

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH v2 7/7] ext4: Add a per-file provision override xattr

Adds a per-file provision override that allows select files to
override the per-mount setting for provisioning blocks on allocation.

This acts as a mechanism to allow mounts using provision to
replicate the current behavior for fallocate() and only preserve
space at the filesystem level.

Signed-off-by: Sarthak Kukreti <[email protected]>
---
fs/ext4/extents.c | 32 ++++++++++++++++++++++++++++++++
fs/ext4/xattr.h | 1 +
2 files changed, 33 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a73f44264fe2..9861115681b3 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4428,6 +4428,26 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode)
return err;
}

+static int ext4_file_provision_support(struct inode *inode)
+{
+ char provision;
+ int ret =
+ ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED,
+ EXT4_XATTR_NAME_PROVISION_POLICY, &provision, 1);
+
+ if (ret < 0)
+ return ret;
+
+ switch (provision) {
+ case 'y':
+ return 1;
+ case 'n':
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
ext4_lblk_t len, loff_t new_size,
int flags)
@@ -4440,12 +4460,24 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
struct ext4_map_blocks map;
unsigned int credits;
loff_t epos;
+ bool provision = false;
+ int file_provision_override = -1;

/*
* Attempt to provision file blocks if the mount is mounted with
* provision.
*/
if (test_opt2(inode->i_sb, PROVISION))
+ provision = true;
+
+ /*
+ * Use file-specific override, if available.
+ */
+ file_provision_override = ext4_file_provision_support(inode);
+ if (file_provision_override >= 0)
+ provision &= file_provision_override;
+
+ if (provision)
flags |= EXT4_GET_BLOCKS_PROVISION;

BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 824faf0b15a8..69e97f853b0c 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -140,6 +140,7 @@ extern const struct xattr_handler ext4_xattr_security_handler;
extern const struct xattr_handler ext4_xattr_hurd_handler;

#define EXT4_XATTR_NAME_ENCRYPTION_CONTEXT "c"
+#define EXT4_XATTR_NAME_PROVISION_POLICY "provision"

/*
* The EXT4_STATE_NO_EXPAND is overloaded and used for two purposes.
--
2.37.3

2022-12-29 08:17:26

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH v2 6/7] ext4: Add mount option for provisioning blocks during allocations

Add a mount option that sets the default provisioning mode for
all files within the filesystem.

Signed-off-by: Sarthak Kukreti <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 7 +++++++
fs/ext4/super.c | 7 +++++++
3 files changed, 15 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 49832e90b62f..29cab2e2ea20 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1269,6 +1269,7 @@ struct ext4_inode_info {
#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
* scanning in mballoc
*/
+#define EXT4_MOUNT2_PROVISION 0x00000100 /* Provision while allocating file blocks */

#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2e64a9211792..a73f44264fe2 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4441,6 +4441,13 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
unsigned int credits;
loff_t epos;

+ /*
+ * Attempt to provision file blocks if the mount is mounted with
+ * provision.
+ */
+ if (test_opt2(inode->i_sb, PROVISION))
+ flags |= EXT4_GET_BLOCKS_PROVISION;
+
BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
map.m_lblk = offset;
map.m_len = len;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 260c1b3e3ef2..5bc376f6a6f0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1591,6 +1591,7 @@ enum {
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
+ Opt_provision, Opt_noprovision,
#ifdef CONFIG_EXT4_DEBUG
Opt_fc_debug_max_replay, Opt_fc_debug_force
#endif
@@ -1737,6 +1738,8 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
fsparam_flag ("reservation", Opt_removed), /* mount option from ext2/3 */
fsparam_flag ("noreservation", Opt_removed), /* mount option from ext2/3 */
fsparam_u32 ("journal", Opt_removed), /* mount option from ext2/3 */
+ fsparam_flag ("provision", Opt_provision),
+ fsparam_flag ("noprovision", Opt_noprovision),
{}
};

@@ -1826,6 +1829,8 @@ static const struct mount_opts {
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
MOPT_SET},
+ {Opt_provision, EXT4_MOUNT2_PROVISION, MOPT_SET | MOPT_2},
+ {Opt_noprovision, EXT4_MOUNT2_PROVISION, MOPT_CLEAR | MOPT_2},
#ifdef CONFIG_EXT4_DEBUG
{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
@@ -2977,6 +2982,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
SEQ_OPTS_PUTS("dax=never");
} else if (test_opt2(sb, DAX_INODE)) {
SEQ_OPTS_PUTS("dax=inode");
+ } else if (test_opt2(sb, PROVISION)) {
+ SEQ_OPTS_PUTS("provision");
}

if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD &&
--
2.37.3

2023-01-09 15:09:32

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] ext4: Add mount option for provisioning blocks during allocations

On Thu, Dec 29, 2022 at 12:12:51AM -0800, Sarthak Kukreti wrote:
> Add a mount option that sets the default provisioning mode for
> all files within the filesystem.
>

There's not much description here to explain what a user should expect
from this mode. Should the user expect -ENOSPC from the lower layers
once out of space? What about files that are provisioned by the fs and
then freed? Should the user/admin know to run fstrim or also enable an
online discard mechanism to ensure freed filesystem blocks are returned
to the free pool in the block/dm layer in order to avoid premature fs
-ENOSPC conditions?

Also, what about dealing with block level snapshots? There is some
discussion on previous patches wrt to expectations on how provision
might handle unsharing of cow'd blocks. If the fs only provisions on
initial allocation, then a subsequent snapshot means we run into the
same sort of ENOSPC problem on overwrites of already allocated blocks.
That also doesn't consider things like an internal log, which may have
been physically allocated (provisioned?) at mkfs time and yet is subject
to the same general problem.

So what is the higher level goal with this sort of mode? Is
provision-on-alloc sufficient to provide a practical benefit to users,
or should this perhaps consider other scenarios where a provision might
be warranted before submitting writes to a thinly provisioned device?

FWIW, it seems reasonable to me to introduce this without snapshot
support and work toward it later, but it should be made clear what is
being advertised in the meantime. Unless there's some nice way to
explicitly limit the scope of use, such as preventing snapshots or
something, the fs might want to consider this sort of feature
experimental until it is more fully functional.

Brian

> Signed-off-by: Sarthak Kukreti <[email protected]>
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/extents.c | 7 +++++++
> fs/ext4/super.c | 7 +++++++
> 3 files changed, 15 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 49832e90b62f..29cab2e2ea20 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1269,6 +1269,7 @@ struct ext4_inode_info {
> #define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
> * scanning in mballoc
> */
> +#define EXT4_MOUNT2_PROVISION 0x00000100 /* Provision while allocating file blocks */
>
> #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> ~EXT4_MOUNT_##opt
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 2e64a9211792..a73f44264fe2 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4441,6 +4441,13 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
> unsigned int credits;
> loff_t epos;
>
> + /*
> + * Attempt to provision file blocks if the mount is mounted with
> + * provision.
> + */
> + if (test_opt2(inode->i_sb, PROVISION))
> + flags |= EXT4_GET_BLOCKS_PROVISION;
> +
> BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
> map.m_lblk = offset;
> map.m_len = len;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 260c1b3e3ef2..5bc376f6a6f0 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1591,6 +1591,7 @@ enum {
> Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
> Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
> + Opt_provision, Opt_noprovision,
> #ifdef CONFIG_EXT4_DEBUG
> Opt_fc_debug_max_replay, Opt_fc_debug_force
> #endif
> @@ -1737,6 +1738,8 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
> fsparam_flag ("reservation", Opt_removed), /* mount option from ext2/3 */
> fsparam_flag ("noreservation", Opt_removed), /* mount option from ext2/3 */
> fsparam_u32 ("journal", Opt_removed), /* mount option from ext2/3 */
> + fsparam_flag ("provision", Opt_provision),
> + fsparam_flag ("noprovision", Opt_noprovision),
> {}
> };
>
> @@ -1826,6 +1829,8 @@ static const struct mount_opts {
> {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
> {Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
> MOPT_SET},
> + {Opt_provision, EXT4_MOUNT2_PROVISION, MOPT_SET | MOPT_2},
> + {Opt_noprovision, EXT4_MOUNT2_PROVISION, MOPT_CLEAR | MOPT_2},
> #ifdef CONFIG_EXT4_DEBUG
> {Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
> MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
> @@ -2977,6 +2982,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
> SEQ_OPTS_PUTS("dax=never");
> } else if (test_opt2(sb, DAX_INODE)) {
> SEQ_OPTS_PUTS("dax=inode");
> + } else if (test_opt2(sb, PROVISION)) {
> + SEQ_OPTS_PUTS("provision");
> }
>
> if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD &&
> --
> 2.37.3
>