2022-09-15 16:50:10

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

From: Sarthak Kukreti <[email protected]>

Hi,

This patch series is an RFC of a mechanism to pass through provision requests on stacked thinly provisioned storage devices/filesystems.

The linux kernel provides several mechanisms to set up thinly provisioned block storage abstractions (eg. dm-thin, loop devices over sparse files), either directly as block devices or backing storage for filesystems. Currently, short of writing data to either the device or filesystem, there is no way for users to pre-allocate space for use in such storage setups. Consider the following use-cases:

1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that the underlying thinpool metadata is not modified during the suspend mechanism, the dm-thin device needs to be fully provisioned.
2) If a filesystem uses a loop device over a sparse file, fallocate() on the filesystem will allocate blocks for files but the underlying sparse file will remain intact.
3) Another example is virtual machine using a sparse file/dm-thin as a storage device; by default, allocations within the VM boundaries will not affect the host.
4) Several storage standards support mechanisms for thin provisioning on real hardware devices. For example:
a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning: "When the THINP bit in the NSFEAT field of the Identify Namespace data structure is set to ‘1’, the controller ... shall track the number of allocated blocks in the Namespace Utilization field"
b. The SCSi Block Commands reference - 4 section references "Thin provisioned logical units",
c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".

In all of the above situations, currently the only way for pre-allocating space is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not scale well with larger pre-allocation sizes.

This patchset introduces primitives to support block-level provisioning (note: the term 'provisioning' is used to prevent overloading the term 'allocations/pre-allocations') requests across filesystems and block devices. This allows fallocate() and file creation requests to reserve space across stacked layers of block devices and filesystems. Currently, the patchset covers a prototype on the device-mapper targets, loop device and ext4, but the same mechanism can be extended to other filesystems/block devices as well as extended for use with devices in 4 a-c.

Patch 1 introduces REQ_OP_PROVISION as a new request type. The provision request acts like the inverse of a discard request; instead of notifying lower layers that the block range will no longer be used, provision acts as a request to lower layers to provision disk space for the given block range. Real hardware storage devices will currently disable the provisioing capability but for the standards listed in 4a.-c., REQ_OP_PROVISION can be overloaded for use as the provisioing primitive for future devices.

Patch 2 implements REQ_OP_PROVISION handling for some of the device-mapper targets. This additionally adds support for pre-allocating space for thinly provisioned logical volumes via fallocate()

Patch 3 implements the handling for virtio-blk.

Patch 4 introduces an fallocate() mode (FALLOC_FL_PROVISION) that sends a provision request to the underlying block device (and beyond). This acts as the primary mechanism for file-level provisioing.

Patch 5 wires up the loop device handling of REQ_OP_PROVISION.

Patches 6-8 cover a prototype implementation for ext4, which includes wiring up the fallocate() implementation, introducing a filesystem level option (called 'provision') to control the default allocation behaviour and finally a file level override to retain current handling, even on filesystems mounted with 'provision'

Testing:
--------
- A backport of this patch series was tested on ChromiumOS using a 5.10 kernel.
- File on ext4 on a thin logical volume: fallocate(FALLOC_FL_PROVISION) : 4.6s, dd if=/dev/zero of=...: 6 mins.

TODOs:
------
1) The stacked block devices (dm-*, loop etc.) currently unconditionally pass through provision requests. Add support for provision, similar to how discard handling is set up (with options to disable, passdown or passthrough requests).
2) Blktests and Xfstests for validating provisioning.


2022-09-15 16:50:22

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH RFC 2/8] dm: Add support for block provisioning

From: Sarthak Kukreti <[email protected]>

Add support to dm devices for REQ_OP_PROVISION. The default mode
is to pass through the request and dm-thin will utilize it to provision
blocks.

Signed-off-by: Sarthak Kukreti <[email protected]>
---
drivers/md/dm-crypt.c | 4 +-
drivers/md/dm-linear.c | 1 +
drivers/md/dm-table.c | 17 +++++++
drivers/md/dm-thin.c | 86 +++++++++++++++++++++++++++++++++--
drivers/md/dm.c | 4 ++
include/linux/device-mapper.h | 6 +++
6 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 159c6806c19b..357f0899cfb6 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -3081,6 +3081,8 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
if (ret)
return ret;

+ ti->num_provision_bios = 1;
+
while (opt_params--) {
opt_string = dm_shift_arg(&as);
if (!opt_string) {
@@ -3384,7 +3386,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
* - for REQ_OP_DISCARD caller must use flush if IO ordering matters
*/
if (unlikely(bio->bi_opf & REQ_PREFLUSH ||
- bio_op(bio) == REQ_OP_DISCARD)) {
+ bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_PROVISION)) {
bio_set_dev(bio, cc->dev->bdev);
if (bio_sectors(bio))
bio->bi_iter.bi_sector = cc->start +
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 3212ef6aa81b..1aa782149428 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -61,6 +61,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
ti->num_write_zeroes_bios = 1;
+ ti->num_provision_bios = 1;
ti->private = lc;
return 0;

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 332f96b58252..b7f9cb66b7ba 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1853,6 +1853,18 @@ static bool dm_table_supports_write_zeroes(struct dm_table *t)
return true;
}

+static bool dm_table_supports_provision(struct dm_table *t)
+{
+ for (unsigned int i = 0; i < t->num_targets; i++) {
+ struct dm_target *ti = dm_table_get_target(t, i);
+
+ if (ti->num_provision_bios)
+ return true;
+ }
+
+ return false;
+}
+
static int device_not_nowait_capable(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1989,6 +2001,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (!dm_table_supports_write_zeroes(t))
q->limits.max_write_zeroes_sectors = 0;

+ if (dm_table_supports_provision(t))
+ blk_queue_max_provision_sectors(q, UINT_MAX >> 9);
+ else
+ q->limits.max_provision_sectors = 0;
+
dm_table_verify_integrity(t);

/*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index e76c96c760a9..fd3eb306c823 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -908,7 +908,8 @@ static void __inc_remap_and_issue_cell(void *context,
struct bio *bio;

while ((bio = bio_list_pop(&cell->bios))) {
- if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD)
+ if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
+ bio_op(bio) == REQ_OP_PROVISION)
bio_list_add(&info->defer_bios, bio);
else {
inc_all_io_entry(info->tc->pool, bio);
@@ -1012,6 +1013,9 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
goto out;
}

+ if (bio && bio_op(bio) == REQ_OP_PROVISION)
+ return;
+
/*
* Release any bios held while the block was being provisioned.
* If we are processing a write bio that completely covers the block,
@@ -1388,6 +1392,9 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
m->data_block = data_block;
m->cell = cell;

+ if (bio && bio_op(bio) == REQ_OP_PROVISION)
+ m->bio = bio;
+
/*
* If the whole block of data is being overwritten or we are not
* zeroing pre-existing data, we can issue the bio immediately.
@@ -1897,7 +1904,7 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
/*
* Fill read bios with zeroes and complete them immediately.
*/
- if (bio_data_dir(bio) == READ) {
+ if (bio_data_dir(bio) == READ && bio_op(bio) != REQ_OP_PROVISION) {
zero_fill_bio(bio);
cell_defer_no_holder(tc, cell);
bio_endio(bio);
@@ -1980,6 +1987,69 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
}
}

+static void process_provision_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+ int r;
+ struct pool *pool = tc->pool;
+ struct bio *bio = cell->holder;
+ dm_block_t begin, end;
+ struct dm_thin_lookup_result lookup_result;
+
+ if (tc->requeue_mode) {
+ cell_requeue(pool, cell);
+ return;
+ }
+
+ get_bio_block_range(tc, bio, &begin, &end);
+
+ while (begin != end) {
+ r = ensure_next_mapping(pool);
+ if (r)
+ /* we did our best */
+ return;
+
+ r = dm_thin_find_block(tc->td, begin, 1, &lookup_result);
+ switch (r) {
+ case 0:
+ begin++;
+ break;
+ case -ENODATA:
+ provision_block(tc, bio, begin, cell);
+ begin++;
+ break;
+ default:
+ DMERR_LIMIT(
+ "%s: dm_thin_find_block() failed: error = %d",
+ __func__, r);
+ cell_defer_no_holder(tc, cell);
+ bio_io_error(bio);
+ begin++;
+ break;
+ }
+ }
+ bio_endio(bio);
+ cell_defer_no_holder(tc, cell);
+}
+
+static void process_provision_bio(struct thin_c *tc, struct bio *bio)
+{
+ dm_block_t begin, end;
+ struct dm_cell_key virt_key;
+ struct dm_bio_prison_cell *virt_cell;
+
+ get_bio_block_range(tc, bio, &begin, &end);
+ if (begin == end) {
+ bio_endio(bio);
+ return;
+ }
+
+ build_key(tc->td, VIRTUAL, begin, end, &virt_key);
+ if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
+ return;
+
+ process_provision_cell(tc, virt_cell);
+}
+
static void process_bio(struct thin_c *tc, struct bio *bio)
{
struct pool *pool = tc->pool;
@@ -2024,7 +2094,7 @@ static void __process_bio_read_only(struct thin_c *tc, struct bio *bio,
case -ENODATA:
if (cell)
cell_defer_no_holder(tc, cell);
- if (rw != READ) {
+ if (rw != READ || bio_op(bio) == REQ_OP_PROVISION) {
handle_unserviceable_bio(tc->pool, bio);
break;
}
@@ -2200,6 +2270,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)

if (bio_op(bio) == REQ_OP_DISCARD)
pool->process_discard(tc, bio);
+ else if (bio_op(bio) == REQ_OP_PROVISION)
+ process_provision_bio(tc, bio);
else
pool->process_bio(tc, bio);

@@ -2716,7 +2788,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
return DM_MAPIO_SUBMITTED;
}

- if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
+ if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
+ bio_op(bio) == REQ_OP_PROVISION) {
thin_defer_bio_with_throttle(tc, bio);
return DM_MAPIO_SUBMITTED;
}
@@ -3353,6 +3426,7 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
pt->low_water_blocks = low_water_blocks;
pt->adjusted_pf = pt->requested_pf = pf;
ti->num_flush_bios = 1;
+ ti->num_provision_bios = 1;

/*
* Only need to enable discards if the pool should pass
@@ -4043,6 +4117,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
}

+
/*
* pt->adjusted_pf is a staging area for the actual features to use.
* They get transferred to the live pool in bind_control_target()
@@ -4233,6 +4308,8 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
ti->num_discard_bios = 1;
}

+ ti->num_provision_bios = 1;
+
mutex_unlock(&dm_thin_pool_table.mutex);

spin_lock_irq(&tc->pool->lock);
@@ -4447,6 +4524,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)

limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
+ limits->max_provision_sectors = 2048 * 1024 * 16; /* 16G */
}

static struct target_type thin_target = {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 60549b65c799..3fe524800f5a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1600,6 +1600,7 @@ static bool is_abnormal_io(struct bio *bio)
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_PROVISION:
return true;
default:
break;
@@ -1624,6 +1625,9 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
case REQ_OP_WRITE_ZEROES:
num_bios = ti->num_write_zeroes_bios;
break;
+ case REQ_OP_PROVISION:
+ num_bios = ti->num_provision_bios;
+ break;
default:
break;
}
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 04c6acf7faaa..edeb47195b6f 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -333,6 +333,12 @@ struct dm_target {
*/
unsigned num_write_zeroes_bios;

+ /*
+ * The number of PROVISION bios that will be submitted to the target.
+ * The bio number can be accessed with dm_bio_get_target_bio_nr.
+ */
+ unsigned num_provision_bios;
+
/*
* The minimum number of extra bytes allocated in each io for the
* target to use.
--
2.31.0

2022-09-15 16:50:26

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

From: Sarthak Kukreti <[email protected]>

FALLOC_FL_PROVISION is a new fallocate() allocation mode that
sends a hint to (supported) thinly provisioned block devices to
allocate space for the given range of sectors via REQ_OP_PROVISION.

Signed-off-by: Sarthak Kukreti <[email protected]>
---
block/fops.c | 7 ++++++-
include/linux/falloc.h | 3 ++-
include/uapi/linux/falloc.h | 8 ++++++++
3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index b90742595317..a436a7596508 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -605,7 +605,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)

#define BLKDEV_FALLOC_FL_SUPPORTED \
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \
- FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
+ FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE | \
+ FALLOC_FL_PROVISION)

static long blkdev_fallocate(struct file *file, int mode, loff_t start,
loff_t len)
@@ -661,6 +662,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
len >> SECTOR_SHIFT, GFP_KERNEL);
break;
+ case FALLOC_FL_PROVISION:
+ error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
+ len >> SECTOR_SHIFT, GFP_KERNEL);
+ break;
default:
error = -EOPNOTSUPP;
}
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index f3f0b97b1675..a0e506255b20 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -30,7 +30,8 @@ struct space_resv {
FALLOC_FL_COLLAPSE_RANGE | \
FALLOC_FL_ZERO_RANGE | \
FALLOC_FL_INSERT_RANGE | \
- FALLOC_FL_UNSHARE_RANGE)
+ FALLOC_FL_UNSHARE_RANGE | \
+ FALLOC_FL_PROVISION)

/* on ia32 l_start is on a 32-bit boundary */
#if defined(CONFIG_X86_64)
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 51398fa57f6c..2d323d113eed 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -77,4 +77,12 @@
*/
#define FALLOC_FL_UNSHARE_RANGE 0x40

+/*
+ * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
+ * blocks for the range/EOF.
+ *
+ * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
+ */
+#define FALLOC_FL_PROVISION 0x80
+
#endif /* _UAPI_FALLOC_H_ */
--
2.31.0

2022-09-15 16:50:26

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH RFC 3/8] virtio_blk: Add support for provision requests

From: Sarthak Kukreti <[email protected]>

Adds support for provision requests. Provision requests act like
the inverse of discards.

Signed-off-by: Sarthak Kukreti <[email protected]>
---
drivers/block/virtio_blk.c | 48 +++++++++++++++++++++++++++++++++
include/uapi/linux/virtio_blk.h | 9 +++++++
2 files changed, 57 insertions(+)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 30255fcaf181..eacc2bffe1d1 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -178,6 +178,39 @@ static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
return 0;
}

+static int virtblk_setup_provision(struct request *req)
+{
+ unsigned short segments = blk_rq_nr_discard_segments(req);
+ unsigned short n = 0;
+
+ struct virtio_blk_discard_write_zeroes *range;
+ struct bio *bio;
+ u32 flags = 0;
+
+ range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
+ if (!range)
+ return -ENOMEM;
+
+ __rq_for_each_bio(bio, req) {
+ u64 sector = bio->bi_iter.bi_sector;
+ u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+
+ range[n].flags = cpu_to_le32(flags);
+ range[n].num_sectors = cpu_to_le32(num_sectors);
+ range[n].sector = cpu_to_le64(sector);
+ n++;
+ }
+
+ WARN_ON_ONCE(n != segments);
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * segments;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ return 0;
+}
+
static void virtblk_unmap_data(struct request *req, struct virtblk_req *vbr)
{
if (blk_rq_nr_phys_segments(req))
@@ -243,6 +276,9 @@ static blk_status_t virtblk_setup_cmd(struct virtio_device *vdev,
case REQ_OP_DRV_IN:
type = VIRTIO_BLK_T_GET_ID;
break;
+ case REQ_OP_PROVISION:
+ type = VIRTIO_BLK_T_PROVISION;
+ break;
default:
WARN_ON_ONCE(1);
return BLK_STS_IOERR;
@@ -256,6 +292,11 @@ static blk_status_t virtblk_setup_cmd(struct virtio_device *vdev,
return BLK_STS_RESOURCE;
}

+ if (type == VIRTIO_BLK_T_PROVISION) {
+ if (virtblk_setup_provision(req))
+ return BLK_STS_RESOURCE;
+ }
+
return 0;
}

@@ -1075,6 +1116,12 @@ static int virtblk_probe(struct virtio_device *vdev)
blk_queue_max_write_zeroes_sectors(q, v ? v : UINT_MAX);
}

+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_PROVISION)) {
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_provision_sectors, &v);
+ q->limits.max_provision_sectors = v ? v : UINT_MAX;
+ }
+
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);

@@ -1177,6 +1224,7 @@ static unsigned int features[] = {
VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
+ VIRTIO_BLK_F_PROVISION,
};

static struct virtio_driver virtio_blk = {
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index d888f013d9ff..184f8cf6d185 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -40,6 +40,7 @@
#define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
#define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */
#define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is supported */
+#define VIRTIO_BLK_F_PROVISION 15 /* provision is supported */

/* Legacy feature bits */
#ifndef VIRTIO_BLK_NO_LEGACY
@@ -120,6 +121,11 @@ struct virtio_blk_config {
*/
__u8 write_zeroes_may_unmap;

+ /*
+ * The maximum number of sectors in a provision request.
+ */
+ __virtio32 max_provision_sectors;
+
__u8 unused1[3];
} __attribute__((packed));

@@ -155,6 +161,9 @@ struct virtio_blk_config {
/* Write zeroes command */
#define VIRTIO_BLK_T_WRITE_ZEROES 13

+/* Provision command */
+#define VIRTIO_BLK_T_PROVISION 14
+
#ifndef VIRTIO_BLK_NO_LEGACY
/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
--
2.31.0

2022-09-15 16:51:37

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH RFC 5/8] loop: Add support for provision requests

From: Sarthak Kukreti <[email protected]>

Add support for provision requests to loopback devices.
Loop devices will configure provision support based on
whether the underlying block device/file can support
the provision request and upon receiving a provision bio,
will map it to the backing device/storage.

Signed-off-by: Sarthak Kukreti <[email protected]>
---
drivers/block/loop.c | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ad92192c7d61..83f486b9bceb 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -327,6 +327,24 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
return ret;
}

+static int lo_req_provision(struct loop_device *lo, struct request *rq, loff_t pos)
+{
+ struct file *file = lo->lo_backing_file;
+ struct request_queue *q = lo->lo_queue;
+ int ret;
+
+ if (!q->limits.max_provision_sectors) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+
+ ret = file->f_op->fallocate(file, FALLOC_FL_PROVISION, pos, blk_rq_bytes(rq));
+ if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
+ ret = -EIO;
+ out:
+ return ret;
+}
+
static int lo_req_flush(struct loop_device *lo, struct request *rq)
{
int ret = vfs_fsync(lo->lo_backing_file, 0);
@@ -488,6 +506,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
FALLOC_FL_PUNCH_HOLE);
case REQ_OP_DISCARD:
return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
+ case REQ_OP_PROVISION:
+ return lo_req_provision(lo, rq, pos);
case REQ_OP_WRITE:
if (cmd->use_aio)
return lo_rw_aio(lo, cmd, pos, WRITE);
@@ -754,6 +774,25 @@ static void loop_sysfs_exit(struct loop_device *lo)
&loop_attribute_group);
}

+static void loop_config_provision(struct loop_device *lo)
+{
+ struct file *file = lo->lo_backing_file;
+ struct inode *inode = file->f_mapping->host;
+
+ /*
+ * If the backing device is a block device, mirror its provisioning
+ * capability.
+ */
+ if (S_ISBLK(inode->i_mode)) {
+ blk_queue_max_provision_sectors(lo->lo_queue,
+ bdev_max_provision_sectors(I_BDEV(inode)));
+ } else if (file->f_op->fallocate) {
+ blk_queue_max_provision_sectors(lo->lo_queue, UINT_MAX >> 9);
+ } else {
+ blk_queue_max_provision_sectors(lo->lo_queue, 0);
+ }
+}
+
static void loop_config_discard(struct loop_device *lo)
{
struct file *file = lo->lo_backing_file;
@@ -1092,6 +1131,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
blk_queue_io_min(lo->lo_queue, bsize);

loop_config_discard(lo);
+ loop_config_provision(lo);
loop_update_rotational(lo);
loop_update_dio(lo);
loop_sysfs_init(lo);
@@ -1304,6 +1344,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
}

loop_config_discard(lo);
+ loop_config_provision(lo);

/* update dio if lo_offset or transfer is changed */
__loop_update_dio(lo, lo->use_dio);
@@ -1815,6 +1856,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
case REQ_OP_FLUSH:
case REQ_OP_DISCARD:
case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_PROVISION:
cmd->use_aio = false;
break;
default:
--
2.31.0

2022-09-15 16:51:44

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH RFC 6/8] ext4: Add support for FALLOC_FL_PROVISION

From: Sarthak Kukreti <[email protected]>

Once ext4 is done mapping blocks for an fallocate() request, send
out an FALLOC_FL_PROVISION request to the underlying layer to
ensure that the space is provisioned for the newly allocated extent
or indirect blocks.

Signed-off-by: Sarthak Kukreti <[email protected]>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/extents.c | 15 ++++++++++++++-
fs/ext4/indirect.c | 9 +++++++++
include/linux/blkdev.h | 11 +++++++++++
4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9bca5565547b..ec0871e687c1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -675,6 +675,8 @@ enum {
#define EXT4_GET_BLOCKS_IO_SUBMIT 0x0400
/* Caller is in the atomic contex, find extent if it has been cached */
#define EXT4_GET_BLOCKS_CACHED_NOWAIT 0x0800
+ /* Provision blocks on underlying storage */
+#define EXT4_GET_BLOCKS_PROVISION 0x1000

/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c148bb97b527..7a096144b7f8 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4356,6 +4356,13 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
}
}

+ /* Attempt to provision blocks on underlying storage */
+ if (flags & EXT4_GET_BLOCKS_PROVISION) {
+ err = sb_issue_provision(inode->i_sb, pblk, ar.len, GFP_NOFS);
+ if (err)
+ goto out;
+ }
+
/*
* Cache the extent and update transaction to commit on fdatasync only
* when it is _not_ an unwritten extent.
@@ -4690,7 +4697,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
/* Return error if mode is not supported */
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
- FALLOC_FL_INSERT_RANGE))
+ FALLOC_FL_INSERT_RANGE | FALLOC_FL_PROVISION))
return -EOPNOTSUPP;

inode_lock(inode);
@@ -4750,6 +4757,12 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (ret)
goto out;

+ /* Ensure that preallocation provisions the blocks on the underlying
+ * storage device.
+ */
+ if (mode & FALLOC_FL_PROVISION)
+ flags |= EXT4_GET_BLOCKS_PROVISION;
+
ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, flags);
if (ret)
goto out;
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 860fc5119009..860a2560872b 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -640,6 +640,15 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
if (err)
goto cleanup;

+ /* Attempt to provision blocks on underlying storage */
+ if (flags & EXT4_GET_BLOCKS_PROVISION) {
+ err = sb_issue_provision(inode->i_sb,
+ le32_to_cpu(chain[depth-1].key),
+ ar.len, GFP_NOFS);
+ if (err)
+ goto out;
+ }
+
map->m_flags |= EXT4_MAP_NEW;

ext4_update_inode_fsync_trans(handle, inode, 1);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a58496d3f922..26b41a6c12f4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1107,6 +1107,17 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
gfp_mask, 0);
}

+static inline int sb_issue_provision(struct super_block *sb, sector_t block,
+ sector_t nr_blocks, gfp_t gfp_mask)
+{
+ return blkdev_issue_provision(sb->s_bdev,
+ block << (sb->s_blocksize_bits -
+ SECTOR_SHIFT),
+ nr_blocks << (sb->s_blocksize_bits -
+ SECTOR_SHIFT),
+ gfp_mask);
+}
+
static inline bool bdev_is_partition(struct block_device *bdev)
{
return bdev->bd_partno;
--
2.31.0

2022-09-15 16:51:49

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH RFC 7/8] ext4: Add mount option for provisioning blocks during allocations

From: Sarthak Kukreti <[email protected]>

Add a mount option that sets the default provisioning mode for
all files within the filesystem.

Signed-off-by: Sarthak Kukreti <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 7 +++++++
fs/ext4/super.c | 7 +++++++
3 files changed, 15 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ec0871e687c1..75f6e7f2f90b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1271,6 +1271,7 @@ struct ext4_inode_info {
#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
* scanning in mballoc
*/
+#define EXT4_MOUNT2_PROVISION 0x00000100 /* Provision while allocating file blocks */

#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7a096144b7f8..746213b5ec3d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4437,6 +4437,13 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
unsigned int credits;
loff_t epos;

+ /*
+ * Attempt to provision file blocks if the mount is mounted with
+ * provision.
+ */
+ if (test_opt2(inode->i_sb, PROVISION))
+ flags |= EXT4_GET_BLOCKS_PROVISION;
+
BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
map.m_lblk = offset;
map.m_len = len;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9a66abcca1a8..5ece1868f332 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1596,6 +1596,7 @@ enum {
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
+ Opt_provision, Opt_noprovision,
#ifdef CONFIG_EXT4_DEBUG
Opt_fc_debug_max_replay, Opt_fc_debug_force
#endif
@@ -1744,6 +1745,8 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
fsparam_flag ("reservation", Opt_removed), /* mount option from ext2/3 */
fsparam_flag ("noreservation", Opt_removed), /* mount option from ext2/3 */
fsparam_u32 ("journal", Opt_removed), /* mount option from ext2/3 */
+ fsparam_flag ("discard", Opt_provision),
+ fsparam_flag ("noprovision", Opt_noprovision),
{}
};

@@ -1840,6 +1843,8 @@ static const struct mount_opts {
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
MOPT_SET},
+ {Opt_provision, EXT4_MOUNT2_PROVISION, MOPT_SET | MOPT_2},
+ {Opt_noprovision, EXT4_MOUNT2_PROVISION, MOPT_CLEAR | MOPT_2},
#ifdef CONFIG_EXT4_DEBUG
{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
@@ -3010,6 +3015,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
SEQ_OPTS_PUTS("dax=never");
} else if (test_opt2(sb, DAX_INODE)) {
SEQ_OPTS_PUTS("dax=inode");
+ } else if (test_opt2(sb, PROVISION)) {
+ SEQ_OPTS_PUTS("provision");
}

if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD &&
--
2.31.0

2022-09-15 16:52:50

by Sarthak Kukreti

[permalink] [raw]
Subject: [PATCH RFC 8/8] ext4: Add a per-file provision override xattr

From: Sarthak Kukreti <[email protected]>

Adds a per-file provision override that allows select files to
override the per-mount setting for provisioning blocks on allocation.

This acts as a mechanism to allow mounts using provision to
replicate the current behavior for fallocate() and only preserve
space at the filesystem level.

Signed-off-by: Sarthak Kukreti <[email protected]>
---
fs/ext4/extents.c | 32 ++++++++++++++++++++++++++++++++
fs/ext4/xattr.h | 1 +
2 files changed, 33 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 746213b5ec3d..a9ed908b2ebe 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4424,6 +4424,26 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode)
return err;
}

+int ext4_provision_support(struct inode *inode)
+{
+ char provision;
+ int ret =
+ ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED,
+ EXT4_XATTR_NAME_PROVISION_POLICY, &provision, 1);
+
+ if (ret < 0)
+ return ret;
+
+ switch (provision) {
+ case 'y':
+ return 1;
+ case 'n':
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
ext4_lblk_t len, loff_t new_size,
int flags)
@@ -4436,12 +4456,24 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
struct ext4_map_blocks map;
unsigned int credits;
loff_t epos;
+ bool provision = false;
+ int file_provision_override = -1;

/*
* Attempt to provision file blocks if the mount is mounted with
* provision.
*/
if (test_opt2(inode->i_sb, PROVISION))
+ provision = true;
+
+ /*
+ * Use file-specific override, if available.
+ */
+ file_provision_override = ext4_provision_support(inode);
+ if (file_provision_override >= 0)
+ provision &= file_provision_override;
+
+ if (provision)
flags |= EXT4_GET_BLOCKS_PROVISION;

BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 824faf0b15a8..69e97f853b0c 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -140,6 +140,7 @@ extern const struct xattr_handler ext4_xattr_security_handler;
extern const struct xattr_handler ext4_xattr_hurd_handler;

#define EXT4_XATTR_NAME_ENCRYPTION_CONTEXT "c"
+#define EXT4_XATTR_NAME_PROVISION_POLICY "provision"

/*
* The EXT4_STATE_NO_EXPAND is overloaded and used for two purposes.
--
2.31.0

2022-09-16 06:06:20

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH RFC 3/8] virtio_blk: Add support for provision requests

On Thu, Sep 15, 2022 at 09:48:21AM -0700, Sarthak Kukreti wrote:
> From: Sarthak Kukreti <[email protected]>
>
> Adds support for provision requests. Provision requests act like
> the inverse of discards.
>
> Signed-off-by: Sarthak Kukreti <[email protected]>
> ---
> drivers/block/virtio_blk.c | 48 +++++++++++++++++++++++++++++++++
> include/uapi/linux/virtio_blk.h | 9 +++++++
> 2 files changed, 57 insertions(+)

Please send a VIRTIO spec patch too:
https://github.com/oasis-tcs/virtio-spec#providing-feedback

Stefan

>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 30255fcaf181..eacc2bffe1d1 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -178,6 +178,39 @@ static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
> return 0;
> }
>
> +static int virtblk_setup_provision(struct request *req)
> +{
> + unsigned short segments = blk_rq_nr_discard_segments(req);
> + unsigned short n = 0;
> +
> + struct virtio_blk_discard_write_zeroes *range;
> + struct bio *bio;
> + u32 flags = 0;
> +
> + range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
> + if (!range)
> + return -ENOMEM;
> +
> + __rq_for_each_bio(bio, req) {
> + u64 sector = bio->bi_iter.bi_sector;
> + u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> +
> + range[n].flags = cpu_to_le32(flags);
> + range[n].num_sectors = cpu_to_le32(num_sectors);
> + range[n].sector = cpu_to_le64(sector);
> + n++;
> + }
> +
> + WARN_ON_ONCE(n != segments);
> +
> + req->special_vec.bv_page = virt_to_page(range);
> + req->special_vec.bv_offset = offset_in_page(range);
> + req->special_vec.bv_len = sizeof(*range) * segments;
> + req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> +
> + return 0;
> +}
> +
> static void virtblk_unmap_data(struct request *req, struct virtblk_req *vbr)
> {
> if (blk_rq_nr_phys_segments(req))
> @@ -243,6 +276,9 @@ static blk_status_t virtblk_setup_cmd(struct virtio_device *vdev,
> case REQ_OP_DRV_IN:
> type = VIRTIO_BLK_T_GET_ID;
> break;
> + case REQ_OP_PROVISION:
> + type = VIRTIO_BLK_T_PROVISION;
> + break;
> default:
> WARN_ON_ONCE(1);
> return BLK_STS_IOERR;
> @@ -256,6 +292,11 @@ static blk_status_t virtblk_setup_cmd(struct virtio_device *vdev,
> return BLK_STS_RESOURCE;
> }
>
> + if (type == VIRTIO_BLK_T_PROVISION) {
> + if (virtblk_setup_provision(req))
> + return BLK_STS_RESOURCE;
> + }
> +
> return 0;
> }
>
> @@ -1075,6 +1116,12 @@ static int virtblk_probe(struct virtio_device *vdev)
> blk_queue_max_write_zeroes_sectors(q, v ? v : UINT_MAX);
> }
>
> + if (virtio_has_feature(vdev, VIRTIO_BLK_F_PROVISION)) {
> + virtio_cread(vdev, struct virtio_blk_config,
> + max_provision_sectors, &v);
> + q->limits.max_provision_sectors = v ? v : UINT_MAX;
> + }
> +
> virtblk_update_capacity(vblk, false);
> virtio_device_ready(vdev);
>
> @@ -1177,6 +1224,7 @@ static unsigned int features[] = {
> VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
> VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
> + VIRTIO_BLK_F_PROVISION,
> };
>
> static struct virtio_driver virtio_blk = {
> diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
> index d888f013d9ff..184f8cf6d185 100644
> --- a/include/uapi/linux/virtio_blk.h
> +++ b/include/uapi/linux/virtio_blk.h
> @@ -40,6 +40,7 @@
> #define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
> #define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */
> #define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is supported */
> +#define VIRTIO_BLK_F_PROVISION 15 /* provision is supported */
>
> /* Legacy feature bits */
> #ifndef VIRTIO_BLK_NO_LEGACY
> @@ -120,6 +121,11 @@ struct virtio_blk_config {
> */
> __u8 write_zeroes_may_unmap;
>
> + /*
> + * The maximum number of sectors in a provision request.
> + */
> + __virtio32 max_provision_sectors;
> +
> __u8 unused1[3];
> } __attribute__((packed));
>
> @@ -155,6 +161,9 @@ struct virtio_blk_config {
> /* Write zeroes command */
> #define VIRTIO_BLK_T_WRITE_ZEROES 13
>
> +/* Provision command */
> +#define VIRTIO_BLK_T_PROVISION 14
> +
> #ifndef VIRTIO_BLK_NO_LEGACY
> /* Barrier before this op. */
> #define VIRTIO_BLK_T_BARRIER 0x80000000
> --
> 2.31.0
>


Attachments:
(No filename) (4.43 kB)
signature.asc (499.00 B)
Download all attachments

2022-09-16 06:12:01

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
> From: Sarthak Kukreti <[email protected]>
>
> Hi,
>
> This patch series is an RFC of a mechanism to pass through provision requests on stacked thinly provisioned storage devices/filesystems.
>
> The linux kernel provides several mechanisms to set up thinly provisioned block storage abstractions (eg. dm-thin, loop devices over sparse files), either directly as block devices or backing storage for filesystems. Currently, short of writing data to either the device or filesystem, there is no way for users to pre-allocate space for use in such storage setups. Consider the following use-cases:
>
> 1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that the underlying thinpool metadata is not modified during the suspend mechanism, the dm-thin device needs to be fully provisioned.
> 2) If a filesystem uses a loop device over a sparse file, fallocate() on the filesystem will allocate blocks for files but the underlying sparse file will remain intact.
> 3) Another example is virtual machine using a sparse file/dm-thin as a storage device; by default, allocations within the VM boundaries will not affect the host.
> 4) Several storage standards support mechanisms for thin provisioning on real hardware devices. For example:
> a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning: "When the THINP bit in the NSFEAT field of the Identify Namespace data structure is set to ‘1’, the controller ... shall track the number of allocated blocks in the Namespace Utilization field"
> b. The SCSi Block Commands reference - 4 section references "Thin provisioned logical units",
> c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".

When REQ_OP_PROVISION is sent on an already-allocated range of blocks,
are those blocks zeroed? NVMe Write Zeroes with Deallocate=0 works this
way, for example. That behavior is counterintuitive since the operation
name suggests it just affects the logical block's provisioning state,
not the contents of the blocks.

> In all of the above situations, currently the only way for pre-allocating space is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not scale well with larger pre-allocation sizes.

What exactly is the issue with WRITE_ZEROES scalability? Are you
referring to cases where the device doesn't support an efficient
WRITE_ZEROES command and actually writes blocks filled with zeroes
instead of updating internal allocation metadata cheaply?

Stefan


Attachments:
(No filename) (2.53 kB)
signature.asc (499.00 B)
Download all attachments

2022-09-16 12:00:39

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Thu, Sep 15, 2022 at 09:48:22AM -0700, Sarthak Kukreti wrote:
> From: Sarthak Kukreti <[email protected]>
>
> FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> sends a hint to (supported) thinly provisioned block devices to
> allocate space for the given range of sectors via REQ_OP_PROVISION.
>
> Signed-off-by: Sarthak Kukreti <[email protected]>
> ---
> block/fops.c | 7 ++++++-
> include/linux/falloc.h | 3 ++-
> include/uapi/linux/falloc.h | 8 ++++++++
> 3 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/block/fops.c b/block/fops.c
> index b90742595317..a436a7596508 100644
> --- a/block/fops.c
> +++ b/block/fops.c
...
> @@ -661,6 +662,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> len >> SECTOR_SHIFT, GFP_KERNEL);
> break;
> + case FALLOC_FL_PROVISION:
> + error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> + len >> SECTOR_SHIFT, GFP_KERNEL);
> + break;
> default:
> error = -EOPNOTSUPP;
> }

Hi Sarthak,

Neat mechanism.. I played with something very similar in the past (that
was much more crudely hacked up to target dm-thin) to allow filesystems
to request a thinly provisioned device to allocate blocks and try to do
a better job of avoiding inactivation when overprovisioned.

One thing I'm a little curious about here.. what's the need for a new
fallocate mode? On a cursory glance, the provision mode looks fairly
analogous to normal (mode == 0) allocation mode with the exception of
sending the request down to the bdev. blkdev_fallocate() already maps
some of the logical falloc modes (i.e. punch hole, zero range) to
sending write sames or discards, etc., and it doesn't currently look
like it supports allocation mode, so could it not map such requests to
the underlying REQ_OP_PROVISION op?

I guess the difference would be at the filesystem level where we'd
probably need to rely on a mount option or some such to control whether
traditional fallocate issues provision ops (like you've implemented for
ext4) vs. the specific falloc command, but that seems fairly consistent
with historical punch hole/discard behavior too. Hm? You might want to
cc linux-fsdevel in future posts in any event to get some more feedback
on how other filesystems might want to interact with such a thing.

BTW another thing that might be useful wrt to dm-thin is to support
FALLOC_FL_UNSHARE. I.e., it looks like the previous dm-thin patch only
checks that blocks are allocated, but not whether those blocks are
shared (re: lookup_result.shared). It might be useful to do the COW in
such cases if the caller passes down a REQ_UNSHARE or some such flag.

Brian

> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index f3f0b97b1675..a0e506255b20 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -30,7 +30,8 @@ struct space_resv {
> FALLOC_FL_COLLAPSE_RANGE | \
> FALLOC_FL_ZERO_RANGE | \
> FALLOC_FL_INSERT_RANGE | \
> - FALLOC_FL_UNSHARE_RANGE)
> + FALLOC_FL_UNSHARE_RANGE | \
> + FALLOC_FL_PROVISION)
>
> /* on ia32 l_start is on a 32-bit boundary */
> #if defined(CONFIG_X86_64)
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 51398fa57f6c..2d323d113eed 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -77,4 +77,12 @@
> */
> #define FALLOC_FL_UNSHARE_RANGE 0x40
>
> +/*
> + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> + * blocks for the range/EOF.
> + *
> + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> + */
> +#define FALLOC_FL_PROVISION 0x80
> +
> #endif /* _UAPI_FALLOC_H_ */
> --
> 2.31.0
>

2022-09-16 18:54:06

by Sarthak Kukreti

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On Thu, Sep 15, 2022 at 11:10 PM Stefan Hajnoczi <[email protected]> wrote:
>
> On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
> > From: Sarthak Kukreti <[email protected]>
> >
> > Hi,
> >
> > This patch series is an RFC of a mechanism to pass through provision requests on stacked thinly provisioned storage devices/filesystems.
> >
> > The linux kernel provides several mechanisms to set up thinly provisioned block storage abstractions (eg. dm-thin, loop devices over sparse files), either directly as block devices or backing storage for filesystems. Currently, short of writing data to either the device or filesystem, there is no way for users to pre-allocate space for use in such storage setups. Consider the following use-cases:
> >
> > 1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that the underlying thinpool metadata is not modified during the suspend mechanism, the dm-thin device needs to be fully provisioned.
> > 2) If a filesystem uses a loop device over a sparse file, fallocate() on the filesystem will allocate blocks for files but the underlying sparse file will remain intact.
> > 3) Another example is virtual machine using a sparse file/dm-thin as a storage device; by default, allocations within the VM boundaries will not affect the host.
> > 4) Several storage standards support mechanisms for thin provisioning on real hardware devices. For example:
> > a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning: "When the THINP bit in the NSFEAT field of the Identify Namespace data structure is set to ‘1’, the controller ... shall track the number of allocated blocks in the Namespace Utilization field"
> > b. The SCSi Block Commands reference - 4 section references "Thin provisioned logical units",
> > c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".
>
> When REQ_OP_PROVISION is sent on an already-allocated range of blocks,
> are those blocks zeroed? NVMe Write Zeroes with Deallocate=0 works this
> way, for example. That behavior is counterintuitive since the operation
> name suggests it just affects the logical block's provisioning state,
> not the contents of the blocks.
>
No, the blocks are not zeroed. The current implementation (in the dm
patch) is to indeed look at the provisioned state of the logical block
and provision if it is unmapped. if the block is already allocated,
REQ_OP_PROVISION should have no effect on the contents of the block.
Similarly, in the file semantics, sending an FALLOC_FL_PROVISION
requests for extents already mapped should not affect the contents in
the extents.

> > In all of the above situations, currently the only way for pre-allocating space is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not scale well with larger pre-allocation sizes.
>
> What exactly is the issue with WRITE_ZEROES scalability? Are you
> referring to cases where the device doesn't support an efficient
> WRITE_ZEROES command and actually writes blocks filled with zeroes
> instead of updating internal allocation metadata cheaply?
>
Yes. On ChromiumOS, we regularly deal with storage devices that don't
support WRITE_ZEROES or that need to have it disabled, via a quirk,
due to a bug in the vendor's implementation. Using WRITE_ZEROES for
allocation makes the allocation path quite slow for such devices (not
to mention the effect on storage lifetime), so having a separate
provisioning construct is very appealing. Even for devices that do
support an efficient WRITE_ZEROES implementation but don't support
logical provisioning per-se, I suppose that the allocation path might
be a bit faster (the device driver's request queue would report
'max_provision_sectors'=0 and the request would be short circuited
there) although I haven't benchmarked the difference.

Sarthak

> Stefan

2022-09-16 20:04:05

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On 9/16/22 11:48, Sarthak Kukreti wrote:
> Yes. On ChromiumOS, we regularly deal with storage devices that don't
> support WRITE_ZEROES or that need to have it disabled, via a quirk,
> due to a bug in the vendor's implementation. Using WRITE_ZEROES for
> allocation makes the allocation path quite slow for such devices (not
> to mention the effect on storage lifetime), so having a separate
> provisioning construct is very appealing. Even for devices that do
> support an efficient WRITE_ZEROES implementation but don't support
> logical provisioning per-se, I suppose that the allocation path might
> be a bit faster (the device driver's request queue would report
> 'max_provision_sectors'=0 and the request would be short circuited
> there) although I haven't benchmarked the difference.

Some background information about why ChromiumOS uses thin provisioning
instead of a single filesystem across the entire storage device would be
welcome. Although UFS devices support thin provisioning I am not aware
of any use cases in Android that would benefit from UFS thin
provisioning support.

Thanks,

Bart.

2022-09-16 21:06:37

by Sarthak Kukreti

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Fri, Sep 16, 2022 at 4:56 AM Brian Foster <[email protected]> wrote:
>
> On Thu, Sep 15, 2022 at 09:48:22AM -0700, Sarthak Kukreti wrote:
> > From: Sarthak Kukreti <[email protected]>
> >
> > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > sends a hint to (supported) thinly provisioned block devices to
> > allocate space for the given range of sectors via REQ_OP_PROVISION.
> >
> > Signed-off-by: Sarthak Kukreti <[email protected]>
> > ---
> > block/fops.c | 7 ++++++-
> > include/linux/falloc.h | 3 ++-
> > include/uapi/linux/falloc.h | 8 ++++++++
> > 3 files changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/fops.c b/block/fops.c
> > index b90742595317..a436a7596508 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> ...
> > @@ -661,6 +662,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > len >> SECTOR_SHIFT, GFP_KERNEL);
> > break;
> > + case FALLOC_FL_PROVISION:
> > + error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > + len >> SECTOR_SHIFT, GFP_KERNEL);
> > + break;
> > default:
> > error = -EOPNOTSUPP;
> > }
>
> Hi Sarthak,
>
> Neat mechanism.. I played with something very similar in the past (that
> was much more crudely hacked up to target dm-thin) to allow filesystems
> to request a thinly provisioned device to allocate blocks and try to do
> a better job of avoiding inactivation when overprovisioned.
>
> One thing I'm a little curious about here.. what's the need for a new
> fallocate mode? On a cursory glance, the provision mode looks fairly
> analogous to normal (mode == 0) allocation mode with the exception of
> sending the request down to the bdev. blkdev_fallocate() already maps
> some of the logical falloc modes (i.e. punch hole, zero range) to
> sending write sames or discards, etc., and it doesn't currently look
> like it supports allocation mode, so could it not map such requests to
> the underlying REQ_OP_PROVISION op?
>
> I guess the difference would be at the filesystem level where we'd
> probably need to rely on a mount option or some such to control whether
> traditional fallocate issues provision ops (like you've implemented for
> ext4) vs. the specific falloc command, but that seems fairly consistent
> with historical punch hole/discard behavior too. Hm? You might want to
> cc linux-fsdevel in future posts in any event to get some more feedback
> on how other filesystems might want to interact with such a thing.
>
Thanks for the feedback!
Argh, I completely forgot that I should add linux-fsdevel. Let me
re-send this with linux-fsdevel cc'd

There's a slight distinction is that the current filesystem-level
controls are usually for default handling, but userspace can still
call the relevant functions manually if they need to. For example, for
ext4, the 'discard' mount option dictates whether free blocks are
discarded, but it doesn't set the policy to allow/disallow userspace
from manually punching holes into files even if the mount opt is
'nodiscard'. FALLOC_FL_PROVISION is similar in that regard; it adds a
manual mechanism for users to provision the files' extents, that is
separate from the filesystems' default handling of provisioning files.

> BTW another thing that might be useful wrt to dm-thin is to support
> FALLOC_FL_UNSHARE. I.e., it looks like the previous dm-thin patch only
> checks that blocks are allocated, but not whether those blocks are
> shared (re: lookup_result.shared). It might be useful to do the COW in
> such cases if the caller passes down a REQ_UNSHARE or some such flag.
>
That's an interesting idea! There's a few more things on the TODO list
for this patch series but I think we can follow up with a patch to
handle that as well.

Sarthak

> Brian
>
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index f3f0b97b1675..a0e506255b20 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -30,7 +30,8 @@ struct space_resv {
> > FALLOC_FL_COLLAPSE_RANGE | \
> > FALLOC_FL_ZERO_RANGE | \
> > FALLOC_FL_INSERT_RANGE | \
> > - FALLOC_FL_UNSHARE_RANGE)
> > + FALLOC_FL_UNSHARE_RANGE | \
> > + FALLOC_FL_PROVISION)
> >
> > /* on ia32 l_start is on a 32-bit boundary */
> > #if defined(CONFIG_X86_64)
> > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > index 51398fa57f6c..2d323d113eed 100644
> > --- a/include/uapi/linux/falloc.h
> > +++ b/include/uapi/linux/falloc.h
> > @@ -77,4 +77,12 @@
> > */
> > #define FALLOC_FL_UNSHARE_RANGE 0x40
> >
> > +/*
> > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > + * blocks for the range/EOF.
> > + *
> > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > + */
> > +#define FALLOC_FL_PROVISION 0x80
> > +
> > #endif /* _UAPI_FALLOC_H_ */
> > --
> > 2.31.0
> >
>

2022-09-16 22:11:00

by Sarthak Kukreti

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On Fri, Sep 16, 2022 at 1:01 PM Bart Van Assche <[email protected]> wrote:
>
> On 9/16/22 11:48, Sarthak Kukreti wrote:
> > Yes. On ChromiumOS, we regularly deal with storage devices that don't
> > support WRITE_ZEROES or that need to have it disabled, via a quirk,
> > due to a bug in the vendor's implementation. Using WRITE_ZEROES for
> > allocation makes the allocation path quite slow for such devices (not
> > to mention the effect on storage lifetime), so having a separate
> > provisioning construct is very appealing. Even for devices that do
> > support an efficient WRITE_ZEROES implementation but don't support
> > logical provisioning per-se, I suppose that the allocation path might
> > be a bit faster (the device driver's request queue would report
> > 'max_provision_sectors'=0 and the request would be short circuited
> > there) although I haven't benchmarked the difference.
>
> Some background information about why ChromiumOS uses thin provisioning
> instead of a single filesystem across the entire storage device would be
> welcome. Although UFS devices support thin provisioning I am not aware
> of any use cases in Android that would benefit from UFS thin
> provisioning support.
>
Sure (and I'd be happy to put this in the cover letter, if you prefer;
I didn't include it initially, since it seemed orthogonal to the
discussion of the patchset)!

On ChromiumOS, the primary driving force for using thin provisioning
is to have flexible, segmented block storage, both per-user and for
applications/virtual machines with several useful properties, for
example: block-level encrypted user storage, snapshot based A-B
updates for verified content, on-demand partitioning for short-lived
use cases. Several of the other planned use-cases (like verified
content retention over powerwash) require flexible on-demand block
storage that is decoupled from the primary filesystem(s) so that we
can have cryptographic erase for the user partitions and keep the
on-demand, dm-verity backed executables intact.

Best
Sarthak

> Thanks,
>
> Bart.

2022-09-17 03:11:01

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
> From: Sarthak Kukreti <[email protected]>
>
> Hi,
>
> This patch series is an RFC of a mechanism to pass through provision
> requests on stacked thinly provisioned storage devices/filesystems.

[Reflowed text]

> The linux kernel provides several mechanisms to set up thinly
> provisioned block storage abstractions (eg. dm-thin, loop devices over
> sparse files), either directly as block devices or backing storage for
> filesystems. Currently, short of writing data to either the device or
> filesystem, there is no way for users to pre-allocate space for use in
> such storage setups. Consider the following use-cases:
>
> 1) Suspend-to-disk and resume from a dm-thin device: In order to
> ensure that the underlying thinpool metadata is not modified during
> the suspend mechanism, the dm-thin device needs to be fully
> provisioned.
> 2) If a filesystem uses a loop device over a sparse file, fallocate()
> on the filesystem will allocate blocks for files but the underlying
> sparse file will remain intact.
> 3) Another example is virtual machine using a sparse file/dm-thin as a
> storage device; by default, allocations within the VM boundaries will
> not affect the host.
> 4) Several storage standards support mechanisms for thin provisioning
> on real hardware devices. For example:
> a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin
> provisioning: "When the THINP bit in the NSFEAT field of the
> Identify Namespace data structure is set to ‘1’, the controller ...
> shall track the number of allocated blocks in the Namespace
> Utilization field"
> b. The SCSi Block Commands reference - 4 section references "Thin
> provisioned logical units",
> c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".
>
> In all of the above situations, currently the only way for
> pre-allocating space is to issue writes (or use
> WRITE_ZEROES/WRITE_SAME). However, that does not scale well with
> larger pre-allocation sizes.
>
> This patchset introduces primitives to support block-level
> provisioning (note: the term 'provisioning' is used to prevent
> overloading the term 'allocations/pre-allocations') requests across
> filesystems and block devices. This allows fallocate() and file
> creation requests to reserve space across stacked layers of block
> devices and filesystems. Currently, the patchset covers a prototype on
> the device-mapper targets, loop device and ext4, but the same
> mechanism can be extended to other filesystems/block devices as well
> as extended for use with devices in 4 a-c.

If you call REQ_OP_PROVISION on an unmapped LBA range of a block device
and then try to read the provisioned blocks, what do you get? Zeroes?
Random stale disk contents?

I think I saw elsewhere in the thread that any mapped LBAs within the
provisioning range are left alone (i.e. not zeroed) so I'll proceed on
that basis.

> Patch 1 introduces REQ_OP_PROVISION as a new request type. The
> provision request acts like the inverse of a discard request; instead
> of notifying lower layers that the block range will no longer be used,
> provision acts as a request to lower layers to provision disk space
> for the given block range. Real hardware storage devices will
> currently disable the provisioing capability but for the standards
> listed in 4a.-c., REQ_OP_PROVISION can be overloaded for use as the
> provisioing primitive for future devices.
>
> Patch 2 implements REQ_OP_PROVISION handling for some of the
> device-mapper targets. This additionally adds support for
> pre-allocating space for thinly provisioned logical volumes via
> fallocate()
>
> Patch 3 implements the handling for virtio-blk.
>
> Patch 4 introduces an fallocate() mode (FALLOC_FL_PROVISION) that
> sends a provision request to the underlying block device (and beyond).
> This acts as the primary mechanism for file-level provisioing.

Personally, I think it's well within the definition of fallocate mode==0
(aka preallocate) for XFS to call REQ_OP_PROVISION on the blocks that it
preallocates? XFS always sets the unwritten flag on the file mapping,
so it doesn't matter if the device provisions space without zeroing the
contents.

That said, if devices are really allowed to expose stale disk blocks
then for blkdev fallocate I think you could get away with reusin
FALLOC_FL_NO_HIDE_STALE instead of introducing a new fallocate flag.

> Patch 5 wires up the loop device handling of REQ_OP_PROVISION.
>
> Patches 6-8 cover a prototype implementation for ext4, which includes
> wiring up the fallocate() implementation, introducing a filesystem
> level option (called 'provision') to control the default allocation
> behaviour and finally a file level override to retain current
> handling, even on filesystems mounted with 'provision'

Hmm, I'll have a look.

> Testing:
> --------
> - A backport of this patch series was tested on ChromiumOS using a
> 5.10 kernel.
> - File on ext4 on a thin logical volume:
> fallocate(FALLOC_FL_PROVISION) : 4.6s, dd if=/dev/zero of=...: 6 mins.
>
> TODOs:
> ------
> 1) The stacked block devices (dm-*, loop etc.) currently
> unconditionally pass through provision requests. Add support for
> provision, similar to how discard handling is set up (with options to
> disable, passdown or passthrough requests).
> 2) Blktests and Xfstests for validating provisioning.

Yes....

--D

> --
> dm-devel mailing list
> [email protected]
> https://listman.redhat.com/mailman/listinfo/dm-devel

2022-09-20 07:59:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On Fri, Sep 16, 2022 at 11:48:34AM -0700, Sarthak Kukreti wrote:
> Yes. On ChromiumOS, we regularly deal with storage devices that don't
> support WRITE_ZEROES or that need to have it disabled, via a quirk,
> due to a bug in the vendor's implementation.

So bloody punich the vendors for it. Unlike most of the Linux community
your actually have purchasing power and you'd help everyone by making
use of that instead adding hacks to upstream.

2022-09-20 08:00:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Thu, Sep 15, 2022 at 09:48:22AM -0700, Sarthak Kukreti wrote:
> From: Sarthak Kukreti <[email protected]>
>
> FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> sends a hint to (supported) thinly provisioned block devices to
> allocate space for the given range of sectors via REQ_OP_PROVISION.

So, how does that "provisioning" actually work in todays world where
storage is usually doing out of place writes in one or more layers,
including the flash storage everyone is using. Does it give you one
write? And unlimited number? Some undecided number inbetween? How
is it affected by write zeroes to that range or a discard?

2022-09-20 11:34:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On Tue, Sep 20, 2022 at 08:17:10PM +1000, Daniil Lunev wrote:
> to WRITE ZERO command in NVMe, but to WRITE UNAVAILABLE in

There is no such thing as WRITE UNAVAILABLE in NVMe.

> NVME 2.0 spec, and to UNMAP ANCHORED in SCSI spec.

The SCSI anchored LBA state is quite complicated, and in addition
to UNMAP you can also create it using WRITE SAME, which is at least
partially useful, as it allows for sensible initialization pattern.
For the purpose of Linux that woud be 0.

That being siad you still haven't actually explained what problem
you're even trying to solve.

2022-09-21 06:04:50

by Sarthak Kukreti

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Tue, Sep 20, 2022 at 12:49 AM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, Sep 15, 2022 at 09:48:22AM -0700, Sarthak Kukreti wrote:
> > From: Sarthak Kukreti <[email protected]>
> >
> > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > sends a hint to (supported) thinly provisioned block devices to
> > allocate space for the given range of sectors via REQ_OP_PROVISION.
>
> So, how does that "provisioning" actually work in todays world where
> storage is usually doing out of place writes in one or more layers,
> including the flash storage everyone is using. Does it give you one
> write? And unlimited number? Some undecided number inbetween?

Apologies, the patchset was a bit short on describing the semantics so
I'll expand more in the next revision; I'd say that it's the minimum
of regular mode fallocate() guarantees at each allocation layer. For
example, the guarantees from a contrived storage stack like (left to
right is bottom to top):

[ mmc0blkp1 | ext4(1) | sparse file | loop | dm-thinp | dm-thin | ext4(2) ]

would be predicated on the guarantees of fallocate() per allocation
layer; if ext4(1) was replaced by a filesystem that did not support
fallocate(), then there would be no guarantee that a write to a file
on ext4(2) succeeds.

For dm-thinp, in the current implementation, the provision request
allocates blocks for the range specified and adds the mapping to the
thinpool metadata. All subsequent writes are to the same block, so
you'll be able to write to the same block inifinitely. Brian mentioned
this above, one case it doesn't cover is if provision is called on a
shared block, but the natural extension would be to allocate and
assign a new block and copy the contents of the shared block (kind of
like copy-on-provision).

[reflowed]
> How is it affected by write zeroes to that range or a discard?

The current semantics of discards for dm-thinp/ext4/sparse files will
apply as they do today; discards will unmap the dm-thin block/free the
file extent. Write zeroes is more interesting; dm-thinp will treat the
command as usual. ext4_zero_range will mark the extents as unwritten,
so essentially if a user did provision + write to a block, write zeros
to the block would essentially leave it in the original provisioned
state, but ext4 would now show the contents of the block as zero on
the next read. I think, similar to above, the semantics of a request
will depend on each layer that it passes through.

Best
Sarthak

2022-09-21 15:16:46

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On Tue, Sep 20 2022 at 5:48P -0400,
Daniil Lunev <[email protected]> wrote:

> > There is no such thing as WRITE UNAVAILABLE in NVMe.
> Apologize, that is WRITE UNCORRECTABLE. Chapter 3.2.7 of
> NVM Express NVM Command Set Specification 1.0b
>
> > That being siad you still haven't actually explained what problem
> > you're even trying to solve.
>
> The specific problem is the following:
> * There is an thinpool over a physical device
> * There are multiple logical volumes over the thin pool
> * Each logical volume has an independent file system and an
> independent application running over it
> * Each application is potentially allowed to consume the entirety
> of the disk space - there is no strict size limit for application
> * Applications need to pre-allocate space sometime, for which
> they use fallocate. Once the operation succeeded, the application
> assumed the space is guaranteed to be there for it.
> * Since filesystems on the volumes are independent, filesystem
> level enforcement of size constraints is impossible and the only
> common level is the thin pool, thus, each fallocate has to find its
> representation in thin pool one way or another - otherwise you
> may end up in the situation, where FS thinks it has allocated space
> but when it tries to actually write it, the thin pool is already
> exhausted.
> * Hole-Punching fallocate will not reach the thin pool, so the only
> solution presently is zero-writing pre-allocate.
> * Not all storage devices support zero-writing efficiently - apart
> from NVMe being or not being capable of doing efficient write
> zero - changing which is easier said than done, and would take
> years - there are also other types of storage devices that do not
> have WRITE ZERO capability in the first place or have it in a
> peculiar way. And adding custom WRITE ZERO to LVM would be
> arguably a much bigger hack.
> * Thus, a provisioning block operation allows an interface specific
> operation that guarantees the presence of the block in the
> mapped space. LVM Thin-pool itself is the primary target for our
> use case but the argument is that this operation maps well to
> other interfaces which allow thinly provisioned units.

Thanks for this overview. Should help level-set others.

Adding fallocate support has been a long-standing dm-thin TODO item
for me. I just never got around to it. So thanks to Sarthak, you and
anyone else who had a hand in developing this.

I had a look at the DM thin implementation and it looks pretty simple
(doesn't require a thin-metadata change, etc). I'll look closer at
the broader implementation (block, etc) but I'm encouraged by what I'm
seeing.

Mike

2022-09-21 15:27:17

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Wed, Sep 21 2022 at 1:54P -0400,
Sarthak Kukreti <[email protected]> wrote:

> On Tue, Sep 20, 2022 at 12:49 AM Christoph Hellwig <[email protected]> wrote:
> >
> > On Thu, Sep 15, 2022 at 09:48:22AM -0700, Sarthak Kukreti wrote:
> > > From: Sarthak Kukreti <[email protected]>
> > >
> > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > sends a hint to (supported) thinly provisioned block devices to
> > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> >
> > So, how does that "provisioning" actually work in todays world where
> > storage is usually doing out of place writes in one or more layers,
> > including the flash storage everyone is using. Does it give you one
> > write? And unlimited number? Some undecided number inbetween?
>
> Apologies, the patchset was a bit short on describing the semantics so
> I'll expand more in the next revision; I'd say that it's the minimum
> of regular mode fallocate() guarantees at each allocation layer. For
> example, the guarantees from a contrived storage stack like (left to
> right is bottom to top):
>
> [ mmc0blkp1 | ext4(1) | sparse file | loop | dm-thinp | dm-thin | ext4(2) ]
>
> would be predicated on the guarantees of fallocate() per allocation
> layer; if ext4(1) was replaced by a filesystem that did not support
> fallocate(), then there would be no guarantee that a write to a file
> on ext4(2) succeeds.
>
> For dm-thinp, in the current implementation, the provision request
> allocates blocks for the range specified and adds the mapping to the
> thinpool metadata. All subsequent writes are to the same block, so
> you'll be able to write to the same block inifinitely. Brian mentioned
> this above, one case it doesn't cover is if provision is called on a
> shared block, but the natural extension would be to allocate and
> assign a new block and copy the contents of the shared block (kind of
> like copy-on-provision).

It follows that ChromiumOS isn't using dm-thinp's snapshot support?

But please do fold in incremental dm-thinp support to properly handle
shared blocks (dm-thinp already handles breaking sharing, etc.. so
I'll need to see where you're hooking into that you don't get this
"for free").

Mike

2022-09-21 15:50:09

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Fri, Sep 16, 2022 at 02:02:31PM -0700, Sarthak Kukreti wrote:
> On Fri, Sep 16, 2022 at 4:56 AM Brian Foster <[email protected]> wrote:
> >
> > On Thu, Sep 15, 2022 at 09:48:22AM -0700, Sarthak Kukreti wrote:
> > > From: Sarthak Kukreti <[email protected]>
> > >
> > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > sends a hint to (supported) thinly provisioned block devices to
> > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > >
> > > Signed-off-by: Sarthak Kukreti <[email protected]>
> > > ---
> > > block/fops.c | 7 ++++++-
> > > include/linux/falloc.h | 3 ++-
> > > include/uapi/linux/falloc.h | 8 ++++++++
> > > 3 files changed, 16 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/block/fops.c b/block/fops.c
> > > index b90742595317..a436a7596508 100644
> > > --- a/block/fops.c
> > > +++ b/block/fops.c
> > ...
> > > @@ -661,6 +662,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > len >> SECTOR_SHIFT, GFP_KERNEL);
> > > break;
> > > + case FALLOC_FL_PROVISION:
> > > + error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > + len >> SECTOR_SHIFT, GFP_KERNEL);
> > > + break;
> > > default:
> > > error = -EOPNOTSUPP;
> > > }
> >
> > Hi Sarthak,
> >
> > Neat mechanism.. I played with something very similar in the past (that
> > was much more crudely hacked up to target dm-thin) to allow filesystems
> > to request a thinly provisioned device to allocate blocks and try to do
> > a better job of avoiding inactivation when overprovisioned.
> >
> > One thing I'm a little curious about here.. what's the need for a new
> > fallocate mode? On a cursory glance, the provision mode looks fairly
> > analogous to normal (mode == 0) allocation mode with the exception of
> > sending the request down to the bdev. blkdev_fallocate() already maps
> > some of the logical falloc modes (i.e. punch hole, zero range) to
> > sending write sames or discards, etc., and it doesn't currently look
> > like it supports allocation mode, so could it not map such requests to
> > the underlying REQ_OP_PROVISION op?
> >
> > I guess the difference would be at the filesystem level where we'd
> > probably need to rely on a mount option or some such to control whether
> > traditional fallocate issues provision ops (like you've implemented for
> > ext4) vs. the specific falloc command, but that seems fairly consistent
> > with historical punch hole/discard behavior too. Hm? You might want to
> > cc linux-fsdevel in future posts in any event to get some more feedback
> > on how other filesystems might want to interact with such a thing.
> >
> Thanks for the feedback!
> Argh, I completely forgot that I should add linux-fsdevel. Let me
> re-send this with linux-fsdevel cc'd
>
> There's a slight distinction is that the current filesystem-level
> controls are usually for default handling, but userspace can still
> call the relevant functions manually if they need to. For example, for
> ext4, the 'discard' mount option dictates whether free blocks are
> discarded, but it doesn't set the policy to allow/disallow userspace
> from manually punching holes into files even if the mount opt is
> 'nodiscard'. FALLOC_FL_PROVISION is similar in that regard; it adds a
> manual mechanism for users to provision the files' extents, that is
> separate from the filesystems' default handling of provisioning files.
>

What I'm trying to understand is why not let blkdev_fallocate() issue a
provision based on the default mode (i.e. mode == 0) of fallocate(),
which is already defined to mean "perform allocation?" It currently
issues discards or write zeroes based on variants of
FALLOC_FL_PUNCH_HOLE without the need for a separate FALLOC_FL_DISCARD
mode, for example.

Brian

> > BTW another thing that might be useful wrt to dm-thin is to support
> > FALLOC_FL_UNSHARE. I.e., it looks like the previous dm-thin patch only
> > checks that blocks are allocated, but not whether those blocks are
> > shared (re: lookup_result.shared). It might be useful to do the COW in
> > such cases if the caller passes down a REQ_UNSHARE or some such flag.
> >
> That's an interesting idea! There's a few more things on the TODO list
> for this patch series but I think we can follow up with a patch to
> handle that as well.
>
> Sarthak
>
> > Brian
> >
> > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > index f3f0b97b1675..a0e506255b20 100644
> > > --- a/include/linux/falloc.h
> > > +++ b/include/linux/falloc.h
> > > @@ -30,7 +30,8 @@ struct space_resv {
> > > FALLOC_FL_COLLAPSE_RANGE | \
> > > FALLOC_FL_ZERO_RANGE | \
> > > FALLOC_FL_INSERT_RANGE | \
> > > - FALLOC_FL_UNSHARE_RANGE)
> > > + FALLOC_FL_UNSHARE_RANGE | \
> > > + FALLOC_FL_PROVISION)
> > >
> > > /* on ia32 l_start is on a 32-bit boundary */
> > > #if defined(CONFIG_X86_64)
> > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > index 51398fa57f6c..2d323d113eed 100644
> > > --- a/include/uapi/linux/falloc.h
> > > +++ b/include/uapi/linux/falloc.h
> > > @@ -77,4 +77,12 @@
> > > */
> > > #define FALLOC_FL_UNSHARE_RANGE 0x40
> > >
> > > +/*
> > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > + * blocks for the range/EOF.
> > > + *
> > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > + */
> > > +#define FALLOC_FL_PROVISION 0x80
> > > +
> > > #endif /* _UAPI_FALLOC_H_ */
> > > --
> > > 2.31.0
> > >
> >
>

2022-09-22 08:07:31

by Sarthak Kukreti

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Wed, Sep 21, 2022 at 8:39 AM Brian Foster <[email protected]> wrote:
>
> On Fri, Sep 16, 2022 at 02:02:31PM -0700, Sarthak Kukreti wrote:
> > On Fri, Sep 16, 2022 at 4:56 AM Brian Foster <[email protected]> wrote:
> > >
> > > On Thu, Sep 15, 2022 at 09:48:22AM -0700, Sarthak Kukreti wrote:
> > > > From: Sarthak Kukreti <[email protected]>
> > > >
> > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > sends a hint to (supported) thinly provisioned block devices to
> > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > >
> > > > Signed-off-by: Sarthak Kukreti <[email protected]>
> > > > ---
> > > > block/fops.c | 7 ++++++-
> > > > include/linux/falloc.h | 3 ++-
> > > > include/uapi/linux/falloc.h | 8 ++++++++
> > > > 3 files changed, 16 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/block/fops.c b/block/fops.c
> > > > index b90742595317..a436a7596508 100644
> > > > --- a/block/fops.c
> > > > +++ b/block/fops.c
> > > ...
> > > > @@ -661,6 +662,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > > len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > break;
> > > > + case FALLOC_FL_PROVISION:
> > > > + error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > + len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > + break;
> > > > default:
> > > > error = -EOPNOTSUPP;
> > > > }
> > >
> > > Hi Sarthak,
> > >
> > > Neat mechanism.. I played with something very similar in the past (that
> > > was much more crudely hacked up to target dm-thin) to allow filesystems
> > > to request a thinly provisioned device to allocate blocks and try to do
> > > a better job of avoiding inactivation when overprovisioned.
> > >
> > > One thing I'm a little curious about here.. what's the need for a new
> > > fallocate mode? On a cursory glance, the provision mode looks fairly
> > > analogous to normal (mode == 0) allocation mode with the exception of
> > > sending the request down to the bdev. blkdev_fallocate() already maps
> > > some of the logical falloc modes (i.e. punch hole, zero range) to
> > > sending write sames or discards, etc., and it doesn't currently look
> > > like it supports allocation mode, so could it not map such requests to
> > > the underlying REQ_OP_PROVISION op?
> > >
> > > I guess the difference would be at the filesystem level where we'd
> > > probably need to rely on a mount option or some such to control whether
> > > traditional fallocate issues provision ops (like you've implemented for
> > > ext4) vs. the specific falloc command, but that seems fairly consistent
> > > with historical punch hole/discard behavior too. Hm? You might want to
> > > cc linux-fsdevel in future posts in any event to get some more feedback
> > > on how other filesystems might want to interact with such a thing.
> > >
> > Thanks for the feedback!
> > Argh, I completely forgot that I should add linux-fsdevel. Let me
> > re-send this with linux-fsdevel cc'd
> >
> > There's a slight distinction is that the current filesystem-level
> > controls are usually for default handling, but userspace can still
> > call the relevant functions manually if they need to. For example, for
> > ext4, the 'discard' mount option dictates whether free blocks are
> > discarded, but it doesn't set the policy to allow/disallow userspace
> > from manually punching holes into files even if the mount opt is
> > 'nodiscard'. FALLOC_FL_PROVISION is similar in that regard; it adds a
> > manual mechanism for users to provision the files' extents, that is
> > separate from the filesystems' default handling of provisioning files.
> >
>
> What I'm trying to understand is why not let blkdev_fallocate() issue a
> provision based on the default mode (i.e. mode == 0) of fallocate(),
> which is already defined to mean "perform allocation?" It currently
> issues discards or write zeroes based on variants of
> FALLOC_FL_PUNCH_HOLE without the need for a separate FALLOC_FL_DISCARD
> mode, for example.
>
It's mostly to keep the block device fallocate() semantics in-line and
consistent with the file-specific modes: I added the separate
filesystem fallocate() mode under the assumption that we'd want to
keep the traditional handling for filesystems intact with (mode == 0).
And for block devices, I didn't map the requests to mode == 0 so that
it's less confusing to describe (eg. mode == 0 on block devices will
issue provision; mode == 0 on files will not). It would complicate
loopback devices, for instance; if the loop device is backed by a
file, it would need to use (mode == FALLOC_FL_PROVISION) but if the
loop device is backed by another block device, then the fallocate()
call would need to switch to (mode == 0).

With the separate mode, we can describe the semantics of falllcate()
modes a bit more cleanly, and it is common for both files and block
devices:

1. mode == 0: allocation at the same layer, will not provision on the
underlying device/filesystem (unsupported for block devices).
2. mode == FALLOC_FL_PROVISION, allocation at the layer, will
provision on the underlying device/filesystem.

Block devices don't technically need to use a separate mode, but it
makes it much less confusing if filesystems are already using a
separate mode for provision.

Best
Sarthak

> Brian
>
> > > BTW another thing that might be useful wrt to dm-thin is to support
> > > FALLOC_FL_UNSHARE. I.e., it looks like the previous dm-thin patch only
> > > checks that blocks are allocated, but not whether those blocks are
> > > shared (re: lookup_result.shared). It might be useful to do the COW in
> > > such cases if the caller passes down a REQ_UNSHARE or some such flag.
> > >
> > That's an interesting idea! There's a few more things on the TODO list
> > for this patch series but I think we can follow up with a patch to
> > handle that as well.
> >
> > Sarthak
> >
> > > Brian
> > >
> > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > index f3f0b97b1675..a0e506255b20 100644
> > > > --- a/include/linux/falloc.h
> > > > +++ b/include/linux/falloc.h
> > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > > FALLOC_FL_COLLAPSE_RANGE | \
> > > > FALLOC_FL_ZERO_RANGE | \
> > > > FALLOC_FL_INSERT_RANGE | \
> > > > - FALLOC_FL_UNSHARE_RANGE)
> > > > + FALLOC_FL_UNSHARE_RANGE | \
> > > > + FALLOC_FL_PROVISION)
> > > >
> > > > /* on ia32 l_start is on a 32-bit boundary */
> > > > #if defined(CONFIG_X86_64)
> > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > index 51398fa57f6c..2d323d113eed 100644
> > > > --- a/include/uapi/linux/falloc.h
> > > > +++ b/include/uapi/linux/falloc.h
> > > > @@ -77,4 +77,12 @@
> > > > */
> > > > #define FALLOC_FL_UNSHARE_RANGE 0x40
> > > >
> > > > +/*
> > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > + * blocks for the range/EOF.
> > > > + *
> > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > + */
> > > > +#define FALLOC_FL_PROVISION 0x80
> > > > +
> > > > #endif /* _UAPI_FALLOC_H_ */
> > > > --
> > > > 2.31.0
> > > >
> > >
> >
>

2022-09-22 08:09:57

by Sarthak Kukreti

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Wed, Sep 21, 2022 at 8:21 AM Mike Snitzer <[email protected]> wrote:
>
> On Wed, Sep 21 2022 at 1:54P -0400,
> Sarthak Kukreti <[email protected]> wrote:
>
> > On Tue, Sep 20, 2022 at 12:49 AM Christoph Hellwig <[email protected]> wrote:
> > >
> > > On Thu, Sep 15, 2022 at 09:48:22AM -0700, Sarthak Kukreti wrote:
> > > > From: Sarthak Kukreti <[email protected]>
> > > >
> > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > sends a hint to (supported) thinly provisioned block devices to
> > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > >
> > > So, how does that "provisioning" actually work in todays world where
> > > storage is usually doing out of place writes in one or more layers,
> > > including the flash storage everyone is using. Does it give you one
> > > write? And unlimited number? Some undecided number inbetween?
> >
> > Apologies, the patchset was a bit short on describing the semantics so
> > I'll expand more in the next revision; I'd say that it's the minimum
> > of regular mode fallocate() guarantees at each allocation layer. For
> > example, the guarantees from a contrived storage stack like (left to
> > right is bottom to top):
> >
> > [ mmc0blkp1 | ext4(1) | sparse file | loop | dm-thinp | dm-thin | ext4(2) ]
> >
> > would be predicated on the guarantees of fallocate() per allocation
> > layer; if ext4(1) was replaced by a filesystem that did not support
> > fallocate(), then there would be no guarantee that a write to a file
> > on ext4(2) succeeds.
> >
> > For dm-thinp, in the current implementation, the provision request
> > allocates blocks for the range specified and adds the mapping to the
> > thinpool metadata. All subsequent writes are to the same block, so
> > you'll be able to write to the same block inifinitely. Brian mentioned
> > this above, one case it doesn't cover is if provision is called on a
> > shared block, but the natural extension would be to allocate and
> > assign a new block and copy the contents of the shared block (kind of
> > like copy-on-provision).
>
> It follows that ChromiumOS isn't using dm-thinp's snapshot support?
>
Not at the moment, but we definitely have ideas to explore re:snapshot
and dm-thinp (like A-B updates with thin volume snapshots), where this
would definitely be useful!

> But please do fold in incremental dm-thinp support to properly handle
> shared blocks (dm-thinp already handles breaking sharing, etc.. so
> I'll need to see where you're hooking into that you don't get this
> "for free").
>
Will do in v2. Thanks for the feedback.

Best
Sarthak

> Mike
>

2022-09-22 18:37:26

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Thu, Sep 22, 2022 at 01:04:33AM -0700, Sarthak Kukreti wrote:
> On Wed, Sep 21, 2022 at 8:39 AM Brian Foster <[email protected]> wrote:
> >
> > On Fri, Sep 16, 2022 at 02:02:31PM -0700, Sarthak Kukreti wrote:
> > > On Fri, Sep 16, 2022 at 4:56 AM Brian Foster <[email protected]> wrote:
> > > >
> > > > On Thu, Sep 15, 2022 at 09:48:22AM -0700, Sarthak Kukreti wrote:
> > > > > From: Sarthak Kukreti <[email protected]>
> > > > >
> > > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > > sends a hint to (supported) thinly provisioned block devices to
> > > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > > >
> > > > > Signed-off-by: Sarthak Kukreti <[email protected]>
> > > > > ---
> > > > > block/fops.c | 7 ++++++-
> > > > > include/linux/falloc.h | 3 ++-
> > > > > include/uapi/linux/falloc.h | 8 ++++++++
> > > > > 3 files changed, 16 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/block/fops.c b/block/fops.c
> > > > > index b90742595317..a436a7596508 100644
> > > > > --- a/block/fops.c
> > > > > +++ b/block/fops.c
> > > > ...
> > > > > @@ -661,6 +662,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > > error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > > > len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > > break;
> > > > > + case FALLOC_FL_PROVISION:
> > > > > + error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > > + len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > > + break;
> > > > > default:
> > > > > error = -EOPNOTSUPP;
> > > > > }
> > > >
> > > > Hi Sarthak,
> > > >
> > > > Neat mechanism.. I played with something very similar in the past (that
> > > > was much more crudely hacked up to target dm-thin) to allow filesystems
> > > > to request a thinly provisioned device to allocate blocks and try to do
> > > > a better job of avoiding inactivation when overprovisioned.
> > > >
> > > > One thing I'm a little curious about here.. what's the need for a new
> > > > fallocate mode? On a cursory glance, the provision mode looks fairly
> > > > analogous to normal (mode == 0) allocation mode with the exception of
> > > > sending the request down to the bdev. blkdev_fallocate() already maps
> > > > some of the logical falloc modes (i.e. punch hole, zero range) to
> > > > sending write sames or discards, etc., and it doesn't currently look
> > > > like it supports allocation mode, so could it not map such requests to
> > > > the underlying REQ_OP_PROVISION op?
> > > >
> > > > I guess the difference would be at the filesystem level where we'd
> > > > probably need to rely on a mount option or some such to control whether
> > > > traditional fallocate issues provision ops (like you've implemented for
> > > > ext4) vs. the specific falloc command, but that seems fairly consistent
> > > > with historical punch hole/discard behavior too. Hm? You might want to
> > > > cc linux-fsdevel in future posts in any event to get some more feedback
> > > > on how other filesystems might want to interact with such a thing.
> > > >
> > > Thanks for the feedback!
> > > Argh, I completely forgot that I should add linux-fsdevel. Let me
> > > re-send this with linux-fsdevel cc'd
> > >
> > > There's a slight distinction is that the current filesystem-level
> > > controls are usually for default handling, but userspace can still
> > > call the relevant functions manually if they need to. For example, for
> > > ext4, the 'discard' mount option dictates whether free blocks are
> > > discarded, but it doesn't set the policy to allow/disallow userspace
> > > from manually punching holes into files even if the mount opt is
> > > 'nodiscard'. FALLOC_FL_PROVISION is similar in that regard; it adds a
> > > manual mechanism for users to provision the files' extents, that is
> > > separate from the filesystems' default handling of provisioning files.
> > >
> >
> > What I'm trying to understand is why not let blkdev_fallocate() issue a
> > provision based on the default mode (i.e. mode == 0) of fallocate(),
> > which is already defined to mean "perform allocation?" It currently
> > issues discards or write zeroes based on variants of
> > FALLOC_FL_PUNCH_HOLE without the need for a separate FALLOC_FL_DISCARD
> > mode, for example.
> >
> It's mostly to keep the block device fallocate() semantics in-line and
> consistent with the file-specific modes: I added the separate
> filesystem fallocate() mode under the assumption that we'd want to
> keep the traditional handling for filesystems intact with (mode == 0).
> And for block devices, I didn't map the requests to mode == 0 so that
> it's less confusing to describe (eg. mode == 0 on block devices will
> issue provision; mode == 0 on files will not). It would complicate
> loopback devices, for instance; if the loop device is backed by a
> file, it would need to use (mode == FALLOC_FL_PROVISION) but if the
> loop device is backed by another block device, then the fallocate()
> call would need to switch to (mode == 0).
>

I would expect the loopback scenario for provision to behave similar to
how discards are handled. I.e., loopback receives a provision request
and translates that to fallocate(mode = 0). If the backing device is
block, blkdev_fallocate(mode = 0) translates that to another provision
request. If the backing device is a file, the associated fallocate
handler allocs/maps, if necessary, and then issues a provision on
allocation, if enabled by the fs.

AFAICT there's no need for FL_PROVISION at all in that scenario. Is
there a functional purpose to FL_PROVISION? Is the intent to try and
guarantee that a provision request propagates down the I/O stack? If so,
what happens if blocks were already preallocated in the backing file (in
the loopback file example)?

BTW, an unrelated thing I noticed is that blkdev_fallocate()
unconditionally calls truncate_bdev_range(), which probably doesn't make
sense for any sort of alloc mode.

> With the separate mode, we can describe the semantics of falllcate()
> modes a bit more cleanly, and it is common for both files and block
> devices:
>
> 1. mode == 0: allocation at the same layer, will not provision on the
> underlying device/filesystem (unsupported for block devices).
> 2. mode == FALLOC_FL_PROVISION, allocation at the layer, will
> provision on the underlying device/filesystem.
>

I think I see why you make the distinction, since the block layer
doesn't have a "this layer only" mode, but IMO it's also quite confusing
to say that mode == FL_PROVISION can allocate at the current and
underlying layer(s) but mode == 0 to that underlying layer cannot.

Either way, if you want to propose a new falloc mode/modifier, it
probably warrants a more detailed commit log with more explanation of
the purpose, examples of behavior, perhaps some details on how the mode
might be documented in man pages, etc.

Brian

> Block devices don't technically need to use a separate mode, but it
> makes it much less confusing if filesystems are already using a
> separate mode for provision.
>
> Best
> Sarthak
>
> > Brian
> >
> > > > BTW another thing that might be useful wrt to dm-thin is to support
> > > > FALLOC_FL_UNSHARE. I.e., it looks like the previous dm-thin patch only
> > > > checks that blocks are allocated, but not whether those blocks are
> > > > shared (re: lookup_result.shared). It might be useful to do the COW in
> > > > such cases if the caller passes down a REQ_UNSHARE or some such flag.
> > > >
> > > That's an interesting idea! There's a few more things on the TODO list
> > > for this patch series but I think we can follow up with a patch to
> > > handle that as well.
> > >
> > > Sarthak
> > >
> > > > Brian
> > > >
> > > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > > index f3f0b97b1675..a0e506255b20 100644
> > > > > --- a/include/linux/falloc.h
> > > > > +++ b/include/linux/falloc.h
> > > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > > > FALLOC_FL_COLLAPSE_RANGE | \
> > > > > FALLOC_FL_ZERO_RANGE | \
> > > > > FALLOC_FL_INSERT_RANGE | \
> > > > > - FALLOC_FL_UNSHARE_RANGE)
> > > > > + FALLOC_FL_UNSHARE_RANGE | \
> > > > > + FALLOC_FL_PROVISION)
> > > > >
> > > > > /* on ia32 l_start is on a 32-bit boundary */
> > > > > #if defined(CONFIG_X86_64)
> > > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > > index 51398fa57f6c..2d323d113eed 100644
> > > > > --- a/include/uapi/linux/falloc.h
> > > > > +++ b/include/uapi/linux/falloc.h
> > > > > @@ -77,4 +77,12 @@
> > > > > */
> > > > > #define FALLOC_FL_UNSHARE_RANGE 0x40
> > > > >
> > > > > +/*
> > > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > > + * blocks for the range/EOF.
> > > > > + *
> > > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > > + */
> > > > > +#define FALLOC_FL_PROVISION 0x80
> > > > > +
> > > > > #endif /* _UAPI_FALLOC_H_ */
> > > > > --
> > > > > 2.31.0
> > > > >
> > > >
> > >
> >
>

2022-09-23 08:51:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Tue, Sep 20, 2022 at 10:54:32PM -0700, Sarthak Kukreti wrote:
> [ mmc0blkp1 | ext4(1) | sparse file | loop | dm-thinp | dm-thin | ext4(2) ]
>
> would be predicated on the guarantees of fallocate() per allocation
> layer; if ext4(1) was replaced by a filesystem that did not support
> fallocate(), then there would be no guarantee that a write to a file
> on ext4(2) succeeds.

a write or any unlimited number of writes?

2022-09-23 08:54:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On Wed, Sep 21, 2022 at 07:48:50AM +1000, Daniil Lunev wrote:
> > There is no such thing as WRITE UNAVAILABLE in NVMe.
> Apologize, that is WRITE UNCORRECTABLE. Chapter 3.2.7 of
> NVM Express NVM Command Set Specification 1.0b

Write uncorrectable is a very different thing, and the equivalent of the
horribly misnamed SCSI WRITE LONG COMMAND. It injects an unrecoverable
error, and does not provision anything.

> * Each application is potentially allowed to consume the entirety
> of the disk space - there is no strict size limit for application
> * Applications need to pre-allocate space sometime, for which
> they use fallocate. Once the operation succeeded, the application
> assumed the space is guaranteed to be there for it.
> * Since filesystems on the volumes are independent, filesystem
> level enforcement of size constraints is impossible and the only
> common level is the thin pool, thus, each fallocate has to find its
> representation in thin pool one way or another - otherwise you
> may end up in the situation, where FS thinks it has allocated space
> but when it tries to actually write it, the thin pool is already
> exhausted.
> * Hole-Punching fallocate will not reach the thin pool, so the only
> solution presently is zero-writing pre-allocate.

To me it sounds like you want a non-thin pool in dm-thin and/or
guaranted space reservations for it.

> * Thus, a provisioning block operation allows an interface specific
> operation that guarantees the presence of the block in the
> mapped space. LVM Thin-pool itself is the primary target for our
> use case but the argument is that this operation maps well to
> other interfaces which allow thinly provisioned units.

I think where you are trying to go here is badly mistaken. With flash
(or hard drive SMR) there is no such thing as provisioning LBAs. Every
write is out of place, and a one time space allocation does not help
you at all. So fundamentally what you try to here just goes against
the actual physics of modern storage media. While there are some
layers that keep up a pretence, trying to that an an exposed API
level is a really bad idea.

2022-09-23 14:13:44

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On Fri, Sep 23 2022 at 4:51P -0400,
Christoph Hellwig <[email protected]> wrote:

> On Wed, Sep 21, 2022 at 07:48:50AM +1000, Daniil Lunev wrote:
> > > There is no such thing as WRITE UNAVAILABLE in NVMe.
> > Apologize, that is WRITE UNCORRECTABLE. Chapter 3.2.7 of
> > NVM Express NVM Command Set Specification 1.0b
>
> Write uncorrectable is a very different thing, and the equivalent of the
> horribly misnamed SCSI WRITE LONG COMMAND. It injects an unrecoverable
> error, and does not provision anything.
>
> > * Each application is potentially allowed to consume the entirety
> > of the disk space - there is no strict size limit for application
> > * Applications need to pre-allocate space sometime, for which
> > they use fallocate. Once the operation succeeded, the application
> > assumed the space is guaranteed to be there for it.
> > * Since filesystems on the volumes are independent, filesystem
> > level enforcement of size constraints is impossible and the only
> > common level is the thin pool, thus, each fallocate has to find its
> > representation in thin pool one way or another - otherwise you
> > may end up in the situation, where FS thinks it has allocated space
> > but when it tries to actually write it, the thin pool is already
> > exhausted.
> > * Hole-Punching fallocate will not reach the thin pool, so the only
> > solution presently is zero-writing pre-allocate.
>
> To me it sounds like you want a non-thin pool in dm-thin and/or
> guaranted space reservations for it.

What is implemented in this patchset: enablement for dm-thinp to
actually provide guarantees which fallocate requires.

Seems you're getting hung up on the finishing details in HW (details
which are _not_ the point of this patchset).

The proposed changes are in service to _Linux_ code. The patchset
implements the primitive from top (ext4) to bottom (dm-thinp, loop).
It stops short of implementing handling everywhere that'd need it
(e.g. in XFS, etc). But those changes can come as follow-on work once
the primitive is established top to bottom.

But you know all this ;)

> > * Thus, a provisioning block operation allows an interface specific
> > operation that guarantees the presence of the block in the
> > mapped space. LVM Thin-pool itself is the primary target for our
> > use case but the argument is that this operation maps well to
> > other interfaces which allow thinly provisioned units.
>
> I think where you are trying to go here is badly mistaken. With flash
> (or hard drive SMR) there is no such thing as provisioning LBAs. Every
> write is out of place, and a one time space allocation does not help
> you at all. So fundamentally what you try to here just goes against
> the actual physics of modern storage media. While there are some
> layers that keep up a pretence, trying to that an an exposed API
> level is a really bad idea.

This doesn't need to be so feudal. Reserving an LBA in physical HW
really isn't the point.

Fact remains: an operation that ensures space is actually reserved via
fallocate is long overdue (just because an FS did its job doesn't mean
underlying layers reflect that). And certainly useful, even if "only"
benefiting dm-thinp and the loop driver. Like other block primitives,
REQ_OP_PROVISION is filtered out by block core if the device doesn't
support it.

That said, I agree with Brian Foster that we need really solid
documentation and justification for why fallocate mode=0 cannot be
used (but the case has been made in this thread).

Also, I do see an issue with the implementation (relative to stacked
devices): dm_table_supports_provision() is too myopic about DM. It
needs to go a step further and verify that some layer in the stack
actually services REQ_OP_PROVISION. Will respond to DM patch too.

2022-09-23 14:32:50

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH RFC 2/8] dm: Add support for block provisioning

On Thu, Sep 15 2022 at 12:48P -0400,
Sarthak Kukreti <[email protected]> wrote:

> From: Sarthak Kukreti <[email protected]>
>
> Add support to dm devices for REQ_OP_PROVISION. The default mode
> is to pass through the request and dm-thin will utilize it to provision
> blocks.
>
> Signed-off-by: Sarthak Kukreti <[email protected]>
> ---
> drivers/md/dm-crypt.c | 4 +-
> drivers/md/dm-linear.c | 1 +
> drivers/md/dm-table.c | 17 +++++++
> drivers/md/dm-thin.c | 86 +++++++++++++++++++++++++++++++++--
> drivers/md/dm.c | 4 ++
> include/linux/device-mapper.h | 6 +++
> 6 files changed, 113 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 159c6806c19b..357f0899cfb6 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -3081,6 +3081,8 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
> if (ret)
> return ret;
>
> + ti->num_provision_bios = 1;
> +
> while (opt_params--) {
> opt_string = dm_shift_arg(&as);
> if (!opt_string) {
> @@ -3384,7 +3386,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
> * - for REQ_OP_DISCARD caller must use flush if IO ordering matters
> */
> if (unlikely(bio->bi_opf & REQ_PREFLUSH ||
> - bio_op(bio) == REQ_OP_DISCARD)) {
> + bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_PROVISION)) {
> bio_set_dev(bio, cc->dev->bdev);
> if (bio_sectors(bio))
> bio->bi_iter.bi_sector = cc->start +
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 3212ef6aa81b..1aa782149428 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -61,6 +61,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> ti->num_discard_bios = 1;
> ti->num_secure_erase_bios = 1;
> ti->num_write_zeroes_bios = 1;
> + ti->num_provision_bios = 1;
> ti->private = lc;
> return 0;
>
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 332f96b58252..b7f9cb66b7ba 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1853,6 +1853,18 @@ static bool dm_table_supports_write_zeroes(struct dm_table *t)
> return true;
> }
>
> +static bool dm_table_supports_provision(struct dm_table *t)
> +{
> + for (unsigned int i = 0; i < t->num_targets; i++) {
> + struct dm_target *ti = dm_table_get_target(t, i);
> +
> + if (ti->num_provision_bios)
> + return true;
> + }
> +
> + return false;
> +}
> +

This needs to go a step further and verify a device in the stack
actually services REQ_OP_PROVISION.

Please see dm_table_supports_discards(): it iterates all devices in
the table and checks that support is advertised.

For discard, DM requires that _all_ devices in a table advertise
support (that is pretty strict and likely could be relaxed to _any_).

You'll need ti->provision_supported (like ->discards_supported) to
advertise actual support is provided by dm-thinp (even if underlying
devices don't support it).

And yeah, dm-thinp passdown support for REQ_OP_PROVISION can follow
later as needed (if there actual HW that would benefit from
REQ_OP_PROVISION).

Mike

2022-09-27 21:44:58

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC 3/8] virtio_blk: Add support for provision requests

On Thu, Sep 15, 2022 at 09:48:21AM -0700, Sarthak Kukreti wrote:
> From: Sarthak Kukreti <[email protected]>
>
> Adds support for provision requests. Provision requests act like
> the inverse of discards.
>
> Signed-off-by: Sarthak Kukreti <[email protected]>
> ---
> drivers/block/virtio_blk.c | 48 +++++++++++++++++++++++++++++++++
> include/uapi/linux/virtio_blk.h | 9 +++++++
> 2 files changed, 57 insertions(+)
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 30255fcaf181..eacc2bffe1d1 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -178,6 +178,39 @@ static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
> return 0;
> }
>
> +static int virtblk_setup_provision(struct request *req)
> +{
> + unsigned short segments = blk_rq_nr_discard_segments(req);
> + unsigned short n = 0;
> +
> + struct virtio_blk_discard_write_zeroes *range;
> + struct bio *bio;
> + u32 flags = 0;
> +
> + range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
> + if (!range)
> + return -ENOMEM;
> +
> + __rq_for_each_bio(bio, req) {
> + u64 sector = bio->bi_iter.bi_sector;
> + u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> +
> + range[n].flags = cpu_to_le32(flags);
> + range[n].num_sectors = cpu_to_le32(num_sectors);
> + range[n].sector = cpu_to_le64(sector);
> + n++;
> + }
> +
> + WARN_ON_ONCE(n != segments);
> +
> + req->special_vec.bv_page = virt_to_page(range);
> + req->special_vec.bv_offset = offset_in_page(range);
> + req->special_vec.bv_len = sizeof(*range) * segments;
> + req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> +
> + return 0;
> +}
> +
> static void virtblk_unmap_data(struct request *req, struct virtblk_req *vbr)
> {
> if (blk_rq_nr_phys_segments(req))
> @@ -243,6 +276,9 @@ static blk_status_t virtblk_setup_cmd(struct virtio_device *vdev,
> case REQ_OP_DRV_IN:
> type = VIRTIO_BLK_T_GET_ID;
> break;
> + case REQ_OP_PROVISION:
> + type = VIRTIO_BLK_T_PROVISION;
> + break;
> default:
> WARN_ON_ONCE(1);
> return BLK_STS_IOERR;
> @@ -256,6 +292,11 @@ static blk_status_t virtblk_setup_cmd(struct virtio_device *vdev,
> return BLK_STS_RESOURCE;
> }
>
> + if (type == VIRTIO_BLK_T_PROVISION) {
> + if (virtblk_setup_provision(req))
> + return BLK_STS_RESOURCE;
> + }
> +
> return 0;
> }
>
> @@ -1075,6 +1116,12 @@ static int virtblk_probe(struct virtio_device *vdev)
> blk_queue_max_write_zeroes_sectors(q, v ? v : UINT_MAX);
> }
>
> + if (virtio_has_feature(vdev, VIRTIO_BLK_F_PROVISION)) {
> + virtio_cread(vdev, struct virtio_blk_config,
> + max_provision_sectors, &v);
> + q->limits.max_provision_sectors = v ? v : UINT_MAX;
> + }
> +
> virtblk_update_capacity(vblk, false);
> virtio_device_ready(vdev);
>
> @@ -1177,6 +1224,7 @@ static unsigned int features[] = {
> VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
> VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
> + VIRTIO_BLK_F_PROVISION,
> };
>
> static struct virtio_driver virtio_blk = {
> diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
> index d888f013d9ff..184f8cf6d185 100644
> --- a/include/uapi/linux/virtio_blk.h
> +++ b/include/uapi/linux/virtio_blk.h
> @@ -40,6 +40,7 @@
> #define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
> #define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */
> #define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is supported */
> +#define VIRTIO_BLK_F_PROVISION 15 /* provision is supported */
>
> /* Legacy feature bits */
> #ifndef VIRTIO_BLK_NO_LEGACY
> @@ -120,6 +121,11 @@ struct virtio_blk_config {
> */
> __u8 write_zeroes_may_unmap;
>
> + /*
> + * The maximum number of sectors in a provision request.
> + */
> + __virtio32 max_provision_sectors;
> +
> __u8 unused1[3];
> } __attribute__((packed));
>
> @@ -155,6 +161,9 @@ struct virtio_blk_config {
> /* Write zeroes command */
> #define VIRTIO_BLK_T_WRITE_ZEROES 13
>
> +/* Provision command */
> +#define VIRTIO_BLK_T_PROVISION 14
> +
> #ifndef VIRTIO_BLK_NO_LEGACY
> /* Barrier before this op. */
> #define VIRTIO_BLK_T_BARRIER 0x80000000


Feature bit has to be reserved in the virtio spec.
Pls do this through the virtio TC mailing list.

> --
> 2.31.0

2022-12-29 08:16:59

by Sarthak Kukreti

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Thu, Sep 22, 2022 at 11:29 AM Brian Foster <[email protected]> wrote:
>
> On Thu, Sep 22, 2022 at 01:04:33AM -0700, Sarthak Kukreti wrote:
> > On Wed, Sep 21, 2022 at 8:39 AM Brian Foster <[email protected]> wrote:
> > >
> > > On Fri, Sep 16, 2022 at 02:02:31PM -0700, Sarthak Kukreti wrote:
> > > > On Fri, Sep 16, 2022 at 4:56 AM Brian Foster <[email protected]> wrote:
> > > > >
> > > > > On Thu, Sep 15, 2022 at 09:48:22AM -0700, Sarthak Kukreti wrote:
> > > > > > From: Sarthak Kukreti <[email protected]>
> > > > > >
> > > > > > FALLOC_FL_PROVISION is a new fallocate() allocation mode that
> > > > > > sends a hint to (supported) thinly provisioned block devices to
> > > > > > allocate space for the given range of sectors via REQ_OP_PROVISION.
> > > > > >
> > > > > > Signed-off-by: Sarthak Kukreti <[email protected]>
> > > > > > ---
> > > > > > block/fops.c | 7 ++++++-
> > > > > > include/linux/falloc.h | 3 ++-
> > > > > > include/uapi/linux/falloc.h | 8 ++++++++
> > > > > > 3 files changed, 16 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/block/fops.c b/block/fops.c
> > > > > > index b90742595317..a436a7596508 100644
> > > > > > --- a/block/fops.c
> > > > > > +++ b/block/fops.c
> > > > > ...
> > > > > > @@ -661,6 +662,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > > > > > error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> > > > > > len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > > > break;
> > > > > > + case FALLOC_FL_PROVISION:
> > > > > > + error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > > > > > + len >> SECTOR_SHIFT, GFP_KERNEL);
> > > > > > + break;
> > > > > > default:
> > > > > > error = -EOPNOTSUPP;
> > > > > > }
> > > > >
> > > > > Hi Sarthak,
> > > > >
> > > > > Neat mechanism.. I played with something very similar in the past (that
> > > > > was much more crudely hacked up to target dm-thin) to allow filesystems
> > > > > to request a thinly provisioned device to allocate blocks and try to do
> > > > > a better job of avoiding inactivation when overprovisioned.
> > > > >
> > > > > One thing I'm a little curious about here.. what's the need for a new
> > > > > fallocate mode? On a cursory glance, the provision mode looks fairly
> > > > > analogous to normal (mode == 0) allocation mode with the exception of
> > > > > sending the request down to the bdev. blkdev_fallocate() already maps
> > > > > some of the logical falloc modes (i.e. punch hole, zero range) to
> > > > > sending write sames or discards, etc., and it doesn't currently look
> > > > > like it supports allocation mode, so could it not map such requests to
> > > > > the underlying REQ_OP_PROVISION op?
> > > > >
> > > > > I guess the difference would be at the filesystem level where we'd
> > > > > probably need to rely on a mount option or some such to control whether
> > > > > traditional fallocate issues provision ops (like you've implemented for
> > > > > ext4) vs. the specific falloc command, but that seems fairly consistent
> > > > > with historical punch hole/discard behavior too. Hm? You might want to
> > > > > cc linux-fsdevel in future posts in any event to get some more feedback
> > > > > on how other filesystems might want to interact with such a thing.
> > > > >
> > > > Thanks for the feedback!
> > > > Argh, I completely forgot that I should add linux-fsdevel. Let me
> > > > re-send this with linux-fsdevel cc'd
> > > >
> > > > There's a slight distinction is that the current filesystem-level
> > > > controls are usually for default handling, but userspace can still
> > > > call the relevant functions manually if they need to. For example, for
> > > > ext4, the 'discard' mount option dictates whether free blocks are
> > > > discarded, but it doesn't set the policy to allow/disallow userspace
> > > > from manually punching holes into files even if the mount opt is
> > > > 'nodiscard'. FALLOC_FL_PROVISION is similar in that regard; it adds a
> > > > manual mechanism for users to provision the files' extents, that is
> > > > separate from the filesystems' default handling of provisioning files.
> > > >
> > >
> > > What I'm trying to understand is why not let blkdev_fallocate() issue a
> > > provision based on the default mode (i.e. mode == 0) of fallocate(),
> > > which is already defined to mean "perform allocation?" It currently
> > > issues discards or write zeroes based on variants of
> > > FALLOC_FL_PUNCH_HOLE without the need for a separate FALLOC_FL_DISCARD
> > > mode, for example.
> > >
> > It's mostly to keep the block device fallocate() semantics in-line and
> > consistent with the file-specific modes: I added the separate
> > filesystem fallocate() mode under the assumption that we'd want to
> > keep the traditional handling for filesystems intact with (mode == 0).
> > And for block devices, I didn't map the requests to mode == 0 so that
> > it's less confusing to describe (eg. mode == 0 on block devices will
> > issue provision; mode == 0 on files will not). It would complicate
> > loopback devices, for instance; if the loop device is backed by a
> > file, it would need to use (mode == FALLOC_FL_PROVISION) but if the
> > loop device is backed by another block device, then the fallocate()
> > call would need to switch to (mode == 0).
> >
>
> I would expect the loopback scenario for provision to behave similar to
> how discards are handled. I.e., loopback receives a provision request
> and translates that to fallocate(mode = 0). If the backing device is
> block, blkdev_fallocate(mode = 0) translates that to another provision
> request. If the backing device is a file, the associated fallocate
> handler allocs/maps, if necessary, and then issues a provision on
> allocation, if enabled by the fs.
>
> AFAICT there's no need for FL_PROVISION at all in that scenario. Is
> there a functional purpose to FL_PROVISION? Is the intent to try and
> guarantee that a provision request propagates down the I/O stack? If so,
> what happens if blocks were already preallocated in the backing file (in
> the loopback file example)?
>
> BTW, an unrelated thing I noticed is that blkdev_fallocate()
> unconditionally calls truncate_bdev_range(), which probably doesn't make
> sense for any sort of alloc mode.
>
Thanks for pointing that out, fixed in v2.

> > With the separate mode, we can describe the semantics of falllcate()
> > modes a bit more cleanly, and it is common for both files and block
> > devices:
> >
> > 1. mode == 0: allocation at the same layer, will not provision on the
> > underlying device/filesystem (unsupported for block devices).
> > 2. mode == FALLOC_FL_PROVISION, allocation at the layer, will
> > provision on the underlying device/filesystem.
> >
>
> I think I see why you make the distinction, since the block layer
> doesn't have a "this layer only" mode, but IMO it's also quite confusing
> to say that mode == FL_PROVISION can allocate at the current and
> underlying layer(s) but mode == 0 to that underlying layer cannot.
>
> Either way, if you want to propose a new falloc mode/modifier, it
> probably warrants a more detailed commit log with more explanation of
> the purpose, examples of behavior, perhaps some details on how the mode
> might be documented in man pages, etc.
>
That's fair. Added more details to the patch commit log in v2.

Thanks
Sarthak

> Brian
>
> > Block devices don't technically need to use a separate mode, but it
> > makes it much less confusing if filesystems are already using a
> > separate mode for provision.
> >
> > Best
> > Sarthak
> >
> > > Brian
> > >
> > > > > BTW another thing that might be useful wrt to dm-thin is to support
> > > > > FALLOC_FL_UNSHARE. I.e., it looks like the previous dm-thin patch only
> > > > > checks that blocks are allocated, but not whether those blocks are
> > > > > shared (re: lookup_result.shared). It might be useful to do the COW in
> > > > > such cases if the caller passes down a REQ_UNSHARE or some such flag.
> > > > >
> > > > That's an interesting idea! There's a few more things on the TODO list
> > > > for this patch series but I think we can follow up with a patch to
> > > > handle that as well.
> > > >
> > > > Sarthak
> > > >
> > > > > Brian
> > > > >
> > > > > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > > > > > index f3f0b97b1675..a0e506255b20 100644
> > > > > > --- a/include/linux/falloc.h
> > > > > > +++ b/include/linux/falloc.h
> > > > > > @@ -30,7 +30,8 @@ struct space_resv {
> > > > > > FALLOC_FL_COLLAPSE_RANGE | \
> > > > > > FALLOC_FL_ZERO_RANGE | \
> > > > > > FALLOC_FL_INSERT_RANGE | \
> > > > > > - FALLOC_FL_UNSHARE_RANGE)
> > > > > > + FALLOC_FL_UNSHARE_RANGE | \
> > > > > > + FALLOC_FL_PROVISION)
> > > > > >
> > > > > > /* on ia32 l_start is on a 32-bit boundary */
> > > > > > #if defined(CONFIG_X86_64)
> > > > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > > > > > index 51398fa57f6c..2d323d113eed 100644
> > > > > > --- a/include/uapi/linux/falloc.h
> > > > > > +++ b/include/uapi/linux/falloc.h
> > > > > > @@ -77,4 +77,12 @@
> > > > > > */
> > > > > > #define FALLOC_FL_UNSHARE_RANGE 0x40
> > > > > >
> > > > > > +/*
> > > > > > + * FALLOC_FL_PROVISION acts as a hint for thinly provisioned devices to allocate
> > > > > > + * blocks for the range/EOF.
> > > > > > + *
> > > > > > + * FALLOC_FL_PROVISION can only be used with allocate-mode fallocate.
> > > > > > + */
> > > > > > +#define FALLOC_FL_PROVISION 0x80
> > > > > > +
> > > > > > #endif /* _UAPI_FALLOC_H_ */
> > > > > > --
> > > > > > 2.31.0
> > > > > >
> > > > >
> > > >
> > >
> >
>

2022-12-29 08:19:08

by Sarthak Kukreti

[permalink] [raw]
Subject: Re: [PATCH RFC 4/8] fs: Introduce FALLOC_FL_PROVISION

On Fri, Sep 23, 2022 at 1:45 AM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Sep 20, 2022 at 10:54:32PM -0700, Sarthak Kukreti wrote:
> > [ mmc0blkp1 | ext4(1) | sparse file | loop | dm-thinp | dm-thin | ext4(2) ]
> >
> > would be predicated on the guarantees of fallocate() per allocation
> > layer; if ext4(1) was replaced by a filesystem that did not support
> > fallocate(), then there would be no guarantee that a write to a file
> > on ext4(2) succeeds.
>
> a write or any unlimited number of writes?

(Apologies for the super late reply!) In this case, even a write won't
be guaranteed if we run out of space on the lower filesystem. Looking
at the fallocate() man page, I think the key part lies in the
following phrase (emphasis mine):

```
After a successful call, subsequent writes into the range
specified by offset and len are guaranteed not to fail _because of
lack of disk space_
```

So, it's not a blanket guarantee that all writes will always succeed,
but that any writes into that range will not fail due to lack of disk
space. As you mentioned, writes may happen out-of-place in one or more
layer. But the fallocate(FALLOC_FL_PROVISION) ensures that each layer
will preserve space for writes into that range to not fail with ENOSPC
(so eg. ext4 and dm-thinp will set aside enough extents to fulfil that
promise later on for all writes).

Best

Sarthak

2022-12-29 08:19:39

by Sarthak Kukreti

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

On Fri, Sep 23, 2022 at 7:08 AM Mike Snitzer <[email protected]> wrote:
>
> On Fri, Sep 23 2022 at 4:51P -0400,
> Christoph Hellwig <[email protected]> wrote:
>
> > On Wed, Sep 21, 2022 at 07:48:50AM +1000, Daniil Lunev wrote:
> > > > There is no such thing as WRITE UNAVAILABLE in NVMe.
> > > Apologize, that is WRITE UNCORRECTABLE. Chapter 3.2.7 of
> > > NVM Express NVM Command Set Specification 1.0b
> >
> > Write uncorrectable is a very different thing, and the equivalent of the
> > horribly misnamed SCSI WRITE LONG COMMAND. It injects an unrecoverable
> > error, and does not provision anything.
> >
> > > * Each application is potentially allowed to consume the entirety
> > > of the disk space - there is no strict size limit for application
> > > * Applications need to pre-allocate space sometime, for which
> > > they use fallocate. Once the operation succeeded, the application
> > > assumed the space is guaranteed to be there for it.
> > > * Since filesystems on the volumes are independent, filesystem
> > > level enforcement of size constraints is impossible and the only
> > > common level is the thin pool, thus, each fallocate has to find its
> > > representation in thin pool one way or another - otherwise you
> > > may end up in the situation, where FS thinks it has allocated space
> > > but when it tries to actually write it, the thin pool is already
> > > exhausted.
> > > * Hole-Punching fallocate will not reach the thin pool, so the only
> > > solution presently is zero-writing pre-allocate.
> >
> > To me it sounds like you want a non-thin pool in dm-thin and/or
> > guaranted space reservations for it.
>
> What is implemented in this patchset: enablement for dm-thinp to
> actually provide guarantees which fallocate requires.
>
> Seems you're getting hung up on the finishing details in HW (details
> which are _not_ the point of this patchset).
>
> The proposed changes are in service to _Linux_ code. The patchset
> implements the primitive from top (ext4) to bottom (dm-thinp, loop).
> It stops short of implementing handling everywhere that'd need it
> (e.g. in XFS, etc). But those changes can come as follow-on work once
> the primitive is established top to bottom.
>
> But you know all this ;)
>
> > > * Thus, a provisioning block operation allows an interface specific
> > > operation that guarantees the presence of the block in the
> > > mapped space. LVM Thin-pool itself is the primary target for our
> > > use case but the argument is that this operation maps well to
> > > other interfaces which allow thinly provisioned units.
> >
> > I think where you are trying to go here is badly mistaken. With flash
> > (or hard drive SMR) there is no such thing as provisioning LBAs. Every
> > write is out of place, and a one time space allocation does not help
> > you at all. So fundamentally what you try to here just goes against
> > the actual physics of modern storage media. While there are some
> > layers that keep up a pretence, trying to that an an exposed API
> > level is a really bad idea.
>
> This doesn't need to be so feudal. Reserving an LBA in physical HW
> really isn't the point.
>
> Fact remains: an operation that ensures space is actually reserved via
> fallocate is long overdue (just because an FS did its job doesn't mean
> underlying layers reflect that). And certainly useful, even if "only"
> benefiting dm-thinp and the loop driver. Like other block primitives,
> REQ_OP_PROVISION is filtered out by block core if the device doesn't
> support it.
>
> That said, I agree with Brian Foster that we need really solid
> documentation and justification for why fallocate mode=0 cannot be
> used (but the case has been made in this thread).
>
> Also, I do see an issue with the implementation (relative to stacked
> devices): dm_table_supports_provision() is too myopic about DM. It
> needs to go a step further and verify that some layer in the stack
> actually services REQ_OP_PROVISION. Will respond to DM patch too.
>
Thanks all for the suggestions and feedback! I just posted v2 (more
than a bit belatedly) on the various mailing lists with the relevant
fixes, documentation and some benchmarks on performance.

Best
Sarthak

2022-12-29 08:28:27

by Sarthak Kukreti

[permalink] [raw]
Subject: Re: [PATCH RFC 2/8] dm: Add support for block provisioning

On Fri, Sep 23, 2022 at 7:23 AM Mike Snitzer <[email protected]> wrote:
>
> On Thu, Sep 15 2022 at 12:48P -0400,
> Sarthak Kukreti <[email protected]> wrote:
>
> > From: Sarthak Kukreti <[email protected]>
> >
> > Add support to dm devices for REQ_OP_PROVISION. The default mode
> > is to pass through the request and dm-thin will utilize it to provision
> > blocks.
> >
> > Signed-off-by: Sarthak Kukreti <[email protected]>
> > ---
> > drivers/md/dm-crypt.c | 4 +-
> > drivers/md/dm-linear.c | 1 +
> > drivers/md/dm-table.c | 17 +++++++
> > drivers/md/dm-thin.c | 86 +++++++++++++++++++++++++++++++++--
> > drivers/md/dm.c | 4 ++
> > include/linux/device-mapper.h | 6 +++
> > 6 files changed, 113 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > index 159c6806c19b..357f0899cfb6 100644
> > --- a/drivers/md/dm-crypt.c
> > +++ b/drivers/md/dm-crypt.c
> > @@ -3081,6 +3081,8 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
> > if (ret)
> > return ret;
> >
> > + ti->num_provision_bios = 1;
> > +
> > while (opt_params--) {
> > opt_string = dm_shift_arg(&as);
> > if (!opt_string) {
> > @@ -3384,7 +3386,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
> > * - for REQ_OP_DISCARD caller must use flush if IO ordering matters
> > */
> > if (unlikely(bio->bi_opf & REQ_PREFLUSH ||
> > - bio_op(bio) == REQ_OP_DISCARD)) {
> > + bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_PROVISION)) {
> > bio_set_dev(bio, cc->dev->bdev);
> > if (bio_sectors(bio))
> > bio->bi_iter.bi_sector = cc->start +
> > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> > index 3212ef6aa81b..1aa782149428 100644
> > --- a/drivers/md/dm-linear.c
> > +++ b/drivers/md/dm-linear.c
> > @@ -61,6 +61,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> > ti->num_discard_bios = 1;
> > ti->num_secure_erase_bios = 1;
> > ti->num_write_zeroes_bios = 1;
> > + ti->num_provision_bios = 1;
> > ti->private = lc;
> > return 0;
> >
> > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > index 332f96b58252..b7f9cb66b7ba 100644
> > --- a/drivers/md/dm-table.c
> > +++ b/drivers/md/dm-table.c
> > @@ -1853,6 +1853,18 @@ static bool dm_table_supports_write_zeroes(struct dm_table *t)
> > return true;
> > }
> >
> > +static bool dm_table_supports_provision(struct dm_table *t)
> > +{
> > + for (unsigned int i = 0; i < t->num_targets; i++) {
> > + struct dm_target *ti = dm_table_get_target(t, i);
> > +
> > + if (ti->num_provision_bios)
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> > +
>
> This needs to go a step further and verify a device in the stack
> actually services REQ_OP_PROVISION.
>
> Please see dm_table_supports_discards(): it iterates all devices in
> the table and checks that support is advertised.
>
> For discard, DM requires that _all_ devices in a table advertise
> support (that is pretty strict and likely could be relaxed to _any_).
>
> You'll need ti->provision_supported (like ->discards_supported) to
> advertise actual support is provided by dm-thinp (even if underlying
> devices don't support it).
>
> And yeah, dm-thinp passdown support for REQ_OP_PROVISION can follow
> later as needed (if there actual HW that would benefit from
> REQ_OP_PROVISION).
>
Done, thanks (the provision support, not the passdown)! I think the
one case where passdown might help is to build images with dm-thinp
already set up on one of the partitions (I have something in the works
for ChromiumOS images to do VM tests with preset state :)). That would
allow us to preallocate space for thin logical volumes inside the
image file.

> Mike
>