This patchset tries to add support for TP4065a ("Simple Copy Command"),
v2020.05.04 ("Ratified")
The Specification can be found in following link.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
Simple copy command is a copy offloading operation and is used to copy
multiple contiguous ranges (source_ranges) of LBA's to a single destination
LBA within the device reducing traffic between host and device.
This implementation doesn't add native copy offload support for stacked
devices rather copy offload is done through emulation. Possible use
cases are F2FS gc and BTRFS relocation/balance.
*blkdev_issue_copy* takes source bdev, no of sources, array of source
ranges(values are in bytes), destination bdev and destination offset(in bytes).
If both source and destination block devices are same and copy_offload = 1,
then copy is done through native copy offloading. Copy emulation is used
in other cases.
As SCSI XCOPY can take two different block devices and no of source range is
equal to 1, this interface can be extended in future to support SCSI XCOPY.
For devices supporting native simple copy, attach the control information
as payload to the bio and submit to the device. For devices without native
copy support, copy emulation is done by reading each source range into memory
and writing it to the destination.
Following limits are added to queue limits and are exposed in sysfs
to userspace
- *copy_offload* controls copy_offload. set 0 to disable copy
offload, 1 to enable native copy offloading support.
- *max_copy_sectors* limits the sum of all source_range length
- *max_copy_nr_ranges* limits the number of source ranges
- *max_copy_range_sectors* limit the maximum number of sectors
that can constitute a single source range.
max_copy_sectors = 0 indicates the device doesn't support copy
offloading.
*copy offload* sysfs entry is configurable and can be used toggle
between emulation and native support depending upon the usecase.
Changes from v3
1. gfp_flag fixes.
2. Export bio_map_kern() and use it to allocate and add pages to bio.
3. Move copy offload, reading to buf, writing from buf to separate functions.
4. Send read bio of copy offload by chaining them and submit asynchronously.
5. Add gendisk->part0 and part->bd_start_sect changes to blk_check_copy().
6. Move single source range limit check to blk_check_copy()
7. Rename __blkdev_issue_copy() to blkdev_issue_copy and remove old helper.
8. Change blkdev_issue_copy() interface generic to accepts destination bdev
to support XCOPY as well.
9. Add invalidate_kernel_vmap_range() after reading data for vmalloc'ed memory.
10. Fix buf allocoation logic to allocate buffer for the total size of copy.
11. Reword patch commit description.
Changes from v2
1. Add emulation support for devices not supporting copy.
2. Add *copy_offload* sysfs entry to enable and disable copy_offload
in devices supporting simple copy.
3. Remove simple copy support for stacked devices.
Changes from v1:
1. Fix memory leak in __blkdev_issue_copy
2. Unmark blk_check_copy inline
3. Fix line break in blk_check_copy_eod
4. Remove p checks and made code more readable
5. Don't use bio_set_op_attrs and remove op and set
bi_opf directly
6. Use struct_size to calculate total_size
7. Fix partition remap of copy destination
8. Remove mcl,mssrl,msrc from nvme_ns
9. Initialize copy queue limits to 0 in nvme_config_copy
10. Remove return in QUEUE_FLAG_COPY check
11. Remove unused OCFS
SelvaKumar S (3):
block: export bio_map_kern()
block: add simple copy support
nvme: add simple copy support
block/blk-core.c | 94 ++++++++++++++--
block/blk-lib.c | 223 ++++++++++++++++++++++++++++++++++++++
block/blk-map.c | 3 +-
block/blk-merge.c | 2 +
block/blk-settings.c | 10 ++
block/blk-sysfs.c | 50 +++++++++
block/blk-zoned.c | 1 +
block/bounce.c | 1 +
block/ioctl.c | 43 ++++++++
drivers/nvme/host/core.c | 87 +++++++++++++++
include/linux/bio.h | 1 +
include/linux/blk_types.h | 15 +++
include/linux/blkdev.h | 15 +++
include/linux/nvme.h | 43 +++++++-
include/uapi/linux/fs.h | 13 +++
15 files changed, 589 insertions(+), 12 deletions(-)
--
2.25.1
Export bio_map_kern() so that copy offload emulation can use
it to add vmalloced memory to bio.
Signed-off-by: SelvaKumar S <[email protected]>
---
block/blk-map.c | 3 ++-
include/linux/blkdev.h | 2 ++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index 21630dccac62..50d61475bb68 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -378,7 +378,7 @@ static void bio_map_kern_endio(struct bio *bio)
* Map the kernel address into a bio suitable for io to a block
* device. Returns an error pointer in case of error.
*/
-static struct bio *bio_map_kern(struct request_queue *q, void *data,
+struct bio *bio_map_kern(struct request_queue *q, void *data,
unsigned int len, gfp_t gfp_mask)
{
unsigned long kaddr = (unsigned long)data;
@@ -428,6 +428,7 @@ static struct bio *bio_map_kern(struct request_queue *q, void *data,
bio->bi_end_io = bio_map_kern_endio;
return bio;
}
+EXPORT_SYMBOL(bio_map_kern);
static void bio_copy_kern_endio(struct bio *bio)
{
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 070de09425ad..81f9e7bec16c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -936,6 +936,8 @@ extern int blk_rq_map_user(struct request_queue *, struct request *,
struct rq_map_data *, void __user *, unsigned long,
gfp_t);
extern int blk_rq_unmap_user(struct bio *);
+extern struct bio *bio_map_kern(struct request_queue *q, void *data,
+ unsigned int len, gfp_t gfp_mask);
extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
struct rq_map_data *, const struct iov_iter *,
--
2.25.1
Add new BLKCOPY ioctl that offloads copying of one or more sources
ranges to a destination in the device. Accepts copy_ranges that contains
destination, no of sources and pointer to the array of source
ranges. Each range_entry contains start and length of source
ranges (in bytes).
Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
bio with control information as payload and submit to the device.
REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
to zoned device.
If the device doesn't support copy or copy offload is disabled, then
copy is emulated by allocating memory of total copy size. The source
ranges are read into memory by chaining bio for each source ranges and
submitting them async and the last bio waits for completion. After data
is read, it is written to the destination.
bio_map_kern() is used to allocate bio and add pages of copy buffer to
bio. As bio->bi_private and bio->bi_end_io is needed for chaining the
bio and over written, invalidate_kernel_vmap_range() for read is called
in the caller.
Introduce queue limits for simple copy and other helper functions.
Add device limits as sysfs entries.
- copy_offload
- max_copy_sectors
- max_copy_ranges_sectors
- max_copy_nr_ranges
copy_offload(= 0) is disabled by default.
max_copy_sectors = 0 indicates the device doesn't support native copy.
Native copy offload is not supported for stacked devices and is done via
copy emulation.
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier González <[email protected]>
---
block/blk-core.c | 94 ++++++++++++++--
block/blk-lib.c | 223 ++++++++++++++++++++++++++++++++++++++
block/blk-merge.c | 2 +
block/blk-settings.c | 10 ++
block/blk-sysfs.c | 50 +++++++++
block/blk-zoned.c | 1 +
block/bounce.c | 1 +
block/ioctl.c | 43 ++++++++
include/linux/bio.h | 1 +
include/linux/blk_types.h | 15 +++
include/linux/blkdev.h | 13 +++
include/uapi/linux/fs.h | 13 +++
12 files changed, 458 insertions(+), 8 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 96e5fcd7f071..4a5cd3f53cd2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -719,6 +719,17 @@ static noinline int should_fail_bio(struct bio *bio)
}
ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
+static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
+ sector_t nr_sectors, sector_t maxsector)
+{
+ if (nr_sectors && maxsector &&
+ (nr_sectors > maxsector || start > maxsector - nr_sectors)) {
+ handle_bad_sector(bio, maxsector);
+ return -EIO;
+ }
+ return 0;
+}
+
/*
* Check whether this bio extends beyond the end of the device or partition.
* This may well happen - the kernel calls bread() without checking the size of
@@ -737,6 +748,65 @@ static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
return 0;
}
+/*
+ * Check for copy limits and remap source ranges if needed.
+ */
+static int blk_check_copy(struct bio *bio)
+{
+ struct block_device *p = NULL;
+ struct request_queue *q = bio->bi_disk->queue;
+ struct blk_copy_payload *payload;
+ int i, maxsector, start_sect = 0, ret = -EIO;
+ unsigned short nr_range;
+
+ rcu_read_lock();
+
+ p = __disk_get_part(bio->bi_disk, bio->bi_partno);
+ if (unlikely(!p))
+ goto out;
+ if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
+ goto out;
+ if (unlikely(bio_check_ro(bio, p)))
+ goto out;
+
+ maxsector = bdev_nr_sectors(p);
+ start_sect = p->bd_start_sect;
+
+ payload = bio_data(bio);
+ nr_range = payload->copy_range;
+
+ /* cannot handle copy crossing nr_ranges limit */
+ if (payload->copy_range > q->limits.max_copy_nr_ranges)
+ goto out;
+
+ /* cannot handle copy more than copy limits */
+ if (payload->copy_size > q->limits.max_copy_sectors)
+ goto out;
+
+ /* check if copy length crosses eod */
+ if (unlikely(bio_check_copy_eod(bio, bio->bi_iter.bi_sector,
+ payload->copy_size, maxsector)))
+ goto out;
+ bio->bi_iter.bi_sector += start_sect;
+
+ for (i = 0; i < nr_range; i++) {
+ if (unlikely(bio_check_copy_eod(bio, payload->range[i].src,
+ payload->range[i].len, maxsector)))
+ goto out;
+
+ /* single source range length limit */
+ if (payload->range[i].src > q->limits.max_copy_range_sectors)
+ goto out;
+ payload->range[i].src += start_sect;
+ }
+
+ bio->bi_partno = 0;
+ ret = 0;
+out:
+ rcu_read_unlock();
+ return ret;
+}
+
/*
* Remap block n of partition p to block n+start(p) of the disk.
*/
@@ -826,14 +896,16 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
if (should_fail_bio(bio))
goto end_io;
- if (bio->bi_partno) {
- if (unlikely(blk_partition_remap(bio)))
- goto end_io;
- } else {
- if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
- goto end_io;
- if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
- goto end_io;
+ if (likely(!op_is_copy(bio->bi_opf))) {
+ if (bio->bi_partno) {
+ if (unlikely(blk_partition_remap(bio)))
+ goto end_io;
+ } else {
+ if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
+ goto end_io;
+ if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
+ goto end_io;
+ }
}
/*
@@ -857,6 +929,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
if (!blk_queue_discard(q))
goto not_supported;
break;
+ case REQ_OP_COPY:
+ if (!blk_queue_copy(q))
+ goto not_supported;
+ if (unlikely(blk_check_copy(bio)))
+ goto end_io;
+ break;
case REQ_OP_SECURE_ERASE:
if (!blk_queue_secure_erase(q))
goto not_supported;
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 752f9c722062..4c0f12e2ed7c 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -150,6 +150,229 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL(blkdev_issue_discard);
+int blk_copy_offload(struct block_device *bdev, struct blk_copy_payload *payload,
+ sector_t dest, gfp_t gfp_mask)
+{
+ struct bio *bio;
+ int ret, total_size;
+
+ bio = bio_alloc(gfp_mask, 1);
+ bio->bi_iter.bi_sector = dest;
+ bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE;
+ bio_set_dev(bio, bdev);
+
+ total_size = struct_size(payload, range, payload->copy_range);
+ ret = bio_add_page(bio, virt_to_page(payload), total_size,
+ offset_in_page(payload));
+ if (ret != total_size) {
+ ret = -ENOMEM;
+ bio_put(bio);
+ goto err;
+ }
+
+ ret = submit_bio_wait(bio);
+err:
+ bio_put(bio);
+ return ret;
+
+}
+
+int blk_read_to_buf(struct block_device *bdev, struct blk_copy_payload *payload,
+ gfp_t gfp_mask, void **buf_p)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+ struct bio *bio, *parent = NULL;
+ void *buf = NULL;
+ bool is_vmalloc;
+ int i, nr_srcs, copy_len, ret, cur_size, t_len = 0;
+
+ nr_srcs = payload->copy_range;
+ copy_len = payload->copy_size << SECTOR_SHIFT;
+
+ buf = kvmalloc(copy_len, gfp_mask);
+ if (!buf)
+ return -ENOMEM;
+ is_vmalloc = is_vmalloc_addr(buf);
+
+ for (i = 0; i < nr_srcs; i++) {
+ cur_size = payload->range[i].len << SECTOR_SHIFT;
+
+ bio = bio_map_kern(q, buf + t_len, cur_size, gfp_mask);
+ if (IS_ERR(bio)) {
+ ret = PTR_ERR(bio);
+ goto out;
+ }
+
+ bio->bi_iter.bi_sector = payload->range[i].src;
+ bio->bi_opf = REQ_OP_READ;
+ bio_set_dev(bio, bdev);
+ bio->bi_end_io = NULL;
+ bio->bi_private = NULL;
+
+ if (parent) {
+ bio_chain(parent, bio);
+ submit_bio(parent);
+ }
+
+ parent = bio;
+ t_len += cur_size;
+ }
+
+ ret = submit_bio_wait(bio);
+ bio_put(bio);
+ if (is_vmalloc)
+ invalidate_kernel_vmap_range(buf, copy_len);
+ if (ret)
+ goto out;
+
+ *buf_p = buf;
+ return 0;
+out:
+ kvfree(buf);
+ return ret;
+}
+
+int blk_write_from_buf(struct block_device *bdev, void *buf, sector_t dest,
+ int copy_len, gfp_t gfp_mask)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+ struct bio *bio;
+ int ret;
+
+ bio = bio_map_kern(q, buf, copy_len, gfp_mask);
+ if (IS_ERR(bio)) {
+ ret = PTR_ERR(bio);
+ goto out;
+ }
+ bio_set_dev(bio, bdev);
+ bio->bi_opf = REQ_OP_WRITE;
+ bio->bi_iter.bi_sector = dest;
+
+ bio->bi_end_io = NULL;
+ ret = submit_bio_wait(bio);
+ bio_put(bio);
+out:
+ return ret;
+}
+
+int blk_prepare_payload(struct block_device *bdev, int nr_srcs, struct range_entry *rlist,
+ gfp_t gfp_mask, struct blk_copy_payload **payload_p)
+{
+
+ struct request_queue *q = bdev_get_queue(bdev);
+ struct blk_copy_payload *payload;
+ sector_t bs_mask;
+ sector_t src_sects, len = 0, total_len = 0;
+ int i, ret, total_size;
+
+ if (!q)
+ return -ENXIO;
+
+ if (!nr_srcs)
+ return -EINVAL;
+
+ if (bdev_read_only(bdev))
+ return -EPERM;
+
+ bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+
+ total_size = struct_size(payload, range, nr_srcs);
+ payload = kmalloc(total_size, gfp_mask);
+ if (!payload)
+ return -ENOMEM;
+
+ for (i = 0; i < nr_srcs; i++) {
+ /* copy payload provided are in bytes */
+ src_sects = rlist[i].src;
+ if (src_sects & bs_mask) {
+ ret = -EINVAL;
+ goto err;
+ }
+ src_sects = src_sects >> SECTOR_SHIFT;
+
+ if (len & bs_mask) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ len = rlist[i].len >> SECTOR_SHIFT;
+
+ total_len += len;
+
+ WARN_ON_ONCE((src_sects << 9) > UINT_MAX);
+
+ payload->range[i].src = src_sects;
+ payload->range[i].len = len;
+ }
+
+ /* storing # of source ranges */
+ payload->copy_range = i;
+ /* storing copy len so far */
+ payload->copy_size = total_len;
+
+ *payload_p = payload;
+ return 0;
+err:
+ kfree(payload);
+ return ret;
+}
+
+/**
+ * blkdev_issue_copy - queue a copy
+ * @bdev: source block device
+ * @nr_srcs: number of source ranges to copy
+ * @rlist: array of source ranges (in bytes)
+ * @dest_bdev: destination block device
+ * @dest: destination (in bytes)
+ * @gfp_mask: memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ * Copy array of source ranges from source block device to
+ * destination block devcie.
+ */
+
+
+int blkdev_issue_copy(struct block_device *bdev, int nr_srcs,
+ struct range_entry *src_rlist, struct block_device *dest_bdev,
+ sector_t dest, gfp_t gfp_mask)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+ struct blk_copy_payload *payload;
+ sector_t bs_mask, dest_sect;
+ void *buf = NULL;
+ int ret;
+
+ ret = blk_prepare_payload(bdev, nr_srcs, src_rlist, gfp_mask, &payload);
+ if (ret)
+ return ret;
+
+ bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
+ if (dest & bs_mask) {
+ return -EINVAL;
+ goto out;
+ }
+ dest_sect = dest >> SECTOR_SHIFT;
+
+ if (bdev == dest_bdev && q->limits.copy_offload) {
+ ret = blk_copy_offload(bdev, payload, dest_sect, gfp_mask);
+ if (ret)
+ goto out;
+ } else {
+ ret = blk_read_to_buf(bdev, payload, gfp_mask, &buf);
+ if (ret)
+ goto out;
+ ret = blk_write_from_buf(dest_bdev, buf, dest_sect,
+ payload->copy_size << SECTOR_SHIFT, gfp_mask);
+ }
+
+ if (buf)
+ kvfree(buf);
+out:
+ kvfree(payload);
+ return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_copy);
+
/**
* __blkdev_issue_write_same - generate number of bios with same page
* @bdev: target blockdev
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 808768f6b174..4e04f24e13c1 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -309,6 +309,8 @@ void __blk_queue_split(struct bio **bio, unsigned int *nr_segs)
struct bio *split = NULL;
switch (bio_op(*bio)) {
+ case REQ_OP_COPY:
+ break;
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 43990b1d148b..93c15ba45a69 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -60,6 +60,10 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->io_opt = 0;
lim->misaligned = 0;
lim->zoned = BLK_ZONED_NONE;
+ lim->copy_offload = 0;
+ lim->max_copy_sectors = 0;
+ lim->max_copy_nr_ranges = 0;
+ lim->max_copy_range_sectors = 0;
}
EXPORT_SYMBOL(blk_set_default_limits);
@@ -565,6 +569,12 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
if (b->chunk_sectors)
t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
+ /* simple copy not supported in stacked devices */
+ t->copy_offload = 0;
+ t->max_copy_sectors = 0;
+ t->max_copy_range_sectors = 0;
+ t->max_copy_nr_ranges = 0;
+
/* Physical block size a multiple of the logical block size? */
if (t->physical_block_size & (t->logical_block_size - 1)) {
t->physical_block_size = t->logical_block_size;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b513f1683af0..51b35a8311d9 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -166,6 +166,47 @@ static ssize_t queue_discard_granularity_show(struct request_queue *q, char *pag
return queue_var_show(q->limits.discard_granularity, page);
}
+static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(q->limits.copy_offload, page);
+}
+
+static ssize_t queue_copy_offload_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long copy_offload;
+ ssize_t ret = queue_var_store(©_offload, page, count);
+
+ if (ret < 0)
+ return ret;
+
+ if (copy_offload < 0 || copy_offload > 1)
+ return -EINVAL;
+
+ if (q->limits.max_copy_sectors == 0 && copy_offload == 1)
+ return -EINVAL;
+
+ q->limits.copy_offload = copy_offload;
+ return ret;
+}
+
+static ssize_t queue_max_copy_sectors_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(q->limits.max_copy_sectors, page);
+}
+
+static ssize_t queue_max_copy_range_sectors_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(q->limits.max_copy_range_sectors, page);
+}
+
+static ssize_t queue_max_copy_nr_ranges_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(q->limits.max_copy_nr_ranges, page);
+}
+
static ssize_t queue_discard_max_hw_show(struct request_queue *q, char *page)
{
@@ -591,6 +632,11 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
+QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
+QUEUE_RO_ENTRY(queue_max_copy_sectors, "max_copy_sectors");
+QUEUE_RO_ENTRY(queue_max_copy_range_sectors, "max_copy_range_sectors");
+QUEUE_RO_ENTRY(queue_max_copy_nr_ranges, "max_copy_nr_ranges");
+
QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
QUEUE_RW_ENTRY(queue_poll, "io_poll");
@@ -636,6 +682,10 @@ static struct attribute *queue_attrs[] = {
&queue_discard_max_entry.attr,
&queue_discard_max_hw_entry.attr,
&queue_discard_zeroes_data_entry.attr,
+ &queue_copy_offload_entry.attr,
+ &queue_max_copy_sectors_entry.attr,
+ &queue_max_copy_range_sectors_entry.attr,
+ &queue_max_copy_nr_ranges_entry.attr,
&queue_write_same_max_entry.attr,
&queue_write_zeroes_max_entry.attr,
&queue_zone_append_max_entry.attr,
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 7a68b6e4300c..02069178d51e 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -75,6 +75,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE_SAME:
case REQ_OP_WRITE:
+ case REQ_OP_COPY:
return blk_rq_zone_is_seq(rq);
default:
return false;
diff --git a/block/bounce.c b/block/bounce.c
index d3f51acd6e3b..5e052afe8691 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -254,6 +254,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
bio->bi_iter.bi_size = bio_src->bi_iter.bi_size;
switch (bio_op(bio)) {
+ case REQ_OP_COPY:
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
diff --git a/block/ioctl.c b/block/ioctl.c
index d61d652078f4..d50b6abe2883 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -133,6 +133,47 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
GFP_KERNEL, flags);
}
+static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
+ unsigned long arg)
+{
+ struct copy_range crange;
+ struct range_entry *rlist;
+ struct request_queue *q = bdev_get_queue(bdev);
+ sector_t dest;
+ int ret;
+
+ if (!(mode & FMODE_WRITE))
+ return -EBADF;
+
+ if (!blk_queue_copy(q))
+ return -EOPNOTSUPP;
+
+ if (copy_from_user(&crange, (void __user *)arg, sizeof(crange)))
+ return -EFAULT;
+
+ if (crange.dest & ((1 << SECTOR_SHIFT) - 1))
+ return -EFAULT;
+ dest = crange.dest;
+
+ rlist = kmalloc_array(crange.nr_range, sizeof(*rlist),
+ GFP_KERNEL);
+
+ if (!rlist)
+ return -ENOMEM;
+
+ if (copy_from_user(rlist, (void __user *)crange.range_list,
+ sizeof(*rlist) * crange.nr_range)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ ret = blkdev_issue_copy(bdev, crange.nr_range, rlist, bdev, dest,
+ GFP_KERNEL);
+out:
+ kfree(rlist);
+ return ret;
+}
+
static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
unsigned long arg)
{
@@ -458,6 +499,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
case BLKSECDISCARD:
return blk_ioctl_discard(bdev, mode, arg,
BLKDEV_DISCARD_SECURE);
+ case BLKCOPY:
+ return blk_ioctl_copy(bdev, mode, arg);
case BLKZEROOUT:
return blk_ioctl_zeroout(bdev, mode, arg);
case BLKREPORTZONE:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1edda614f7ce..164313bdfb35 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -71,6 +71,7 @@ static inline bool bio_has_data(struct bio *bio)
static inline bool bio_no_advance_iter(const struct bio *bio)
{
return bio_op(bio) == REQ_OP_DISCARD ||
+ bio_op(bio) == REQ_OP_COPY ||
bio_op(bio) == REQ_OP_SECURE_ERASE ||
bio_op(bio) == REQ_OP_WRITE_SAME ||
bio_op(bio) == REQ_OP_WRITE_ZEROES;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 866f74261b3b..d4d11e9ff814 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -380,6 +380,8 @@ enum req_opf {
REQ_OP_ZONE_RESET = 15,
/* reset all the zone present on the device */
REQ_OP_ZONE_RESET_ALL = 17,
+ /* copy ranges within device */
+ REQ_OP_COPY = 19,
/* SCSI passthrough using struct scsi_request */
REQ_OP_SCSI_IN = 32,
@@ -506,6 +508,11 @@ static inline bool op_is_discard(unsigned int op)
return (op & REQ_OP_MASK) == REQ_OP_DISCARD;
}
+static inline bool op_is_copy(unsigned int op)
+{
+ return (op & REQ_OP_MASK) == REQ_OP_COPY;
+}
+
/*
* Check if a bio or request operation is a zone management operation, with
* the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case
@@ -565,4 +572,12 @@ struct blk_rq_stat {
u64 batch;
};
+struct blk_copy_payload {
+ sector_t dest;
+ int copy_range;
+ int copy_size;
+ int err;
+ struct range_entry range[];
+};
+
#endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 81f9e7bec16c..4c7e861e57e4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -340,10 +340,14 @@ struct queue_limits {
unsigned int max_zone_append_sectors;
unsigned int discard_granularity;
unsigned int discard_alignment;
+ unsigned int copy_offload;
+ unsigned int max_copy_sectors;
unsigned short max_segments;
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
+ unsigned short max_copy_range_sectors;
+ unsigned short max_copy_nr_ranges;
unsigned char misaligned;
unsigned char discard_misaligned;
@@ -625,6 +629,7 @@ struct request_queue {
#define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */
#define QUEUE_FLAG_HCTX_ACTIVE 28 /* at least one blk-mq hctx is active */
#define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */
+#define QUEUE_FLAG_COPY 30 /* supports copy */
#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
@@ -647,6 +652,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
#define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
+#define blk_queue_copy(q) test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
#define blk_queue_zone_resetall(q) \
test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
#define blk_queue_secure_erase(q) \
@@ -1061,6 +1067,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
return min(q->limits.max_discard_sectors,
UINT_MAX >> SECTOR_SHIFT);
+ if (unlikely(op == REQ_OP_COPY))
+ return q->limits.max_copy_sectors;
+
if (unlikely(op == REQ_OP_WRITE_SAME))
return q->limits.max_write_same_sectors;
@@ -1335,6 +1344,10 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, int flags,
struct bio **biop);
+extern int blkdev_issue_copy(struct block_device *bdev, int nr_srcs,
+ struct range_entry *src_rlist, struct block_device *dest_bdev,
+ sector_t dest, gfp_t gfp_mask);
+
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f44eb0a04afd..5cadb176317a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,18 @@ struct fstrim_range {
__u64 minlen;
};
+struct range_entry {
+ __u64 src;
+ __u64 len;
+};
+
+struct copy_range {
+ __u64 dest;
+ __u64 nr_range;
+ __u64 range_list;
+ __u64 rsvd;
+};
+
/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
#define FILE_DEDUPE_RANGE_SAME 0
#define FILE_DEDUPE_RANGE_DIFFERS 1
@@ -184,6 +196,7 @@ struct fsxattr {
#define BLKSECDISCARD _IO(0x12,125)
#define BLKROTATIONAL _IO(0x12,126)
#define BLKZEROOUT _IO(0x12,127)
+#define BLKCOPY _IOWR(0x12, 128, struct copy_range)
/*
* A jump here: 130-131 are reserved for zoned block devices
* (see uapi/linux/blkzoned.h)
--
2.25.1
Add support for TP 4065a ("Simple Copy Command"), v2020.05.04
("Ratified")
For device supporting native simple copy, this implementation accepts
the payload passed from the block layer and convert payload to form
simple copy command and submit to the device.
Set the device copy limits to queue limits. By default copy_offload
is disabled.
End-to-end protection is done by setting both PRINFOR and PRINFOW
to 0.
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier González <[email protected]>
---
drivers/nvme/host/core.c | 87 ++++++++++++++++++++++++++++++++++++++++
include/linux/nvme.h | 43 ++++++++++++++++++--
2 files changed, 127 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ce1b61519441..ea75af3e865a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -708,6 +708,63 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
}
+static inline blk_status_t nvme_setup_copy(struct nvme_ns *ns,
+ struct request *req, struct nvme_command *cmnd)
+{
+ struct nvme_ctrl *ctrl = ns->ctrl;
+ struct nvme_copy_range *range = NULL;
+ struct blk_copy_payload *payload;
+ unsigned short nr_range = 0;
+ u16 control = 0, ssrl;
+ u32 dsmgmt = 0;
+ u64 slba;
+ int i;
+
+ payload = bio_data(req->bio);
+ nr_range = payload->copy_range;
+
+ if (req->cmd_flags & REQ_FUA)
+ control |= NVME_RW_FUA;
+
+ if (req->cmd_flags & REQ_FAILFAST_DEV)
+ control |= NVME_RW_LR;
+
+ cmnd->copy.opcode = nvme_cmd_copy;
+ cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+ cmnd->copy.sdlba = cpu_to_le64(blk_rq_pos(req) >> (ns->lba_shift - 9));
+
+ range = kmalloc_array(nr_range, sizeof(*range),
+ GFP_ATOMIC | __GFP_NOWARN);
+ if (!range)
+ return BLK_STS_RESOURCE;
+
+ for (i = 0; i < nr_range; i++) {
+ slba = payload->range[i].src;
+ slba = slba >> (ns->lba_shift - 9);
+
+ ssrl = payload->range[i].len;
+ ssrl = ssrl >> (ns->lba_shift - 9);
+
+ range[i].slba = cpu_to_le64(slba);
+ range[i].nlb = cpu_to_le16(ssrl - 1);
+ }
+
+ cmnd->copy.nr_range = nr_range - 1;
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * nr_range;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ if (ctrl->nr_streams)
+ nvme_assign_write_stream(ctrl, req, &control, &dsmgmt);
+
+ cmnd->rw.control = cpu_to_le16(control);
+ cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
+
+ return BLK_STS_OK;
+}
+
static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
{
@@ -890,6 +947,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
case REQ_OP_DISCARD:
ret = nvme_setup_discard(ns, req, cmd);
break;
+ case REQ_OP_COPY:
+ ret = nvme_setup_copy(ns, req, cmd);
+ break;
case REQ_OP_READ:
ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
break;
@@ -1917,6 +1977,31 @@ static void nvme_config_discard(struct gendisk *disk, struct nvme_ns *ns)
blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
}
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+ struct nvme_id_ns *id)
+{
+ struct nvme_ctrl *ctrl = ns->ctrl;
+ struct request_queue *queue = disk->queue;
+
+ if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
+ queue->limits.copy_offload = 0;
+ queue->limits.max_copy_sectors = 0;
+ queue->limits.max_copy_range_sectors = 0;
+ queue->limits.max_copy_nr_ranges = 0;
+ blk_queue_flag_clear(QUEUE_FLAG_COPY, queue);
+ return;
+ }
+
+ /* setting copy limits */
+ blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, queue);
+ queue->limits.copy_offload = 0;
+ queue->limits.max_copy_sectors = le64_to_cpu(id->mcl) *
+ (1 << (ns->lba_shift - 9));
+ queue->limits.max_copy_range_sectors = le32_to_cpu(id->mssrl) *
+ (1 << (ns->lba_shift - 9));
+ queue->limits.max_copy_nr_ranges = id->msrc + 1;
+}
+
static void nvme_config_write_zeroes(struct gendisk *disk, struct nvme_ns *ns)
{
u64 max_blocks;
@@ -2112,6 +2197,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
set_capacity_and_notify(disk, capacity);
nvme_config_discard(disk, ns);
+ nvme_config_copy(disk, ns, id);
nvme_config_write_zeroes(disk, ns);
if ((id->nsattr & NVME_NS_ATTR_RO) ||
@@ -4689,6 +4775,7 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64);
BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64);
BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64);
+ BUILD_BUG_ON(sizeof(struct nvme_copy_command) != 64);
BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64);
BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64);
BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64);
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index d92535997687..11ed72a2164d 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -289,7 +289,7 @@ struct nvme_id_ctrl {
__u8 nvscc;
__u8 nwpc;
__le16 acwu;
- __u8 rsvd534[2];
+ __le16 ocfs;
__le32 sgls;
__le32 mnan;
__u8 rsvd544[224];
@@ -314,6 +314,7 @@ enum {
NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
NVME_CTRL_ONCS_RESERVATIONS = 1 << 5,
NVME_CTRL_ONCS_TIMESTAMP = 1 << 6,
+ NVME_CTRL_ONCS_COPY = 1 << 8,
NVME_CTRL_VWC_PRESENT = 1 << 0,
NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
@@ -362,7 +363,10 @@ struct nvme_id_ns {
__le16 npdg;
__le16 npda;
__le16 nows;
- __u8 rsvd74[18];
+ __le16 mssrl;
+ __le32 mcl;
+ __u8 msrc;
+ __u8 rsvd91[11];
__le32 anagrpid;
__u8 rsvd96[3];
__u8 nsattr;
@@ -673,6 +677,7 @@ enum nvme_opcode {
nvme_cmd_resv_report = 0x0e,
nvme_cmd_resv_acquire = 0x11,
nvme_cmd_resv_release = 0x15,
+ nvme_cmd_copy = 0x19,
nvme_cmd_zone_mgmt_send = 0x79,
nvme_cmd_zone_mgmt_recv = 0x7a,
nvme_cmd_zone_append = 0x7d,
@@ -691,7 +696,8 @@ enum nvme_opcode {
nvme_opcode_name(nvme_cmd_resv_register), \
nvme_opcode_name(nvme_cmd_resv_report), \
nvme_opcode_name(nvme_cmd_resv_acquire), \
- nvme_opcode_name(nvme_cmd_resv_release))
+ nvme_opcode_name(nvme_cmd_resv_release), \
+ nvme_opcode_name(nvme_cmd_copy))
/*
@@ -863,6 +869,36 @@ struct nvme_dsm_range {
__le64 slba;
};
+struct nvme_copy_command {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2;
+ __le64 metadata;
+ union nvme_data_ptr dptr;
+ __le64 sdlba;
+ __u8 nr_range;
+ __u8 rsvd12;
+ __le16 control;
+ __le16 rsvd13;
+ __le16 dspec;
+ __le32 ilbrt;
+ __le16 lbat;
+ __le16 lbatm;
+};
+
+struct nvme_copy_range {
+ __le64 rsvd0;
+ __le64 slba;
+ __le16 nlb;
+ __le16 rsvd18;
+ __le32 rsvd20;
+ __le32 eilbrt;
+ __le16 elbat;
+ __le16 elbatm;
+};
+
struct nvme_write_zeroes_cmd {
__u8 opcode;
__u8 flags;
@@ -1400,6 +1436,7 @@ struct nvme_command {
struct nvme_download_firmware dlfw;
struct nvme_format_cmd format;
struct nvme_dsm_cmd dsm;
+ struct nvme_copy_command copy;
struct nvme_write_zeroes_cmd write_zeroes;
struct nvme_zone_mgmt_send_cmd zms;
struct nvme_zone_mgmt_recv_cmd zmr;
--
2.25.1
On 2021/01/04 19:48, SelvaKumar S wrote:
> Export bio_map_kern() so that copy offload emulation can use
> it to add vmalloced memory to bio.
>
> Signed-off-by: SelvaKumar S <[email protected]>
> ---
> block/blk-map.c | 3 ++-
> include/linux/blkdev.h | 2 ++
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 21630dccac62..50d61475bb68 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -378,7 +378,7 @@ static void bio_map_kern_endio(struct bio *bio)
> * Map the kernel address into a bio suitable for io to a block
> * device. Returns an error pointer in case of error.
> */
> -static struct bio *bio_map_kern(struct request_queue *q, void *data,
> +struct bio *bio_map_kern(struct request_queue *q, void *data,
> unsigned int len, gfp_t gfp_mask)
> {
> unsigned long kaddr = (unsigned long)data;
> @@ -428,6 +428,7 @@ static struct bio *bio_map_kern(struct request_queue *q, void *data,
> bio->bi_end_io = bio_map_kern_endio;
> return bio;
> }
> +EXPORT_SYMBOL(bio_map_kern);
Simple copy support is a block layer code, so you I do not think you need this.
You only need to remove the static declaration of the function.
>
> static void bio_copy_kern_endio(struct bio *bio)
> {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 070de09425ad..81f9e7bec16c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -936,6 +936,8 @@ extern int blk_rq_map_user(struct request_queue *, struct request *,
> struct rq_map_data *, void __user *, unsigned long,
> gfp_t);
> extern int blk_rq_unmap_user(struct bio *);
> +extern struct bio *bio_map_kern(struct request_queue *q, void *data,
> + unsigned int len, gfp_t gfp_mask);
> extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
> extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
> struct rq_map_data *, const struct iov_iter *,
>
--
Damien Le Moal
Western Digital Research
Thanks Damien, Will update that.
On Mon, Jan 4, 2021 at 5:45 PM Damien Le Moal <[email protected]> wrote:
>
> On 2021/01/04 19:48, SelvaKumar S wrote:
> > Export bio_map_kern() so that copy offload emulation can use
> > it to add vmalloced memory to bio.
> >
> > Signed-off-by: SelvaKumar S <[email protected]>
> > ---
> > block/blk-map.c | 3 ++-
> > include/linux/blkdev.h | 2 ++
> > 2 files changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-map.c b/block/blk-map.c
> > index 21630dccac62..50d61475bb68 100644
> > --- a/block/blk-map.c
> > +++ b/block/blk-map.c
> > @@ -378,7 +378,7 @@ static void bio_map_kern_endio(struct bio *bio)
> > * Map the kernel address into a bio suitable for io to a block
> > * device. Returns an error pointer in case of error.
> > */
> > -static struct bio *bio_map_kern(struct request_queue *q, void *data,
> > +struct bio *bio_map_kern(struct request_queue *q, void *data,
> > unsigned int len, gfp_t gfp_mask)
> > {
> > unsigned long kaddr = (unsigned long)data;
> > @@ -428,6 +428,7 @@ static struct bio *bio_map_kern(struct request_queue *q, void *data,
> > bio->bi_end_io = bio_map_kern_endio;
> > return bio;
> > }
> > +EXPORT_SYMBOL(bio_map_kern);
>
> Simple copy support is a block layer code, so you I do not think you need this.
> You only need to remove the static declaration of the function.
>
> >
> > static void bio_copy_kern_endio(struct bio *bio)
> > {
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 070de09425ad..81f9e7bec16c 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -936,6 +936,8 @@ extern int blk_rq_map_user(struct request_queue *, struct request *,
> > struct rq_map_data *, void __user *, unsigned long,
> > gfp_t);
> > extern int blk_rq_unmap_user(struct bio *);
> > +extern struct bio *bio_map_kern(struct request_queue *q, void *data,
> > + unsigned int len, gfp_t gfp_mask);
> > extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
> > extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
> > struct rq_map_data *, const struct iov_iter *,
> >
>
>
> --
> Damien Le Moal
> Western Digital Research
On 2021/01/04 19:48, SelvaKumar S wrote:
> Add new BLKCOPY ioctl that offloads copying of one or more sources
> ranges to a destination in the device. Accepts copy_ranges that contains
> destination, no of sources and pointer to the array of source
> ranges. Each range_entry contains start and length of source
> ranges (in bytes).
>
> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> bio with control information as payload and submit to the device.
> REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
> to zoned device.
>
> If the device doesn't support copy or copy offload is disabled, then
> copy is emulated by allocating memory of total copy size. The source
> ranges are read into memory by chaining bio for each source ranges and
> submitting them async and the last bio waits for completion. After data
> is read, it is written to the destination.
>
> bio_map_kern() is used to allocate bio and add pages of copy buffer to
> bio. As bio->bi_private and bio->bi_end_io is needed for chaining the
> bio and over written, invalidate_kernel_vmap_range() for read is called
> in the caller.
>
> Introduce queue limits for simple copy and other helper functions.
> Add device limits as sysfs entries.
> - copy_offload
> - max_copy_sectors
> - max_copy_ranges_sectors
> - max_copy_nr_ranges
>
> copy_offload(= 0) is disabled by default.
> max_copy_sectors = 0 indicates the device doesn't support native copy.
>
> Native copy offload is not supported for stacked devices and is done via
> copy emulation.
>
> Signed-off-by: SelvaKumar S <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Javier Gonz?lez <[email protected]>
> ---
> block/blk-core.c | 94 ++++++++++++++--
> block/blk-lib.c | 223 ++++++++++++++++++++++++++++++++++++++
> block/blk-merge.c | 2 +
> block/blk-settings.c | 10 ++
> block/blk-sysfs.c | 50 +++++++++
> block/blk-zoned.c | 1 +
> block/bounce.c | 1 +
> block/ioctl.c | 43 ++++++++
> include/linux/bio.h | 1 +
> include/linux/blk_types.h | 15 +++
> include/linux/blkdev.h | 13 +++
> include/uapi/linux/fs.h | 13 +++
> 12 files changed, 458 insertions(+), 8 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 96e5fcd7f071..4a5cd3f53cd2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -719,6 +719,17 @@ static noinline int should_fail_bio(struct bio *bio)
> }
> ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
>
> +static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
> + sector_t nr_sectors, sector_t maxsector)
> +{
> + if (nr_sectors && maxsector &&
> + (nr_sectors > maxsector || start > maxsector - nr_sectors)) {
> + handle_bad_sector(bio, maxsector);
> + return -EIO;
> + }
> + return 0;
> +}
> +
> /*
> * Check whether this bio extends beyond the end of the device or partition.
> * This may well happen - the kernel calls bread() without checking the size of
> @@ -737,6 +748,65 @@ static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
> return 0;
> }
>
> +/*
> + * Check for copy limits and remap source ranges if needed.
> + */
> +static int blk_check_copy(struct bio *bio)
> +{
> + struct block_device *p = NULL;
> + struct request_queue *q = bio->bi_disk->queue;
> + struct blk_copy_payload *payload;
> + int i, maxsector, start_sect = 0, ret = -EIO;
> + unsigned short nr_range;
> +
> + rcu_read_lock();
> +
> + p = __disk_get_part(bio->bi_disk, bio->bi_partno);
> + if (unlikely(!p))
> + goto out;
> + if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
> + goto out;
> + if (unlikely(bio_check_ro(bio, p)))
> + goto out;
> +
> + maxsector = bdev_nr_sectors(p);
> + start_sect = p->bd_start_sect;
> +
> + payload = bio_data(bio);
> + nr_range = payload->copy_range;
> +
> + /* cannot handle copy crossing nr_ranges limit */
> + if (payload->copy_range > q->limits.max_copy_nr_ranges)
> + goto out;
If payload->copy_range indicates the number of ranges to be copied, you should
name it payload->copy_nr_ranges.
> +
> + /* cannot handle copy more than copy limits */
Why ? You could loop... Similarly to discard, zone reset etc.
> + if (payload->copy_size > q->limits.max_copy_sectors)
> + goto out;
Shouldn't the list of source ranges give the total size to be copied ?
Otherwise, if payload->copy_size is user provided, you would need to check that
the sum of the source ranges actually is equal to copy_size, no ? That means
that dropping copy_size sound like the right thing to do here.
> +
> + /* check if copy length crosses eod */
> + if (unlikely(bio_check_copy_eod(bio, bio->bi_iter.bi_sector,
> + payload->copy_size, maxsector)))
> + goto out;
> + bio->bi_iter.bi_sector += start_sect;
> +
> + for (i = 0; i < nr_range; i++) {
> + if (unlikely(bio_check_copy_eod(bio, payload->range[i].src,
> + payload->range[i].len, maxsector)))
> + goto out;
> +
> + /* single source range length limit */
> + if (payload->range[i].src > q->limits.max_copy_range_sectors)
> + goto out;
> + payload->range[i].src += start_sect;
> + }
> +
> + bio->bi_partno = 0;
> + ret = 0;
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +
> /*
> * Remap block n of partition p to block n+start(p) of the disk.
> */
> @@ -826,14 +896,16 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> if (should_fail_bio(bio))
> goto end_io;
>
> - if (bio->bi_partno) {
> - if (unlikely(blk_partition_remap(bio)))
> - goto end_io;
> - } else {
> - if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
> - goto end_io;
> - if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
> - goto end_io;
> + if (likely(!op_is_copy(bio->bi_opf))) {
> + if (bio->bi_partno) {
> + if (unlikely(blk_partition_remap(bio)))
> + goto end_io;
> + } else {
> + if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
> + goto end_io;
> + if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
> + goto end_io;
> + }
> }
>
> /*
> @@ -857,6 +929,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> if (!blk_queue_discard(q))
> goto not_supported;
> break;
> + case REQ_OP_COPY:
> + if (!blk_queue_copy(q))
> + goto not_supported;
This check could be inside blk_check_copy(). In any case, since you added the
read-write emulation, why are you even checking this ? That will prevent the use
of the read-write emulation for devices that lack the simple copy support, no ?
> + if (unlikely(blk_check_copy(bio)))
> + goto end_io;
> + break;
> case REQ_OP_SECURE_ERASE:
> if (!blk_queue_secure_erase(q))
> goto not_supported;
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 752f9c722062..4c0f12e2ed7c 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -150,6 +150,229 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> }
> EXPORT_SYMBOL(blkdev_issue_discard);
>
> +int blk_copy_offload(struct block_device *bdev, struct blk_copy_payload *payload,
> + sector_t dest, gfp_t gfp_mask)
> +{
> + struct bio *bio;
> + int ret, total_size;
> +
> + bio = bio_alloc(gfp_mask, 1);
> + bio->bi_iter.bi_sector = dest;
> + bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE;
> + bio_set_dev(bio, bdev);
> +
> + total_size = struct_size(payload, range, payload->copy_range);
> + ret = bio_add_page(bio, virt_to_page(payload), total_size,
How is payload allocated ? If it is a structure on-stack in the caller, I am not
sure it would be wise to do an IO using the thread stack page...
> + offset_in_page(payload));
> + if (ret != total_size) {
> + ret = -ENOMEM;
> + bio_put(bio);
> + goto err;
> + }
> +
> + ret = submit_bio_wait(bio);
> +err:
> + bio_put(bio);
> + return ret;
> +
> +}
> +
> +int blk_read_to_buf(struct block_device *bdev, struct blk_copy_payload *payload,
> + gfp_t gfp_mask, void **buf_p)
> +{
> + struct request_queue *q = bdev_get_queue(bdev);
> + struct bio *bio, *parent = NULL;
> + void *buf = NULL;
> + bool is_vmalloc;
> + int i, nr_srcs, copy_len, ret, cur_size, t_len = 0;
> +
> + nr_srcs = payload->copy_range;
> + copy_len = payload->copy_size << SECTOR_SHIFT;
> +
> + buf = kvmalloc(copy_len, gfp_mask);
> + if (!buf)
> + return -ENOMEM;
> + is_vmalloc = is_vmalloc_addr(buf);
> +
> + for (i = 0; i < nr_srcs; i++) {
> + cur_size = payload->range[i].len << SECTOR_SHIFT;
> +
> + bio = bio_map_kern(q, buf + t_len, cur_size, gfp_mask);
> + if (IS_ERR(bio)) {
> + ret = PTR_ERR(bio);
> + goto out;
> + }
> +
> + bio->bi_iter.bi_sector = payload->range[i].src;
> + bio->bi_opf = REQ_OP_READ;
> + bio_set_dev(bio, bdev);
> + bio->bi_end_io = NULL;
> + bio->bi_private = NULL;
> +
> + if (parent) {
> + bio_chain(parent, bio);
> + submit_bio(parent);
> + }
> +
> + parent = bio;
> + t_len += cur_size;
> + }
> +
> + ret = submit_bio_wait(bio);
> + bio_put(bio);
> + if (is_vmalloc)
> + invalidate_kernel_vmap_range(buf, copy_len);
> + if (ret)
> + goto out;
> +
> + *buf_p = buf;
> + return 0;
> +out:
> + kvfree(buf);
> + return ret;
> +}
> +
> +int blk_write_from_buf(struct block_device *bdev, void *buf, sector_t dest,
> + int copy_len, gfp_t gfp_mask)
> +{
> + struct request_queue *q = bdev_get_queue(bdev);
> + struct bio *bio;
> + int ret;
> +
> + bio = bio_map_kern(q, buf, copy_len, gfp_mask);
> + if (IS_ERR(bio)) {
> + ret = PTR_ERR(bio);
> + goto out;
> + }
> + bio_set_dev(bio, bdev);
> + bio->bi_opf = REQ_OP_WRITE;
> + bio->bi_iter.bi_sector = dest;
> +
> + bio->bi_end_io = NULL;
> + ret = submit_bio_wait(bio);
> + bio_put(bio);
> +out:
> + return ret;
> +}
> +
> +int blk_prepare_payload(struct block_device *bdev, int nr_srcs, struct range_entry *rlist,
> + gfp_t gfp_mask, struct blk_copy_payload **payload_p)
> +{
> +
> + struct request_queue *q = bdev_get_queue(bdev);
> + struct blk_copy_payload *payload;
> + sector_t bs_mask;
> + sector_t src_sects, len = 0, total_len = 0;
> + int i, ret, total_size;
> +
> + if (!q)
> + return -ENXIO;
> +
> + if (!nr_srcs)
> + return -EINVAL;
> +
> + if (bdev_read_only(bdev))
> + return -EPERM;
> +
> + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
> +
> + total_size = struct_size(payload, range, nr_srcs);
> + payload = kmalloc(total_size, gfp_mask);
> + if (!payload)
> + return -ENOMEM;
OK. So this is what goes into the bio. The function blk_copy_offload() assumes
this is at most one page, so total_size <= PAGE_SIZE. Where is that checked ?
> +
> + for (i = 0; i < nr_srcs; i++) {
> + /* copy payload provided are in bytes */
> + src_sects = rlist[i].src;
> + if (src_sects & bs_mask) {
> + ret = -EINVAL;
> + goto err;
> + }
> + src_sects = src_sects >> SECTOR_SHIFT;
> +
> + if (len & bs_mask) {
> + ret = -EINVAL;
> + goto err;
> + }
> +
> + len = rlist[i].len >> SECTOR_SHIFT;
> +
> + total_len += len;
> +
> + WARN_ON_ONCE((src_sects << 9) > UINT_MAX);
> +
> + payload->range[i].src = src_sects;
> + payload->range[i].len = len;
> + }
> +
> + /* storing # of source ranges */
> + payload->copy_range = i;
> + /* storing copy len so far */
> + payload->copy_size = total_len;
The comments here make the code ugly. Rename the variables and structure fields
to have something self explanatory.
> +
> + *payload_p = payload;
> + return 0;
> +err:
> + kfree(payload);
> + return ret;
> +}
> +
> +/**
> + * blkdev_issue_copy - queue a copy
> + * @bdev: source block device
> + * @nr_srcs: number of source ranges to copy
> + * @rlist: array of source ranges (in bytes)
> + * @dest_bdev: destination block device
> + * @dest: destination (in bytes)
> + * @gfp_mask: memory allocation flags (for bio_alloc)
> + *
> + * Description:
> + * Copy array of source ranges from source block device to
> + * destination block devcie.
> + */
> +
> +
> +int blkdev_issue_copy(struct block_device *bdev, int nr_srcs,
> + struct range_entry *src_rlist, struct block_device *dest_bdev,
> + sector_t dest, gfp_t gfp_mask)
> +{
> + struct request_queue *q = bdev_get_queue(bdev);
> + struct blk_copy_payload *payload;
> + sector_t bs_mask, dest_sect;
> + void *buf = NULL;
> + int ret;
> +
> + ret = blk_prepare_payload(bdev, nr_srcs, src_rlist, gfp_mask, &payload);
> + if (ret)
> + return ret;
> +
> + bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
> + if (dest & bs_mask) {
> + return -EINVAL;
> + goto out;
> + }
> + dest_sect = dest >> SECTOR_SHIFT;
dest should already be a 512B sector as all block layer functions interface use
this unit. What is this doing ?
> +
> + if (bdev == dest_bdev && q->limits.copy_offload) {
> + ret = blk_copy_offload(bdev, payload, dest_sect, gfp_mask);
> + if (ret)
> + goto out;
> + } else {
> + ret = blk_read_to_buf(bdev, payload, gfp_mask, &buf);
> + if (ret)
> + goto out;
> + ret = blk_write_from_buf(dest_bdev, buf, dest_sect,
> + payload->copy_size << SECTOR_SHIFT, gfp_mask);
Arg... This is really not pretty. At the very least, this should all be in a
blk_copy_emulate() helper or something named like that.
My worry though is that this likely slow for large copies, not to mention that
buf is likely to fail to allocate for large copy cases. As commented previously,
dm-kcopyd already does such copy well, with a read-write streaming pipeline and
zone support too for the write side. This should really be reused, at least
partly, instead of re-implementing this read-write here.
> + }
> +
> + if (buf)
> + kvfree(buf);
> +out:
> + kvfree(payload);
> + return ret;
> +}
> +EXPORT_SYMBOL(blkdev_issue_copy);
> +
> /**
> * __blkdev_issue_write_same - generate number of bios with same page
> * @bdev: target blockdev
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 808768f6b174..4e04f24e13c1 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -309,6 +309,8 @@ void __blk_queue_split(struct bio **bio, unsigned int *nr_segs)
> struct bio *split = NULL;
>
> switch (bio_op(*bio)) {
> + case REQ_OP_COPY:
> + break;
> case REQ_OP_DISCARD:
> case REQ_OP_SECURE_ERASE:
> split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 43990b1d148b..93c15ba45a69 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -60,6 +60,10 @@ void blk_set_default_limits(struct queue_limits *lim)
> lim->io_opt = 0;
> lim->misaligned = 0;
> lim->zoned = BLK_ZONED_NONE;
> + lim->copy_offload = 0;
> + lim->max_copy_sectors = 0;
> + lim->max_copy_nr_ranges = 0;
> + lim->max_copy_range_sectors = 0;
> }
> EXPORT_SYMBOL(blk_set_default_limits);
>
> @@ -565,6 +569,12 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> if (b->chunk_sectors)
> t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
>
> + /* simple copy not supported in stacked devices */
> + t->copy_offload = 0;
> + t->max_copy_sectors = 0;
> + t->max_copy_range_sectors = 0;
> + t->max_copy_nr_ranges = 0;
> +
> /* Physical block size a multiple of the logical block size? */
> if (t->physical_block_size & (t->logical_block_size - 1)) {
> t->physical_block_size = t->logical_block_size;
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index b513f1683af0..51b35a8311d9 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -166,6 +166,47 @@ static ssize_t queue_discard_granularity_show(struct request_queue *q, char *pag
> return queue_var_show(q->limits.discard_granularity, page);
> }
>
> +static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
> +{
> + return queue_var_show(q->limits.copy_offload, page);
> +}
> +
> +static ssize_t queue_copy_offload_store(struct request_queue *q,
> + const char *page, size_t count)
> +{
> + unsigned long copy_offload;
> + ssize_t ret = queue_var_store(©_offload, page, count);
> +
> + if (ret < 0)
> + return ret;
> +
> + if (copy_offload < 0 || copy_offload > 1)
err... "copy_offload != 1" ?
> + return -EINVAL;
> +
> + if (q->limits.max_copy_sectors == 0 && copy_offload == 1)
> + return -EINVAL;
> +
> + q->limits.copy_offload = copy_offload;
> + return ret;
> +}
> +
> +static ssize_t queue_max_copy_sectors_show(struct request_queue *q, char *page)
> +{
> + return queue_var_show(q->limits.max_copy_sectors, page);
> +}
> +
> +static ssize_t queue_max_copy_range_sectors_show(struct request_queue *q,
> + char *page)
> +{
> + return queue_var_show(q->limits.max_copy_range_sectors, page);
> +}
> +
> +static ssize_t queue_max_copy_nr_ranges_show(struct request_queue *q,
> + char *page)
> +{
> + return queue_var_show(q->limits.max_copy_nr_ranges, page);
> +}
> +
> static ssize_t queue_discard_max_hw_show(struct request_queue *q, char *page)
> {
>
> @@ -591,6 +632,11 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
> QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
> QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>
> +QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
> +QUEUE_RO_ENTRY(queue_max_copy_sectors, "max_copy_sectors");
> +QUEUE_RO_ENTRY(queue_max_copy_range_sectors, "max_copy_range_sectors");
> +QUEUE_RO_ENTRY(queue_max_copy_nr_ranges, "max_copy_nr_ranges");
> +
> QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
> QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
> QUEUE_RW_ENTRY(queue_poll, "io_poll");
> @@ -636,6 +682,10 @@ static struct attribute *queue_attrs[] = {
> &queue_discard_max_entry.attr,
> &queue_discard_max_hw_entry.attr,
> &queue_discard_zeroes_data_entry.attr,
> + &queue_copy_offload_entry.attr,
> + &queue_max_copy_sectors_entry.attr,
> + &queue_max_copy_range_sectors_entry.attr,
> + &queue_max_copy_nr_ranges_entry.attr,
> &queue_write_same_max_entry.attr,
> &queue_write_zeroes_max_entry.attr,
> &queue_zone_append_max_entry.attr,
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 7a68b6e4300c..02069178d51e 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -75,6 +75,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
> case REQ_OP_WRITE_ZEROES:
> case REQ_OP_WRITE_SAME:
> case REQ_OP_WRITE:
> + case REQ_OP_COPY:
> return blk_rq_zone_is_seq(rq);
> default:
> return false;
> diff --git a/block/bounce.c b/block/bounce.c
> index d3f51acd6e3b..5e052afe8691 100644
> --- a/block/bounce.c
> +++ b/block/bounce.c
> @@ -254,6 +254,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
> bio->bi_iter.bi_size = bio_src->bi_iter.bi_size;
>
> switch (bio_op(bio)) {
> + case REQ_OP_COPY:
> case REQ_OP_DISCARD:
> case REQ_OP_SECURE_ERASE:
> case REQ_OP_WRITE_ZEROES:
> diff --git a/block/ioctl.c b/block/ioctl.c
> index d61d652078f4..d50b6abe2883 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -133,6 +133,47 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
> GFP_KERNEL, flags);
> }
>
> +static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
> + unsigned long arg)
> +{
> + struct copy_range crange;
> + struct range_entry *rlist;
> + struct request_queue *q = bdev_get_queue(bdev);
> + sector_t dest;
> + int ret;
> +
> + if (!(mode & FMODE_WRITE))
> + return -EBADF;
> +
> + if (!blk_queue_copy(q))
> + return -EOPNOTSUPP;
But you added the read-write emulation. So what is the point here ? This ioctl
should work for any device, with or without simple copy support. Did you test that ?
> +
> + if (copy_from_user(&crange, (void __user *)arg, sizeof(crange)))
> + return -EFAULT;
> +
> + if (crange.dest & ((1 << SECTOR_SHIFT) - 1))
> + return -EFAULT;
> + dest = crange.dest;
> +
> + rlist = kmalloc_array(crange.nr_range, sizeof(*rlist),
> + GFP_KERNEL);
> +
Unnecessary blank line here.
> + if (!rlist)
> + return -ENOMEM;
> +
> + if (copy_from_user(rlist, (void __user *)crange.range_list,
> + sizeof(*rlist) * crange.nr_range)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = blkdev_issue_copy(bdev, crange.nr_range, rlist, bdev, dest,
> + GFP_KERNEL);
> +out:
> + kfree(rlist);
> + return ret;
> +}
> +
> static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
> unsigned long arg)
> {
> @@ -458,6 +499,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
> case BLKSECDISCARD:
> return blk_ioctl_discard(bdev, mode, arg,
> BLKDEV_DISCARD_SECURE);
> + case BLKCOPY:
> + return blk_ioctl_copy(bdev, mode, arg);
> case BLKZEROOUT:
> return blk_ioctl_zeroout(bdev, mode, arg);
> case BLKREPORTZONE:
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1edda614f7ce..164313bdfb35 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -71,6 +71,7 @@ static inline bool bio_has_data(struct bio *bio)
> static inline bool bio_no_advance_iter(const struct bio *bio)
> {
> return bio_op(bio) == REQ_OP_DISCARD ||
> + bio_op(bio) == REQ_OP_COPY ||
> bio_op(bio) == REQ_OP_SECURE_ERASE ||
> bio_op(bio) == REQ_OP_WRITE_SAME ||
> bio_op(bio) == REQ_OP_WRITE_ZEROES;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 866f74261b3b..d4d11e9ff814 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -380,6 +380,8 @@ enum req_opf {
> REQ_OP_ZONE_RESET = 15,
> /* reset all the zone present on the device */
> REQ_OP_ZONE_RESET_ALL = 17,
> + /* copy ranges within device */
> + REQ_OP_COPY = 19,
>
> /* SCSI passthrough using struct scsi_request */
> REQ_OP_SCSI_IN = 32,
> @@ -506,6 +508,11 @@ static inline bool op_is_discard(unsigned int op)
> return (op & REQ_OP_MASK) == REQ_OP_DISCARD;
> }
>
> +static inline bool op_is_copy(unsigned int op)
> +{
> + return (op & REQ_OP_MASK) == REQ_OP_COPY;
> +}
> +
> /*
> * Check if a bio or request operation is a zone management operation, with
> * the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case
> @@ -565,4 +572,12 @@ struct blk_rq_stat {
> u64 batch;
> };
>
> +struct blk_copy_payload {
> + sector_t dest;
> + int copy_range;
> + int copy_size;
> + int err;
> + struct range_entry range[];
> +};
> +
> #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 81f9e7bec16c..4c7e861e57e4 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -340,10 +340,14 @@ struct queue_limits {
> unsigned int max_zone_append_sectors;
> unsigned int discard_granularity;
> unsigned int discard_alignment;
> + unsigned int copy_offload;
> + unsigned int max_copy_sectors;
>
> unsigned short max_segments;
> unsigned short max_integrity_segments;
> unsigned short max_discard_segments;
> + unsigned short max_copy_range_sectors;
> + unsigned short max_copy_nr_ranges;
>
> unsigned char misaligned;
> unsigned char discard_misaligned;
> @@ -625,6 +629,7 @@ struct request_queue {
> #define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */
> #define QUEUE_FLAG_HCTX_ACTIVE 28 /* at least one blk-mq hctx is active */
> #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */
> +#define QUEUE_FLAG_COPY 30 /* supports copy */
I think this should be called QUEUE_FLAG_SIMPLE_COPY to indicate more precisely
the type of copy supported. SCSI XCOPY is more advanced...
>
> #define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
> (1 << QUEUE_FLAG_SAME_COMP) | \
> @@ -647,6 +652,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
> #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
> #define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
> #define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
> +#define blk_queue_copy(q) test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
> #define blk_queue_zone_resetall(q) \
> test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
> #define blk_queue_secure_erase(q) \
> @@ -1061,6 +1067,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
> return min(q->limits.max_discard_sectors,
> UINT_MAX >> SECTOR_SHIFT);
>
> + if (unlikely(op == REQ_OP_COPY))
> + return q->limits.max_copy_sectors;
> +
> if (unlikely(op == REQ_OP_WRITE_SAME))
> return q->limits.max_write_same_sectors;
>
> @@ -1335,6 +1344,10 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> sector_t nr_sects, gfp_t gfp_mask, int flags,
> struct bio **biop);
>
> +extern int blkdev_issue_copy(struct block_device *bdev, int nr_srcs,
> + struct range_entry *src_rlist, struct block_device *dest_bdev,
> + sector_t dest, gfp_t gfp_mask);
> +
> #define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
> #define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
>
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index f44eb0a04afd..5cadb176317a 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -64,6 +64,18 @@ struct fstrim_range {
> __u64 minlen;
> };
>
> +struct range_entry {
> + __u64 src;
> + __u64 len;
> +};
> +
> +struct copy_range {
> + __u64 dest;
> + __u64 nr_range;
> + __u64 range_list;
> + __u64 rsvd;
> +};
> +
> /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
> #define FILE_DEDUPE_RANGE_SAME 0
> #define FILE_DEDUPE_RANGE_DIFFERS 1
> @@ -184,6 +196,7 @@ struct fsxattr {
> #define BLKSECDISCARD _IO(0x12,125)
> #define BLKROTATIONAL _IO(0x12,126)
> #define BLKZEROOUT _IO(0x12,127)
> +#define BLKCOPY _IOWR(0x12, 128, struct copy_range)
> /*
> * A jump here: 130-131 are reserved for zoned block devices
> * (see uapi/linux/blkzoned.h)
>
--
Damien Le Moal
Western Digital Research
SelvaKumar S: This didn't show up on dm-devel, sorry for the OT reply...
On Mon, Jan 04, 2021 at 12:47:11PM +0000, Damien Le Moal wrote:
> On 2021/01/04 19:48, SelvaKumar S wrote:
> > Add new BLKCOPY ioctl that offloads copying of one or more sources
> > ranges to a destination in the device. Accepts copy_ranges that contains
> > destination, no of sources and pointer to the array of source
> > ranges. Each range_entry contains start and length of source
> > ranges (in bytes).
> >
> > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> > bio with control information as payload and submit to the device.
> > REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
> > to zoned device.
> >
> > If the device doesn't support copy or copy offload is disabled, then
> > copy is emulated by allocating memory of total copy size. The source
> > ranges are read into memory by chaining bio for each source ranges and
> > submitting them async and the last bio waits for completion. After data
> > is read, it is written to the destination.
> >
> > bio_map_kern() is used to allocate bio and add pages of copy buffer to
> > bio. As bio->bi_private and bio->bi_end_io is needed for chaining the
> > bio and over written, invalidate_kernel_vmap_range() for read is called
> > in the caller.
> >
> > Introduce queue limits for simple copy and other helper functions.
> > Add device limits as sysfs entries.
> > - copy_offload
> > - max_copy_sectors
> > - max_copy_ranges_sectors
> > - max_copy_nr_ranges
> >
> > copy_offload(= 0) is disabled by default.
> > max_copy_sectors = 0 indicates the device doesn't support native copy.
> >
> > Native copy offload is not supported for stacked devices and is done via
> > copy emulation.
> >
> > Signed-off-by: SelvaKumar S <[email protected]>
> > Signed-off-by: Kanchan Joshi <[email protected]>
> > Signed-off-by: Nitesh Shetty <[email protected]>
> > Signed-off-by: Javier Gonz?lez <[email protected]>
> > ---
> > block/blk-core.c | 94 ++++++++++++++--
> > block/blk-lib.c | 223 ++++++++++++++++++++++++++++++++++++++
> > block/blk-merge.c | 2 +
> > block/blk-settings.c | 10 ++
> > block/blk-sysfs.c | 50 +++++++++
> > block/blk-zoned.c | 1 +
> > block/bounce.c | 1 +
> > block/ioctl.c | 43 ++++++++
> > include/linux/bio.h | 1 +
> > include/linux/blk_types.h | 15 +++
> > include/linux/blkdev.h | 13 +++
> > include/uapi/linux/fs.h | 13 +++
This series should also be cc'd to linux-api since you're adding a new
userspace api.
Alternately, save yourself the trouble of passing userspace API review
by hooking this up to the existing copy_file_range call in the vfs. /me
hopes you sent blktests to check the operation of this thing, since none
of the original patches made it to this list.
If you really /do/ need a new kernel call for this, then please send in
a manpage documenting the fields of struct range_entry and copy_range,
and how users are supposed to use this.
<now jumping to the ioctl definition because Damien already reviewed the
plumbing...>
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index f44eb0a04afd..5cadb176317a 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -64,6 +64,18 @@ struct fstrim_range {
> > __u64 minlen;
> > };
> >
> > +struct range_entry {
> > + __u64 src;
Is this an offset? Or the fd of an open bdev?
> > + __u64 len;
> > +};
> > +
> > +struct copy_range {
> > + __u64 dest;
> > + __u64 nr_range;
> > + __u64 range_list;
Hm, I think this is a pointer? Why not just put the range_entry array
at the end of struct copy_range?
> > + __u64 rsvd;
Also needs a flags argument so we don't have to add BLKCOPY2 in like 3
months.
--D
> > +};
> > +
> > /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
> > #define FILE_DEDUPE_RANGE_SAME 0
> > #define FILE_DEDUPE_RANGE_DIFFERS 1
> > @@ -184,6 +196,7 @@ struct fsxattr {
> > #define BLKSECDISCARD _IO(0x12,125)
> > #define BLKROTATIONAL _IO(0x12,126)
> > #define BLKZEROOUT _IO(0x12,127)
> > +#define BLKCOPY _IOWR(0x12, 128, struct copy_range)
> > /*
> > * A jump here: 130-131 are reserved for zoned block devices
> > * (see uapi/linux/blkzoned.h)
> >
>
>
> --
> Damien Le Moal
> Western Digital Research
>
>
>
> --
> dm-devel mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/dm-devel
>
Thanks for the review, Damien.
On Mon, Jan 4, 2021 at 6:17 PM Damien Le Moal <[email protected]> wrote:
>
> On 2021/01/04 19:48, SelvaKumar S wrote:
> > Add new BLKCOPY ioctl that offloads copying of one or more sources
> > ranges to a destination in the device. Accepts copy_ranges that contains
> > destination, no of sources and pointer to the array of source
> > ranges. Each range_entry contains start and length of source
> > ranges (in bytes).
> >
> > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> > bio with control information as payload and submit to the device.
> > REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
> > to zoned device.
> >
> > If the device doesn't support copy or copy offload is disabled, then
> > copy is emulated by allocating memory of total copy size. The source
> > ranges are read into memory by chaining bio for each source ranges and
> > submitting them async and the last bio waits for completion. After data
> > is read, it is written to the destination.
> >
> > bio_map_kern() is used to allocate bio and add pages of copy buffer to
> > bio. As bio->bi_private and bio->bi_end_io is needed for chaining the
> > bio and over written, invalidate_kernel_vmap_range() for read is called
> > in the caller.
> >
> > Introduce queue limits for simple copy and other helper functions.
> > Add device limits as sysfs entries.
> > - copy_offload
> > - max_copy_sectors
> > - max_copy_ranges_sectors
> > - max_copy_nr_ranges
> >
> > copy_offload(= 0) is disabled by default.
> > max_copy_sectors = 0 indicates the device doesn't support native copy.
> >
> > Native copy offload is not supported for stacked devices and is done via
> > copy emulation.
> >
> > Signed-off-by: SelvaKumar S <[email protected]>
> > Signed-off-by: Kanchan Joshi <[email protected]>
> > Signed-off-by: Nitesh Shetty <[email protected]>
> > Signed-off-by: Javier González <[email protected]>
> > ---
> > block/blk-core.c | 94 ++++++++++++++--
> > block/blk-lib.c | 223 ++++++++++++++++++++++++++++++++++++++
> > block/blk-merge.c | 2 +
> > block/blk-settings.c | 10 ++
> > block/blk-sysfs.c | 50 +++++++++
> > block/blk-zoned.c | 1 +
> > block/bounce.c | 1 +
> > block/ioctl.c | 43 ++++++++
> > include/linux/bio.h | 1 +
> > include/linux/blk_types.h | 15 +++
> > include/linux/blkdev.h | 13 +++
> > include/uapi/linux/fs.h | 13 +++
> > 12 files changed, 458 insertions(+), 8 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 96e5fcd7f071..4a5cd3f53cd2 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -719,6 +719,17 @@ static noinline int should_fail_bio(struct bio *bio)
> > }
> > ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
> >
> > +static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
> > + sector_t nr_sectors, sector_t maxsector)
> > +{
> > + if (nr_sectors && maxsector &&
> > + (nr_sectors > maxsector || start > maxsector - nr_sectors)) {
> > + handle_bad_sector(bio, maxsector);
> > + return -EIO;
> > + }
> > + return 0;
> > +}
> > +
> > /*
> > * Check whether this bio extends beyond the end of the device or partition.
> > * This may well happen - the kernel calls bread() without checking the size of
> > @@ -737,6 +748,65 @@ static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
> > return 0;
> > }
> >
> > +/*
> > + * Check for copy limits and remap source ranges if needed.
> > + */
> > +static int blk_check_copy(struct bio *bio)
> > +{
> > + struct block_device *p = NULL;
> > + struct request_queue *q = bio->bi_disk->queue;
> > + struct blk_copy_payload *payload;
> > + int i, maxsector, start_sect = 0, ret = -EIO;
> > + unsigned short nr_range;
> > +
> > + rcu_read_lock();
> > +
> > + p = __disk_get_part(bio->bi_disk, bio->bi_partno);
> > + if (unlikely(!p))
> > + goto out;
> > + if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
> > + goto out;
> > + if (unlikely(bio_check_ro(bio, p)))
> > + goto out;
> > +
> > + maxsector = bdev_nr_sectors(p);
> > + start_sect = p->bd_start_sect;
> > +
> > + payload = bio_data(bio);
> > + nr_range = payload->copy_range;
> > +
> > + /* cannot handle copy crossing nr_ranges limit */
> > + if (payload->copy_range > q->limits.max_copy_nr_ranges)
> > + goto out;
>
> If payload->copy_range indicates the number of ranges to be copied, you should
> name it payload->copy_nr_ranges.
>
Agreed. Will rename the entries.
> > +
> > + /* cannot handle copy more than copy limits */
>
> Why ? You could loop... Similarly to discard, zone reset etc.
>
Sure. Will add the support for handling copy larger than device limits.
>
> > + if (payload->copy_size > q->limits.max_copy_sectors)
> > + goto out;
>
> Shouldn't the list of source ranges give the total size to be copied ?
> Otherwise, if payload->copy_size is user provided, you would need to check that
> the sum of the source ranges actually is equal to copy_size, no ? That means
> that dropping copy_size sound like the right thing to do here.
>
payload->copy_size is used in copy emulation to allocate the buffer.
Let me check
one more time if it is possible to do without this.
> > +
> > + /* check if copy length crosses eod */
> > + if (unlikely(bio_check_copy_eod(bio, bio->bi_iter.bi_sector,
> > + payload->copy_size, maxsector)))
> > + goto out;
> > + bio->bi_iter.bi_sector += start_sect;
> > +
> > + for (i = 0; i < nr_range; i++) {
> > + if (unlikely(bio_check_copy_eod(bio, payload->range[i].src,
> > + payload->range[i].len, maxsector)))
> > + goto out;
> > +
> > + /* single source range length limit */
> > + if (payload->range[i].src > q->limits.max_copy_range_sectors)
> > + goto out;
> > + payload->range[i].src += start_sect;
> > + }
> > +
> > + bio->bi_partno = 0;
> > + ret = 0;
> > +out:
> > + rcu_read_unlock();
> > + return ret;
> > +}
> > +
> > /*
> > * Remap block n of partition p to block n+start(p) of the disk.
> > */
> > @@ -826,14 +896,16 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> > if (should_fail_bio(bio))
> > goto end_io;
> >
> > - if (bio->bi_partno) {
> > - if (unlikely(blk_partition_remap(bio)))
> > - goto end_io;
> > - } else {
> > - if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
> > - goto end_io;
> > - if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
> > - goto end_io;
> > + if (likely(!op_is_copy(bio->bi_opf))) {
> > + if (bio->bi_partno) {
> > + if (unlikely(blk_partition_remap(bio)))
> > + goto end_io;
> > + } else {
> > + if (unlikely(bio_check_ro(bio, bio->bi_disk->part0)))
> > + goto end_io;
> > + if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
> > + goto end_io;
> > + }
> > }
> >
> > /*
> > @@ -857,6 +929,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> > if (!blk_queue_discard(q))
> > goto not_supported;
> > break;
> > + case REQ_OP_COPY:
> > + if (!blk_queue_copy(q))
> > + goto not_supported;
>
> This check could be inside blk_check_copy(). In any case, since you added the
> read-write emulation, why are you even checking this ? That will prevent the use
> of the read-write emulation for devices that lack the simple copy support, no ?
>
Makes sense. Will remove this check.
>
> > + if (unlikely(blk_check_copy(bio)))
> > + goto end_io;
> > + break;
> > case REQ_OP_SECURE_ERASE:
> > if (!blk_queue_secure_erase(q))
> > goto not_supported;
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index 752f9c722062..4c0f12e2ed7c 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -150,6 +150,229 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> > }
> > EXPORT_SYMBOL(blkdev_issue_discard);
> >
> > +int blk_copy_offload(struct block_device *bdev, struct blk_copy_payload *payload,
> > + sector_t dest, gfp_t gfp_mask)
> > +{
> > + struct bio *bio;
> > + int ret, total_size;
> > +
> > + bio = bio_alloc(gfp_mask, 1);
> > + bio->bi_iter.bi_sector = dest;
> > + bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE;
> > + bio_set_dev(bio, bdev);
> > +
> > + total_size = struct_size(payload, range, payload->copy_range);
> > + ret = bio_add_page(bio, virt_to_page(payload), total_size,
>
> How is payload allocated ? If it is a structure on-stack in the caller, I am not
> sure it would be wise to do an IO using the thread stack page...
>
> > + offset_in_page(payload));
> > + if (ret != total_size) {
> > + ret = -ENOMEM;
> > + bio_put(bio);
> > + goto err;
> > + }
> > +
> > + ret = submit_bio_wait(bio);
> > +err:
> > + bio_put(bio);
> > + return ret;
> > +
> > +}
> > +
> > +int blk_read_to_buf(struct block_device *bdev, struct blk_copy_payload *payload,
> > + gfp_t gfp_mask, void **buf_p)
> > +{
> > + struct request_queue *q = bdev_get_queue(bdev);
> > + struct bio *bio, *parent = NULL;
> > + void *buf = NULL;
> > + bool is_vmalloc;
> > + int i, nr_srcs, copy_len, ret, cur_size, t_len = 0;
> > +
> > + nr_srcs = payload->copy_range;
> > + copy_len = payload->copy_size << SECTOR_SHIFT;
> > +
> > + buf = kvmalloc(copy_len, gfp_mask);
> > + if (!buf)
> > + return -ENOMEM;
> > + is_vmalloc = is_vmalloc_addr(buf);
> > +
> > + for (i = 0; i < nr_srcs; i++) {
> > + cur_size = payload->range[i].len << SECTOR_SHIFT;
> > +
> > + bio = bio_map_kern(q, buf + t_len, cur_size, gfp_mask);
> > + if (IS_ERR(bio)) {
> > + ret = PTR_ERR(bio);
> > + goto out;
> > + }
> > +
> > + bio->bi_iter.bi_sector = payload->range[i].src;
> > + bio->bi_opf = REQ_OP_READ;
> > + bio_set_dev(bio, bdev);
> > + bio->bi_end_io = NULL;
> > + bio->bi_private = NULL;
> > +
> > + if (parent) {
> > + bio_chain(parent, bio);
> > + submit_bio(parent);
> > + }
> > +
> > + parent = bio;
> > + t_len += cur_size;
> > + }
> > +
> > + ret = submit_bio_wait(bio);
> > + bio_put(bio);
> > + if (is_vmalloc)
> > + invalidate_kernel_vmap_range(buf, copy_len);
> > + if (ret)
> > + goto out;
> > +
> > + *buf_p = buf;
> > + return 0;
> > +out:
> > + kvfree(buf);
> > + return ret;
> > +}
> > +
> > +int blk_write_from_buf(struct block_device *bdev, void *buf, sector_t dest,
> > + int copy_len, gfp_t gfp_mask)
> > +{
> > + struct request_queue *q = bdev_get_queue(bdev);
> > + struct bio *bio;
> > + int ret;
> > +
> > + bio = bio_map_kern(q, buf, copy_len, gfp_mask);
> > + if (IS_ERR(bio)) {
> > + ret = PTR_ERR(bio);
> > + goto out;
> > + }
> > + bio_set_dev(bio, bdev);
> > + bio->bi_opf = REQ_OP_WRITE;
> > + bio->bi_iter.bi_sector = dest;
> > +
> > + bio->bi_end_io = NULL;
> > + ret = submit_bio_wait(bio);
> > + bio_put(bio);
> > +out:
> > + return ret;
> > +}
> > +
> > +int blk_prepare_payload(struct block_device *bdev, int nr_srcs, struct range_entry *rlist,
> > + gfp_t gfp_mask, struct blk_copy_payload **payload_p)
> > +{
> > +
> > + struct request_queue *q = bdev_get_queue(bdev);
> > + struct blk_copy_payload *payload;
> > + sector_t bs_mask;
> > + sector_t src_sects, len = 0, total_len = 0;
> > + int i, ret, total_size;
> > +
> > + if (!q)
> > + return -ENXIO;
> > +
> > + if (!nr_srcs)
> > + return -EINVAL;
> > +
> > + if (bdev_read_only(bdev))
> > + return -EPERM;
> > +
> > + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
> > +
> > + total_size = struct_size(payload, range, nr_srcs);
> > + payload = kmalloc(total_size, gfp_mask);
> > + if (!payload)
> > + return -ENOMEM;
>
> OK. So this is what goes into the bio. The function blk_copy_offload() assumes
> this is at most one page, so total_size <= PAGE_SIZE. Where is that checked ?
>
CMIIW. As payload was allocated via kmalloc, it would be represented by a single
contiguous segment. In case it crosses one page, rely on multi-page bio_vec to
cover it.
> > +
> > + for (i = 0; i < nr_srcs; i++) {
> > + /* copy payload provided are in bytes */
> > + src_sects = rlist[i].src;
> > + if (src_sects & bs_mask) {
> > + ret = -EINVAL;
> > + goto err;
> > + }
> > + src_sects = src_sects >> SECTOR_SHIFT;
> > +
> > + if (len & bs_mask) {
> > + ret = -EINVAL;
> > + goto err;
> > + }
> > +
> > + len = rlist[i].len >> SECTOR_SHIFT;
> > +
> > + total_len += len;
> > +
> > + WARN_ON_ONCE((src_sects << 9) > UINT_MAX);
> > +
> > + payload->range[i].src = src_sects;
> > + payload->range[i].len = len;
> > + }
> > +
> > + /* storing # of source ranges */
> > + payload->copy_range = i;
> > + /* storing copy len so far */
> > + payload->copy_size = total_len;
>
> The comments here make the code ugly. Rename the variables and structure fields
> to have something self explanatory.
>
Agreed.
> > +
> > + *payload_p = payload;
> > + return 0;
> > +err:
> > + kfree(payload);
> > + return ret;
> > +}
> > +
> > +/**
> > + * blkdev_issue_copy - queue a copy
> > + * @bdev: source block device
> > + * @nr_srcs: number of source ranges to copy
> > + * @rlist: array of source ranges (in bytes)
> > + * @dest_bdev: destination block device
> > + * @dest: destination (in bytes)
> > + * @gfp_mask: memory allocation flags (for bio_alloc)
> > + *
> > + * Description:
> > + * Copy array of source ranges from source block device to
> > + * destination block devcie.
> > + */
> > +
> > +
> > +int blkdev_issue_copy(struct block_device *bdev, int nr_srcs,
> > + struct range_entry *src_rlist, struct block_device *dest_bdev,
> > + sector_t dest, gfp_t gfp_mask)
> > +{
> > + struct request_queue *q = bdev_get_queue(bdev);
> > + struct blk_copy_payload *payload;
> > + sector_t bs_mask, dest_sect;
> > + void *buf = NULL;
> > + int ret;
> > +
> > + ret = blk_prepare_payload(bdev, nr_srcs, src_rlist, gfp_mask, &payload);
> > + if (ret)
> > + return ret;
> > +
> > + bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
> > + if (dest & bs_mask) {
> > + return -EINVAL;
> > + goto out;
> > + }
> > + dest_sect = dest >> SECTOR_SHIFT;
>
> dest should already be a 512B sector as all block layer functions interface use
> this unit. What is this doing ?
>
As source ranges and length received were in bytes, to keep things
same, dest was
also chosen to be in bytes. Seems wrong. Will change it to the 512B sector.
>
> > +
> > + if (bdev == dest_bdev && q->limits.copy_offload) {
> > + ret = blk_copy_offload(bdev, payload, dest_sect, gfp_mask);
> > + if (ret)
> > + goto out;
> > + } else {
> > + ret = blk_read_to_buf(bdev, payload, gfp_mask, &buf);
> > + if (ret)
> > + goto out;
> > + ret = blk_write_from_buf(dest_bdev, buf, dest_sect,
> > + payload->copy_size << SECTOR_SHIFT, gfp_mask);
>
> Arg... This is really not pretty. At the very least, this should all be in a
> blk_copy_emulate() helper or something named like that.
>
Okay. Will move this to a helper.
> My worry though is that this likely slow for large copies, not to mention that
> buf is likely to fail to allocate for large copy cases. As commented previously,
> dm-kcopyd already does such copy well, with a read-write streaming pipeline and
> zone support too for the write side. This should really be reused, at least
> partly, instead of re-implementing this read-write here.
>
I was a bit hesitant to use dm layer code in the block layer. It makes sense to
reuse the code as much as possible. Will try to reuse dm-kcopyd code for copy
emulation part.
>
> > + }
> > +
> > + if (buf)
> > + kvfree(buf);
> > +out:
> > + kvfree(payload);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL(blkdev_issue_copy);
> > +
> > /**
> > * __blkdev_issue_write_same - generate number of bios with same page
> > * @bdev: target blockdev
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 808768f6b174..4e04f24e13c1 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -309,6 +309,8 @@ void __blk_queue_split(struct bio **bio, unsigned int *nr_segs)
> > struct bio *split = NULL;
> >
> > switch (bio_op(*bio)) {
> > + case REQ_OP_COPY:
> > + break;
> > case REQ_OP_DISCARD:
> > case REQ_OP_SECURE_ERASE:
> > split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
> > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > index 43990b1d148b..93c15ba45a69 100644
> > --- a/block/blk-settings.c
> > +++ b/block/blk-settings.c
> > @@ -60,6 +60,10 @@ void blk_set_default_limits(struct queue_limits *lim)
> > lim->io_opt = 0;
> > lim->misaligned = 0;
> > lim->zoned = BLK_ZONED_NONE;
> > + lim->copy_offload = 0;
> > + lim->max_copy_sectors = 0;
> > + lim->max_copy_nr_ranges = 0;
> > + lim->max_copy_range_sectors = 0;
> > }
> > EXPORT_SYMBOL(blk_set_default_limits);
> >
> > @@ -565,6 +569,12 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> > if (b->chunk_sectors)
> > t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
> >
> > + /* simple copy not supported in stacked devices */
> > + t->copy_offload = 0;
> > + t->max_copy_sectors = 0;
> > + t->max_copy_range_sectors = 0;
> > + t->max_copy_nr_ranges = 0;
> > +
> > /* Physical block size a multiple of the logical block size? */
> > if (t->physical_block_size & (t->logical_block_size - 1)) {
> > t->physical_block_size = t->logical_block_size;
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index b513f1683af0..51b35a8311d9 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -166,6 +166,47 @@ static ssize_t queue_discard_granularity_show(struct request_queue *q, char *pag
> > return queue_var_show(q->limits.discard_granularity, page);
> > }
> >
> > +static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
> > +{
> > + return queue_var_show(q->limits.copy_offload, page);
> > +}
> > +
> > +static ssize_t queue_copy_offload_store(struct request_queue *q,
> > + const char *page, size_t count)
> > +{
> > + unsigned long copy_offload;
> > + ssize_t ret = queue_var_store(©_offload, page, count);
> > +
> > + if (ret < 0)
> > + return ret;
> > +
> > + if (copy_offload < 0 || copy_offload > 1)
>
> err... "copy_offload != 1" ?
Initial thought was to keep it either 0 or 1. I'll change it to 0 or else.
>
> > + return -EINVAL;
> > +
> > + if (q->limits.max_copy_sectors == 0 && copy_offload == 1)
> > + return -EINVAL;
> > +
> > + q->limits.copy_offload = copy_offload;
> > + return ret;
> > +}
> > +
> > +static ssize_t queue_max_copy_sectors_show(struct request_queue *q, char *page)
> > +{
> > + return queue_var_show(q->limits.max_copy_sectors, page);
> > +}
> > +
> > +static ssize_t queue_max_copy_range_sectors_show(struct request_queue *q,
> > + char *page)
> > +{
> > + return queue_var_show(q->limits.max_copy_range_sectors, page);
> > +}
> > +
> > +static ssize_t queue_max_copy_nr_ranges_show(struct request_queue *q,
> > + char *page)
> > +{
> > + return queue_var_show(q->limits.max_copy_nr_ranges, page);
> > +}
> > +
> > static ssize_t queue_discard_max_hw_show(struct request_queue *q, char *page)
> > {
> >
> > @@ -591,6 +632,11 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
> > QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
> > QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
> >
> > +QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
> > +QUEUE_RO_ENTRY(queue_max_copy_sectors, "max_copy_sectors");
> > +QUEUE_RO_ENTRY(queue_max_copy_range_sectors, "max_copy_range_sectors");
> > +QUEUE_RO_ENTRY(queue_max_copy_nr_ranges, "max_copy_nr_ranges");
> > +
> > QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
> > QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
> > QUEUE_RW_ENTRY(queue_poll, "io_poll");
> > @@ -636,6 +682,10 @@ static struct attribute *queue_attrs[] = {
> > &queue_discard_max_entry.attr,
> > &queue_discard_max_hw_entry.attr,
> > &queue_discard_zeroes_data_entry.attr,
> > + &queue_copy_offload_entry.attr,
> > + &queue_max_copy_sectors_entry.attr,
> > + &queue_max_copy_range_sectors_entry.attr,
> > + &queue_max_copy_nr_ranges_entry.attr,
> > &queue_write_same_max_entry.attr,
> > &queue_write_zeroes_max_entry.attr,
> > &queue_zone_append_max_entry.attr,
> > diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> > index 7a68b6e4300c..02069178d51e 100644
> > --- a/block/blk-zoned.c
> > +++ b/block/blk-zoned.c
> > @@ -75,6 +75,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
> > case REQ_OP_WRITE_ZEROES:
> > case REQ_OP_WRITE_SAME:
> > case REQ_OP_WRITE:
> > + case REQ_OP_COPY:
> > return blk_rq_zone_is_seq(rq);
> > default:
> > return false;
> > diff --git a/block/bounce.c b/block/bounce.c
> > index d3f51acd6e3b..5e052afe8691 100644
> > --- a/block/bounce.c
> > +++ b/block/bounce.c
> > @@ -254,6 +254,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
> > bio->bi_iter.bi_size = bio_src->bi_iter.bi_size;
> >
> > switch (bio_op(bio)) {
> > + case REQ_OP_COPY:
> > case REQ_OP_DISCARD:
> > case REQ_OP_SECURE_ERASE:
> > case REQ_OP_WRITE_ZEROES:
> > diff --git a/block/ioctl.c b/block/ioctl.c
> > index d61d652078f4..d50b6abe2883 100644
> > --- a/block/ioctl.c
> > +++ b/block/ioctl.c
> > @@ -133,6 +133,47 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
> > GFP_KERNEL, flags);
> > }
> >
> > +static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
> > + unsigned long arg)
> > +{
> > + struct copy_range crange;
> > + struct range_entry *rlist;
> > + struct request_queue *q = bdev_get_queue(bdev);
> > + sector_t dest;
> > + int ret;
> > +
> > + if (!(mode & FMODE_WRITE))
> > + return -EBADF;
> > +
> > + if (!blk_queue_copy(q))
> > + return -EOPNOTSUPP;
>
> But you added the read-write emulation. So what is the point here ? This ioctl
> should work for any device, with or without simple copy support. Did you test that ?
>
Sorry. It was a mistake. Will fix this.
> > +
> > + if (copy_from_user(&crange, (void __user *)arg, sizeof(crange)))
> > + return -EFAULT;
> > +
> > + if (crange.dest & ((1 << SECTOR_SHIFT) - 1))
> > + return -EFAULT;
> > + dest = crange.dest;
> > +
> > + rlist = kmalloc_array(crange.nr_range, sizeof(*rlist),
> > + GFP_KERNEL);
> > +
>
> Unnecessary blank line here.
>
> > + if (!rlist)
> > + return -ENOMEM;
> > +
> > + if (copy_from_user(rlist, (void __user *)crange.range_list,
> > + sizeof(*rlist) * crange.nr_range)) {
> > + ret = -EFAULT;
> > + goto out;
> > + }
> > +
> > + ret = blkdev_issue_copy(bdev, crange.nr_range, rlist, bdev, dest,
> > + GFP_KERNEL);
> > +out:
> > + kfree(rlist);
> > + return ret;
> > +}
> > +
> > static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
> > unsigned long arg)
> > {
> > @@ -458,6 +499,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
> > case BLKSECDISCARD:
> > return blk_ioctl_discard(bdev, mode, arg,
> > BLKDEV_DISCARD_SECURE);
> > + case BLKCOPY:
> > + return blk_ioctl_copy(bdev, mode, arg);
> > case BLKZEROOUT:
> > return blk_ioctl_zeroout(bdev, mode, arg);
> > case BLKREPORTZONE:
> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 1edda614f7ce..164313bdfb35 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -71,6 +71,7 @@ static inline bool bio_has_data(struct bio *bio)
> > static inline bool bio_no_advance_iter(const struct bio *bio)
> > {
> > return bio_op(bio) == REQ_OP_DISCARD ||
> > + bio_op(bio) == REQ_OP_COPY ||
> > bio_op(bio) == REQ_OP_SECURE_ERASE ||
> > bio_op(bio) == REQ_OP_WRITE_SAME ||
> > bio_op(bio) == REQ_OP_WRITE_ZEROES;
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index 866f74261b3b..d4d11e9ff814 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -380,6 +380,8 @@ enum req_opf {
> > REQ_OP_ZONE_RESET = 15,
> > /* reset all the zone present on the device */
> > REQ_OP_ZONE_RESET_ALL = 17,
> > + /* copy ranges within device */
> > + REQ_OP_COPY = 19,
> >
> > /* SCSI passthrough using struct scsi_request */
> > REQ_OP_SCSI_IN = 32,
> > @@ -506,6 +508,11 @@ static inline bool op_is_discard(unsigned int op)
> > return (op & REQ_OP_MASK) == REQ_OP_DISCARD;
> > }
> >
> > +static inline bool op_is_copy(unsigned int op)
> > +{
> > + return (op & REQ_OP_MASK) == REQ_OP_COPY;
> > +}
> > +
> > /*
> > * Check if a bio or request operation is a zone management operation, with
> > * the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case
> > @@ -565,4 +572,12 @@ struct blk_rq_stat {
> > u64 batch;
> > };
> >
> > +struct blk_copy_payload {
> > + sector_t dest;
> > + int copy_range;
> > + int copy_size;
> > + int err;
> > + struct range_entry range[];
> > +};
> > +
> > #endif /* __LINUX_BLK_TYPES_H */
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 81f9e7bec16c..4c7e861e57e4 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -340,10 +340,14 @@ struct queue_limits {
> > unsigned int max_zone_append_sectors;
> > unsigned int discard_granularity;
> > unsigned int discard_alignment;
> > + unsigned int copy_offload;
> > + unsigned int max_copy_sectors;
> >
> > unsigned short max_segments;
> > unsigned short max_integrity_segments;
> > unsigned short max_discard_segments;
> > + unsigned short max_copy_range_sectors;
> > + unsigned short max_copy_nr_ranges;
> >
> > unsigned char misaligned;
> > unsigned char discard_misaligned;
> > @@ -625,6 +629,7 @@ struct request_queue {
> > #define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */
> > #define QUEUE_FLAG_HCTX_ACTIVE 28 /* at least one blk-mq hctx is active */
> > #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */
> > +#define QUEUE_FLAG_COPY 30 /* supports copy */
>
> I think this should be called QUEUE_FLAG_SIMPLE_COPY to indicate more precisely
> the type of copy supported. SCSI XCOPY is more advanced...
>
Agreed.
> >
> > #define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
> > (1 << QUEUE_FLAG_SAME_COMP) | \
> > @@ -647,6 +652,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
> > #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
> > #define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
> > #define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
> > +#define blk_queue_copy(q) test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
> > #define blk_queue_zone_resetall(q) \
> > test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
> > #define blk_queue_secure_erase(q) \
> > @@ -1061,6 +1067,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
> > return min(q->limits.max_discard_sectors,
> > UINT_MAX >> SECTOR_SHIFT);
> >
> > + if (unlikely(op == REQ_OP_COPY))
> > + return q->limits.max_copy_sectors;
> > +
> > if (unlikely(op == REQ_OP_WRITE_SAME))
> > return q->limits.max_write_same_sectors;
> >
> > @@ -1335,6 +1344,10 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> > sector_t nr_sects, gfp_t gfp_mask, int flags,
> > struct bio **biop);
> >
> > +extern int blkdev_issue_copy(struct block_device *bdev, int nr_srcs,
> > + struct range_entry *src_rlist, struct block_device *dest_bdev,
> > + sector_t dest, gfp_t gfp_mask);
> > +
> > #define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
> > #define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
> >
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index f44eb0a04afd..5cadb176317a 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -64,6 +64,18 @@ struct fstrim_range {
> > __u64 minlen;
> > };
> >
> > +struct range_entry {
> > + __u64 src;
> > + __u64 len;
> > +};
> > +
> > +struct copy_range {
> > + __u64 dest;
> > + __u64 nr_range;
> > + __u64 range_list;
> > + __u64 rsvd;
> > +};
> > +
> > /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
> > #define FILE_DEDUPE_RANGE_SAME 0
> > #define FILE_DEDUPE_RANGE_DIFFERS 1
> > @@ -184,6 +196,7 @@ struct fsxattr {
> > #define BLKSECDISCARD _IO(0x12,125)
> > #define BLKROTATIONAL _IO(0x12,126)
> > #define BLKZEROOUT _IO(0x12,127)
> > +#define BLKCOPY _IOWR(0x12, 128, struct copy_range)
> > /*
> > * A jump here: 130-131 are reserved for zoned block devices
> > * (see uapi/linux/blkzoned.h)
> >
>
>
> --
> Damien Le Moal
> Western Digital Research
Thanks,
Selva
Hi Darrick,
On Tue, Jan 5, 2021 at 12:33 AM Darrick J. Wong <[email protected]> wrote:
>
> SelvaKumar S: This didn't show up on dm-devel, sorry for the OT reply...
>
> On Mon, Jan 04, 2021 at 12:47:11PM +0000, Damien Le Moal wrote:
> > On 2021/01/04 19:48, SelvaKumar S wrote:
> > > Add new BLKCOPY ioctl that offloads copying of one or more sources
> > > ranges to a destination in the device. Accepts copy_ranges that contains
> > > destination, no of sources and pointer to the array of source
> > > ranges. Each range_entry contains start and length of source
> > > ranges (in bytes).
> > >
> > > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> > > bio with control information as payload and submit to the device.
> > > REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
> > > to zoned device.
> > >
> > > If the device doesn't support copy or copy offload is disabled, then
> > > copy is emulated by allocating memory of total copy size. The source
> > > ranges are read into memory by chaining bio for each source ranges and
> > > submitting them async and the last bio waits for completion. After data
> > > is read, it is written to the destination.
> > >
> > > bio_map_kern() is used to allocate bio and add pages of copy buffer to
> > > bio. As bio->bi_private and bio->bi_end_io is needed for chaining the
> > > bio and over written, invalidate_kernel_vmap_range() for read is called
> > > in the caller.
> > >
> > > Introduce queue limits for simple copy and other helper functions.
> > > Add device limits as sysfs entries.
> > > - copy_offload
> > > - max_copy_sectors
> > > - max_copy_ranges_sectors
> > > - max_copy_nr_ranges
> > >
> > > copy_offload(= 0) is disabled by default.
> > > max_copy_sectors = 0 indicates the device doesn't support native copy.
> > >
> > > Native copy offload is not supported for stacked devices and is done via
> > > copy emulation.
> > >
> > > Signed-off-by: SelvaKumar S <[email protected]>
> > > Signed-off-by: Kanchan Joshi <[email protected]>
> > > Signed-off-by: Nitesh Shetty <[email protected]>
> > > Signed-off-by: Javier González <[email protected]>
> > > ---
> > > block/blk-core.c | 94 ++++++++++++++--
> > > block/blk-lib.c | 223 ++++++++++++++++++++++++++++++++++++++
> > > block/blk-merge.c | 2 +
> > > block/blk-settings.c | 10 ++
> > > block/blk-sysfs.c | 50 +++++++++
> > > block/blk-zoned.c | 1 +
> > > block/bounce.c | 1 +
> > > block/ioctl.c | 43 ++++++++
> > > include/linux/bio.h | 1 +
> > > include/linux/blk_types.h | 15 +++
> > > include/linux/blkdev.h | 13 +++
> > > include/uapi/linux/fs.h | 13 +++
>
> This series should also be cc'd to linux-api since you're adding a new
> userspace api.
>
Sure. Will cc linux-api
>
> Alternately, save yourself the trouble of passing userspace API review
> by hooking this up to the existing copy_file_range call in the vfs. /me
> hopes you sent blktests to check the operation of this thing, since none
> of the original patches made it to this list.
>
As the initial version had only source bdev, copy_file_rage call was not
viable. With this version, we have two bdev for source and destination.
I'll relook at that. Sure. Will add a new blktests for simple copy.
>
> If you really /do/ need a new kernel call for this, then please send in
> a manpage documenting the fields of struct range_entry and copy_range,
> and how users are supposed to use this.
>
Sure. Will send a manpage documentation once the plumbing is concrete.
> <now jumping to the ioctl definition because Damien already reviewed the
> plumbing...>
>
> > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > index f44eb0a04afd..5cadb176317a 100644
> > > --- a/include/uapi/linux/fs.h
> > > +++ b/include/uapi/linux/fs.h
> > > @@ -64,6 +64,18 @@ struct fstrim_range {
> > > __u64 minlen;
> > > };
> > >
> > > +struct range_entry {
> > > + __u64 src;
>
> Is this an offset? Or the fd of an open bdev?
This is the source offset.
>
> > > + __u64 len;
> > > +};
> > > +
> > > +struct copy_range {
> > > + __u64 dest;
> > > + __u64 nr_range;
> > > + __u64 range_list;
>
> Hm, I think this is a pointer? Why not just put the range_entry array
> at the end of struct copy_range?
>
As the size of the range_entry array can change dynamically depending on
nr_range, it was difficult to do copy_from_user() on copy_range structure in the
ioctl. So I decided to keep that as a pointer to range_entry array
instead of keeping
array at the end.
> > > + __u64 rsvd;
>
> Also needs a flags argument so we don't have to add BLKCOPY2 in like 3
> months.
>
'rsvd' field is kept to support future copies. Will rename it.
> --D
>
> > > +};
> > > +
> > > /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
> > > #define FILE_DEDUPE_RANGE_SAME 0
> > > #define FILE_DEDUPE_RANGE_DIFFERS 1
> > > @@ -184,6 +196,7 @@ struct fsxattr {
> > > #define BLKSECDISCARD _IO(0x12,125)
> > > #define BLKROTATIONAL _IO(0x12,126)
> > > #define BLKZEROOUT _IO(0x12,127)
> > > +#define BLKCOPY _IOWR(0x12, 128, struct copy_range)
> > > /*
> > > * A jump here: 130-131 are reserved for zoned block devices
> > > * (see uapi/linux/blkzoned.h)
> > >
> >
> >
> > --
> > Damien Le Moal
> > Western Digital Research
> >
> >
> >
> > --
> > dm-devel mailing list
> > [email protected]
> > https://www.redhat.com/mailman/listinfo/dm-devel
> >
Thanks,
Selva
On 2021/01/05 21:24, Selva Jove wrote:
> Thanks for the review, Damien.
>
> On Mon, Jan 4, 2021 at 6:17 PM Damien Le Moal <[email protected]> wrote:
>>
>> On 2021/01/04 19:48, SelvaKumar S wrote:
>>> Add new BLKCOPY ioctl that offloads copying of one or more sources
>>> ranges to a destination in the device. Accepts copy_ranges that contains
>>> destination, no of sources and pointer to the array of source
>>> ranges. Each range_entry contains start and length of source
>>> ranges (in bytes).
>>>
>>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
>>> bio with control information as payload and submit to the device.
>>> REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
>>> to zoned device.
>>>
>>> If the device doesn't support copy or copy offload is disabled, then
>>> copy is emulated by allocating memory of total copy size. The source
>>> ranges are read into memory by chaining bio for each source ranges and
>>> submitting them async and the last bio waits for completion. After data
>>> is read, it is written to the destination.
>>>
>>> bio_map_kern() is used to allocate bio and add pages of copy buffer to
>>> bio. As bio->bi_private and bio->bi_end_io is needed for chaining the
>>> bio and over written, invalidate_kernel_vmap_range() for read is called
>>> in the caller.
>>>
>>> Introduce queue limits for simple copy and other helper functions.
>>> Add device limits as sysfs entries.
>>> - copy_offload
>>> - max_copy_sectors
>>> - max_copy_ranges_sectors
>>> - max_copy_nr_ranges
>>>
>>> copy_offload(= 0) is disabled by default.
>>> max_copy_sectors = 0 indicates the device doesn't support native copy.
>>>
>>> Native copy offload is not supported for stacked devices and is done via
>>> copy emulation.
>>>
>>> Signed-off-by: SelvaKumar S <[email protected]>
>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>> Signed-off-by: Javier Gonz?lez <[email protected]>
>>> ---
>>> block/blk-core.c | 94 ++++++++++++++--
>>> block/blk-lib.c | 223 ++++++++++++++++++++++++++++++++++++++
>>> block/blk-merge.c | 2 +
>>> block/blk-settings.c | 10 ++
>>> block/blk-sysfs.c | 50 +++++++++
>>> block/blk-zoned.c | 1 +
>>> block/bounce.c | 1 +
>>> block/ioctl.c | 43 ++++++++
>>> include/linux/bio.h | 1 +
>>> include/linux/blk_types.h | 15 +++
>>> include/linux/blkdev.h | 13 +++
>>> include/uapi/linux/fs.h | 13 +++
>>> 12 files changed, 458 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 96e5fcd7f071..4a5cd3f53cd2 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -719,6 +719,17 @@ static noinline int should_fail_bio(struct bio *bio)
>>> }
>>> ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
>>>
>>> +static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
>>> + sector_t nr_sectors, sector_t maxsector)
>>> +{
>>> + if (nr_sectors && maxsector &&
>>> + (nr_sectors > maxsector || start > maxsector - nr_sectors)) {
>>> + handle_bad_sector(bio, maxsector);
>>> + return -EIO;
>>> + }
>>> + return 0;
>>> +}
>>> +
>>> /*
>>> * Check whether this bio extends beyond the end of the device or partition.
>>> * This may well happen - the kernel calls bread() without checking the size of
>>> @@ -737,6 +748,65 @@ static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
>>> return 0;
>>> }
>>>
>>> +/*
>>> + * Check for copy limits and remap source ranges if needed.
>>> + */
>>> +static int blk_check_copy(struct bio *bio)
>>> +{
>>> + struct block_device *p = NULL;
>>> + struct request_queue *q = bio->bi_disk->queue;
>>> + struct blk_copy_payload *payload;
>>> + int i, maxsector, start_sect = 0, ret = -EIO;
>>> + unsigned short nr_range;
>>> +
>>> + rcu_read_lock();
>>> +
>>> + p = __disk_get_part(bio->bi_disk, bio->bi_partno);
>>> + if (unlikely(!p))
>>> + goto out;
>>> + if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
>>> + goto out;
>>> + if (unlikely(bio_check_ro(bio, p)))
>>> + goto out;
>>> +
>>> + maxsector = bdev_nr_sectors(p);
>>> + start_sect = p->bd_start_sect;
>>> +
>>> + payload = bio_data(bio);
>>> + nr_range = payload->copy_range;
>>> +
>>> + /* cannot handle copy crossing nr_ranges limit */
>>> + if (payload->copy_range > q->limits.max_copy_nr_ranges)
>>> + goto out;
>>
>> If payload->copy_range indicates the number of ranges to be copied, you should
>> name it payload->copy_nr_ranges.
>>
>
> Agreed. Will rename the entries.
>
>>> +
>>> + /* cannot handle copy more than copy limits */
>>
>> Why ? You could loop... Similarly to discard, zone reset etc.
>>
>
> Sure. Will add the support for handling copy larger than device limits.
>
>>
>>> + if (payload->copy_size > q->limits.max_copy_sectors)
>>> + goto out;
>>
>> Shouldn't the list of source ranges give the total size to be copied ?
>> Otherwise, if payload->copy_size is user provided, you would need to check that
>> the sum of the source ranges actually is equal to copy_size, no ? That means
>> that dropping copy_size sound like the right thing to do here.
>>
>
> payload->copy_size is used in copy emulation to allocate the buffer.
> Let me check
> one more time if it is possible to do without this.
If this is used for the emulation only, then it should be a local variable in
the emulation helper.
[...]
>>
>>> + if (unlikely(blk_check_copy(bio)))
>>> + goto end_io;
>>> + break;
>>> case REQ_OP_SECURE_ERASE:
>>> if (!blk_queue_secure_erase(q))
>>> goto not_supported;
>>> diff --git a/block/blk-lib.c b/block/blk-lib.c
>>> index 752f9c722062..4c0f12e2ed7c 100644
>>> --- a/block/blk-lib.c
>>> +++ b/block/blk-lib.c
>>> @@ -150,6 +150,229 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>>> }
>>> EXPORT_SYMBOL(blkdev_issue_discard);
>>>
>>> +int blk_copy_offload(struct block_device *bdev, struct blk_copy_payload *payload,
>>> + sector_t dest, gfp_t gfp_mask)
>>> +{
>>> + struct bio *bio;
>>> + int ret, total_size;
>>> +
>>> + bio = bio_alloc(gfp_mask, 1);
>>> + bio->bi_iter.bi_sector = dest;
>>> + bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE;
>>> + bio_set_dev(bio, bdev);
>>> +
>>> + total_size = struct_size(payload, range, payload->copy_range);
>>> + ret = bio_add_page(bio, virt_to_page(payload), total_size,
>>
>> How is payload allocated ? If it is a structure on-stack in the caller, I am not
>> sure it would be wise to do an IO using the thread stack page...
>>
>>> + offset_in_page(payload));
>>> + if (ret != total_size) {
>>> + ret = -ENOMEM;
>>> + bio_put(bio);
>>> + goto err;
>>> + }
>>> +
>>> + ret = submit_bio_wait(bio);
>>> +err:
>>> + bio_put(bio);
>>> + return ret;
>>> +
>>> +}
>>> +
>>> +int blk_read_to_buf(struct block_device *bdev, struct blk_copy_payload *payload,
>>> + gfp_t gfp_mask, void **buf_p)
>>> +{
>>> + struct request_queue *q = bdev_get_queue(bdev);
>>> + struct bio *bio, *parent = NULL;
>>> + void *buf = NULL;
>>> + bool is_vmalloc;
>>> + int i, nr_srcs, copy_len, ret, cur_size, t_len = 0;
>>> +
>>> + nr_srcs = payload->copy_range;
>>> + copy_len = payload->copy_size << SECTOR_SHIFT;
>>> +
>>> + buf = kvmalloc(copy_len, gfp_mask);
>>> + if (!buf)
>>> + return -ENOMEM;
>>> + is_vmalloc = is_vmalloc_addr(buf);
>>> +
>>> + for (i = 0; i < nr_srcs; i++) {
>>> + cur_size = payload->range[i].len << SECTOR_SHIFT;
>>> +
>>> + bio = bio_map_kern(q, buf + t_len, cur_size, gfp_mask);
>>> + if (IS_ERR(bio)) {
>>> + ret = PTR_ERR(bio);
>>> + goto out;
>>> + }
>>> +
>>> + bio->bi_iter.bi_sector = payload->range[i].src;
>>> + bio->bi_opf = REQ_OP_READ;
>>> + bio_set_dev(bio, bdev);
>>> + bio->bi_end_io = NULL;
>>> + bio->bi_private = NULL;
>>> +
>>> + if (parent) {
>>> + bio_chain(parent, bio);
>>> + submit_bio(parent);
>>> + }
>>> +
>>> + parent = bio;
>>> + t_len += cur_size;
>>> + }
>>> +
>>> + ret = submit_bio_wait(bio);
>>> + bio_put(bio);
>>> + if (is_vmalloc)
>>> + invalidate_kernel_vmap_range(buf, copy_len);
>>> + if (ret)
>>> + goto out;
>>> +
>>> + *buf_p = buf;
>>> + return 0;
>>> +out:
>>> + kvfree(buf);
>>> + return ret;
>>> +}
>>> +
>>> +int blk_write_from_buf(struct block_device *bdev, void *buf, sector_t dest,
>>> + int copy_len, gfp_t gfp_mask)
>>> +{
>>> + struct request_queue *q = bdev_get_queue(bdev);
>>> + struct bio *bio;
>>> + int ret;
>>> +
>>> + bio = bio_map_kern(q, buf, copy_len, gfp_mask);
>>> + if (IS_ERR(bio)) {
>>> + ret = PTR_ERR(bio);
>>> + goto out;
>>> + }
>>> + bio_set_dev(bio, bdev);
>>> + bio->bi_opf = REQ_OP_WRITE;
>>> + bio->bi_iter.bi_sector = dest;
>>> +
>>> + bio->bi_end_io = NULL;
>>> + ret = submit_bio_wait(bio);
>>> + bio_put(bio);
>>> +out:
>>> + return ret;
>>> +}
>>> +
>>> +int blk_prepare_payload(struct block_device *bdev, int nr_srcs, struct range_entry *rlist,
>>> + gfp_t gfp_mask, struct blk_copy_payload **payload_p)
>>> +{
>>> +
>>> + struct request_queue *q = bdev_get_queue(bdev);
>>> + struct blk_copy_payload *payload;
>>> + sector_t bs_mask;
>>> + sector_t src_sects, len = 0, total_len = 0;
>>> + int i, ret, total_size;
>>> +
>>> + if (!q)
>>> + return -ENXIO;
>>> +
>>> + if (!nr_srcs)
>>> + return -EINVAL;
>>> +
>>> + if (bdev_read_only(bdev))
>>> + return -EPERM;
>>> +
>>> + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
>>> +
>>> + total_size = struct_size(payload, range, nr_srcs);
>>> + payload = kmalloc(total_size, gfp_mask);
>>> + if (!payload)
>>> + return -ENOMEM;
>>
>> OK. So this is what goes into the bio. The function blk_copy_offload() assumes
>> this is at most one page, so total_size <= PAGE_SIZE. Where is that checked ?
>>
>
> CMIIW. As payload was allocated via kmalloc, it would be represented by a single
> contiguous segment. In case it crosses one page, rely on multi-page bio_vec to
> cover it.
That is not how I understand the code. If the allocation is more than one page,
you still need to add ALL pages to the BIO. What the multi-page bvec code will
do is to use a single bvec for all physically contiguous pages instead of adding
one bvec per page.
Thinking more about it, I think the function blk_copy_offload() could simply use
bio_map_kern() to allocate and prepare the BIO. That will avoid the need for the
add page loop.
>
>>> +
>>> + for (i = 0; i < nr_srcs; i++) {
>>> + /* copy payload provided are in bytes */
>>> + src_sects = rlist[i].src;
>>> + if (src_sects & bs_mask) {
>>> + ret = -EINVAL;
>>> + goto err;
>>> + }
>>> + src_sects = src_sects >> SECTOR_SHIFT;
>>> +
>>> + if (len & bs_mask) {
>>> + ret = -EINVAL;
>>> + goto err;
>>> + }
>>> +
>>> + len = rlist[i].len >> SECTOR_SHIFT;
>>> +
>>> + total_len += len;
>>> +
>>> + WARN_ON_ONCE((src_sects << 9) > UINT_MAX);
>>> +
>>> + payload->range[i].src = src_sects;
>>> + payload->range[i].len = len;
>>> + }
>>> +
>>> + /* storing # of source ranges */
>>> + payload->copy_range = i;
>>> + /* storing copy len so far */
>>> + payload->copy_size = total_len;
>>
>> The comments here make the code ugly. Rename the variables and structure fields
>> to have something self explanatory.
>>
>
> Agreed.
>
>>> +
>>> + *payload_p = payload;
>>> + return 0;
>>> +err:
>>> + kfree(payload);
>>> + return ret;
>>> +}
>>> +
>>> +/**
>>> + * blkdev_issue_copy - queue a copy
>>> + * @bdev: source block device
>>> + * @nr_srcs: number of source ranges to copy
>>> + * @rlist: array of source ranges (in bytes)
>>> + * @dest_bdev: destination block device
>>> + * @dest: destination (in bytes)
>>> + * @gfp_mask: memory allocation flags (for bio_alloc)
>>> + *
>>> + * Description:
>>> + * Copy array of source ranges from source block device to
>>> + * destination block devcie.
>>> + */
>>> +
>>> +
>>> +int blkdev_issue_copy(struct block_device *bdev, int nr_srcs,
>>> + struct range_entry *src_rlist, struct block_device *dest_bdev,
>>> + sector_t dest, gfp_t gfp_mask)
>>> +{
>>> + struct request_queue *q = bdev_get_queue(bdev);
>>> + struct blk_copy_payload *payload;
>>> + sector_t bs_mask, dest_sect;
>>> + void *buf = NULL;
>>> + int ret;
>>> +
>>> + ret = blk_prepare_payload(bdev, nr_srcs, src_rlist, gfp_mask, &payload);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
>>> + if (dest & bs_mask) {
>>> + return -EINVAL;
>>> + goto out;
>>> + }
>>> + dest_sect = dest >> SECTOR_SHIFT;
>>
>> dest should already be a 512B sector as all block layer functions interface use
>> this unit. What is this doing ?
>>
>
> As source ranges and length received were in bytes, to keep things
> same, dest was
> also chosen to be in bytes. Seems wrong. Will change it to the 512B sector.
Using a byte interface seems very dangerous since writes can only be at best LBA
sized, and must be physical sector size aligned for zoned block devices. I
strobgly suggest that the interface use sector_t 512B unit.
>
>>
>>> +
>>> + if (bdev == dest_bdev && q->limits.copy_offload) {
>>> + ret = blk_copy_offload(bdev, payload, dest_sect, gfp_mask);
>>> + if (ret)
>>> + goto out;
>>> + } else {
>>> + ret = blk_read_to_buf(bdev, payload, gfp_mask, &buf);
>>> + if (ret)
>>> + goto out;
>>> + ret = blk_write_from_buf(dest_bdev, buf, dest_sect,
>>> + payload->copy_size << SECTOR_SHIFT, gfp_mask);
>>
>> Arg... This is really not pretty. At the very least, this should all be in a
>> blk_copy_emulate() helper or something named like that.
>>
>
> Okay. Will move this to a helper.
>
>> My worry though is that this likely slow for large copies, not to mention that
>> buf is likely to fail to allocate for large copy cases. As commented previously,
>> dm-kcopyd already does such copy well, with a read-write streaming pipeline and
>> zone support too for the write side. This should really be reused, at least
>> partly, instead of re-implementing this read-write here.
>>
>
> I was a bit hesitant to use dm layer code in the block layer. It makes sense to
> reuse the code as much as possible. Will try to reuse dm-kcopyd code for copy
> emulation part.
You missed my point. I never suggested that you use DM code in the block layer.
That would be wrong. What I suggested is that you move the dm-kcopyd code from
DM into the block layer, changing the function names to blk_copy_xxx(). See the
current interface in include/linux/dm-kcopyd.h: there is absolutely nothing that
is DM specific in there, so moving that code into block/blk-copy.c should be
straightforward, mostly.
The way I see it, your series should look something like this:
1) A prep patch that moves dm-kcopyd to the block layer, changing the API names
and converting all DM driver users to the new API. This may be a very large
patch though, so splitting it into multiple peaces may be required to facilitate
review.
2) A prep patch that introduces queue flags for devices to advertize their
support for simple copy and a generic api for simple copy BIOs
3) A patch that adds the use of simple copy BIO into the moved dm-kcopyd code
4) The NVMe driver bits that probably will look like what you have now
With this, all DM drivers that currently use dm-kcopyd (RAID and dm-zoned at
least) will get free offload using simple copy commands for sector copies within
the same device. All other copy cases will still work as per kcopyd code. That
is very neat I think.
And you can add one more patch that use this generic copy block interface to
implement copy_file_range for raw block devices as Darrick suggested.
[...]
>>> +static ssize_t queue_copy_offload_store(struct request_queue *q,
>>> + const char *page, size_t count)
>>> +{
>>> + unsigned long copy_offload;
>>> + ssize_t ret = queue_var_store(©_offload, page, count);
>>> +
>>> + if (ret < 0)
>>> + return ret;
>>> +
>>> + if (copy_offload < 0 || copy_offload > 1)
>>
>> err... "copy_offload != 1" ?
>
> Initial thought was to keep it either 0 or 1. I'll change it to 0 or else.
If you use 0 and 1, then make copy_offload a bool. That is more natural given
the variable name.
[...]
>>> +static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
>>> + unsigned long arg)
>>> +{
>>> + struct copy_range crange;
>>> + struct range_entry *rlist;
>>> + struct request_queue *q = bdev_get_queue(bdev);
>>> + sector_t dest;
>>> + int ret;
>>> +
>>> + if (!(mode & FMODE_WRITE))
>>> + return -EBADF;
>>> +
>>> + if (!blk_queue_copy(q))
>>> + return -EOPNOTSUPP;
>>
>> But you added the read-write emulation. So what is the point here ? This ioctl
>> should work for any device, with or without simple copy support. Did you test that ?
>>
>
> Sorry. It was a mistake. Will fix this.
Please make sure to test to catch such mistakes before sending. It does sound
like your read-write manual copy has not been tested.
--
Damien Le Moal
Western Digital Research