2023-01-12 15:14:00

by Nitesh Shetty

[permalink] [raw]
Subject: [PATCH v6 0/9] Implement copy offload support

The patch series covers the points discussed in November 2021 virtual
call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.
Patchset borrows Mikulas's token based approach for 2 bdev
implementation.

This is on top of our previous patchset v5[1].

Overall series supports:
========================
1. Driver
- NVMe Copy command (single NS, TP 4065), including support
in nvme-target (for block and file backend).

2. Block layer
- Block-generic copy (REQ_COPY flag), with interface
accommodating two block-devs, and multi-source/destination
interface
- Emulation, when offload is natively absent
- dm-linear support (for cases not requiring split)

3. User-interface
- new ioctl

4. In-kernel user
- dm-kcopyd

Testing
=======
Copy offload can be tested on:
a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
parameters mssrl,mcl, msrc. For more info [2].
b. Fabrics loopback.
c. blktests[3] (tests block/032,033, nvme/046,047,048,049)

Emuation can be tested on any device.

Sample application to use IOCTL is present in patch desciption.
fio[4].

Performance
===========
With the async design of copy-emulation/offload using fio[4],
we were able to see the following improvements as
compared to userspace read and write on a NVMeOF TCP setup:
Setup1: Network Speed: 1000Mb/s
Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Target PC: AMD Ryzen 9 5900X 12-Core Processor
block size 8k, range 1:
635% improvement in IO BW (107 MiB/s to 787 MiB/s).
Network utilisation drops from 97% to 14%.
block-size 2M, range 16:
2555% improvement in IO BW (100 MiB/s to 2655 MiB/s).
Network utilisation drops from 89% to 0.62%.
Setup2: Network Speed: 100Gb/s
Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
(host and target have the same configuration)
block-size 8k, range 1:
6.5% improvement in IO BW (791 MiB/s to 843 MiB/s).
Network utilisation drops from 6.75% to 0.14%.
block-size 2M, range 16:
15% improvement in IO BW (1027 MiB/s to 1183 MiB/s).
Network utilisation drops from 8.42% to ~0%.
block-size 8k, 8 ranges:
18% drop in IO BW (from 798 MiB/s to 647 MiB/s)
Network utilisation drops from 6.66% to 0.13%.

At present we see drop in performance for bs 8k,16k and
higher ranges (8, 16), so something more to check there.
Overall, in these tests, kernel copy emulation performs better than
userspace read+write.

Blktests[3]
======================
tests/block/032,033: Runs copy offload and emulation on block device.
tests/nvme/046,047,048,049 Create a loop backed fabrics device and
run copy offload and emulation.

Future Work
===========
- nullblk: copy-offload emulation.
- generic copy file range (CFR):
We explored the possibility of using block device
def_blk_ops, but we saw a major disadvantage for in-kernel
users. fd is not available for in-kernel user [5].
- loopback device copy offload support
- upstream fio to use copy offload

These are to be taken up after we reach consensus on the
plumbing of current elements that are part of this series.


Additional links:
=================
[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
[1] https://lore.kernel.org/lkml/20221130041450.GA17533@test-zns/T/
[2] https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v6
[4] https://github.com/vincentkfu/fio/tree/copyoffload
[5] https://lore.kernel.org/lkml/20221130041450.GA17533@test-zns/T/#m0e2754202fc2223e937c8e7ba3cf7336a93f97a3

Changes since v5:
=================
- Addition of blktests (Chaitanya Kulkarni)
- Minor fix for fabrics file backed path
- Remove buggy zonefs copy file range implementation.

Changes since v4:
=================
- make the offload and emulation design asynchronous (Hannes
Reinecke)
- fabrics loopback support
- sysfs naming improvements (Damien Le Moal)
- use kfree() instead of kvfree() in cio_await_completion
(Damien Le Moal)
- use ranges instead of rlist to represent range_entry (Damien
Le Moal)
- change argument ordering in blk_copy_offload suggested (Damien
Le Moal)
- removed multiple copy limit and merged into only one limit
(Damien Le Moal)
- wrap overly long lines (Damien Le Moal)
- other naming improvements and cleanups (Damien Le Moal)
- correctly format the code example in description (Damien Le
Moal)
- mark blk_copy_offload as static (kernel test robot)

Changes since v3:
=================
- added copy_file_range support for zonefs
- added documentation about new sysfs entries
- incorporated review comments on v3
- minor fixes

Changes since v2:
=================
- fixed possible race condition reported by Damien Le Moal
- new sysfs controls as suggested by Damien Le Moal
- fixed possible memory leak reported by Dan Carpenter, lkp
- minor fixes

Nitesh Shetty (9):
block: Introduce queue limits for copy-offload support
block: Add copy offload support infrastructure
block: add emulation for copy
block: Introduce a new ioctl for copy
nvme: add copy offload support
nvmet: add copy command support for bdev and file ns
dm: Add support for copy offload.
dm: Enable copy offload for dm-linear target
dm kcopyd: use copy offload support

Documentation/ABI/stable/sysfs-block | 36 ++
block/blk-lib.c | 597 +++++++++++++++++++++++++++
block/blk-map.c | 4 +-
block/blk-settings.c | 24 ++
block/blk-sysfs.c | 64 +++
block/blk.h | 2 +
block/ioctl.c | 36 ++
drivers/md/dm-kcopyd.c | 56 ++-
drivers/md/dm-linear.c | 1 +
drivers/md/dm-table.c | 42 ++
drivers/md/dm.c | 7 +
drivers/nvme/host/constants.c | 1 +
drivers/nvme/host/core.c | 106 ++++-
drivers/nvme/host/fc.c | 5 +
drivers/nvme/host/nvme.h | 7 +
drivers/nvme/host/pci.c | 27 +-
drivers/nvme/host/rdma.c | 7 +
drivers/nvme/host/tcp.c | 16 +
drivers/nvme/host/trace.c | 19 +
drivers/nvme/target/admin-cmd.c | 9 +-
drivers/nvme/target/io-cmd-bdev.c | 79 ++++
drivers/nvme/target/io-cmd-file.c | 52 +++
drivers/nvme/target/loop.c | 6 +
drivers/nvme/target/nvmet.h | 2 +
include/linux/blk_types.h | 44 ++
include/linux/blkdev.h | 18 +
include/linux/device-mapper.h | 5 +
include/linux/nvme.h | 43 +-
include/uapi/linux/fs.h | 27 ++
29 files changed, 1324 insertions(+), 18 deletions(-)


base-commit: 469a89fd3bb73bb2eea628da2b3e0f695f80b7ce
--
2.35.1.500.gb896f729e2


2023-01-12 15:16:49

by Nitesh Shetty

[permalink] [raw]
Subject: [PATCH v6 3/9] block: add emulation for copy

For the devices which does not support copy, copy emulation is
added. Copy-emulation is implemented by reading from source ranges
into memory and writing to the corresponding destination asynchronously.
For zoned device we maintain a linked list of read submission and try to
submit corresponding write in same order.
Also emulation is used, if copy offload fails or partially completes.

Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Vincent Fu <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
block/blk-lib.c | 241 ++++++++++++++++++++++++++++++++++++++++-
block/blk-map.c | 4 +-
include/linux/blkdev.h | 3 +
3 files changed, 245 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2ce3c872ca49..43b1d0ef5732 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -428,6 +428,239 @@ static inline int blk_copy_sanity_check(struct block_device *src_bdev,
return 0;
}

+static void *blk_alloc_buf(sector_t req_size, sector_t *alloc_size,
+ gfp_t gfp_mask)
+{
+ int min_size = PAGE_SIZE;
+ void *buf;
+
+ while (req_size >= min_size) {
+ buf = kvmalloc(req_size, gfp_mask);
+ if (buf) {
+ *alloc_size = req_size;
+ return buf;
+ }
+ /* retry half the requested size */
+ req_size >>= 1;
+ }
+
+ return NULL;
+}
+
+static void blk_copy_emulate_write_end_io(struct bio *bio)
+{
+ struct copy_ctx *ctx = bio->bi_private;
+ struct cio *cio = ctx->cio;
+ sector_t clen;
+ int ri = ctx->range_idx;
+
+ if (bio->bi_status) {
+ cio->io_err = blk_status_to_errno(bio->bi_status);
+ clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+ cio->ranges[ri].dst;
+ cio->ranges[ri].comp_len = min_t(sector_t, clen,
+ cio->ranges[ri].comp_len);
+ }
+ kvfree(page_address(bio->bi_io_vec[0].bv_page));
+ bio_map_kern_endio(bio);
+ if (atomic_dec_and_test(&ctx->refcount))
+ kfree(ctx);
+ if (atomic_dec_and_test(&cio->refcount)) {
+ if (cio->endio) {
+ cio->endio(cio->private, cio->io_err);
+ kfree(cio);
+ } else
+ blk_wake_io_task(cio->waiter);
+ }
+}
+
+static void blk_copy_emulate_read_end_io(struct bio *read_bio)
+{
+ struct copy_ctx *ctx = read_bio->bi_private;
+ struct cio *cio = ctx->cio;
+ sector_t clen;
+ int ri = ctx->range_idx;
+ unsigned long flags;
+
+ if (read_bio->bi_status) {
+ cio->io_err = blk_status_to_errno(read_bio->bi_status);
+ goto err_rw_bio;
+ }
+
+ /* For zoned device, we check if completed bio is first entry in linked
+ * list,
+ * if yes, we start the worker to submit write bios.
+ * if not, then we just update status of bio in ctx,
+ * once the worker gets scheduled, it will submit writes for all
+ * the consecutive REQ_COPY_READ_COMPLETE bios.
+ */
+ if (bdev_is_zoned(ctx->write_bio->bi_bdev)) {
+ spin_lock_irqsave(&cio->list_lock, flags);
+ ctx->status = REQ_COPY_READ_COMPLETE;
+ if (ctx == list_first_entry(&cio->list,
+ struct copy_ctx, list)) {
+ spin_unlock_irqrestore(&cio->list_lock, flags);
+ schedule_work(&ctx->dispatch_work);
+ goto free_read_bio;
+ }
+ spin_unlock_irqrestore(&cio->list_lock, flags);
+ } else
+ schedule_work(&ctx->dispatch_work);
+
+free_read_bio:
+ kfree(read_bio);
+
+ return;
+
+err_rw_bio:
+ clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+ cio->ranges[ri].src;
+ cio->ranges[ri].comp_len = min_t(sector_t, clen,
+ cio->ranges[ri].comp_len);
+ __free_page(read_bio->bi_io_vec[0].bv_page);
+ bio_map_kern_endio(read_bio);
+ if (atomic_dec_and_test(&ctx->refcount))
+ kfree(ctx);
+ if (atomic_dec_and_test(&cio->refcount)) {
+ if (cio->endio) {
+ cio->endio(cio->private, cio->io_err);
+ kfree(cio);
+ } else
+ blk_wake_io_task(cio->waiter);
+ }
+}
+
+/*
+ * If native copy offload feature is absent, this function tries to emulate,
+ * by copying data from source to a temporary buffer and from buffer to
+ * destination device.
+ */
+static int blk_copy_emulate(struct block_device *src_bdev,
+ struct block_device *dst_bdev, struct range_entry *ranges,
+ int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask)
+{
+ struct request_queue *sq = bdev_get_queue(src_bdev);
+ struct request_queue *dq = bdev_get_queue(dst_bdev);
+ struct bio *read_bio, *write_bio;
+ void *buf = NULL;
+ struct copy_ctx *ctx;
+ struct cio *cio;
+ sector_t src, dst, offset, buf_len, req_len, rem = 0;
+ int ri = 0, ret = 0;
+ unsigned long flags;
+ sector_t max_src_hw_len = min_t(unsigned int, queue_max_hw_sectors(sq),
+ queue_max_segments(sq) << (PAGE_SHIFT - SECTOR_SHIFT))
+ << SECTOR_SHIFT;
+ sector_t max_dst_hw_len = min_t(unsigned int, queue_max_hw_sectors(dq),
+ queue_max_segments(dq) << (PAGE_SHIFT - SECTOR_SHIFT))
+ << SECTOR_SHIFT;
+ sector_t max_hw_len = min_t(unsigned int,
+ max_src_hw_len, max_dst_hw_len);
+
+ cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+ if (!cio)
+ return -ENOMEM;
+ cio->ranges = ranges;
+ atomic_set(&cio->refcount, 1);
+ cio->waiter = current;
+ cio->endio = end_io;
+ cio->private = private;
+
+ if (bdev_is_zoned(dst_bdev)) {
+ INIT_LIST_HEAD(&cio->list);
+ spin_lock_init(&cio->list_lock);
+ }
+
+ for (ri = 0; ri < nr; ri++) {
+ offset = ranges[ri].comp_len;
+ src = ranges[ri].src + offset;
+ dst = ranges[ri].dst + offset;
+ /* If IO fails, we truncate comp_len */
+ ranges[ri].comp_len = ranges[ri].len;
+
+ for (rem = ranges[ri].len - offset; rem > 0; rem -= buf_len) {
+ req_len = min_t(int, max_hw_len, rem);
+
+ buf = blk_alloc_buf(req_len, &buf_len, gfp_mask);
+ if (!buf) {
+ ret = -ENOMEM;
+ goto err_alloc_buf;
+ }
+
+ ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+ if (!ctx) {
+ ret = -ENOMEM;
+ goto err_ctx;
+ }
+
+ read_bio = bio_map_kern(sq, buf, buf_len, gfp_mask);
+ if (IS_ERR(read_bio)) {
+ ret = PTR_ERR(read_bio);
+ goto err_read_bio;
+ }
+
+ write_bio = bio_map_kern(dq, buf, buf_len, gfp_mask);
+ if (IS_ERR(write_bio)) {
+ ret = PTR_ERR(write_bio);
+ goto err_write_bio;
+ }
+
+ ctx->cio = cio;
+ ctx->range_idx = ri;
+ ctx->write_bio = write_bio;
+ atomic_set(&ctx->refcount, 1);
+
+ read_bio->bi_iter.bi_sector = src >> SECTOR_SHIFT;
+ read_bio->bi_iter.bi_size = buf_len;
+ read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
+ bio_set_dev(read_bio, src_bdev);
+ read_bio->bi_end_io = blk_copy_emulate_read_end_io;
+ read_bio->bi_private = ctx;
+
+ write_bio->bi_iter.bi_size = buf_len;
+ write_bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
+ bio_set_dev(write_bio, dst_bdev);
+ write_bio->bi_end_io = blk_copy_emulate_write_end_io;
+ write_bio->bi_iter.bi_sector = dst >> SECTOR_SHIFT;
+ write_bio->bi_private = ctx;
+
+ if (bdev_is_zoned(dst_bdev)) {
+ INIT_WORK(&ctx->dispatch_work,
+ blk_zoned_copy_dispatch_work_fn);
+ INIT_LIST_HEAD(&ctx->list);
+ spin_lock_irqsave(&cio->list_lock, flags);
+ ctx->status = REQ_COPY_READ_PROGRESS;
+ list_add_tail(&ctx->list, &cio->list);
+ spin_unlock_irqrestore(&cio->list_lock, flags);
+ } else
+ INIT_WORK(&ctx->dispatch_work,
+ blk_copy_dispatch_work_fn);
+
+ atomic_inc(&cio->refcount);
+ submit_bio(read_bio);
+
+ src += buf_len;
+ dst += buf_len;
+ }
+ }
+
+ /* Wait for completion of all IO's*/
+ return cio_await_completion(cio);
+
+err_write_bio:
+ bio_put(read_bio);
+err_read_bio:
+ kfree(ctx);
+err_ctx:
+ kvfree(buf);
+err_alloc_buf:
+ ranges[ri].comp_len -= min_t(sector_t,
+ ranges[ri].comp_len, (ranges[ri].len - rem));
+
+ cio->io_err = ret;
+ return cio_await_completion(cio);
+}
+
static inline bool blk_check_copy_offload(struct request_queue *src_q,
struct request_queue *dst_q)
{
@@ -460,15 +693,21 @@ int blkdev_issue_copy(struct block_device *src_bdev,
struct request_queue *src_q = bdev_get_queue(src_bdev);
struct request_queue *dst_q = bdev_get_queue(dst_bdev);
int ret = -EINVAL;
+ bool offload = false;

ret = blk_copy_sanity_check(src_bdev, dst_bdev, ranges, nr);
if (ret)
return ret;

- if (blk_check_copy_offload(src_q, dst_q))
+ offload = blk_check_copy_offload(src_q, dst_q);
+ if (offload)
ret = blk_copy_offload(src_bdev, dst_bdev, ranges, nr,
end_io, private, gfp_mask);

+ if (ret || !offload)
+ ret = blk_copy_emulate(src_bdev, dst_bdev, ranges, nr,
+ end_io, private, gfp_mask);
+
return ret;
}
EXPORT_SYMBOL_GPL(blkdev_issue_copy);
diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73..bcf8db2b75f1 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -363,7 +363,7 @@ static void bio_invalidate_vmalloc_pages(struct bio *bio)
#endif
}

-static void bio_map_kern_endio(struct bio *bio)
+void bio_map_kern_endio(struct bio *bio)
{
bio_invalidate_vmalloc_pages(bio);
bio_uninit(bio);
@@ -380,7 +380,7 @@ static void bio_map_kern_endio(struct bio *bio)
* Map the kernel address into a bio suitable for io to a block
* device. Returns an error pointer in case of error.
*/
-static struct bio *bio_map_kern(struct request_queue *q, void *data,
+struct bio *bio_map_kern(struct request_queue *q, void *data,
unsigned int len, gfp_t gfp_mask)
{
unsigned long kaddr = (unsigned long)data;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 48e9160b7195..c5621550e5b4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1066,6 +1066,9 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
int blkdev_issue_copy(struct block_device *src_bdev,
struct block_device *dst_bdev, struct range_entry *ranges,
int nr, cio_iodone_t end_io, void *private, gfp_t gfp_mask);
+struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
+ gfp_t gfp_mask);
+void bio_map_kern_endio(struct bio *bio);

#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
--
2.35.1.500.gb896f729e2

2023-01-12 15:17:46

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v6 3/9] block: add emulation for copy

On 1/12/23 12:58, Nitesh Shetty wrote:
> For the devices which does not support copy, copy emulation is
> added. Copy-emulation is implemented by reading from source ranges
> into memory and writing to the corresponding destination asynchronously.
> For zoned device we maintain a linked list of read submission and try to
> submit corresponding write in same order.
> Also emulation is used, if copy offload fails or partially completes.
>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Vincent Fu <[email protected]>
> Signed-off-by: Anuj Gupta <[email protected]>
> ---
> block/blk-lib.c | 241 ++++++++++++++++++++++++++++++++++++++++-
> block/blk-map.c | 4 +-
> include/linux/blkdev.h | 3 +
> 3 files changed, 245 insertions(+), 3 deletions(-)
>
I'm not sure if I agree with this one.

You just submitted a patch for device-mapper to implement copy offload,
which (to all intents and purposes) _is_ an emulation.

So why do we need to implement it in the block layer as an emulation?
Or, if we have to, why do we need the device-mapper emulation?
This emulation will be doing the same thing, no?

Cheers,

Hannes

2023-01-12 15:18:04

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v6 3/9] block: add emulation for copy

On 1/12/23 15:46, Hannes Reinecke wrote:
> On 1/12/23 12:58, Nitesh Shetty wrote:
>> For the devices which does not support copy, copy emulation is
>> added. Copy-emulation is implemented by reading from source ranges
>> into memory and writing to the corresponding destination asynchronously.
>> For zoned device we maintain a linked list of read submission and try to
>> submit corresponding write in same order.
>> Also emulation is used, if copy offload fails or partially completes.
>>
>> Signed-off-by: Nitesh Shetty <[email protected]>
>> Signed-off-by: Vincent Fu <[email protected]>
>> Signed-off-by: Anuj Gupta <[email protected]>
>> ---
>>   block/blk-lib.c        | 241 ++++++++++++++++++++++++++++++++++++++++-
>>   block/blk-map.c        |   4 +-
>>   include/linux/blkdev.h |   3 +
>>   3 files changed, 245 insertions(+), 3 deletions(-)
>>
> I'm not sure if I agree with this one.
>
> You just submitted a patch for device-mapper to implement copy offload,
> which (to all intents and purposes) _is_ an emulation.
>
> So why do we need to implement it in the block layer as an emulation?
> Or, if we have to, why do we need the device-mapper emulation?
> This emulation will be doing the same thing, no?
>
Sheesh. One should read the entire patchset.

Disregard the above comment.

Cheers,

Hannes

2023-01-12 16:01:42

by Nitesh Shetty

[permalink] [raw]
Subject: [PATCH v6 4/9] block: Introduce a new ioctl for copy

Add new BLKCOPY ioctl that offloads copying of one or more sources ranges
to one or more destination in a device. COPY ioctl accepts a 'copy_range'
structure that contains no of range, a reserved field , followed by an
array of ranges. Each source range is represented by 'range_entry' that
contains source start offset, destination start offset and length of
source ranges (in bytes)

MAX_COPY_NR_RANGE, limits the number of entries for the IOCTL and
MAX_COPY_TOTAL_LENGTH limits the total copy length, IOCTL can handle.

Example code, to issue BLKCOPY:
/* Sample example to copy three entries with [dest,src,len],
* [32768, 0, 4096] [36864, 4096, 4096] [40960,8192,4096] on same device */

int main(void)
{
int i, ret, fd;
unsigned long src = 0, dst = 32768, len = 4096;
struct copy_range *cr;

cr = (struct copy_range *)malloc(sizeof(*cr)+
(sizeof(struct range_entry)*3));
cr->nr_range = 3;
cr->reserved = 0;
for (i = 0; i< cr->nr_range; i++, src += len, dst += len) {
cr->ranges[i].dst = dst;
cr->ranges[i].src = src;
cr->ranges[i].len = len;
cr->ranges[i].comp_len = 0;
}

fd = open("/dev/nvme0n1", O_RDWR);
if (fd < 0) return 1;

ret = ioctl(fd, BLKCOPY, cr);
if (ret != 0)
printf("copy failed, ret= %d\n", ret);

for (i=0; i< cr->nr_range; i++)
if (cr->ranges[i].len != cr->ranges[i].comp_len)
printf("Partial copy for entry %d: requested %llu,
completed %llu\n",
i, cr->ranges[i].len,
cr->ranges[i].comp_len);
close(fd);
free(cr);
return ret;
}

Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier González <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
block/ioctl.c | 36 ++++++++++++++++++++++++++++++++++++
include/uapi/linux/fs.h | 9 +++++++++
2 files changed, 45 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index 96617512982e..d636bc1f0047 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -120,6 +120,40 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
return err;
}

+static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
+ unsigned long arg)
+{
+ struct copy_range ucopy_range, *kcopy_range = NULL;
+ size_t payload_size = 0;
+ int ret;
+
+ if (!(mode & FMODE_WRITE))
+ return -EBADF;
+
+ if (copy_from_user(&ucopy_range, (void __user *)arg,
+ sizeof(ucopy_range)))
+ return -EFAULT;
+
+ if (unlikely(!ucopy_range.nr_range || ucopy_range.reserved ||
+ ucopy_range.nr_range >= MAX_COPY_NR_RANGE))
+ return -EINVAL;
+
+ payload_size = (ucopy_range.nr_range * sizeof(struct range_entry)) +
+ sizeof(ucopy_range);
+
+ kcopy_range = memdup_user((void __user *)arg, payload_size);
+ if (IS_ERR(kcopy_range))
+ return PTR_ERR(kcopy_range);
+
+ ret = blkdev_issue_copy(bdev, bdev, kcopy_range->ranges,
+ kcopy_range->nr_range, NULL, NULL, GFP_KERNEL);
+ if (copy_to_user((void __user *)arg, kcopy_range, payload_size))
+ ret = -EFAULT;
+
+ kfree(kcopy_range);
+ return ret;
+}
+
static int blk_ioctl_secure_erase(struct block_device *bdev, fmode_t mode,
void __user *argp)
{
@@ -482,6 +516,8 @@ static int blkdev_common_ioctl(struct file *file, fmode_t mode, unsigned cmd,
return blk_ioctl_discard(bdev, mode, arg);
case BLKSECDISCARD:
return blk_ioctl_secure_erase(bdev, mode, argp);
+ case BLKCOPY:
+ return blk_ioctl_copy(bdev, mode, arg);
case BLKZEROOUT:
return blk_ioctl_zeroout(bdev, mode, arg);
case BLKGETDISKSEQ:
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 9248b6d259de..8af10b926a6f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -82,6 +82,14 @@ struct range_entry {
__u64 comp_len;
};

+struct copy_range {
+ __u64 nr_range;
+ __u64 reserved;
+
+ /* Ranges always must be at the end */
+ struct range_entry ranges[];
+};
+
/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
#define FILE_DEDUPE_RANGE_SAME 0
#define FILE_DEDUPE_RANGE_DIFFERS 1
@@ -203,6 +211,7 @@ struct fsxattr {
#define BLKROTATIONAL _IO(0x12,126)
#define BLKZEROOUT _IO(0x12,127)
#define BLKGETDISKSEQ _IOR(0x12,128,__u64)
+#define BLKCOPY _IOWR(0x12, 129, struct copy_range)
/*
* A jump here: 130-136 are reserved for zoned block devices
* (see uapi/linux/blkzoned.h)
--
2.35.1.500.gb896f729e2